# Project Linear Regression: Boston House Price Prediction

# **Marks: 30**

Welcome to the project on Linear Regression. We will use the Boston house price data for the exercise.

-------------------------------
## Problem Statement
-------------------------------

The problem on hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us. 

----------------------------
## Data Information
---------------------------

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):
- **CRIM:**     per capita crime rate by town
- **ZN:**       proportion of residential land zoned for lots over 25,000 sq.ft.
- **INDUS:**    proportion of non-retail business acres per town
- **CHAS:**     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- **NOX:**      nitric oxides concentration (parts per 10 million)
- **RM:**       average number of rooms per dwelling
- **AGE:**     proportion of owner-occupied units built before 1940
- **DIS:**      weighted distances to five Boston employment centers
- **RAD:**      index of accessibility to radial highways
- **TAX:**      full-value property-tax rate per 10,000 dollars
- **PTRATIO:**  pupil-teacher ratio by town
- **LSTAT:**    %lower status of the population
- **MEDV:**     Median value of owner-occupied homes in 1000 dollars

### Let us start by importing the required libraries

In [6]:
# import libraries for data manipulation
import pandas as pd
import numpy as np
import random as rnd


# import libraries for data visualization
import matplotlib.pyplot as plt
%matplotlib inline 
from pandas.plotting import scatter_matrix

import seaborn as sns
sns.set_style('darkgrid')
# settings for seaborn plotting style
sns.set(color_codes=True)
# settings for seaborn plot sizes
#sns.set(rc={'figure.figsize':(5,5)})
from statsmodels.graphics.gofplots import ProbPlot

# import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# import library for preparing data
from sklearn.model_selection import train_test_split

# import library for data preprocessing
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")

# import library for scientific computing 
import scipy
import scipy.stats as scipy
import scipy.stats as norm
import scipy.stats as stats
# import uniform distribution
from scipy.stats import uniform
# for latex equations
from IPython.display import Math, Latex
# for displaying images
from IPython.core.display import Image


In [7]:
!pip install periscope

Collecting periscope
  Using cached periscope-0.2.4.tar.gz (25 kB)
Collecting BeautifulSoup>=3.2.0
  Using cached BeautifulSoup-3.2.2.tar.gz (32 kB)
[31m    ERROR: Command errored out with exit status 1:
     command: /Users/student/opt/anaconda3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/zb/6jslj21n4rl_r356rd2y6nn00000gn/T/pip-install-5h2f_wf_/beautifulsoup_8f37d0ae5b284a31ac1b9fc501beb6b6/setup.py'"'"'; __file__='"'"'/private/var/folders/zb/6jslj21n4rl_r356rd2y6nn00000gn/T/pip-install-5h2f_wf_/beautifulsoup_8f37d0ae5b284a31ac1b9fc501beb6b6/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/zb/6jslj21n4rl_r356rd2y6nn00000gn/T/pip-pip-egg-info-v3lpouou
         cwd: /private/var/folders/zb/6jslj21n4rl_r356rd2y6nn00000gn/T/pip-install-5h2f_wf_/beautifulsoup_8f37d0ae5b284a31ac1b9

### Read the dataset

In [8]:
df = pd.read_csv("Boston.csv")
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


**Observations**
- The price of the house indicated by the variable MEDV is the target variable.
- The other features are the independent variables based on which to predict house price.
- On first glance, the per capita crime rate by town CRIM is at its lowest (0.0632%) for 
areas that have the highest proportion of residential land zoned ZN for lots over 25,000 sq.ft. 
- The per capita crime rate CRIM by town is at its highest (6.905%) in areas that have a low 
proportion of non-retail business acres per town INDUS.
- There are 4 out of the 5 towns with no proportion of non-retail business acres per town ZN.
- In towns with a high proportion of non-retail business acres per town INDUS (7.07%), there is 
the largest proportion of owner-occupied units built before 1940 AGE. 
- Nitric oxides concentration (parts per 10 million) NOX, is at its highest in area(s) where 
there is a high proportion of residential land zoned for lots over 25,000 sq.ft. ZN; this might
assume previously agrarian land ready for development and may not be an essential feature in predicting 
home prices.  
- Average number of rooms per dwelling RM should have an affect on the house prices. 
- There appears to be a correlation between median value of owner-occupied homes in 1000 dollars MEDV at 24.0, 
and the index of accessibility to radial highways RAD at 1. The lower the accessibility to radial highways, the 
lower the the median value of owner-occupied homes in 1000 dollars.
- For these five features, the highest median value of owner-occupied homes MEDV at 36.2, there appears 
the highest per capita crime rate by town CRIM at 6.905%.
- It might be noteworthy that a very small town will not have a suburb. 


### Get information about the dataset using the info() method

In [None]:
#check the info
df.info()

#check the shape of the dataset
df.shape

**Observations**
* There are a total of 506 non-null observations in each of the columns. This indicates that there are no missing values in the data.
* There are 13 features that can aid in predicting the price of a home. 
* Every column in this dataset is numeric in nature.
* There are 11 features that help predict housing prices of a town or a suburb. 

In [None]:
#check the unique values in each column
df.nunique()

**Observations**
- There are two observations (total 506, we have 504) that do not have data per capita crime rate by town CRIM.
- There are only 9 observations for the index of accessibility to radial highways RAD. This might conclude that our data
does not comprise of towns or suburban communities that have a high index of accessibility to radial highways RAD. 

### Let's now check the summary statistics of this dataset

#### **Question 1:** Write the code to find the summary statistics and write your observations based on that. (1 Mark)

In [None]:
#create numerical columns 
numerical_columns = ['CRIM','ZN', 'INDUS', 'CHAS',
                    'NOX','RM','AGE','DIS','RAD',
                    'TAX','PTRATIO','LSTAT','MEDV']
numerical_columns

#check summary statistics
df[numerical_columns].describe().T


In [None]:
# get correlations
df_correlations = df.corr()

# figure 
fig, ax = plt.subplots(figsize=(16,10))

# mask 
mask = np.triu(np.ones_like(df_correlations, dtype=np.bool8))

# adjust mask and dataframe
mask = mask[1:,:-1]
correlation = df_correlations.iloc[1:,:-1].copy()

# plot heatmap
sns.heatmap(correlation, mask=mask, annot=True, fmt=".2f", cmap='Blues', vmin=-1, vmax=1, cbar_kws={"shrink":.8})

# yticks 
# ticks 
plt.yticks(rotation=0)

plt.show()

# alternative matrix 
df_correlations.style.background_gradient(cmap='coolwarm')


**Observations:**
- The full-value property tax rate per 10,000 dollars TAX in nearly perfect in correlation ~0.910228 with the 
index of accessibility to five Boston employment centers RAD: this may be suggesting that a property's tax rate
is higher near areas where access to employment by way of radial highways is greater.  
- Pupil teacher ration by town PTRATION, does not seem to affect per capita crime rate by town CRIM. 
- It appears that the more weighted distances to Boston employment centers DIS there are, the less 
per capita crime rate CRIM that town might experience. This negative correlation is not significant however.
- As the median value of owner-occupied homes in 1000 dollars increases MEDV, it would be simple to assume that per 
capita crime rate by town CRIM should also increase; however, this is not the case, there is a negative 
correlation between MEDV and CRIM, albeit small. 
- There is a distinct and negative corrlation for proportion of residential land zoned for lots over 25,000 sq.ft ZN 
and that of INDUS, NOX, AGE, RAD, TAX, LSTAT. It could be possible to hypothize that the higher the proportion of 
residential land zoned for lots over 25,000 sq. ft., there may be a significant development influx in the town or suburb 
that contributes to fewer non-retail businesses INDUS, fewer pollutants nitric oxide pollutants permitted in the town/suburb NOX,
newer homes that are not originally established in the older part of town but are therefore further away from radial highways RAD, and 
have a % lower status of the population LSTAT. 
- There is no explanation evident why the proportion of residential land zoned for lots over 25,000 sq.ft. ZN, appears to be taxed 
at a lower full-value property-tax rate TAX. 
- There is a significant correlation between proportion of residential land zoned for lots over 25,000 sq.ft. ZN and 
the weighted distances to five Boston employment centers DIS. Hypothetically, where employment centers exist, if land zoned for 
residential has not been entirely utilized, development will occur in an effort to fulfill the need for more housing. 
- The is a very significant correlation with the proportion of non-retail business acres per town INDUS, and the amount 
of nitiric oxides concentration NOX. This is possibly suggesting that when land is not developed, it is then used in a way 
that allows for the increased nitric oxides concentrations NOX. 
- When a town or suburb has a low proportion of of non-retial business acres per town INDUS, its homes are more proportionally owner-occupied
units built before 1940 AGE. 
- As a town or suburb utilizes its proportion of non-retail business acres INDUS, it experiences an increase in the index of 
accessibility to radial highways RAD, an increase in the full-value property-tax rate per 10,000 dollars TAX, more pupils to teachers ratio PTRATIO
and a increase in %lower status of population LSTAT. These observations suggest that towns or suburbs near urban cores are more populated and 
have a better access to employment, taxed higher, and see more students attending school(s) in those areas.  
- It would seem expected that the nitiric oxicdes concentration (parts per 10 million) NOX, would be higher in areas that have a higher proportion 
of non-retail business acres per town INDUS. Perhaps these observations are suggesting that greater distances in the town increase travel, which 
depending on the method of travel can incur more nitric oxide into the atmosphere. 
- The pupil-teacher ration by town PTRATIO shows hardly any correlation with nitric oxides concentration NOX. This would seem expected. 
- For average number of rooms per dwelling RM, there is a slight and negative correlation with the other features. There is a significant 
negative correlation between RM, and specifically, %lower status of population LSTAT, which represents an idea that an increase in that number of rooms per 
dwelling doesn't necessarily increase the %lower status of a population as could be assumed. 
- With older homes built before 1940 AGE, there is an increase in nitiric oxide concentration NOX and an increase in the proportion of non-retail business
acres per town. This may be implying that older homes are farther apart from each other in a town, than the homes that are newly built.
- In older home communities AGE, there is a higher proportion of the population that is part of the lower socioeconomic distinction LSTAT. 
- The older a home is AGE, the farther it is away from the weighted distances to five Boston employment centers DIS. 
- Older homes AGE, have a lower median value of owner-occupied homes in 1000 dollars. 
- From this data, the higher the index of accessibility to radial highways RAD, the higher the proportion of residential land zoned for lots over 25,000 sq.ft.
- The weighted distances to five Boston employment centers, does not significantly affect the median value of owner-occupied homes in 1000 dollars. 
- The average full-value property-tax rate per 10,000 dollars TAX is higher than the median,
which might suggest that some properties pay more per 10,000 dollars than other properties. 
- There  

Before performing the modeling, it is important to check the univariate distribution of the variables.

### Univariate Analysis

### Check the distribution of the variables

In [None]:
# check the data distribution 
for i in df.columns:
    plt.figure(figsize=(7, 4))
    sns.histplot(data=df, x=i, kde = True)
    plt.show()

In [None]:
# set the cumulative distribution function
def CRIM_ECDF_Distribution(df): 
    #number of data points: n
    n=len(df)

    # x-data for the ECDF: x
    x=np.sort(df)

    #y-data for the ECDF: y
    y=np.arange(1,n+1)/n

    return x, y

x, y = CRIM_ECDF_Distribution(df["CRIM"])

plt.figure(figsize=(6,5))
sns.set()
plt.plot(x,y,marker=".",linestyle="none")
plt.xlabel("CRIM")
plt.ylabel("Cumulative Distribution Function")

# compare cumulative distribution funciton to normal distribution
random_samples = np.random.normal(np.mean(df["CRIM"]), np.std(df["CRIM"]), size = 1000)

# transform data into x, y pairs 
x_theoretical, y_theoretical = CRIM_ECDF_Distribution(random_samples)

plt.plot(x_theoretical, y_theoretical)
plt.legend(("Normal Distribution", "Empirical Data"), loc = "lower right")

#persicope

# test whether the sample data differs from normal distribution
print(df["CRIM"])
print(stats.normaltest(df))



**Observations**
* **The variables CRIM and ZN are positively skewed.** This suggests that most of the areas have lower crime rates and most residential plots are under the area of 25,000 sq. ft.
* **The variable CHAS, with only 2 possible values 0 and 1, follows a binomial distribution**, and the majority of the houses are away from Charles river (CHAS = 0).
* The distribution of the variable AGE suggests that many of the owner-occupied houses were built before 1940. 
* **The variable DIS** (average distances to five Boston employment centers) **has a nearly exponential distribution**, which indicates that most of the houses are closer to these employment centers.
* **The variables TAX and RAD have a bimodal distribution.**, indicating that the tax rate is possibly higher for some properties which have a high index of accessibility to radial highways.  
* The dependent variable MEDV seems to be slightly right skewed.

As the dependent variable is sightly skewed, we will apply a **log transformation on the 'MEDV' column** and check the distribution of the transformed column.

In [None]:
df['MEDV_log'] = np.log(df['MEDV'])

In [None]:
sns.histplot(data=df, x='MEDV_log', kde = True)

**Observations**
* The log-transformed variable (**MEDV_log**) appears to have a **nearly normal distribution without skew**, and hence we can proceed.

Before creating the linear regression model, it is important to check the bivariate relationship between the variables. Let's check the same using the heatmap and scatterplot.

### Bivariate Analysis

#### Let's check the correlation using the heatmap 

### **Question 2** (3 Marks):
- **Write the code to plot the correlation heatmap between the variables (1 Mark)**
- **Write your observations (2 Marks)**

In [None]:
plt.figure(figsize=(12,8))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(______________,annot=True,fmt='.2f',cmap=cmap ) #write your code here
plt.show()

**Observations:______**

Now, we will visualize the relationship between the pairs of features having significant correlations.

### Visualizing the relationship between the features having significant correlations (> 0.7) 

### **Question 3** (6 Marks):
- **Create a scatter plot to visualize the relationship between the features having significant correlations (>0.7) (3 Marks)**
- **Write your observations from the plots (3 Marks)**

In [None]:
# scatterplot to visualize the relationship between NOX and INDUS
plt.figure(figsize=(6, 6))
#___________________________ #write you code here
plt.show()

**Observations:____**

In [None]:
# scatterplot to visualize the relationship between AGE and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()

**Observations:____**

In [None]:
# scatterplot to visualize the relationship between DIS and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()

**Observations:___**

In [None]:
# scatterplot to visualize the relationship between AGE and DIS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'DIS', data = df)
plt.show()

**Observations:**
* The distance of the houses to the Boston employment centers appears to decrease moderately as the the proportion of the old houses increase in the town. It is possible that the Boston employment centers are located in the established towns where proportion of owner-occupied units built prior to 1940 is comparatively high.

In [None]:
# scatterplot to visualize the relationship between AGE and INDUS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'INDUS', data = df)
plt.show()

**Observations:**
* No trend between the two variables is visible in the above plot.

In [None]:
# scatterplot to visulaize the relationship between RAD and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RAD', y = 'TAX', data = df)
plt.show()

**Observations:**
* The correlation between RAD and TAX is very high. But, no trend is visible between the two variables. 
This might be due to outliers. 

Let's check the correlation after removing the outliers.

In [None]:
# remove the data corresponding to high tax rate
df1 = df[df['TAX'] < 600]
# import the required function
from scipy.stats import pearsonr
# calculate the correlation
print('The correlation between TAX and RAD is', pearsonr(df1['TAX'], df1['RAD'])[0])

So the high correlation between TAX and RAD is due to the outliers. The tax rate for some properties might be higher due to some other reason.

In [None]:
# scatterplot to visualize the relationship between INDUS and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'INDUS', y = 'TAX', data = df)
plt.show()

**Observations:**
* The tax rate appears to increase with an increase in the proportion of non-retail business acres per town. This might be due to the reason that the variables TAX and INDUS are related with a third variable.

In [None]:
# scatterplot to visulaize the relationship between RM and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RM', y = 'MEDV', data = df)
plt.show()

**Observations:**
* The price of the house seems to increase as the value of RM increases. This is expected as the price is generally higher for more rooms.

* There are a few outliers in a horizontal line as the MEDV value seems to be capped at 50.

In [None]:
# scatterplot to visulaize the relationship between LSTAT and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'LSTAT', y = 'MEDV', data = df)
plt.show()

**Observations:**
* The price of the house tends to decrease with an increase in LSTAT. This is also possible as the house price is lower in areas where lower status people live.
* There are few outliers and the data seems to be capped at 50.

We have seen that the variables LSTAT and RM have a linear relationship with the dependent variable MEDV. Also, there are significant relationships among a few independent variables, which is not desirable for a linear regression model. Let's first split the dataset.

### Split the dataset
Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.

In [None]:
# separate the dependent and independent variable
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})

# add the intercept term
X = sm.add_constant(X)

In [None]:
# splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

Next, we will check the multicollinearity in the train dataset.

### Check for Multicollinearity

We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.

Features having a VIF score > 5 will be dropped/treated till all the features have a VIF score < 5

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to check VIF
def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif


print(checking_vif(X_train))

**Observations:**
* There are two variables with a high VIF - RAD and TAX. Let's remove TAX as it has the highest VIF values and check the multicollinearity again.

#### **Question 4:** Drop the column 'TAX' from the training data and check if multicollinearity is removed? (1 Mark)

In [None]:
# create the model after dropping TAX
X_train = #Write your code here

# check for VIF
print(checking_vif(X_train))

Now, we will create the linear regression model as the VIF is less than 5 for all the independent variables, and we can assume that multicollinearity has been removed between the variables.

#### **Question 5:** Write the code to create the linear regression model and print the model summary. Write your observations from the model. (3 Marks)

In [None]:
# create the model
model1 = #write your code here

# get the model summary
model1.summary()

**Observations:_____**

#### **Question 6:** Drop insignificant variables from the above model and create the regression model again. (2 Marks)

### Examining the significance of the model

It is not enough to fit a multiple regression model to the data, it is necessary to check whether all the regression coefficients are significant or not. Significance here means whether the population regression parameters are significantly different from zero. 

From the above it may be noted that the regression coefficients corresponding to ZN, AGE, and INDUS are not statistically significant at level α = 0.05. In other words, the regression coefficients corresponding to these three are not significantly different from 0 in the population. Hence, we will eliminate the three features and create a new model.

In [None]:
# create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df dataframe
Y = df['MEDV_log']
X = df.drop(_____________________________) #write your code here
X = sm.add_constant(X)

#splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

# create the model
model2 = __________________________ #write your code here
# get the model summary
model2.summary()

**Observations:**
* We can see that the **R-squared value has decreased by 0.002**, since we have removed variables from the model, whereas the **adjusted R-squared value has increased by 0.001**, since we removed statistically insignificant variables only.

Now, we will check the linear regression assumptions.

### Check the below linear regression assumptions

1. **Mean of residuals should be 0**
2. **No Heteroscedasticity**
3. **Linearity of variables**
4. **Normality of error terms**

#### **Question 7:** Write the code to check the above linear regression assumptions and provide insights. (4 Marks)

#### Check for mean residuals

In [None]:
residuals = 

# Write your code here

**Observations:____**

#### Check for homoscedasticity

* Homoscedasticity - If the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.

* Heteroscedasticity - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form a funnel shape or any other non-symmetrical shape.

* We'll use `Goldfeldquandt Test` to test the following hypothesis with alpha = 0.05:

    - Null hypothesis: Residuals are homoscedastic
    - Alternate hypothesis: Residuals have heteroscedastic

In [None]:
from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
import statsmodels.stats.api as sms

In [None]:
name = ["F statistic", "p-value"]
test = ____________________________
lzip(name, test)

**Observations:____**

#### Linearity of variables

It states that the predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.

In [None]:
# predicted values
fitted = model2.fittedvalues

# sns.set_style("whitegrid")
sns.residplot(x = ______ y = ________, color="lightblue", lowess=True) #write your code here
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()

**Observations:_____**

#### Normality of error terms
The residuals should be normally distributed.

In [None]:
# Plot histogram of residuals
#write your code here

In [None]:
# Plot q-q plot of residuals
import pylab
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()

**Observations:_____**

### Check the performance of the model on the train and test data set

#### **Question 8:** Write your observations by comparing model performance of train and test dataset (2 Marks)

In [None]:
# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # In-sample Prediction
    y_pred_train = olsmodel.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = olsmodel.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
                "MAPE": [
                    mape(y_pred_train, y_observed_train),
                    mape(y_pred_test, y_observed_test),
                ],
            }
        )
    )


# Checking model performance
model_pref(model2, X_train, X_test)  

**Observations:____**

#### Apply cross validation to improve the model and evaluate it using different evaluation metrics

#### **Question 9:** Apply the cross validation technique to improve the model and evaluate it using different evaluation metrics. (1 Mark)

In [None]:
# import the required function

from sklearn.model_selection import cross_val_score

# build the regression model and cross-validate
linearregression = LinearRegression()                                    

cv_Score11 = #write your code here
cv_Score12 = #write your code here                                


print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std() * 2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std() * 2))

**Observations**
- The R-squared on the cross validation is 0.729, whereas on the training dataset it was 0.769
- And the MSE on cross validation is 0.041, whereas on the training dataset it was 0.038

We may want to reiterate the model building process again with new features or better feature engineering to increase the R-squared and decrease the MSE on cross validation.

#### **Question 10:** Get model Coefficients in a pandas dataframe with column 'Feature' having all the features and column 'Coefs' with all the corresponding Coefs. Write the regression equation. (2 Marks)

In [None]:
coef = #write your code here

In [None]:
# Let us write the equation of the fit
Equation = "log (Price) ="
print(Equation, end='\t')
for i in range(len(coef)):
    print('(', coef[i], ') * ', coef.index[i], '+', end = ' ')

#### **Question 11:** Write the conclusions and business recommendations derived from the model. (5 Marks)

Write Conclusions here

Write Recommendations here