
<h1 id="Project-Linear-Regression:-Boston-House-Price-Prediction">Project Linear Regression: Boston House Price Prediction<a class="anchor-link" href="#Project-Linear-Regression:-Boston-House-Price-Prediction">



<p>Welcome to the project on Linear Regression. We will use the Boston house price data for the exercise.</p>
<hr/>
<h2 id="Problem-Statement">Problem Statement<a class="anchor-link" href="#Problem-Statement">¶</a></h2><hr/>
<p>The problem on hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us.</p>
<hr/>
<h2 id="Data-Information">Data Information<a class="anchor-link" href="#Data-Information">¶</a></h2><hr/>
<p>Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-</p>
<p>Attribute Information (in order):</p>
<ul>
<li><strong>CRIM:</strong>     per capita crime rate by town</li>
<li><strong>ZN:</strong>       proportion of residential land zoned for lots over 25,000 sq.ft.</li>
<li><strong>INDUS:</strong>    proportion of non-retail business acres per town</li>
<li><strong>CHAS:</strong>     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)</li>
<li><strong>NOX:</strong>      nitric oxides concentration (parts per 10 million)</li>
<li><strong>RM:</strong>       average number of rooms per dwelling</li>
<li><strong>AGE:</strong>     proportion of owner-occupied units built before 1940</li>
<li><strong>DIS:</strong>      weighted distances to five Boston employment centers</li>
<li><strong>RAD:</strong>      index of accessibility to radial highways</li>
<li><strong>TAX:</strong>      full-value property-tax rate per 10,000 dollars</li>
<li><strong>PTRATIO:</strong>  pupil-teacher ratio by town</li>
<li><strong>LSTAT:</strong>    %lower status of the population</li>
<li><strong>MEDV:</strong>     Median value of owner-occupied homes in 1000 dollars</li>
</ul>



<h3 id="Let-us-start-by-importing-the-required-libraries">Let us start by importing the required libraries<a class="anchor-link" href="#Let-us-start-by-importing-the-required-libraries">¶</a></h3>


In [None]:


# import libraries for data manipulation
import pandas as pd
import numpy as np

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import ProbPlot

# import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# import library for preparing data
from sklearn.model_selection import train_test_split

# import library for data preprocessing
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")





<h3 id="Read-the-dataset">Read the dataset<a class="anchor-link" href="#Read-the-dataset">¶</a></h3>


In [None]:


df = pd.read_csv("Boston.csv")
df.head()





<p><strong>Observations</strong></p>
<ul>
<li>The price of the house indicated by the variable MEDV is the target variable and the rest are the independent variables based on which we will predict house price.</li>
</ul>



<h3 id="Get-information-about-the-dataset-using-the-info()-method">Get information about the dataset using the info() method<a class="anchor-link" href="#Get-information-about-the-dataset-using-the-info()-method">¶</a></h3>


In [None]:


df.info()





<p><strong>Observations</strong></p>
<ul>
<li><p>There are a total of 506 non-null observations in each of the columns. This indicates that there are no missing values in the data.</p>
</li>
<li><p>Every column in this dataset is numeric in nature.</p>
</li>
</ul>



<h3 id="Let's-now-check-the-summary-statistics-of-this-dataset">Let's now check the summary statistics of this dataset<a class="anchor-link" href="#Let's-now-check-the-summary-statistics-of-this-dataset">¶</a></h3>



<h4 id="Question-1:-Write-the-code-to-find-the-summary-statistics-and-write-your-observations-based-on-that.-(1-Mark)"><strong>Question 1:</strong> Write the code to find the summary statistics and write your observations based on that. (1 Mark)<a class="anchor-link" href="#Question-1:-Write-the-code-to-find-the-summary-statistics-and-write-your-observations-based-on-that.-(1-Mark)">¶</a></h4>


In [None]:


#write your code here
df.describe().T





<p><strong>Observations:MEDV looks to be normally distributed since the mean and the 50% are close to being the same value. The max number of rooms per dwelling is a little interesting since its 8.78, so it has a fraction of a room. </strong></p>



<p>Before performing the modeling, it is important to check the univariate distribution of the variables.</p>



<h3 id="Univariate-Analysis">Univariate Analysis<a class="anchor-link" href="#Univariate-Analysis">¶</a></h3>



<h3 id="Check-the-distribution-of-the-variables">Check the distribution of the variables<a class="anchor-link" href="#Check-the-distribution-of-the-variables">¶</a></h3>


In [None]:


# let's plot all the columns to look at their distributions
for i in df.columns:
    plt.figure(figsize=(7, 4))
    sns.histplot(data=df, x=i, kde = True)
    plt.show()





<p><strong>Observations</strong></p>
<ul>
<li><strong>The variables CRIM and ZN are positively skewed.</strong> This suggests that most of the areas have lower crime rates and most residential plots are under the area of 25,000 sq. ft.</li>
<li><strong>The variable CHAS, with only 2 possible values 0 and 1, follows a binomial distribution</strong>, and the majority of the houses are away from Charles river (CHAS = 0).</li>
<li>The distribution of the variable AGE suggests that many of the owner-occupied houses were built before 1940. </li>
<li><strong>The variable DIS</strong> (average distances to five Boston employment centers) <strong>has a nearly exponential distribution</strong>, which indicates that most of the houses are closer to these employment centers.</li>
<li><strong>The variables TAX and RAD have a bimodal distribution.</strong>, indicating that the tax rate is possibly higher for some properties which have a high index of accessibility to radial highways.  </li>
<li>The dependent variable MEDV seems to be slightly right skewed.</li>
</ul>



<p>As the dependent variable is sightly skewed, we will apply a <strong>log transformation on the 'MEDV' column</strong> and check the distribution of the transformed column.</p>


In [None]:


df['MEDV_log'] = np.log(df['MEDV'])




In [None]:


sns.histplot(data=df, x='MEDV_log', kde = True)





<p><strong>Observations</strong></p>
<ul>
<li>The log-transformed variable (<strong>MEDV_log</strong>) appears to have a <strong>nearly normal distribution without skew</strong>, and hence we can proceed.</li>
</ul>



<p>Before creating the linear regression model, it is important to check the bivariate relationship between the variables. Let's check the same using the heatmap and scatterplot.</p>



<h3 id="Bivariate-Analysis">Bivariate Analysis<a class="anchor-link" href="#Bivariate-Analysis">¶</a></h3>



<h4 id="Let's-check-the-correlation-using-the-heatmap">Let's check the correlation using the heatmap<a class="anchor-link" href="#Let's-check-the-correlation-using-the-heatmap">¶</a></h4>



<h3 id="Question-2-(3-Marks):"><strong>Question 2</strong> (3 Marks):<a class="anchor-link" href="#Question-2-(3-Marks):">¶</a></h3><ul>
<li><strong>Write the code to plot the correlation heatmap between the variables (1 Mark)</strong></li>
<li><strong>Write your observations (2 Marks)</strong></li>
</ul>


In [None]:


plt.figure(figsize=(12,8))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df.corr(),annot=True,fmt='.2f',cmap=cmap ) #write your code here
plt.show()





<p><strong>Observations:Strong negative correlation between DIS and INDUS
               Strong negative correlation between DIS and NOX
               Strong negative correlation between DIS and AGE
               Strong positive correlation between NOX and INDUS
               Strong positive correlation between TAX and INDUS
Independent Varriable correlations:
               Strong negative correlation between MEDV and LSTAT
               Strong negative correlation between MEDV_log and LSTAT
               Strong positive correlation between MEDV and RM
               </strong></p>



<p>Now, we will visualize the relationship between the pairs of features having significant correlations.</p>



<h3 id="Visualizing-the-relationship-between-the-features-having-significant-correlations-(&gt;-0.7)">Visualizing the relationship between the features having significant correlations (&gt; 0.7)<a class="anchor-link" href="#Visualizing-the-relationship-between-the-features-having-significant-correlations-(&gt;-0.7)">¶</a></h3>



<h3 id="Question-3-(6-Marks):"><strong>Question 3</strong> (6 Marks):<a class="anchor-link" href="#Question-3-(6-Marks):">¶</a></h3><ul>
<li><strong>Create a scatter plot to visualize the relationship between the features having significant correlations (&gt;0.7) (3 Marks)</strong></li>
<li><strong>Write your observations from the plots (3 Marks)</strong></li>
</ul>


In [None]:


# scatterplot to visualize the relationship between NOX and INDUS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'NOX', data = df) #write you code here
plt.show()





<p><strong>Observations:This one doesn't have the clearest trend, its almost logrithmic. I looks like as the proportion of houese occupied that were built before 1940 increases so does the nitric oxide, however this isn't always the cases. </strong></p>


In [None]:


# scatterplot to visualize the relationship between AGE and NOX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'NOX', y = 'INDUS', data = df) #Write your code here
plt.show()





<p><strong>Observations:It appears that nitric oxide increases and the proportion of non retian business increases.</strong></p>


In [None]:


# scatterplot to visualize the relationship between DIS and NOX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'MEDV', y = 'RM', data = df) #Write your code here
plt.show()





<p><strong>Observations:Median value of owner occupied homes increase as the number of rooms increases. </strong></p>


In [None]:


# scatterplot to visualize the relationship between AGE and DIS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'DIS', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>The distance of the houses to the Boston employment centers appears to decrease moderately as the the proportion of the old houses increase in the town. It is possible that the Boston employment centers are located in the established towns where proportion of owner-occupied units built prior to 1940 is comparatively high.</li>
</ul>


In [None]:


# scatterplot to visualize the relationship between AGE and INDUS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'INDUS', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>No trend between the two variables is visible in the above plot.</li>
</ul>


In [None]:


# scatterplot to visulaize the relationship between RAD and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RAD', y = 'TAX', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>The correlation between RAD and TAX is very high. But, no trend is visible between the two variables. 
This might be due to outliers. </li>
</ul>



<p>Let's check the correlation after removing the outliers.</p>


In [None]:


# remove the data corresponding to high tax rate
df1 = df[df['TAX'] < 600]
# import the required function
from scipy.stats import pearsonr
# calculate the correlation
print('The correlation between TAX and RAD is', pearsonr(df1['TAX'], df1['RAD'])[0])





<p>So the high correlation between TAX and RAD is due to the outliers. The tax rate for some properties might be higher due to some other reason.</p>


In [None]:


# scatterplot to visualize the relationship between INDUS and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'INDUS', y = 'TAX', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>The tax rate appears to increase with an increase in the proportion of non-retail business acres per town. This might be due to the reason that the variables TAX and INDUS are related with a third variable.</li>
</ul>


In [None]:


# scatterplot to visulaize the relationship between RM and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RM', y = 'MEDV', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li><p>The price of the house seems to increase as the value of RM increases. This is expected as the price is generally higher for more rooms.</p>
</li>
<li><p>There are a few outliers in a horizontal line as the MEDV value seems to be capped at 50.</p>
</li>
</ul>


In [None]:


# scatterplot to visulaize the relationship between LSTAT and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'LSTAT', y = 'MEDV', data = df)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>The price of the house tends to decrease with an increase in LSTAT. This is also possible as the house price is lower in areas where lower status people live.</li>
<li>There are few outliers and the data seems to be capped at 50.</li>
</ul>



<p>We have seen that the variables LSTAT and RM have a linear relationship with the dependent variable MEDV. Also, there are significant relationships among a few independent variables, which is not desirable for a linear regression model. Let's first split the dataset.</p>



<h3 id="Split-the-dataset">Split the dataset<a class="anchor-link" href="#Split-the-dataset">¶</a></h3><p>Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.</p>


In [None]:


# separate the dependent and independent variable
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})

# add the intercept term
X = sm.add_constant(X)




In [None]:


# splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)





<p>Next, we will check the multicollinearity in the train dataset.</p>



<h3 id="Check-for-Multicollinearity">Check for Multicollinearity<a class="anchor-link" href="#Check-for-Multicollinearity">¶</a></h3><p>We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.</p>
<p>Features having a VIF score &gt; 5 will be dropped/treated till all the features have a VIF score &lt; 5</p>


In [None]:


from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to check VIF
def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif


print(checking_vif(X_train))





<p><strong>Observations:</strong></p>
<ul>
<li>There are two variables with a high VIF - RAD and TAX. Let's remove TAX as it has the highest VIF values and check the multicollinearity again.</li>
</ul>



<h4 id="Question-4:-Drop-the-column-'TAX'-from-the-training-data-and-check-if-multicollinearity-is-removed?-(1-Mark)"><strong>Question 4:</strong> Drop the column 'TAX' from the training data and check if multicollinearity is removed? (1 Mark)<a class="anchor-link" href="#Question-4:-Drop-the-column-'TAX'-from-the-training-data-and-check-if-multicollinearity-is-removed?-(1-Mark)">¶</a></h4>


In [None]:


# create the model after dropping TAX
X_train = X_train.drop(['TAX'], axis=1)

# check for VIF
print(checking_vif(X_train))





<p>Now, we will create the linear regression model as the VIF is less than 5 for all the independent variables, and we can assume that multicollinearity has been removed between the variables.</p>



<h4 id="Question-5:-Write-the-code-to-create-the-linear-regression-model-and-print-the-model-summary.-Write-your-observations-from-the-model.-(3-Marks)"><strong>Question 5:</strong> Write the code to create the linear regression model and print the model summary. Write your observations from the model. (3 Marks)<a class="anchor-link" href="#Question-5:-Write-the-code-to-create-the-linear-regression-model-and-print-the-model-summary.-Write-your-observations-from-the-model.-(3-Marks)">¶</a></h4>


In [None]:


# create the model
model1 = sm.OLS(y_train, X_train).fit() #write your code here

# get the model summary
model1.summary()





<p><strong>Observations:ZN, INDUS, and AGE all have coefficents thats are close to 0 meaning they are not very significant. 
NOX has a strong negative relationship. </strong></p>



<h4 id="Question-6:-Drop-insignificant-variables-from-the-above-model-and-create-the-regression-model-again.-(2-Marks)"><strong>Question 6:</strong> Drop insignificant variables from the above model and create the regression model again. (2 Marks)<a class="anchor-link" href="#Question-6:-Drop-insignificant-variables-from-the-above-model-and-create-the-regression-model-again.-(2-Marks)">¶</a></h4>



<h3 id="Examining-the-significance-of-the-model">Examining the significance of the model<a class="anchor-link" href="#Examining-the-significance-of-the-model">¶</a></h3><p>It is not enough to fit a multiple regression model to the data, it is necessary to check whether all the regression coefficients are significant or not. Significance here means whether the population regression parameters are significantly different from zero.</p>
<p>From the above it may be noted that the regression coefficients corresponding to ZN, AGE, and INDUS are not statistically significant at level α = 0.05. In other words, the regression coefficients corresponding to these three are not significantly different from 0 in the population. Hence, we will eliminate the three features and create a new model.</p>


In [None]:


# create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df dataframe
Y = df['MEDV_log']
X = df.drop(['MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS'], axis =1) #write your code here
X = sm.add_constant(X)

#splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

# create the model
model2 = sm.OLS(y_train, X_train).fit() #write your code here
# get the model summary
model2.summary()





<p><strong>Observations:</strong></p>
<ul>
<li>We can see that the <strong>R-squared value has decreased by 0.002</strong>, since we have removed variables from the model, whereas the <strong>adjusted R-squared value has increased by 0.001</strong>, since we removed statistically insignificant variables only.</li>
</ul>



<p>Now, we will check the linear regression assumptions.</p>



<h3 id="Check-the-below-linear-regression-assumptions">Check the below linear regression assumptions<a class="anchor-link" href="#Check-the-below-linear-regression-assumptions">¶</a></h3><ol>
<li><strong>Mean of residuals should be 0</strong></li>
<li><strong>No Heteroscedasticity</strong></li>
<li><strong>Linearity of variables</strong></li>
<li><strong>Normality of error terms</strong></li>
</ol>



<h4 id="Question-7:-Write-the-code-to-check-the-above-linear-regression-assumptions-and-provide-insights.-(4-Marks)"><strong>Question 7:</strong> Write the code to check the above linear regression assumptions and provide insights. (4 Marks)<a class="anchor-link" href="#Question-7:-Write-the-code-to-check-the-above-linear-regression-assumptions-and-provide-insights.-(4-Marks)">¶</a></h4>



<h4 id="Check-for-mean-residuals">Check for mean residuals<a class="anchor-link" href="#Check-for-mean-residuals">¶</a></h4>


In [None]:


residuals = model2.resid

# Write your code here
residuals.mean()





<p><strong>Observations:The mean of residuals is very close to 0. Hence, the corresponding assumption is satisfied.</strong></p>



<h4 id="Check-for-homoscedasticity">Check for homoscedasticity<a class="anchor-link" href="#Check-for-homoscedasticity">¶</a></h4>



<ul>
<li><p>Homoscedasticity - If the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.</p>
</li>
<li><p>Heteroscedasticity - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form a funnel shape or any other non-symmetrical shape.</p>
</li>
<li><p>We'll use <code>Goldfeldquandt Test</code> to test the following hypothesis with alpha = 0.05:</p>
<ul>
<li>Null hypothesis: Residuals are homoscedastic</li>
<li>Alternate hypothesis: Residuals have heteroscedastic</li>
</ul>
</li>
</ul>


In [None]:


from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
import statsmodels.stats.api as sms




In [None]:


name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(y_train, X_train)
lzip(name, test)





<p><strong>Observations:The p-value is greater than alpha, 0.05, so we fail to reject the null hypothesis. The Residuals are homoscedastic</strong></p>



<h4 id="Linearity-of-variables">Linearity of variables<a class="anchor-link" href="#Linearity-of-variables">¶</a></h4><p>It states that the predictor variables must have a linear relation with the dependent variable.</p>
<p>To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.</p>


In [None]:


# predicted values
fitted = model2.fittedvalues

# sns.set_style("whitegrid")
sns.residplot(x = fitted, y = residuals, color="lightblue", lowess=True) #write your code here
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()





<p><strong>Observations:The relationship doesn't look linear and there isn't much of a pattern to it. It looks fairly random. </strong></p>



<h4 id="Normality-of-error-terms">Normality of error terms<a class="anchor-link" href="#Normality-of-error-terms">¶</a></h4><p>The residuals should be normally distributed.</p>


In [None]:


# Plot histogram of residuals
#write your code here
sns.histplot(residuals, kde=True)




In [None]:


# Plot q-q plot of residuals
import pylab
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()





<p><strong>Observations:These look normally distributed. We can assume normality. </strong></p>



<h3 id="Check-the-performance-of-the-model-on-the-train-and-test-data-set">Check the performance of the model on the train and test data set<a class="anchor-link" href="#Check-the-performance-of-the-model-on-the-train-and-test-data-set">¶</a></h3>



<h4 id="Question-8:-Write-your-observations-by-comparing-model-performance-of-train-and-test-dataset-(2-Marks)"><strong>Question 8:</strong> Write your observations by comparing model performance of train and test dataset (2 Marks)<a class="anchor-link" href="#Question-8:-Write-your-observations-by-comparing-model-performance-of-train-and-test-dataset-(2-Marks)">¶</a></h4>


In [None]:


# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # In-sample Prediction
    y_pred_train = olsmodel.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = olsmodel.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
                "MAPE": [
                    mape(y_pred_train, y_observed_train),
                    mape(y_pred_test, y_observed_test),
                ],
            }
        )
    )


# Checking model performance
model_pref(model2, X_train, X_test)  





<p><strong>Observations:Values for the train and the test are very similar. The models seems to perform well on both data sets.</strong></p>



<h4 id="Apply-cross-validation-to-improve-the-model-and-evaluate-it-using-different-evaluation-metrics">Apply cross validation to improve the model and evaluate it using different evaluation metrics<a class="anchor-link" href="#Apply-cross-validation-to-improve-the-model-and-evaluate-it-using-different-evaluation-metrics">¶</a></h4>



<h4 id="Question-9:-Apply-the-cross-validation-technique-to-improve-the-model-and-evaluate-it-using-different-evaluation-metrics.-(1-Mark)"><strong>Question 9:</strong> Apply the cross validation technique to improve the model and evaluate it using different evaluation metrics. (1 Mark)<a class="anchor-link" href="#Question-9:-Apply-the-cross-validation-technique-to-improve-the-model-and-evaluate-it-using-different-evaluation-metrics.-(1-Mark)">¶</a></h4>


In [None]:


# import the required function

from sklearn.model_selection import cross_val_score

# build the regression model and cross-validate
linearregression = LinearRegression()                                    

cv_Score11 = cross_val_score(linearregression, X_test, y_test, cv=10) #write your code here
cv_Score12 = cross_val_score(linearregression, X_test, y_test, cv=10, scoring = 'neg_mean_squared_error') #write your code here                                


print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std() * 2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std() * 2))





<p><strong>Observations</strong></p>
<ul>
<li>The R-squared on the cross validation is 0.729, whereas on the training dataset it was 0.769</li>
<li>And the MSE on cross validation is 0.041, whereas on the training dataset it was 0.038</li>
</ul>



<p>We may want to reiterate the model building process again with new features or better feature engineering to increase the R-squared and decrease the MSE on cross validation.</p>



<h4 id="Question-10:-Get-model-Coefficients-in-a-pandas-dataframe-with-column-'Feature'-having-all-the-features-and-column-'Coefs'-with-all-the-corresponding-Coefs.-Write-the-regression-equation.-(2-Marks)"><strong>Question 10:</strong> Get model Coefficients in a pandas dataframe with column 'Feature' having all the features and column 'Coefs' with all the corresponding Coefs. Write the regression equation. (2 Marks)<a class="anchor-link" href="#Question-10:-Get-model-Coefficients-in-a-pandas-dataframe-with-column-'Feature'-having-all-the-features-and-column-'Coefs'-with-all-the-corresponding-Coefs.-Write-the-regression-equation.-(2-Marks)">¶</a></h4>


In [None]:


coef =  pd.DataFrame({'name':['const', 'CRIM', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'PTRATIO', 'LSTAT'], 'coef':[4.649385823266652, -0.012500455079103941, 0.11977319077019594, -1.0562253516683235, 0.058906575109279144, -0.044068890799406124, 0.007848474606244051, -0.048503620794999036, -0.029277040479797338] })
coef.set_index(['name','coef'], inplace = True, append = True, drop = False)




In [None]:


# Let us write the equation of the fit
Equation = "log (Price) ="
print(Equation, end='\t')
for i in range(len(coef)):
    print('(', coef[i], ') * ', coef.index[i], '+', end = ' ')





<h4 id="Question-11:-Write-the-conclusions-and-business-recommendations-derived-from-the-model.-(5-Marks)"><strong>Question 11:</strong> Write the conclusions and business recommendations derived from the model. (5 Marks)<a class="anchor-link" href="#Question-11:-Write-the-conclusions-and-business-recommendations-derived-from-the-model.-(5-Marks)">¶</a></h4>



<p>Write Conclusions here
We peformed linear regression, checked for multicollinearity, and checked the significance of the model. We removed the features that were not affected the model. We then checked ther perfomance of the model to verify it worked well on our test data.</p>



<p>Write Recommendations here
Houses with the most rooms and and that are close/have good access to the highways are going to have the highest prices. These are really the only features in our model that increase value. All other values will want to be minimized the get the maximum price of a house. There are not any features that have an overly large affect on the housing price. Everything was relatively minimal.</p>
