# Final Project Template

# Optimizing a Linear Regression Model for House Price Prediction

**Ken Walsh**



# Index

- [Abstract](#Abstract)
- [1. Introduction](#1.-Introduction)
- [2. The Data](#2.-The-Data)
    - [2.1 Import the Data](#2.1-Import-the-Data)
    - [2.2 Data Exploration](#2.2-Data-Exploration)
    - [2.3 Data Preparation](#2.3-Data-Preparation)
    - [2.4 Correlation](#2.4-Correlation)
- [3. Project Description](#3.-Project-Description)
    - [3.1 Linear Regression](#3.1-Linear-Regression)
    - [3.2 Analysis](#3.2-Analysis)
    - [3.3 Results](#3.3-Results)
    - [3.4 Verify Your Model Against Test Data](#3.4-Verify-Your-Model-Against-Test-Data)
- [Conclusion](#Conclusion)
- [References](#References)

[Back to top](#Index)


##  Abstract

The goal of this project was to develop and optimize a linear regression model for predicting house prices. Initially, a dataset comprising twenty numerical columns was utilized for training the model. Through a process of feature selection and evaluation, the dataset was subsequently reduced to eight variables, which were deemed to have the greatest impact on the model's predictive performance.

The training phase resulted in a model score 0.91421804, indicating a strong correlation between the selected features and the target variable. To assess the model's predictive ability for the real world, it was applied to a larger dataset, yielding a slightly lower score 0.8665988. When applied to a test dataset of similar data, the model score dropped to 0.817934.

To ensure data integrity, missing values (NaNs) within the dataset were addressed by replacing them with zeros prior to model training. This preprocessing step aimed to minimize the potential impact of missing data on the model's predictive accuracy.

Overall, the findings suggest that the selected eight columns in the training dataset demonstrate a strong linear relationship with house prices, as indicated by the high model scores achieved during both training and application on the full dataset. But a moderate linear relationship when applied to a different set of similar data.

[Back to top](#Index)


## 1. Introduction

The goal of the project was to develop and optimize a prediction model for house prices using a linear regression approach. The project involved various steps, including data exploration and cleansing, feature selection, model training, and evaluation. This introduction provides an overview of the processes followed to solve the problem and create an effective prediction model.

Initially, the dataset consisted of a CSV file with eighty-two columns, containing numerical and non-numerical data, related to various features of houses. The dataset consisted of one hundred records. These features included factors such as square footage, number of bedrooms, location, and other relevant attributes. The data was examined to understand the data, to identify relevant data for the model, and to identify gaps in the data requiring cleaning.

The training dataset was developed by selecting all columns containing numerical data, since the model would only function with this data. This numerical training dataset was subjected to a correlation exercise between the _ **SalePrice** _ column and the other numerical data. The twenty columns with the highest correlation to the _ **SalePrice** _ column were kept, indexed, and cleaned of any bad data. This dataset was then tested and columns with data that would not significantly affect the model score were removed.

Through analysis eight columns were identified as having the most significant impact on predicting house prices. These selected features were then used to train the linear regression model. The performance of the model was evaluated using a training dataset, resulting in a high model score of 0.914218. This score indicated a strong linear relationship between the selected features and house prices.

To assess the model's generalization ability, it was subsequently applied to the full dataset including the testing data. This evaluation yielded a slightly lower model score of 0.866598. When tested against a blind dataset of similar data, the model score dropped to 0.817934.

In conclusion, this project aimed to develop a linear regression mod

[Back to top](#Index)

## 2. The Data

For each of the steps below, make sure you include a description of your steps as well as your complete code. 

[Back to top](#Index)

### 2.1 Import the Data

#### Code:
```
# Project - Build a house price predictive model

# Y (House Price) = X1 (Column from Data)B1 (Must be determined) + ... + XnBn

from sklearn import linear\_model

from sklearn.metrics import r2\_score

from scipy.stats import skew

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

import numpy as np

# Loading the dataset from 'houseSmallData.csv' using pandas

data = pd.read\_csv('houseSmallData.csv')

# Printing the shape of the dataset

print('Shape of the Data Matrix', data.shape)

# Displaying a sample of the dataset

print('Sample of the Data:', data.head())
```

The code snippet demonstrates the data import process. First, the required libraries are imported, including linear\_model, r2\_score from sklearn, skew from scipy, matplotlib.pyplot as plt, seaborn as sns, pandas as pd, and numpy as np. These libraries are commonly used for data manipulation, visualization, statistics, and predictive modeling tasks.

Next, the data is loaded from the 'houseSmallData.csv' file using the read\_csv() function from pandas. The loaded data is stored in the "data" dataframe.

Preliminary data exploration is then performed. The shape of the data is determined using the shape attribute, and the result is stored in the _data\_shape_ variable. The head() function is used to display the first few rows of the dataset, and the output is stored in the _data\_head_ variable.


[Back to top](#Index)

### 2.2 Data Exploration

#### Code:
```
# Creating the Training Dataset by selecting the first 20 rows of data for all the columns

train = data.iloc[0:20, :]

# Select all of the columns containing only numerical data and store them in numeric.

numeric = train.select\_dtypes(include=[np.number])

# Select the variables of interest

variables\_of\_interest = ['OverallQual', 'MasVnrArea', 'FullBath', 'YearBuilt', 'GarageArea', 'GrLivArea', 'GarageCars', 'LotArea']

# Generate scatter plots of the variables versus the Sale Price data and display in a 3 x 4 matrix

# Calculate the number of rows and columns for the subplot grid

num\_rows = 3

num\_cols = 4

# Create the subplot grid

fig, axes = plt.subplots(num\_rows, num\_cols, figsize=(16, 12))

# Flatten the axes array to iterate over it easily

axes = axes.flatten()

# Create scatter plots for each variable

for i, variable in enumerate(variables\_of\_interest):

ax = axes[i]

sns.scatterplot(x=data[variable], y=data['SalePrice'], ax=ax)

ax.set\_xlabel(variable)

ax.set\_ylabel('Sale Price')

ax.set\_title(f'{variable} vs. Sale Price')

# Remove any empty subplots if the number of variables is less than 12

if len(variables\_of\_interest) \< num\_rows \* num\_cols:

for j in range(len(variables\_of\_interest), num\_rows \* num\_cols):

fig.delaxes(axes[j])

# Adjust the spacing between subplots

plt.tight\_layout()
```

![Final Project Graphs.png](attachment:32fca96a-e58e-401a-b96d-684824b9534e.png)

All the variables used in the model are plotted against the sales price data. Firstly, three of the plots, "Full Bath vs Sale Price", "Garage Cars vs Sale Price", and "OverallQual vs Sale Price" display discrete linear relationships with positive correlations.

The "MasVnrArea vs Sale Price" plot has many zero datapoints which affects the shape of the plot. This data could not be averaged out to remove the zero entries as this relates to the area of Masonry Veneer, not all houses will have masonry veneer, so the data was not modified.

The other plots show good positive correlations between the variables and Sale Price with one or two outliers skewing the data, particularly the _LotArea_ variable. We will discuss the _LotArea_ variable in the Correlation Section.

Other variable graphs showed some good correlation, but the variables were rejected when during testing they did not significantly change the model prediction number. For completeness they are included 

![Final Project - Rejected.png](attachment:4fabd54c-7003-4024-b613-49c6294d84aa.png)below:
yout()
```

This generated the following scatterplots:

[Back to top](#Index)

### 2.3 Data Preparation

#### Code:
```
# Remove all NaNs from the training and full datasets.

trainX = trainX.fillna(0)

fullX = fullX.fillna(0)
```

The data was checked for NaN, and these entries were replaced with zeros for both the full and training datasets. This cleaning was done after the numerical columns were isolated, but before the correlations were completed. As stated previously, the numerical data containing zeros were left unchanged, as modifying that data would have generated incorrect values.



[Back to top](#Index)

### 2.4 Correlation

#### Code:
```
# Data Preparation for Correlation Analysis

# Selecting the columns with numerical data for correlation

numeric = train.select\_dtypes(include=[np.number])

# Calculating Correlation Coefficientss

corr = numeric.corr(

# Selecting the top twenty columns correlated with "SalePrice")

bestCols = corr['SalePrice'].sort\_values(ascending=False)[0:21].index

# Creating X variables with the top twenty numerical columns correlated to "SalePrice""

trainX = train[bestCols]

fullX = data[bestCols]
```

The following correlation coefficients were generated from the initial twenty variables:

| **Variable** | **Correlation** |
| --- | --- |
| OverallQual | 0.807380 |
| MasVnrArea | 0.788274 |
| FullBath | 0.721954 |
| TotRmsAbvGrd | 0.699634 |
| YearBuilt | 0.699627 |
| YearRemodAdd | 0.698731 |
| GarageArea | 0.696998 |
| BedroomAbvGr | 0.681291 |
| GrLivArea | 0.676909 |
| TotalBsmtSF | 0.651318 |
| GarageYrBlt | 0.649557 |
| LotFrontage | 0.606910 |
| WoodDeckSF | 0.575730 |
| GarageCars | 0.571377 |
| 1stFlrSF | 0.449307 |
| 2ndFlrSF | 0.419880 |
| BsmtFinSF1 | 0.400864 |
| Fireplaces | 0.374814 |
| MoSold | 0.328774 |
| LotArea | 0.265787 |

It's important to note that while not all these variables are used in the final model, one variable with a relatively low correlation, _LotArea_, is included because it enhances predictive accuracy. Further details on variable selection and their impact on the model will be discussed in subsequent sections of this report.



[Back to top](#Index)

## 3. Project Description

#### Objective

The primary goal of this analysis was to enhance Dr. Williams' linear regression model score (0.750199) by introducing new variables and improving the model's performance, ultimately achieving an R-squared score of 0.8665 for the full dataset, and a score of 0.8179, 34 for the blind test of 0.817934.

#### Methodology

To achieve this goal, we conducted an examination of the top twenty variables most strongly correlated with the sales price. These variables were carefully chosen to expand the model beyond the original framework established by Dr. Williams. They included, in addition to the variables used by Dr. Williams:

- _FullBath_ (number of full bathrooms)
- _YearBuilt_ (year the house was built)
- _GarageArea_ (garage area)
- _GrLivArea_ (above-ground living area)
- _GarageCars_ (garage capacity)
- _LotArea_ (lot area)

#### Key Findings

During our analysis, several significant relationships between these variables and house prices emerged:

1. _OverallQual_, _FullBath_, and _GarageCars_ displayed strong, positive linear correlations with house prices.
2. _MasVnrArea_ exhibited a complex relationship due to the presence of many zero values, but it showed a positive correlation when present.
3. _YearBuilt_ and _GrLivArea_ revealed positive correlations, indicating that newer houses and larger living areas tend to command higher prices.
4. Surprisingly, _GarageArea_ also displayed a positive correlation with house prices.
5. While _LotArea_ initially had a poor correlation score, removing it from the model resulted in a nearly three-point decrease in the model score, highlighting it had sigificance.
6. Some variables, such as _MoSold_, were removed from the model due to their lack of relevance, but this minimally impacted the model's score.

#### Conclusion

These findings provide valuable insights into the determinants of house prices and have allowed us to significantly improve our predictive model. The addition of new variables and the identification of their relationships with house prices have contributed to our model's enhanced performance and accuracy.


[Back to top](#Index)

### 3.1 Linear Regression

#### Code:
```
# Compute the correlation matrix

# Run the correlation function on the numerical columns and assign the highest twenty columns correlated to "SalePrice" to a new variable called bestCols

corr = numeric.corr()

bestCols = corr['SalePrice'].sort\_values(ascending=False)[0:21].index

# Create the the X variables featuring the best twenty numerical columns correlated to "SalePrice"

trainX = train[bestCols]

fullX = data[bestCols]

# Creating our X and Y for our linear regression model for our training data - drop any columns considered not necessary from trainX

trainY = train['SalePrice']

trainX = trainX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis = 1)

# Creating our X and Y for our linear regression model for our full dataset - drop any columns considered not necessary from fullX

fullY = data['SalePrice']

fullX = fullX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis = 1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)
```

In this section, we perform _ **Multiple Linear Regression** _ (MLR) to model the relationship between our independent variables and the dependent variable, _SalePrice_. The code demonstrates how we prepare the data, build the regression models, make predictions, and evaluate the model's performance.

The algorithm is multiple linear regression, an expanded version of simple linear regression, a predictive statistical modeling technique for using multiple independent variables to predict a continuous outcome variable. MLR assumes a linear relationship between the independent variables and the dependent variable and models that relationship. MLR is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable. Multiple Linear Regression is a valuable tool for modeling relationships between multiple variables and predicting continuous outcomes.

#### Theory

"Multiple Linear Regression is a statistical method that models the relationship between a dependent variable and multiple independent variables. It is an extension of Simple Linear Regression, allowing for the consideration of multiple predictors simultaneously." [1]

#### Key Assumptions

- There is a linear relationship between the dependent variables and the independent variables.
- The independent variables are not too highly correlated with each other.
- The observations are selected independently and randomly from the population.
- Residuals should be normally distributed with a mean of 0 and variance σ.
- The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the variation in outcome can be explained by the variation in the independent variables. R2 always increases as more predictors are added to the MLR model, even though the predict

The mathematical equation for Multiple Linear Regression is given by:

y = m1x1 + m2x2 + ... + xnmn + b

Where:

_y_ is the dependent variable (outcome variable or response variable).

_x_ is the independent variable (predictor variable).

_m_ is the slope of the regression line.

_b_ is the intercept (y-axis intercept).

The OLS thod estimates the valuesf m and b that minimize the sum of squared residuals:

RSS = Σ(yᵢ - (m₁xᵢ + b))²

#### Steps to Implement Linear Regression in Python

- Import the necessary libraries: such as NumPy, pandas, seaborn, and scikit-learn.
- Read the data from the CSV file: houseSmallData.csv
- Split the dataset into training and full datasets to evaluate the model's performance on unseen data.
- Select the variables of interest: use the columns with the highest correlation to the dependent variable.
- Create scatter plots for each variable: examine the plots and choose the variable with the best linear relationship.
- Prepare the data for regression: cleaning the data of any NaN data.
- Compute the correlation matrix and select the top correlated columns:
- Create an instance of the Linear Regression model from the scikit-learn library.
- Fit the model to the training data using the fit() function, which estimates the coefficients based on the Ordinary Least Squares (OLS) method.
- Once the model is trained, use it to make predictions on the testing data using the predict() function.
- Evaluate the model's performance by calculating metrics such as mean squared error (MSE) or R-squared.
- Perform model optimization by adding or removing variables to improve the model score.

#### Evaluate the performance of the models:

In this project, a dataset is loaded and preprocessed using Pandas. The independent variables (X) and the dependent variable (y) are extracted. The Linear Regression model is created and trained using the fit() function. Predictions are made on new data, and the model's intercept, coefficients, and predictions are printed for interpretation.

[Back to top](#Index)

### 3.2 Analysis 

Below is the final selection of code for our project. This code is duplicated and builds upon previous sections, so make sure to run all the preceding code snippets to ensure it runs correctly.

_The code snippet below incorporates code detailed earlier in the document, but not all the code required for it to be standalone. If you wish to run this code, please ensure you have run all the other code snippets outlined above._

#### Selected Code
```
# Compute the correlation matrix

# Run the correlation function on the numerical columns and assign the highest twenty columns correlated to "SalePrice" to a new variable called bestCols

corr = numeric.corr()

bestCols = corr['SalePrice'].sort\_values(ascending=False)[0:21].index

# Create the the X variables featuring the best twenty numerical columns correlated to "SalePrice"

trainX = train[bestCols]

fullX = data[bestCols]

# Creating our X and Y for our linear regression model for our training data - drop any columns considered not necessary from trainX

trainY = train['SalePrice']

trainX = trainX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis = 1)

# Creating our X and Y for our linear regression model for our full dataset - drop any columns considered not necessary from fullX

fullY = data['SalePrice']

fullX = fullX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis = 1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)
```

This is the final selection for my project. The training set predictive score was 0.914218 (rounded), this yielded an overall predictive score of 0.866598 (rounded). This is an improvement of about ten points over Dr. Williams more restricted score.

#### Code Variation 1 - Using all twenty variables.
```
# Create the the X variables featuring the best twenty numerical columns correlated to "SalePrice"

trainX = train[bestCols]

fullX = data[bestCols]

# Creating our X and Y for our linear regression model for our training data - drop any columns considered not necessary from trainX

trainY = train['SalePrice']

trainX = trainX.drop('SalePrice', axis = 1)

# Creating our X and Y for our linear regression model for our full dataset - drop any columns considered not necessary from fullX

fullY = data['SalePrice']

fullX = fullX.drop('SalePrice', axis = 1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)
```

When including all the top twenty correlated variables to _SalePrice_ the training set predictive score was 1.0, this yielded an overall predictive score of 0.88692 (rounded). The analysis of the differences between each of the test sets will be detailed later.

#### Code Variation 2 - Using variables with a correlation of 0.7 or higher.

These variables are:

| **Variable** | **Correlation** |
| --- | --- |
| OverallQual | 0.807380 |
| MasVnrArea | 0.788274 |
| FullBath | 0.721954 |

```
# Prepare the training data - remove variables identified as being not necessary

trainX = train[bestCols]

trainY = train['SalePrice']

trainX = trainX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd', 'YearBuilt', 'GarageArea', 'GrLivArea', 'GarageCars', 'LotArea'], axis=1)

# Prepare the full dataset - remove variables identified as being not necessary

fullX = data[bestCols]

fullY = data['SalePrice']

fullX = fullX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'. 'YearBuilt', 'GarageArea', 'GrLivArea', 'GarageCars', 'LotArea'], axis=1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)
```

By reducing the number of variables to three, using only the most highly correlated (coefficient \> 0.7) both the training and full model scores have dropped significantly to 0.787535 (rounded) and 0.752927 (rounded) respectively.

#### Code Variation 3 - Including Variables with a correlation score of 0.6 or higher.

There are twelve of these variables and they are:

| **Variable** | **Correlation** |
| --- | --- |
| OverallQual | 0.807380 |
| MasVnrArea | 0.788274 |
| FullBath | 0.721954 |
| TotRmsAbvGrd | 0.699634 |
| YearBuilt | 0.699627 |
| YearRemodAdd | 0.698731 |
| GarageArea | 0.696998 |
| BedroomAbvGr | 0.681291 |
| GrLivArea | 0.676909 |
| TotalBsmtSF | 0.651318 |
| GarageYrBlt | 0.649557 |
| LotFrontage | 0.606910 |

```
# Prepare the training data - remove variables identified as being not necessary

trainX = train[bestCols]

trainY = train['SalePrice']

trainX = trainX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'Fireplaces', 'GarageCars', 'LotArea'], axis=1)

# Prepare the full dataset - remove variables identified as being not necessary

fullX = data[bestCols]

fullY = data['SalePrice']

fullX = fullX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'Fireplaces', 'GarageCars', 'LotArea'], axis=1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)
```

The model scores improved considerably when the numbers of variables were increased to include variables with a correlation score \> 0.6. The training model score improved to 0.945267 (rounded), and the full dataset predictive score was 0.862985 (rounded).

#### Summary of Code Variations

Here is a table showing the results of the four variations (including the chosen final variation). The implications will be discussed in the next section.

| **Variation** | **Training Correlation Score** | **Full Dataset Correlation Score** |
| --- | --- | --- |
| Selected Model | 0.914218 | 0.866598 |
| All 20 Variables | 1.000000 | 0.886920 |
| Correlation \> 0.7 | 0.787535 | 0.752927 |
| Correlation \> 0.6 | 0.945267 | 0.862985 |

We explored different code variations to assess their impact on model performance. Here ae the results:

**Selected Model:** Using the eight correlated variables, we achieved a training correlation of 0.914218 and a full dataset score of 0.866598.

**All 20 Variables:** When including all top twenty correlated variables to "SalePrice", we obtained a training score of 1.0 but a full dataset score of 0.886920.

**Correlation \> 0.7:** Using only variables with a correlation coefficient of 0.7 or higher significantly reduced the scores, with a training score of approximately 0.787535 and a full dataset score of 0.752927.

**Correlation \> 0.6:** Including variables with a correlation of 0.6 or higher improved the model, resulting in a training score of approximately 0.945267 and a full dataset score of 0.862985.

These variations highlight the importance of variable selection in improving model performance. The implications of these results will be discussed in the next section.





[Back to top](#Index)

### 3.3 Results

In this section, we present the results of various model variations. We explored different model configurations to determine which one performed the best. Each model's training and full dataset performance is evaluated. The final code for the linear regression has been duplicated here from Section 3.2. Additional code has been added for analysis purposes.

_The code snippet below incorporates code detailed earlier in the document, but not all of the code required for it to be standalone. If you wish to run this code, please ensure you have run all of the other code snippets outlined above._

#### Code
```
# Create the the X variables featuring the best twenty numerical columns correlated to "SalePrice"

trainX = train[bestCols]

fullX = data[bestCols]

# Prepare the training data - remove variables identified as being not necessary

trainY = train['SalePrice']

trainX = trainX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis=1)

# Prepare the full dataset - remove variables identified as being not necessary

fullY = data['SalePrice']

fullX = fullX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis=1)

# Train the linear regression model on the training data

lr = linear\_model.LinearRegression()

model = lr.fit(trainX, trainY)

predictions = model.predict(trainX)

# Train the linear regression model on the full dataset

lr = linear\_model.LinearRegression()

model1 = lr.fit(fullX, fullY)

predictions = model1.predict(fullX)

# Evaluate the performance of the models

model.score(trainX, trainY)

model1.score(fullX, fullY)

# Training Error Histogram with Skewness

plt.hist(trainY - predictionsTrain)

# Calculate skewness of the differences

skewnessTrain = skew(trainY - predictionsTrain)

# Display skewness as text on the plot

skew\_text = f'Skewness = {skewnessTrain:.6f}'

plt.text(0.6, 0.9, skew\_text, transform=plt.gca().transAxes)

# Training Scattrplot with best-fit line and R^2 score

# Scatter plot for predictionsTrain

plt.scatter(predictionsTrain, trainY, color = 'r')

# Calculate best fit line coefficients for predictionsTrain

coefficients\_train = np.polyfit(predictionsTrain, trainY, 1)

slope\_train = coefficients\_train[0]

intercept\_train = coefficients\_train[1]

# Generate best fit line for predictionsTrain

best\_fit\_line\_train = slope\_train \* predictionsTrain + intercept\_train

# Calculate R^2 value for predictionsTrain

r2\_value\_train = r2\_score(trainY, predictionsTrain)

# Display R^2 value as text for predictionsTrain

r2\_text\_train = f'R^2 (Train) = {r2\_value\_train:.6f}'

plt.plot(predictionsTrain, best\_fit\_line\_train, color='b')

plt.text(0.6, 0.2, r2\_text\_train, transform=plt.gca().transAxes)

# Full Dataset Error Histogram with Skewness

plt.hist(fullY - predictionsFull)

# Calculate skewness of the differences

skewnessFull = skew(fullY - predictionsFull)

# Display skewness as text on the plot

skew\_text = f'Skewness = {skewnessFull:.6f}'

plt.text(0.6, 0.9, skew\_text, transform=plt.gca().transAxes)

# Full dataset Scattrplot with best-fit line and R^2 score

# Scatter plot for predictionsFull

plt.scatter(predictionsFull, fullY, color='r')

# Calculate best fit line coefficients for predictionsFull

coefficients\_full = np.polyfit(predictionsFull, fullY, 1)

slope\_full = coefficients\_full[0]

intercept\_full = coefficients\_full[1]

# Generate best fit line for predictionsFull

best\_fit\_line\_full = slope\_full \* predictionsFull + intercept\_full

# Calculate R^2 value for predictionsFull

r2\_value\_full = r2\_score(fullY, predictionsFull)

# Display R^2 value as text for predictionsFull

r2\_text\_full = f'R^2 (Full) = {r2\_value\_full:.6f}'

plt.plot(predictionsFull, best\_fit\_line\_full, color='b')

plt.text(0.6, 0.2, r2\_text\_full, transform=plt.gca().transAxes)
```

This is the final selection for my project. The training set predictive score was 0.914218 (rounded), this yielded an overall predictive score of 0.866598 (rounded). This is an improvement of about ten points over Dr. Williams more restricted score.

#### Other Models

**Twenty Variables**

When including all the top twenty correlated variables to "SalePrice" the training set predictive score was 1.0, this yielded an overall predictive score of 0.88692 (rounded). The analysis of the differences between each of the test sets will be detailed later. The perfect training set score is unusual and might be due to overfitting or evaluation biases. Further testing on larger datasets is required.

**Using variables with a correlation of 0.7 or higher.**

These variables are:

| **Variable** | **Correlation** |
| --- | --- |
| OverallQual | 0.807380 |
| MasVnrArea | 0.788274 |
| FullBath | 0.721954 |

By reducing the number of variables to three, using only the most highly correlated (coefficient \> 0.7) both the training and full model scores have dropped significantly to 0.787535 (rounded) and 0.752927 (rounded) respectively. The scores were lower than the selected model, and this variation is not recommended.

**Code Variation 3 - Including Variables with a correlation score of 0.6 or higher.**

| **Variable** | **Correlation** |
| --- | --- |
| OverallQual | 0.807380 |
| MasVnrArea | 0.788274 |
| FullBath | 0.721954 |
| TotRmsAbvGrd | 0.699634 |
| YearBuilt | 0.699627 |
| YearRemodAdd | 0.698731 |
| GarageArea | 0.696998 |
| BedroomAbvGr | 0.681291 |
| GrLivArea | 0.676909 |
| TotalBsmtSF | 0.651318 |
| GarageYrBlt | 0.649557 |
| LotFrontage | 0.606910 |

The model scores improved considerably when the numbers of variables were increased to include variables with a correlation score \> 0.6. The training model score improved to 0.945267 (rounded), and the full dataset predictive score was 0.862985 (rounded). While this model had the highest training set score, it did not outperform the selected model on the full dataset, making the selected model a better choice.

#### Results

| **Variation** | **Training Correlation Score** | **Full Dataset Correlation Score** |
| --- | --- | --- |
| Selected Model | 0.914218 | 0.866598 |
| All 20 Variables | 1.000000 | 0.886920 |
| Correlation \> 0.7 | 0.787535 | 0.752927 |
| Correlation \> 0.6 | 0.945267 | 0.862985 |

**Error Histograms and Scatterplots**

The error histograms for the training and full datasets were 0.235717, and 0.193989 respectively. This is compared to the 1.17 
s![Final Project - Training Error Histogram.png](attachment:bcc198e5-3b85-45ff-8d03-9160611e3433.png)

**Full Dataset Error Histogram**

![Final Project - Full Error Histogram.png](attachment:d6fc31ab-c169-4c82-a02e-319f11b0a0f9.png)

The scatterplots for both datasets with the model applied are displayed below to demonstrate the linear relationship with best-fit lines and R^2 scores. As the numerical results show and the plots display, there is a good linear relationship in the datasets to the model predictions.

**Training Dataset Scatterplot**![Final Project - Training Scatterplot.png](attachment:4970ea63-311e-4393-9320-e0b8d3b62a71.png)

**Full Dataset Scatterplot**

![Final Project - Full Scatterplot.png](attachment:ed4e3099-58c6-4a78-998d-66f6554f7ad7.png)

#### What can be derived from these results

1. Using the twenty variables produced a training set predictive perfect score of 1.0. This immediately raises red flags, as achieving such a score is improbable.
  1. It may have been caused by overfitting the model to the data; the data was not biasedly selected so overfitting is probably not the issue here.
  2. There may be too many perfectly correlated variables, again we know the correlations, and this is not the case. Lastly, it may be due to errors or biases in the evaluation process that led to an inflated predictive score.
  3. The last possibility is the most probable, using the full set of twenty variables gives the highest full dataset predictive score, approximately two points higher than the selected model. Testing against a larger dataset is necessary to determine whether this is a good predictive set.
2. The twenty variables model contains variables that do not appear to have predictive qualities, like _MoSold_ (Month of Sale) or some of the square footage variables like the wood deck or basement. Even though these variables did have correlation coefficients to the sale price when they were removed from the predictive model the score was not significantly adversely affected. This model is rejected based on the prefect score and the number of additional "fluff" variables.
3. Reducing the number of variables by constraining them to having a correlation score greater than 0.7 resulted in much worse training and full dataset scores with moderate applicability. This produced the worst predictive scores of the four variations and so is rejected.
4. Expanding the number of variables based on a correlation score greater than 0.6 results in an excellent training dataset score of 0.945267. This is the best score, we exclude the perfect score, from the training set models. The selected model had a training set score of 0.914218, approximately three points lower.
5. However, the selected model's full dataset predictive model score is 0.866598, which is slightly higher than the "Correlation \> 0.6" model's score of 0.862985 - approximately 0.4 of point better.
kew that Dr. Williams first calculated in the unaltered training dataset. These skew results are close to zero and show good symmetry in the distribution.

**Training Error Histogram**




[Back to top](#Index)

### 3.4 Verify Your Model Against Test Data

#### Code
```
# Read in the data and assign it to test

test = pd.read\_csv('jtest.csv')

# Remove NaNs

testX = testX.fillna(0)

# Extract the bestCols from the new data and remove unecessary variables

testX = test[bestCols]

testY = test['SalePrice']

testX = testX.drop(['SalePrice', 'WoodDeckSF', 'MoSold', 'TotalBsmtSF', 'BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'GarageYrBlt', 'LotFrontage', 'TotRmsAbvGrd', 'BedroomAbvGr', 'Fireplaces', 'YearRemodAdd'], axis = 1)

# Perform the Linear Regression Modeling

lr = linear\_model.LinearRegression()

model2 = lr.fit(testX,testY)

predictionsTest = model2.predict(testX)

model2.score(testX,testY)

# Display the model score

print(f"R^2 is: {model2.score(testX,testY)}")

# Print Out the Graphs

# Histogram

plt.hist(testY - predictionsTest)

# Calculate skewness of the differences

skewnessTest = skew(testY - predictionsTest)

# Display skewness as text on the plot

skew\_text = f'Skewness = {skewnessTest:.6f}'

plt.text(0.6, 0.9, skew\_text, transform=plt.gca().transAxes)

# Scatterplot

# Scatter plot for predictionsTest

plt.scatter(predictionsTest, testY, color='r')

# Calculate best fit line coefficients for predictionsTest

coefficients\_test = np.polyfit(predictionsTest, testY, 1)

slope\_test = coefficients\_test[0]

intercept\_test = coefficients\_test[1]

# Generate best fit line for predictionsTest

best\_fit\_line\_test = slope\_test \* predictionsTest + intercept\_test

# Calculate R^2 value for predictionsTest

r2\_value\_test = r2\_score(testY, predictionsTest)

# Display R^2 value as text for predictionsTest

r2\_text\_test = f'R^2 (Test) = {r2\_value\_test:.2f}'

plt.plot(predictionsTest, best\_fit\_line\_test, color='b')

plt.text(0.7, 0.2, r2\_text\_test, transform=plt.gca().transAxes)
```

#### Results

The model score, R^2, for this new data is 0.817934 (rounded). This is a higher score than achieved in Dr. Williams revised model against the "jtest" dataset. It is observed that when applied to the testing data the R^2 model score dropped approximately five points and is less predictive than the full or training sets. The difference is significant, a difference between the training and full dataset would be expected given the size of the training set.

However, R^2 on its own is not a definitive determination on whether the model is a good predictive model, other factors such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) could be calculated to get a better understanding of the statistics.

Plus, further testing on other similar datasets would permit further model refinement. The model still predicts based on the data the sales price for a home based on having the data for those eight variables with 81.8% accuracy. The model has performed as follows for each dataset:

| **Dataset** | **Model Score** |
| --- | --- |
| Training | 0.914218 |
| Full | 0.866598 |
| Test | 0.817934 |

The Error Histogram for this data shows a uniform distribution with a skewness factor of -0.032614, although the skew is negative, it is very close to zero. Skew is just one factor affecting the distribution, but this shows a good symmetry for this distribution. When compared to the skewness of the other two training sets error histograms the skewness was reduced closer to the zero-line sh
![Final Project - Test Error Histogram.png](attachment:80b6f87b-a082-4522-9412-d6f77b539852.png)

| **Dataset** | **Skew** |
| --- | --- |
| Training | 0.235717 |
| Full | 0.193989 |
| Test | -0.032614 |

The scatterplot for the data displays the R^2 value for the data when the model is applied and a best fit line for the data.

**Test Dataset Scatt

![Final Project - Test Scatterplot.png](attachment:12834698-2c25-4431-a234-955d505a29d9.png)

erplot**owing an improvement in the symmetry of the distribution curve, and a lowering of the error.


[Back to top](#Index)

## Conclusion

To conclude, the project's goal was to develop and optimize an accurate multiple linear regression model for predicting home sale prices based on various variables. Several combinations of numerical data were utilized to evaluate what produced the best model score.

_Here are the main insights derived from this project_

**Model Performance:**

The performance of the predictive model was evaluated on three different datasets: the training dataset, the full dataset, and the test dataset. Each dataset yielded different R^2 scores, indicating how well the model fits the data. The training dataset achieved an R^2 score of 0.914218, the full dataset scored 0.866598, and the test dataset had an R^2 score of 0.817934.

**Approach Comparison:**

Among the different variations of the model, the most successful approach was using variables with a correlation score of 0.6 or higher. This variation included twelve variables, and it achieved the highest training dataset score (0.945) and a competitive full dataset score (0.863). The initial model with all twenty variables produced a suspiciously perfect training dataset score of 1.0, indicating potential overfitting or evaluation biases. Thus, this approach is not recommended. However, the eight variable model produced the highest overall full dataset score of 0.866598 and was chosen for this reason.

**Blind Data Test:**

While the model performed well on the training and full datasets, its performance on the test dataset dropped slightly. This emphasizes the importance of testing models on independent datasets to assess their generalization ability.

**Statistical Metrics:**

Although R^2 is a valuable metric for assessing the model's explanatory power, it's not the sole determinant of a model's quality. Further evaluation using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) can provide a more comprehensive understanding of predictive accuracy and model robustness.

**Data Analysis Lessons:**

This project provided valuable lessons into data analysis techniques. It highlighted the importance of good feature selection, thorough data preprocessing and cleaning, and the impact of correlated variables on model performance.

**Continued Refinement:**

To further enhance the predictive model, additional more extensive datasets need to be tested.

In summary, the model developed in this project provides a reasonable prediction of home sale prices based on the selected variables. However, there is room for improvement, further analysis outside of the techniques used here is needed.


[Back to top](#Index
)
## References

1. Hayes, Adam. “Multiple Linear Regression (MLR) Definition, Formula, and Example.” Investopedia, 2023. https://www.investopedia.com/terms/m/mlr.asp.
