# Ames Housing - OLS Linear Regression
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [1]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [3]:
data_train = pd.read_csv('data/train.csv')

In [4]:
data_train.head()

Unnamed: 0,house_id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ThreeSsnPorch,ScreenPorch,PoolArea,Fence,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,20,RL,80,10400,Pave,none,Reg,Lvl,AllPub,...,0,0,0,MnPrv,0,6,2009,WD,Family,152000
1,2,60,RL,0,28698,Pave,none,IR2,Low,AllPub,...,0,225,0,none,0,6,2009,WD,Abnorml,185000
2,3,90,RL,70,9842,Pave,none,Reg,Lvl,AllPub,...,0,0,0,none,0,3,2007,WD,Normal,101800
3,4,90,RL,60,7200,Pave,none,Reg,Lvl,AllPub,...,0,0,0,none,0,6,2009,WD,Normal,90000
4,5,190,RM,63,7627,Pave,none,Reg,Lvl,AllPub,...,0,0,0,none,0,10,2009,WD,Normal,94550


In [None]:
data_train.shape

In [None]:
data_train.columns

## Prepare data

Let us first focus on some easy to understand variables.

In [None]:
data_train = data_train[["SalePrice", "Neighborhood", "HouseStyle", "LotArea", "GrLivArea", "FullBath", "BedroomAbvGr", "KitchenAbvGr", "OverallQual", "OverallCond"]]

In [None]:
data_train.head()

## Exploratory data analysis

### Descriptive summary statistics

We can quickly calculate the most important summary statistics for a variable with *describe()*.

In [None]:
data_train["SalePrice"].describe()

### Visualize distribution of single variables
In the following, we will use histograms and density plots to get a feeling of the distribution of our main variables. See https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot for more information.

Let's first look at the dependent variable (*SalePrice*).

In [None]:
sns.displot(data_train, x="SalePrice", kde=True)
plt.show()

Let's look at some numerical independent variables. We will start with *GrLivArea*.

In [None]:
sns.displot(data_train, x="GrLivArea", kde=True)
plt.show()

Next, let's look at the variable *OverallQual*.

In [None]:
# YOUR CODE HERE

We can also visualize the distribution of categorical variables using *catplot()*. See https://seaborn.pydata.org/generated/seaborn.catplot.html#seaborn.catplot for more information.

Let's start with *Neighborhood*.

In [None]:
sns.catplot(data_train, x="Neighborhood", kind="count")
plt.xticks(rotation=90)
plt.show()

Let's do the same for *HouseStyle*.

In [None]:
# YOUR CODE HERE

### Visualize the relationship between the dependent variable (i.e., SalePrice) and numerical independent variables.

In the following, we will use scatter plots with linear trend lines to visually explore the relationship between *SalePrice* (Y axis) and various numerical independent variable (X axis). See https://seaborn.pydata.org/generated/seaborn.relplot.html#seaborn.relplot for more information.

*SalePrice* and *LotArea*.

In [None]:
sns.relplot(data=data_train, x="LotArea", y="SalePrice", alpha=0.3)
plt.show()

The function *regplot()* is an alternative to *relplot()* that includes a linear regression model fit that can be drawn on the plot. See https://seaborn.pydata.org/generated/seaborn.regplot.html#seaborn.regplot for more information.

In [None]:
sns.regplot(data=data_train, x="GrLivArea", y="SalePrice", scatter_kws={'alpha':0.3}, line_kws={'color': 'red'})
plt.show()

### Visualize the relationship between the dependent variable (i.e., SalePrice) and categorical independent variables.

In the following, we will use multiple box plots to visually explore the relationship between *SalePrice* (Y axis) and various categorical independent variable (X axis).


*SalePrice* and *OverallQual*.

In [None]:
sns.catplot(data=data_train, x="OverallQual", y="SalePrice", kind="box")
plt.show()

*SalePrice* and *OverallCond*.

In [None]:
# YOUR CODE HERE

Violin plots are similar to box plots, except that they also show the probability density of the data at different values.

In [None]:
sns.catplot(data=data_train, x="OverallCond", y="SalePrice", kind="violin", height=5, aspect=2)
plt.xticks(rotation=90)
plt.show()

## Train linear models

After getting a feeling for the data, we are now ready to fit some linear regression models. We will use the *statsmodels* package, esp. the the formula API to be able to specify R-style formulas. See https://www.statsmodels.org/dev/example_formulas.html for more information.

#### Simple linear regression models

Let's start with a simple model that includes only one independent variable, e.g., *GrLivArea*.

In [None]:
mod_01 = smf.ols(formula='SalePrice ~ GrLivArea', data=data_train)
mod_01 = mod_01.fit()
print(mod_01.summary())

Create a simple linear regression model with *BedroomAbvGr* as the only independent variable.

In [None]:
# YOUR CODE HERE

#### Multiple linear regression models

Let's create a multiple linear regression model with both *GrLivArea* and *BedroomAbvGr* as independent variables.

In [None]:
mod_03 = smf.ols(formula='SalePrice ~ GrLivArea + BedroomAbvGr', data=data_train)
mod_03 = mod_03.fit()
print(mod_03.summary2())

#### Categorical independent variables

Fit a linear regression model with a categorical independent variable (*HouseStyle*).

In [None]:
mod_04 = smf.ols(formula='SalePrice ~ GrLivArea + BedroomAbvGr + HouseStyle', data=data_train)
mod_04 = mod_04.fit()
print(mod_04.summary2())

In [None]:
data_train["HouseStyle"].value_counts()

#### Interaction effects

Fit a linear regression model with an interaction term between two numerical independent variables.


In [None]:
mod_05 = smf.ols(formula='SalePrice ~ GrLivArea * LotArea', data=data_train)
mod_05 = mod_05.fit()
print(mod_05.summary2())

Fit a linear regression model with an interaction term between a categorical and numerical independent variable. In the formula API, the *C()* function can be used to indicate that a numerical variable should be treated as categorical.

In [None]:
mod_05 = smf.ols(formula='SalePrice ~ GrLivArea * C(OverallQual)', data=data_train)
mod_05 = mod_05.fit()
print(mod_05.summary2())

#### Log-transformation of the DV

Fit a linear regression model with a logarithmic transformation of the dependent variable. With *np.log()* we can specify the transformation on the fly.


In [None]:
mod_07a = smf.ols(formula='np.log(SalePrice) ~ BedroomAbvGr', data=data_train)
mod_07a = mod_07a.fit()
print(mod_07a.summary())

For comparison, the same model without the log-transformation.

In [None]:
mod_07b = smf.ols(formula='SalePrice ~ BedroomAbvGr', data=data_train)
mod_07b = mod_07b.fit()
print(mod_07b.summary())

#### Polynomial transformation of the IVs

Fit a linear regression model with a second-order polynomial term (*GrLivArea*)

In [None]:
mod_08 = smf.ols(formula='SalePrice ~ GrLivArea + GrLivArea^2', data=data_train)
mod_08 = mod_08.fit()
print(mod_08.summary())

## Kaggle competition

Let us now use the trained models to make predictions on the test data and submit the results to the Kaggle competition.

### Load test set

In [None]:
data_test = pd.read_csv('data/test.csv')

In [None]:
data_test.head()

### Make predicitons

Use a trained model to make predictions for the test set.

In [None]:
# make predictions on test set
preds = mod_03.predict(data_test)

In [None]:
preds

### Create submission file

In [None]:
my_submission = pd.DataFrame({'HouseId': data_test["house_id"], 'SalePrice': preds})

In [None]:
my_submission.head()

In [None]:
my_submission.to_csv('submission.csv', index=False)