# EDA and Cleaning - Ames Housing Data

This notebook contains all data cleaning and Exploratory Data Analysis performed on Ames Housing Data

## Initial comments from data description review

- "There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). **I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations)** before assigning it to students."

- "... if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical **(such as foreclosures or family sales)**."

## Interesting features after reading data description:

- Lot Shape
- Land Contour
- Lot Config
- Neighborhood
- Year Built
- Year Remod/Add
- Exter Qual
- Exter Cond
- Overall Qual
- Overall Cond
- Gr Liv Area
- Bedroom
- KitchenQual
- Garage Area
- Garage Qual
- Garage Cond
- Mo Sold
- Yr Sold
- Sale Type
- Sale Condition

In [9]:
# Import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
# Import data
test = pd.read_csv('../data/test.csv')
train = pd.read_csv('../data/train.csv')
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [11]:
# List interesting features from reading data description
interesting = [
    'Lot Shape',
    'Land Contour',
    'Lot Config',
    'Neighborhood',
    'Year Built',
    'Year Remod/Add',
    'Exter Qual',
    'Exter Cond',
    'Overall Qual',
    'Overall Cond',
    'Gr Liv Area',
    'Bedroom AbvGr',
    'Kitchen Qual',
    'Garage Area',
    'Garage Qual',
    'Garage Cond',
    'Mo Sold',
    'Yr Sold',
    'Sale Type',
    'SalePrice',
]

In [12]:
# Keep only interesting features
train = train[interesting]


In [13]:
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [14]:
# Check nulls and dtypes
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               878 non-null    int64  
 1   PID              878 non-null    int64  
 2   MS SubClass      878 non-null    int64  
 3   MS Zoning        878 non-null    object 
 4   Lot Frontage     718 non-null    float64
 5   Lot Area         878 non-null    int64  
 6   Street           878 non-null    object 
 7   Alley            58 non-null     object 
 8   Lot Shape        878 non-null    object 
 9   Land Contour     878 non-null    object 
 10  Utilities        878 non-null    object 
 11  Lot Config       878 non-null    object 
 12  Land Slope       878 non-null    object 
 13  Neighborhood     878 non-null    object 
 14  Condition 1      878 non-null    object 
 15  Condition 2      878 non-null    object 
 16  Bldg Type        878 non-null    object 
 17  House Style     

In [15]:
train.corr()

Unnamed: 0,Year Built,Year Remod/Add,Overall Qual,Overall Cond,Gr Liv Area,Bedroom AbvGr,Garage Area,Mo Sold,Yr Sold,SalePrice
Year Built,1.0,0.629116,0.602964,-0.370988,0.258838,-0.042149,0.487177,-0.007083,-0.003559,0.571849
Year Remod/Add,0.629116,1.0,0.584654,0.042614,0.322407,-0.019748,0.398999,0.011568,0.042744,0.55037
Overall Qual,0.602964,0.584654,1.0,-0.08277,0.566701,0.053373,0.563814,0.019242,-0.011578,0.800207
Overall Cond,-0.370988,0.042614,-0.08277,1.0,-0.109804,-0.009908,-0.137917,-0.003144,0.047664,-0.097019
Gr Liv Area,0.258838,0.322407,0.566701,-0.109804,1.0,0.507579,0.490949,0.049644,-0.015891,0.697038
Bedroom AbvGr,-0.042149,-0.019748,0.053373,-0.009908,0.507579,1.0,0.06994,0.068281,-0.011692,0.137067
Garage Area,0.487177,0.398999,0.563814,-0.137917,0.490949,0.06994,1.0,0.009964,-0.003589,0.65027
Mo Sold,-0.007083,0.011568,0.019242,-0.003144,0.049644,0.068281,0.009964,1.0,-0.147494,0.032735
Yr Sold,-0.003559,0.042744,-0.011578,0.047664,-0.015891,-0.011692,-0.003589,-0.147494,1.0,-0.015203
SalePrice,0.571849,0.55037,0.800207,-0.097019,0.697038,0.137067,0.65027,0.032735,-0.015203,1.0


High correlation features: 'Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add'

### **Might be good to include a garage y/n column**

## Dropping rows with 'Gr Liv Area' > 4000 per data description suggestion.

In [16]:
train = train[train['Gr Liv Area'] < 4000]
# test = test[test['Gr Liv Area'] < 4000]

In [17]:
# plt.figure(figsize=(10,10))
# sns.pairplot(train, corner=True)
# ;

'Bedroom AbvGr', 'Mo Sold' and 'Yr Sold' seem to be evenly distributed, so I will drop them.

In [18]:
train.drop(columns=['Bedroom AbvGr', 'Mo Sold', 'Yr Sold'], inplace=True)
test.drop(columns=['Bedroom AbvGr', 'Mo Sold', 'Yr Sold'], inplace=True)

In [19]:
train['Garage Area'].fillna(0, inplace=True)
test['Garage Area'].fillna(0, inplace=True)

In [20]:
test['Garage Area'].isna().sum()

0

In [21]:
X_test = test[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
X = train[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
y = train['SalePrice']

In [22]:
X_test.shape

(878, 5)

In [23]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X, y)

LinearRegression()

In [24]:
preds = linreg.predict(X_test)

In [25]:
linreg.score(X, y)

0.7940701022936076

In [26]:
test['SalePrice'] = preds

In [27]:
preds.shape

(878,)

In [28]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg1.csv', index=False)

In [29]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_test_scaled = ss.transform(X_test)

In [30]:
linreg.fit(X_scaled, y)
preds_scaled = linreg.predict(X_test_scaled)

In [32]:
test['SalePrice'] = preds

In [33]:
test[['Id', 'SalePrice']].to_csv('../data/submission_linreg_scaled.csv', index=False)