# Initial Cleaning

In this notebook, I do the inital cleaning of the training data to prep it for EDA and initial feature engineering

In [291]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [412]:
df = pd.read_csv('./datasets/train.csv')

In [413]:
df.shape

(2051, 81)

In [414]:
df.isnull().sum()

Id                    0
PID                   0
MS SubClass           0
MS Zoning             0
Lot Frontage        330
Lot Area              0
Street                0
Alley              1911
Lot Shape             0
Land Contour          0
Utilities             0
Lot Config            0
Land Slope            0
Neighborhood          0
Condition 1           0
Condition 2           0
Bldg Type             0
House Style           0
Overall Qual          0
Overall Cond          0
Year Built            0
Year Remod/Add        0
Roof Style            0
Roof Matl             0
Exterior 1st          0
Exterior 2nd          0
Mas Vnr Type         22
Mas Vnr Area         22
Exter Qual            0
Exter Cond            0
Foundation            0
Bsmt Qual            55
Bsmt Cond            55
Bsmt Exposure        58
BsmtFin Type 1       55
BsmtFin SF 1          1
BsmtFin Type 2       56
BsmtFin SF 2          1
Bsmt Unf SF           1
Total Bsmt SF         1
Heating               0
Heating QC      

Because there are so many columns, I wanted to identify only the columns that contain null values. I experimented and wasn't having any luck, so I did some research: this [stackoverflow](https://stackoverflow.com/questions/53137100/filter-pandas-dataframe-columns-with-null-data) showed me the approach I used below.

In [415]:
df.columns[df.isna().any()].tolist()

['Lot Frontage',
 'Alley',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin SF 1',
 'BsmtFin Type 2',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Area',
 'Garage Qual',
 'Garage Cond',
 'Pool QC',
 'Fence',
 'Misc Feature']

In [416]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [417]:
df.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000


## Data Dictionary

I reference the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) a lot in the following.

### Alley - NaN --> 'None'

In [418]:
df['alley'].value_counts(dropna = False)

NaN     1911
Grvl      85
Pave      55
Name: alley, dtype: int64

It appears that NaN means there is no alley. I'm going to replace NaN with 'None' for now. 

In [419]:
df['alley'] = df['alley'].fillna('None')

df['alley'].value_counts(dropna = False)

None    1911
Grvl      85
Pave      55
Name: alley, dtype: int64

### Basements!

I created this for-loop below to examine how many NAs existed in each basement column, to help me decide how to deal with them, since there's presumably at least some relationship betwen them. Rather than copying the code over and over again, I also returned to the for-loop to use as my sanity check as I cleaned the basement columns.

In [420]:
bsmt_list = df.columns[df.columns.str.contains('bsmt')].tolist()

bsmt_list

['bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'bsmt_full_bath',
 'bsmt_half_bath']

In [421]:
for i in bsmt_list:
    print(f'{i} has {df[i].isnull().sum()}')

bsmt_qual has 55
bsmt_cond has 55
bsmt_exposure has 58
bsmtfin_type_1 has 55
bsmtfin_sf_1 has 1
bsmtfin_type_2 has 56
bsmtfin_sf_2 has 1
bsmt_unf_sf has 1
total_bsmt_sf has 1
bsmt_full_bath has 2
bsmt_half_bath has 2


I was surprised by the variation on the number of NaNs. Knowing that some of the NaNs actually represent "No Basement", I thought there'd be a pretty consistent number.

I decided to start by looking at the smaller values to see what's going on to start.

### Basement Full Bath and Half Bath

In [422]:
df[df['bsmt_full_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
616,,,,,0.0,,0.0,0.0,0.0,,
1327,,,,,,,,,,,


In [423]:
df['bsmt_full_bath'].value_counts(dropna = False)

0.0    1200
1.0     824
2.0      23
NaN       2
3.0       2
Name: bsmt_full_bath, dtype: int64

I did the above two lines to get a sense of what was going on with these instances. It appears these NaNs should be 0.0, reflecting no basement. The same appears to be true for the half baths, which I checked below.

In [424]:
df['bsmt_full_bath'] = df['bsmt_full_bath'].fillna(0.0)

In [425]:
df['bsmt_full_bath'].value_counts(dropna = False)

0.0    1202
1.0     824
2.0      23
3.0       2
Name: bsmt_full_bath, dtype: int64

In [426]:
df[df['bsmt_half_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
616,,,,,0.0,,0.0,0.0,0.0,0.0,
1327,,,,,,,,,,0.0,


In [427]:
df['bsmt_half_bath'] = df['bsmt_half_bath'].fillna(0.0)

In [428]:
df['bsmt_half_bath'].value_counts(dropna = False)

0.0    1925
1.0     122
2.0       4
Name: bsmt_half_bath, dtype: int64

### Basement Sq Ft, Finished (both types) & Unfinished, Total

In [429]:
df[df['bsmt_unf_sf'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,,,,,,0.0,0.0


Having looked at values in other instances, it appears this also should be 0.0, for no basement, as well.

In [430]:
df['bsmt_unf_sf'] = df['bsmt_unf_sf'].fillna(0.0)
df['bsmt_unf_sf'].isnull().sum()

0

In [431]:
df[df['bsmtfin_sf_1'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,,,,0.0,,0.0,0.0


In [432]:
df['bsmtfin_sf_1'] = df['bsmtfin_sf_1'].fillna(0.0)
df['bsmtfin_sf_1'].isnull().sum()

0

In [433]:
df[df['bsmtfin_sf_2'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,0.0,,,0.0,,0.0,0.0


In [434]:
df['bsmtfin_sf_2'] = df['bsmtfin_sf_2'].fillna(0.0)
df['bsmtfin_sf_2'].isnull().sum()

0

In [435]:
df[df['total_bsmt_sf'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,0.0,,0.0,0.0,,0.0,0.0


In [436]:
df['total_bsmt_sf'] = df['total_bsmt_sf'].fillna(0.0)
df['total_bsmt_sf'].isnull().sum()

0

In [437]:
df['total_bsmt_sf']

0        725.0
1        913.0
2       1057.0
3        384.0
4        676.0
         ...  
2046    1884.0
2047     861.0
2048     896.0
2049    1200.0
2050     994.0
Name: total_bsmt_sf, Length: 2051, dtype: float64

### Basement Quality

The dictionary specifies that 'Na' is 'No Basement', so I'm going to cast it as such. This appears to bet true for the categorical basement elements.

In [438]:
df['bsmt_qual'].value_counts(dropna = False)

TA     887
Gd     864
Ex     184
Fa      60
NaN     55
Po       1
Name: bsmt_qual, dtype: int64

In [439]:
df['bsmt_qual'] = df['bsmt_qual'].fillna('No Basement')

In [440]:
df['bsmt_qual'].value_counts(dropna = False)

TA             887
Gd             864
Ex             184
Fa              60
No Basement     55
Po               1
Name: bsmt_qual, dtype: int64

### Basement Condition

The data dictionary says that 'Na' means None, so I'm making that explicit.

In [441]:
df['bsmt_cond'] = df['bsmt_cond'].fillna('No Basement')
df['bsmt_cond'].isnull().sum()

0

In [442]:
df['bsmt_cond'].value_counts()

TA             1834
Gd               89
Fa               65
No Basement      55
Po                5
Ex                3
Name: bsmt_cond, dtype: int64

### Basement Exposure
Dictionary says 'Na' means 'No Basement', so making that explicit.

In [443]:
df['bsmt_exposure'] = df['bsmt_exposure'].fillna('No Basement')
df['bsmt_exposure'].isna().sum()

0

In [444]:
df['bsmt_exposure'].value_counts()

No             1339
Av              288
Gd              203
Mn              163
No Basement      58
Name: bsmt_exposure, dtype: int64

### Basement Finish Type 1 and 2
The dictionary sas 'Na' means 'No Basement', so making that explicit. There's one more Finish Type 2 than Type 1, so I looked into that before making the change to 'No Basement'.

In [445]:
df['bsmtfin_type_1'] = df['bsmtfin_type_1'].fillna('No Basement')
df['bsmtfin_type_1'].value_counts()

GLQ            615
Unf            603
ALQ            293
BLQ            200
Rec            183
LwQ            102
No Basement     55
Name: bsmtfin_type_1, dtype: int64

In [446]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1147,Gd,TA,No,GLQ,1124.0,,479.0,1603.0,3206.0,1.0,0.0


In the original data frame there's one instance that has a null 'basmtfin_type_2' that clearly has a basement based on the other entries. Since I have no way of speculating what type that should be, I'm deleting this instance.

In [447]:
df.drop(index = df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')].index, inplace = True)

In [448]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Now making the change to 'No Basement' on Finish Type 2

In [449]:
df['bsmtfin_type_2'] = df['bsmtfin_type_2'].fillna('No Basement')
df['bsmtfin_type_2'].value_counts()

Unf            1749
Rec              80
LwQ              60
No Basement      55
BLQ              48
ALQ              35
GLQ              23
Name: bsmtfin_type_2, dtype: int64

Checking in on which columns still have Na's

In [450]:
df.columns[df.isnull().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'fireplace_qu',
 'garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond',
 'pool_qc',
 'fence',
 'misc_feature']

### Garage Columns

In [451]:
df['garage_area'].head(2)

0    475.0
1    559.0
Name: garage_area, dtype: float64

In [452]:
garage_list = df.columns[df.columns.str.contains('garage')].tolist()
garage_list

['garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond']

Again, I recycled the below for-loop as a final sanity check, returning to this line repeatedly as I worked, rather than rewriting the loop for every step I took.

In [453]:
for i in garage_list:
    print(f'{i} has {df[i].isnull().sum()} nulls')

garage_type has 113 nulls
garage_yr_blt has 114 nulls
garage_finish has 114 nulls
garage_cars has 1 nulls
garage_area has 1 nulls
garage_qual has 114 nulls
garage_cond has 114 nulls


I did a search to find the not null method for the following, which turned up the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html).

In [454]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,Detchd,,,,,,


A garage_type is listed for this one instance, but NaN in a number of the other columns means 'No Garage' according to the dictionary, so I'm going to operate on the assumption that the garage type is in error, here and change it to 'No Garage'

In [455]:
df.reset_index()

Unnamed: 0,index,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000
2,2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000
3,3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000
4,4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2045,2046,1587,921126030,20,RL,79.0,11449,Pave,,IR1,HLS,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,1Story,8,5,2007,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,TA,Av,GLQ,1011.0,Unf,0.0,873.0,1884.0,GasA,Ex,Y,SBrkr,1728,0,0,1728,1.0,0.0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2007.0,Fin,2.0,520.0,TA,TA,Y,0,276,0,0,0,0,,,,0,1,2008,WD,298751
2046,2047,785,905377130,30,RL,,12342,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,4,5,1940,1950,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,BLQ,262.0,Unf,0.0,599.0,861.0,GasA,Ex,Y,SBrkr,861,0,0,861,0.0,0.0,1,0,1,1,TA,4,Typ,0,,Detchd,1961.0,Unf,2.0,539.0,TA,TA,Y,158,0,0,0,0,0,,,,0,3,2009,WD,82500
2047,2048,916,909253010,50,RL,57.0,7558,Pave,,Reg,Bnk,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,1.5Fin,6,6,1928,1950,Gable,CompShg,BrkFace,Stone,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0.0,Unf,0.0,896.0,896.0,GasA,Gd,Y,SBrkr,1172,741,0,1913,0.0,0.0,1,1,3,1,TA,9,Typ,1,TA,Detchd,1929.0,Unf,2.0,342.0,Fa,Fa,Y,0,0,0,0,0,0,,,,0,3,2009,WD,177000
2048,2049,639,535179160,20,RL,80.0,10400,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,4,5,1956,1956,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,TA,TA,No,Rec,155.0,LwQ,750.0,295.0,1200.0,GasA,TA,Y,SBrkr,1200,0,0,1200,1.0,0.0,1,0,3,1,TA,6,Typ,2,Gd,Attchd,1956.0,Unf,1.0,294.0,TA,TA,Y,0,189,140,0,0,0,,,,0,11,2009,WD,144000


In [456]:
df['garage_area'].head(2)

0    475.0
1    559.0
Name: garage_area, dtype: float64

In [458]:
garage_index = df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())].index

df.iloc[garage_index]['garage_type']

1713    Attchd
Name: garage_type, dtype: object

In [338]:
df['garage_type'].value_counts()

Attchd     1212
Detchd      536
BuiltIn     132
Basment      27
2Types       19
CarPort      11
Name: garage_type, dtype: int64

In [339]:
garage_index.astype(int)

Int64Index([1712], dtype='int64')

In [340]:
df.at[1712, 'garage_type'] = 'No Garage'

In [341]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,,,,,


In [342]:
df.iloc[garage_index][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1713,Attchd,1952.0,RFn,1.0,280.0,TA,TA


In [343]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,,,,,


In [344]:
df['garage_type'] = df['garage_type'].fillna('No Garage')
   
df['garage_type'].isnull().sum()

0

The dictionary spells out the below "no_garage_list" columns as NA = No Garage. This is the first place I copied the df because I wanted a backup in case I made a mistake filling the NAs in multiple columns in one block, as it's the first time I've done it that way. When I succeeded, I moved the copy_df code to afterwards to save it before my next experiment.

In [345]:
no_garage_list = ['garage_finish', 'garage_qual', 'garage_cond']

df[no_garage_list] = df[no_garage_list].fillna('No Garage')

In [346]:
copy_df = df.copy()

In [347]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,No Garage,,,No Garage,No Garage


In [348]:
df['garage_area'].head(2)

0    475.0
1    559.0
Name: garage_area, dtype: float64

# Error was here.
Watching the other presentations, I saw a couple of garage numerical features appear in most people's correlation heatmaps (and near the top, too, highly correlated!). I realized that when I was cleaning the data, in the following block of code I inserted a string into those columns, converting NaN to 'No Garage', thus turning the whole column into an 'object' column and taking it out of the correlations. Catching this mistake and fixing it made a meaningful difference to all of my models.

'garage_cars' and 'garage_area', I'll cast to 0s for the Nans.

_HT: Joe._ I'm going to use the mean garage_yr_blt, inspired by his use of the method of using the year the house was built for empty/0 'year built' features. That said, in thinking about it, it seems like the mean, or possibly the median, would be a better substitute because newer houses will get credit for having newer garages (etc.) by using the year the house was built. Regardless of the specific year chosen, I wouldn't have thought of this without hearing Joe's solution which I thought was pretty clever. 

**The takeaway: I have to check all the columns for datatype and nulls when I think I've finished cleaning, as a sanity check to make sure that none of my cleaning had any unintended byproducts.**

In [349]:
# no_garage_list2 = ['garage_yr_blt', 'garage_cars', 'garage_area']

# df.at[1712, no_garage_list2] = 'No Garage'

In [350]:
df['garage_yr_blt'].dtypes

dtype('float64')

In [352]:
df['garage_yr_blt'] = df['garage_yr_blt'].fillna(df['garage_yr_blt'].mean())
df['garage_cars'] = df['garage_cars'].fillna(0)
df['garage_area'] = df['garage_area'].fillna(0)

df[garage_list].isnull().sum()

garage_type      0
garage_yr_blt    0
garage_finish    0
garage_cars      0
garage_area      0
garage_qual      0
garage_cond      0
dtype: int64

I'm revisiting the null list to see what's left. I came back up to this one a few times, rather than rewriting it below, for additional sanity-check and to keep track of what's left.

In [355]:
df.columns[df.isna().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'fireplace_qu',
 'pool_qc',
 'fence',
 'misc_feature']

#### Lot Frontage

In [358]:
df['lot_frontage'].value_counts(dropna = False)

NaN      330
60.0     179
70.0      96
80.0      94
50.0      90
65.0      71
75.0      68
85.0      51
63.0      38
24.0      33
78.0      33
21.0      32
64.0      31
74.0      31
90.0      31
72.0      30
62.0      28
68.0      28
73.0      25
100.0     23
82.0      21
43.0      20
52.0      20
57.0      20
79.0      19
66.0      19
67.0      18
53.0      18
59.0      18
51.0      16
76.0      16
88.0      16
56.0      15
84.0      14
69.0      14
81.0      14
55.0      14
58.0      13
91.0      13
71.0      13
40.0      13
92.0      13
35.0      13
44.0      12
48.0      11
34.0      11
96.0      11
30.0      11
77.0      11
41.0      11
95.0      11
61.0      10
83.0      10
105.0      9
107.0      9
110.0      9
93.0       9
87.0       8
120.0      8
45.0       8
94.0       8
42.0       8
102.0      7
98.0       7
86.0       7
89.0       6
54.0       6
47.0       6
32.0       6
37.0       6
103.0      5
36.0       5
39.0       4
97.0       4
38.0       4
108.0      4
129.0      4

In [359]:
df['lot_frontage'].isna().sum()/len(df)

0.16097560975609757

16% of the data seems like a lot to drop.

In [360]:
df[df['lot_frontage'] == 0.0].head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [361]:
df[df['lot_frontage'] < 30.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
10,1044,527451290,160,RM,21.0,1680,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrDale,Norm,Norm,Twnhs,2Story,6,5,1971,1971,Gable,CompShg,HdBoard,HdBoard,BrkFace,232.0,TA,TA,CBlock,TA,TA,No,ALQ,387.0,Unf,0.0,96.0,483.0,GasA,TA,Y,SBrkr,483,504,0,987,0.0,0.0,1,1,2,1,TA,4,Typ,0,,Detchd,1971.0,Unf,1.0,264.0,TA,TA,Y,0,0,0,0,0,0,,,,0,7,2008,COD,85400
13,1177,533236070,160,FV,24.0,2645,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,8,5,1999,2000,Gable,CompShg,MetalSd,MetalSd,BrkFace,456.0,Gd,TA,PConc,Gd,TA,No,GLQ,813.0,Unf,0.0,147.0,960.0,GasA,Ex,Y,SBrkr,962,645,0,1607,1.0,0.0,2,1,3,1,Gd,7,Typ,0,,Detchd,2000.0,Unf,2.0,480.0,TA,TA,Y,169,0,0,0,0,0,,,,0,12,2008,ConLD,200000
53,330,923226250,160,RM,21.0,1476,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,2Story,4,7,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,GLQ,176.0,Unf,0.0,370.0,546.0,GasA,Ex,Y,SBrkr,546,546,0,1092,0.0,0.0,1,1,3,1,TA,5,Typ,0,,No Garage,1978.695248,No Garage,0.0,0.0,No Garage,No Garage,Y,200,26,0,0,0,0,,,,0,3,2010,WD,76000
56,98,533212020,160,FV,24.0,2544,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,7,5,2004,2005,Gable,CompShg,MetalSd,MetalSd,,0.0,Gd,TA,PConc,Gd,TA,No,GLQ,368.0,ALQ,42.0,190.0,600.0,GasA,Ex,Y,SBrkr,600,600,0,1200,1.0,0.0,2,1,2,1,Gd,4,Typ,0,,Detchd,2004.0,RFn,2.0,480.0,TA,TA,Y,0,172,0,0,0,0,,,,0,2,2010,WD,149500
73,2832,908188140,160,RM,24.0,2522,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,Twnhs,2Story,7,5,2004,2004,Gable,CompShg,VinylSd,VinylSd,Stone,50.0,Gd,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,970.0,970.0,GasA,Ex,Y,SBrkr,970,739,0,1709,0.0,0.0,2,0,3,1,Gd,7,Maj1,0,,Detchd,2004.0,Unf,2.0,380.0,TA,TA,Y,0,40,0,0,0,0,,,,0,4,2006,WD,137500
130,1049,527454120,120,RL,24.0,2529,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,1Story,7,6,1977,1977,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,378.0,Unf,0.0,677.0,1055.0,GasA,Fa,Y,SBrkr,1055,0,0,1055,0.0,0.0,2,0,2,1,TA,4,Typ,0,,Attchd,1977.0,Unf,2.0,440.0,TA,TA,Y,0,38,0,0,0,0,,,,0,9,2008,WD,148500
132,414,527454130,160,RL,24.0,2349,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,2Story,6,5,1977,1977,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,389.0,Unf,0.0,466.0,855.0,GasA,TA,Y,SBrkr,855,601,0,1456,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1977.0,Unf,2.0,440.0,TA,TA,Y,0,28,0,0,0,0,,,,0,5,2009,WD,137900
135,1040,527450110,160,RM,21.0,1680,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrDale,Norm,Norm,Twnhs,2Story,6,5,1973,1973,Gable,CompShg,HdBoard,HdBoard,BrkFace,359.0,TA,TA,CBlock,TA,TA,No,LwQ,458.0,Unf,0.0,25.0,483.0,GasA,TA,Y,SBrkr,483,504,0,987,0.0,1.0,1,1,2,1,TA,5,Typ,0,,Detchd,1973.0,Unf,1.0,264.0,TA,TA,Y,52,0,0,0,0,0,,,,0,2,2008,WD,103400
167,416,527455060,160,RL,24.0,2289,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,2Story,6,6,1978,1978,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,TA,TA,No,ALQ,311.0,Unf,0.0,544.0,855.0,GasA,TA,Y,SBrkr,855,586,0,1441,0.0,0.0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1978.0,Unf,2.0,440.0,TA,TA,Y,28,0,0,0,0,0,,,,0,4,2009,WD,148500
179,30,527451180,160,RM,21.0,1680,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrDale,Norm,Norm,Twnhs,2Story,6,5,1971,1971,Gable,CompShg,HdBoard,HdBoard,BrkFace,504.0,TA,TA,CBlock,TA,TA,No,Rec,156.0,Unf,0.0,327.0,483.0,GasA,TA,Y,SBrkr,483,504,0,987,0.0,0.0,1,1,2,1,TA,5,Typ,0,,Detchd,1971.0,Unf,1.0,264.0,TA,TA,Y,275,0,0,0,0,0,,,,0,2,2010,COD,96000


It appears possible that NaN means no frontage because there are no 0.0 lots. In fact, there's nothing less than 21.0 feet of frontage. In reading a little about "flag lots" [here](https://www.city-data.com/forum/real-estate/1735402-value-land-without-road-frontage.html), it seems plausible that's what these are, and while 21 feet seems a little small from that reading, it seems possible these smaller lot frontages represent lots that include a path to the road in the property (as opposed to an easement across someone else's property). For now, I'm going to cast those as 0.0.

In [362]:
df['lot_frontage'] = df['lot_frontage'].fillna(0.0)

df['lot_frontage'].isna().sum()

0

### Fireplace Quality

The dictionary spells out that NA == no fireplace.

In [363]:
df['fireplace_qu'] = df['fireplace_qu'].fillna('No Fireplace')

df['fireplace_qu'].isna().sum()

0

### Pool Quality

The dictionary spells out that NA == no pool.

In [364]:
df['pool_qc'] = df['pool_qc'].fillna('No Pool')

df['fireplace_qu'].isna().sum()

0

### Fence

The dictionary spells out that NA == no fence.

In [365]:
df['fence'] = df['fence'].fillna('No Fence')

df['fence'].isna().sum()

0

### Misc Features

The dictionary spells out that NA == no additional features

In [366]:
df['misc_feature'] = df['misc_feature'].fillna('No Addl Features')
df['misc_feature'].isna().sum()

0

### Masonry Veneer Type

I examined these initially, and wasn't sure about how to handle them. I leaned toward dropping the 22 NaN since it's clear None is another category and hard to know how these NaN should be interpreted. With only 22 instances, I decided to drop these data.

In [367]:
df['mas_vnr_type'].value_counts(dropna = False)

None       1218
BrkFace     629
Stone       168
NaN          22
BrkCmn       13
Name: mas_vnr_type, dtype: int64

In [368]:
df[df['mas_vnr_type'].isna()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
22,2393,528142010,60,RL,103.0,12867,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NridgHt,Norm,Norm,1Fam,2Story,8,5,2005,2006,Gable,CompShg,CemntBd,CmentBd,,,Gd,TA,PConc,Ex,TA,Av,Unf,0.0,Unf,0.0,1209.0,1209.0,GasA,Ex,Y,SBrkr,1209,1044,0,2253,0.0,0.0,2,1,3,1,Ex,8,Typ,1,Gd,Attchd,2005.0,Fin,2.0,575.0,TA,TA,Y,243,142,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,New,344133
41,2383,528110050,20,RL,107.0,13891,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,1Story,10,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,,Ex,TA,PConc,Ex,Gd,Gd,GLQ,1386.0,Unf,0.0,690.0,2076.0,GasA,Ex,Y,SBrkr,2076,0,0,2076,1.0,0.0,2,1,2,1,Ex,7,Typ,1,Gd,Attchd,2006.0,Fin,3.0,850.0,TA,TA,Y,216,229,0,0,0,0,No Pool,No Fence,No Addl Features,0,9,2006,New,465000
86,539,531371050,20,RL,67.0,10083,Pave,,Reg,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Gd,TA,No,GLQ,833.0,Unf,0.0,343.0,1176.0,GasA,Ex,Y,SBrkr,1200,0,0,1200,1.0,0.0,2,0,2,1,Gd,5,Typ,0,No Fireplace,Attchd,2003.0,RFn,2.0,555.0,TA,TA,Y,0,41,0,0,0,0,No Pool,No Fence,No Addl Features,0,8,2009,WD,184900
212,518,528458020,20,FV,90.0,7993,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,7,5,2008,2009,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Ex,TA,No,Unf,0.0,Unf,0.0,1436.0,1436.0,GasA,Ex,Y,SBrkr,1436,0,0,1436,0.0,0.0,2,0,3,1,Gd,6,Typ,0,No Fireplace,Attchd,2008.0,Fin,2.0,529.0,TA,TA,Y,0,121,0,0,0,0,No Pool,No Fence,No Addl Features,0,10,2009,New,225000
276,2824,908130020,20,RL,75.0,8050,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,6,5,2002,2002,Gable,CompShg,VinylSd,VinylSd,,,TA,TA,PConc,Gd,TA,Av,GLQ,475.0,ALQ,297.0,142.0,914.0,GasA,Ex,Y,SBrkr,914,0,0,914,1.0,0.0,1,0,2,1,Gd,4,Typ,0,No Fireplace,No Garage,1978.695248,No Garage,0.0,0.0,No Garage,No Garage,N,32,0,0,0,0,0,No Pool,No Fence,No Addl Features,0,4,2006,WD,117250
338,1800,528458150,60,FV,112.0,12217,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,2Story,8,5,2007,2007,Hip,CompShg,WdShing,Wd Shng,,,Gd,TA,PConc,Ex,TA,Av,GLQ,745.0,Unf,0.0,210.0,955.0,GasA,Ex,Y,SBrkr,955,925,0,1880,1.0,0.0,2,1,3,1,Ex,8,Typ,1,Gd,Attchd,2007.0,Fin,3.0,880.0,TA,TA,Y,168,127,0,0,0,0,No Pool,No Fence,No Addl Features,0,12,2007,New,310013
431,1455,907251090,60,RL,75.0,9473,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,8,5,2002,2002,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Gd,TA,No,GLQ,804.0,Unf,0.0,324.0,1128.0,GasA,Ex,Y,SBrkr,1128,903,0,2031,1.0,0.0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2002.0,RFn,2.0,577.0,TA,TA,Y,0,211,0,0,0,0,No Pool,No Fence,No Addl Features,0,3,2008,WD,237000
451,1120,528439010,20,RL,87.0,10037,Pave,,Reg,Lvl,AllPub,Corner,Gtl,Somerst,Feedr,Norm,1Fam,1Story,8,5,2006,2007,Hip,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Ex,TA,No,GLQ,666.0,Unf,0.0,794.0,1460.0,GasA,Ex,Y,SBrkr,1460,0,0,1460,0.0,0.0,2,0,3,1,Gd,6,Typ,1,Gd,Attchd,2006.0,Fin,2.0,480.0,TA,TA,Y,0,20,0,0,0,0,No Pool,No Fence,No Addl Features,0,8,2008,WD,247000
591,1841,533208040,120,FV,35.0,4274,Pave,Pave,IR1,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,TwnhsE,1Story,7,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Gd,TA,No,GLQ,1106.0,Unf,0.0,135.0,1241.0,GasA,Ex,Y,SBrkr,1241,0,0,1241,1.0,0.0,1,1,1,1,Gd,4,Typ,0,No Fireplace,Attchd,2007.0,Fin,2.0,569.0,TA,TA,Y,0,116,0,0,0,0,No Pool,No Fence,No Addl Features,0,11,2007,New,199900
844,1840,533208030,120,FV,30.0,5330,Pave,Pave,IR2,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,TwnhsE,1Story,8,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,1550.0,1550.0,GasA,Ex,Y,SBrkr,1550,0,0,1550,0.0,0.0,2,1,2,1,Gd,5,Typ,0,No Fireplace,Attchd,2007.0,Fin,2.0,528.0,TA,TA,Y,0,102,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2007,New,207500


### Masonry Veneer Area

In [369]:
df['mas_vnr_area'].value_counts(dropna = False)

0.0       1216
NaN         22
120.0       11
176.0       10
200.0       10
210.0        9
180.0        9
16.0         9
72.0         9
108.0        9
132.0        8
40.0         8
170.0        7
186.0        7
144.0        7
140.0        6
203.0        6
340.0        6
60.0         6
84.0         6
256.0        6
178.0        6
128.0        6
260.0        5
360.0        5
252.0        5
504.0        5
164.0        5
44.0         5
112.0        5
288.0        5
300.0        5
160.0        5
168.0        5
148.0        5
145.0        5
143.0        5
76.0         5
302.0        5
106.0        5
320.0        5
174.0        5
216.0        5
272.0        5
456.0        4
232.0        4
192.0        4
172.0        4
336.0        4
68.0         4
80.0         4
130.0        4
240.0        4
196.0        4
513.0        4
136.0        4
126.0        4
242.0        4
183.0        4
246.0        4
270.0        4
50.0         4
30.0         4
150.0        3
289.0        3
238.0        3
158.0     

In [370]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

22

The nulls from Masonry Veneer Type and Area are the same ones.

I stuck a copy in here to back up my work, in case I wanted to undo the dropna() without having to rerun the whole notebook.

In [371]:
copy_df = df

I reviewed the .dropna documentation to understand how to target this to the column I wanted (which had the effect of taking care of the other column, since it was the same rows with the relevant NA)

In [372]:
df.dropna(subset = 'mas_vnr_type', inplace = True)

In [373]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

0

-----

Learned how to show all rows from [this site](https://stackoverflow.com/questions/46554597/how-to-display-all-rows-in-jupyter-notebook)

In [374]:
pd.options.display.max_rows = 999

In [375]:
pd.Series(df.isnull().sum())

id                 0
pid                0
ms_subclass        0
ms_zoning          0
lot_frontage       0
lot_area           0
street             0
alley              0
lot_shape          0
land_contour       0
utilities          0
lot_config         0
land_slope         0
neighborhood       0
condition_1        0
condition_2        0
bldg_type          0
house_style        0
overall_qual       0
overall_cond       0
year_built         0
year_remod/add     0
roof_style         0
roof_matl          0
exterior_1st       0
exterior_2nd       0
mas_vnr_type       0
mas_vnr_area       0
exter_qual         0
exter_cond         0
foundation         0
bsmt_qual          0
bsmt_cond          0
bsmt_exposure      0
bsmtfin_type_1     0
bsmtfin_sf_1       0
bsmtfin_type_2     0
bsmtfin_sf_2       0
bsmt_unf_sf        0
total_bsmt_sf      0
heating            0
heating_qc         0
central_air        0
electrical         0
1st_flr_sf         0
2nd_flr_sf         0
low_qual_fin_sf    0
gr_liv_area  

In [376]:
df.dtypes

id                   int64
pid                  int64
ms_subclass          int64
ms_zoning           object
lot_frontage       float64
lot_area             int64
street              object
alley               object
lot_shape           object
land_contour        object
utilities           object
lot_config          object
land_slope          object
neighborhood        object
condition_1         object
condition_2         object
bldg_type           object
house_style         object
overall_qual         int64
overall_cond         int64
year_built           int64
year_remod/add       int64
roof_style          object
roof_matl           object
exterior_1st        object
exterior_2nd        object
mas_vnr_type        object
mas_vnr_area       float64
exter_qual          object
exter_cond          object
foundation          object
bsmt_qual           object
bsmt_cond           object
bsmt_exposure       object
bsmtfin_type_1      object
bsmtfin_sf_1       float64
bsmtfin_type_2      object
b

# Exporting Dataframe to CSV

In [377]:
df.to_csv('datasets/draft1_cleaned_train.csv', index = False)

-----

# Examining Data for Outliers
I did additional cleaning on the data after I'd made my first round of predictions. Ultimately, however, I found that the data yielded weaker predictions, so I decided not to proceed with them. I've left the work here, though, to show the thought process.

[This site](https://datascienceparichay.com/article/show-all-columns-of-pandas-dataframe-in-jupyter-notebook/) showed me how to display all columns.

In [378]:
pd.set_option('display.max_columns', None)

In [379]:
df.describe()

Unnamed: 0,id,pid,ms_subclass,lot_frontage,lot_area,overall_qual,overall_cond,year_built,year_remod/add,mas_vnr_area,bsmtfin_sf_1,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,totrms_abvgrd,fireplaces,garage_yr_blt,garage_cars,garage_area,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,misc_val,mo_sold,yr_sold,saleprice
count,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0
mean,1473.767751,714928600.0,57.122781,57.966963,10053.849606,6.098126,5.569034,1971.357495,1983.967456,99.599112,441.290434,48.096647,565.328402,1054.715483,1163.274655,328.687377,5.575444,1497.537475,0.426529,0.064103,1.571992,0.36785,2.844181,1.042899,6.434911,0.58925,1978.337556,1.771696,472.504931,94.146943,47.139546,22.630671,2.620809,16.698718,2.425049,52.15927,6.206114,2007.778107,180839.170118
std,845.058814,188711500.0,42.930897,33.069068,6755.099389,1.424895,1.107746,30.147866,21.038807,174.951931,460.530353,165.456384,443.937563,447.488378,395.715616,425.317335,51.354392,501.541089,0.522816,0.252922,0.547898,0.499424,0.827727,0.209856,1.563396,0.638683,24.196975,0.765981,216.212038,128.88885,66.445075,59.88098,25.37083,57.671676,37.995453,576.61144,2.741255,1.314063,79109.425782
min,1.0,526301100.0,20.0,0.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1895.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0
25%,752.75,528477100.0,20.0,44.0,7500.0,5.0,5.0,1953.0,1964.0,0.0,0.0,0.0,219.75,791.75,879.75,0.0,0.0,1126.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1962.0,1.0,315.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129500.0
50%,1483.5,535455100.0,50.0,63.0,9400.0,6.0,5.0,1973.0,1993.0,0.0,368.0,0.0,473.5,993.0,1092.0,0.0,0.0,1442.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1978.695248,2.0,480.0,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,161000.0
75%,2197.5,907180400.0,70.0,78.0,11500.0,7.0,6.0,2000.0,2004.0,160.25,732.25,0.0,809.25,1314.0,1402.0,689.25,0.0,1728.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2001.0,2.0,576.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,213310.0
max,2930.0,924152000.0,190.0,313.0,159000.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,5095.0,1862.0,1064.0,5642.0,3.0,2.0,4.0,2.0,8.0,3.0,15.0,4.0,2010.0,5.0,1418.0,1424.0,547.0,432.0,508.0,490.0,800.0,17000.0,12.0,2010.0,611657.0


Outliers I need to examime:
* revisit the 0.0 frontage. Is that an error? should it be removed?
* lot area - max lot ~15 x 75th percentile. 159,000 sq ft is 3.6 acres. That doesn't seem unreasonable, but could it be skewing the data?
* mas_vnr_area 1600 sq ft max is 10 times the 75th percentile.
* wood deck sq ft 1424 sf max is ~9x the 75th percentile
* open_porch max of 547 sf is ~8 times 75th percentile
* enclosed porch, other two porches, pool area, misc value, all more than 75% 0.0.
* saleprice - just check how far out the outliers are, both high and low. Lowest one here is about $12k, which seems off, given 25th percentile is $125k

From below:
* 3 kitchens above ground seems like a potential outlier, too.

Otherwise, it seems like there may just be a few huge houses. Or at least one. I'm going to look into that.

In [380]:
col_dtypes = pd.DataFrame(df.dtypes)
col_dtypes.head()

Unnamed: 0,0
id,int64
pid,int64
ms_subclass,int64
ms_zoning,object
lot_frontage,float64


In [381]:
col_dtypes[(col_dtypes[0] == 'int64')]

Unnamed: 0,0
id,int64
pid,int64
ms_subclass,int64
lot_area,int64
overall_qual,int64
overall_cond,int64
year_built,int64
year_remod/add,int64
1st_flr_sf,int64
2nd_flr_sf,int64


In [382]:
col_dtypes[(col_dtypes[0] == 'float64')]

Unnamed: 0,0
lot_frontage,float64
mas_vnr_area,float64
bsmtfin_sf_1,float64
bsmtfin_sf_2,float64
bsmt_unf_sf,float64
total_bsmt_sf,float64
bsmt_full_bath,float64
bsmt_half_bath,float64
garage_yr_blt,float64
garage_cars,float64


The below two describes are to see the things that didn't pop up above.

In [383]:
df[['bsmtfin_sf_1',
'bsmtfin_sf_2',
'bsmt_unf_sf',
'total_bsmt_sf',
'bsmt_full_bath',
'bsmt_half_bath']].describe()

Unnamed: 0,bsmtfin_sf_1,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
count,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0
mean,441.290434,48.096647,565.328402,1054.715483,0.426529,0.064103
std,460.530353,165.456384,443.937563,447.488378,0.522816,0.252922
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,219.75,791.75,0.0,0.0
50%,368.0,0.0,473.5,993.0,0.0,0.0
75%,732.25,0.0,809.25,1314.0,1.0,0.0
max,5644.0,1474.0,2336.0,6110.0,3.0,2.0


In [384]:
df[['1st_flr_sf',
'2nd_flr_sf',
'low_qual_fin_sf',
'gr_liv_area',
'full_bath',
'half_bath',
'bedroom_abvgr',
'kitchen_abvgr',
'totrms_abvgrd',
'fireplaces']].describe()

Unnamed: 0,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,totrms_abvgrd,fireplaces
count,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0,2028.0
mean,1163.274655,328.687377,5.575444,1497.537475,1.571992,0.36785,2.844181,1.042899,6.434911,0.58925
std,395.715616,425.317335,51.354392,501.541089,0.547898,0.499424,0.827727,0.209856,1.563396,0.638683
min,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,2.0,0.0
25%,879.75,0.0,0.0,1126.0,1.0,0.0,2.0,1.0,5.0,0.0
50%,1092.0,0.0,0.0,1442.0,2.0,0.0,3.0,1.0,6.0,1.0
75%,1402.0,689.25,0.0,1728.0,2.0,1.0,3.0,1.0,7.0,1.0
max,5095.0,1862.0,1064.0,5642.0,4.0,2.0,8.0,3.0,15.0,4.0


In [385]:
df[df['saleprice'] > 500_000]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
81,367,527214050,20,RL,63.0,17423,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,1Story,9,5,2008,2009,Hip,CompShg,VinylSd,VinylSd,Stone,748.0,Ex,TA,PConc,Ex,TA,No,GLQ,1904.0,Unf,0.0,312.0,2216.0,GasA,Ex,Y,SBrkr,2234,0,0,2234,1.0,0.0,2,0,1,1,Ex,9,Typ,1,Gd,Attchd,2009.0,Fin,3.0,1166.0,TA,TA,Y,0,60,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2009,New,501837
138,2331,527210040,60,RL,60.0,18062,Pave,,IR1,HLS,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,2Story,10,5,2006,2006,Hip,CompShg,CemntBd,CmentBd,BrkFace,662.0,Ex,TA,PConc,Ex,TA,Gd,Unf,0.0,Unf,0.0,1528.0,1528.0,GasA,Ex,Y,SBrkr,1528,1862,0,3390,0.0,0.0,3,1,5,1,Ex,10,Typ,1,Ex,BuiltIn,2006.0,Fin,3.0,758.0,TA,TA,Y,204,34,0,0,0,0,No Pool,No Fence,No Addl Features,0,9,2006,New,545224
151,2333,527212030,60,RL,85.0,16056,Pave,,IR1,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,1Fam,2Story,9,5,2005,2006,Hip,CompShg,CemntBd,CmentBd,Stone,208.0,Gd,TA,PConc,Ex,TA,Av,GLQ,240.0,Unf,0.0,1752.0,1992.0,GasA,Ex,Y,SBrkr,1992,876,0,2868,0.0,0.0,3,1,4,1,Ex,11,Typ,1,Gd,BuiltIn,2005.0,Fin,3.0,716.0,TA,TA,Y,214,108,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,New,556581
623,457,528176030,20,RL,100.0,14836,Pave,,IR1,HLS,AllPub,Inside,Mod,NridgHt,Norm,Norm,1Fam,1Story,10,5,2004,2005,Hip,CompShg,CemntBd,CmentBd,Stone,730.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,2146.0,Unf,0.0,346.0,2492.0,GasA,Ex,Y,SBrkr,2492,0,0,2492,1.0,0.0,2,1,2,1,Ex,8,Typ,1,Ex,Attchd,2004.0,Fin,3.0,949.0,TA,TA,Y,226,235,0,0,0,0,No Pool,No Fence,No Addl Features,0,2,2009,WD,552000
800,1702,528118050,20,RL,59.0,17169,Pave,,IR2,Lvl,AllPub,CulDSac,Gtl,NridgHt,Norm,Norm,1Fam,1Story,10,5,2007,2007,Hip,CompShg,CemntBd,CmentBd,BrkFace,970.0,Ex,TA,PConc,Ex,TA,Av,GLQ,1684.0,Unf,0.0,636.0,2320.0,GasA,Ex,Y,SBrkr,2290,0,0,2290,2.0,0.0,2,1,2,1,Ex,7,Typ,1,Gd,Attchd,2007.0,Fin,3.0,1174.0,TA,TA,Y,192,30,0,0,0,0,No Pool,No Fence,No Addl Features,0,8,2007,New,500067
823,16,527216070,60,RL,47.0,53504,Pave,,IR2,HLS,AllPub,CulDSac,Mod,StoneBr,Norm,Norm,1Fam,2Story,8,5,2003,2003,Hip,CompShg,CemntBd,Wd Shng,BrkFace,603.0,Ex,TA,PConc,Gd,TA,Gd,ALQ,1416.0,Unf,0.0,234.0,1650.0,GasA,Ex,Y,SBrkr,1690,1589,0,3279,1.0,0.0,3,1,4,1,Ex,12,Mod,1,Gd,BuiltIn,2003.0,Fin,3.0,841.0,TA,TA,Y,503,36,0,0,210,0,No Pool,No Fence,No Addl Features,0,6,2010,WD,538000
1164,424,528106020,20,RL,105.0,15431,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,1Story,10,5,2008,2008,Hip,CompShg,VinylSd,VinylSd,Stone,200.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,1767.0,ALQ,539.0,788.0,3094.0,GasA,Ex,Y,SBrkr,2402,0,0,2402,1.0,0.0,2,0,2,1,Ex,10,Typ,2,Gd,Attchd,2008.0,Fin,3.0,672.0,TA,TA,Y,0,72,0,0,170,0,No Pool,No Fence,No Addl Features,0,4,2009,WD,555000
1592,2335,527214060,60,RL,82.0,16052,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,2Story,10,5,2006,2006,Hip,CompShg,VinylSd,VinylSd,Stone,734.0,Ex,TA,PConc,Ex,TA,No,GLQ,1206.0,Unf,0.0,644.0,1850.0,GasA,Ex,Y,SBrkr,1850,848,0,2698,1.0,0.0,2,1,4,1,Ex,11,Typ,1,Gd,Attchd,2006.0,RFn,3.0,736.0,TA,TA,Y,250,0,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,New,535000
1671,45,528150070,20,RL,100.0,12919,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,1Story,9,5,2009,2010,Hip,CompShg,VinylSd,VinylSd,Stone,760.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,2188.0,Unf,0.0,142.0,2330.0,GasA,Ex,Y,SBrkr,2364,0,0,2364,1.0,0.0,2,1,2,1,Ex,11,Typ,2,Gd,Attchd,2009.0,Fin,3.0,820.0,TA,TA,Y,0,67,0,0,0,0,No Pool,No Fence,No Addl Features,0,3,2010,New,611657
1692,2451,528360050,60,RL,114.0,17242,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NoRidge,Norm,Norm,1Fam,2Story,9,5,1993,1994,Hip,CompShg,MetalSd,MetalSd,BrkFace,738.0,Gd,Gd,PConc,Ex,TA,Gd,Rec,292.0,GLQ,1393.0,48.0,1733.0,GasA,Ex,Y,SBrkr,1933,1567,0,3500,1.0,0.0,3,1,4,1,Ex,11,Typ,1,TA,Attchd,1993.0,RFn,3.0,959.0,TA,TA,Y,870,86,0,0,210,0,No Pool,No Fence,No Addl Features,0,5,2006,WD,584500


Nothing about the high sale prices seems unusual to me.

In [386]:
df[df['lot_area'] > 50_000][['lot_area', 'saleprice', 'gr_liv_area']]

Unnamed: 0,lot_area,saleprice,gr_liv_area
471,159000,277000,2144
694,115149,302000,1824
745,57200,160000,1687
823,53504,538000,3279
960,63887,160000,5642
1052,53227,256000,1663
1571,50271,385000,1842
1726,50102,250764,1650
1843,53107,240000,1953
1854,70761,280000,1533


The two lots over 100,000 sf are so far out there and their prices are so similar to smaller lots, that I think they're in error. I'm going to eliminate them.

In [387]:
df2 = df.copy()

In [388]:
df2.drop(index = df[df['lot_area'] > 100_000].index, inplace = True)

In [389]:
df2[df2['lot_area'] > 100_000][['lot_area', 'saleprice', 'gr_liv_area']]

Unnamed: 0,lot_area,saleprice,gr_liv_area


In [390]:
df2[df2['mas_vnr_area']>1000][['mas_vnr_area', 'saleprice', 'year_built']]

Unnamed: 0,mas_vnr_area,saleprice,year_built
125,1050.0,150000,1977
378,1115.0,244000,1973
489,1110.0,421250,2001
1151,1129.0,176000,1999
1170,1031.0,438780,2006
1227,1095.0,500000,2003
1409,1600.0,239000,1997
1416,1047.0,159950,1966
1885,1224.0,183850,2008


The 1600 SF masonry veneer area seems so far out there, and the price is so much lower than others, I'm going to eliminate it.

In [391]:
df2.drop(index = df2[df2['mas_vnr_area'] == 1600].index, inplace = True)

In [392]:
df2[df2['mas_vnr_area'] > 1300][['mas_vnr_area', 'saleprice', 'year_built']]

Unnamed: 0,mas_vnr_area,saleprice,year_built


In [393]:
df2[df2['wood_deck_sf']>800]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
966,2294,923229100,80,RL,0.0,15957,Pave,,IR1,Low,AllPub,Corner,Mod,Mitchel,Norm,Norm,1Fam,SLvl,6,6,1977,1977,Gable,CompShg,HdBoard,Plywood,,0.0,TA,TA,PConc,Gd,TA,Gd,GLQ,1148.0,Unf,0.0,96.0,1244.0,GasA,TA,Y,SBrkr,1356,0,0,1356,2.0,0.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1977.0,Fin,2.0,528.0,TA,TA,Y,1424,0,0,0,0,0,No Pool,MnPrv,No Addl Features,0,9,2007,WD,188000
1571,2523,533350050,20,RL,68.0,50271,Pave,,IR1,Low,AllPub,Inside,Gtl,Veenker,Norm,Norm,1Fam,1Story,9,5,1981,1987,Gable,WdShngl,WdShing,Wd Shng,,0.0,Gd,TA,CBlock,Ex,TA,Gd,GLQ,1810.0,Unf,0.0,32.0,1842.0,GasA,Gd,Y,SBrkr,1842,0,0,1842,2.0,0.0,0,1,0,1,Gd,5,Typ,1,Gd,Attchd,1981.0,Fin,3.0,894.0,TA,TA,Y,857,72,0,0,0,0,No Pool,No Fence,No Addl Features,0,11,2006,WD,385000
1692,2451,528360050,60,RL,114.0,17242,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NoRidge,Norm,Norm,1Fam,2Story,9,5,1993,1994,Hip,CompShg,MetalSd,MetalSd,BrkFace,738.0,Gd,Gd,PConc,Ex,TA,Gd,Rec,292.0,GLQ,1393.0,48.0,1733.0,GasA,Ex,Y,SBrkr,1933,1567,0,3500,1.0,0.0,3,1,4,1,Ex,11,Typ,1,TA,Attchd,1993.0,RFn,3.0,959.0,TA,TA,Y,870,86,0,0,210,0,No Pool,No Fence,No Addl Features,0,5,2006,WD,584500


The 1424 sf deck is so much larger than the others and the sale price so much lower, I'm going to eliminate it.

In [394]:
df2.drop(index = df2[df2['wood_deck_sf'] == 1424].index, inplace = True)

In [395]:
df2[df2['wood_deck_sf'] == 1400]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [396]:
df2[df2['open_porch_sf'] > 500]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
974,1289,902105020,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Norm,Norm,1Fam,1.5Fin,6,7,1920,1950,Gable,CompShg,BrkFace,Wd Sdng,,0.0,Gd,Gd,BrkTil,Gd,TA,No,LwQ,493.0,Unf,0.0,1017.0,1510.0,GasW,Ex,Y,SBrkr,1584,1208,0,2792,0.0,0.0,2,0,5,1,TA,8,Mod,2,TA,Detchd,1920.0,Unf,2.0,520.0,Fa,TA,Y,0,547,0,0,480,0,No Pool,MnPrv,Shed,1150,6,2008,WD,256000
1141,1321,902401120,75,RM,75.0,13500,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,PosA,1Fam,2.5Unf,10,9,1893,2000,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Ex,Ex,BrkTil,TA,TA,No,Unf,0.0,Unf,0.0,1237.0,1237.0,GasA,Gd,Y,SBrkr,1521,1254,0,2775,0.0,0.0,3,1,3,1,Gd,9,Typ,1,Gd,Detchd,1988.0,Unf,2.0,880.0,Gd,TA,Y,105,502,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2008,WD,325000
1309,727,902477120,30,C (all),60.0,7879,Pave,,Reg,Lvl,AllPub,Inside,Gtl,IDOTRR,Norm,Norm,1Fam,1Story,4,5,1920,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,TA,TA,No,Rec,495.0,Unf,0.0,225.0,720.0,GasA,TA,N,FuseA,720,0,0,720,0.0,0.0,1,0,2,1,TA,4,Typ,0,No Fireplace,No Garage,1978.695248,No Garage,0.0,0.0,No Garage,No Garage,N,0,523,115,0,0,0,No Pool,GdWo,No Addl Features,0,11,2009,WD,34900


This actually doesn't seem that crazy except for the last one, coming in at $35k sale price. I have to check those low prices out.

In [397]:

df2[df2['saleprice']<35_000]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
183,1554,910251050,20,A (agr),80.0,14584,Pave,,Reg,Low,AllPub,Inside,Mod,IDOTRR,Norm,Norm,1Fam,1Story,1,5,1952,1952,Gable,CompShg,AsbShng,VinylSd,,0.0,Fa,Po,Slab,No Basement,No Basement,No Basement,No Basement,0.0,No Basement,0.0,0.0,0.0,Wall,Po,N,FuseA,733,0,0,733,0.0,0.0,1,0,2,1,Fa,4,Sal,0,No Fireplace,Attchd,1952.0,Unf,2.0,487.0,Fa,Po,N,0,0,0,0,0,0,No Pool,No Fence,No Addl Features,0,2,2008,WD,13100
1309,727,902477120,30,C (all),60.0,7879,Pave,,Reg,Lvl,AllPub,Inside,Gtl,IDOTRR,Norm,Norm,1Fam,1Story,4,5,1920,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,TA,TA,No,Rec,495.0,Unf,0.0,225.0,720.0,GasA,TA,N,FuseA,720,0,0,720,0.0,0.0,1,0,2,1,TA,4,Typ,0,No Fireplace,No Garage,1978.695248,No Garage,0.0,0.0,No Garage,No Garage,N,0,523,115,0,0,0,No Pool,GdWo,No Addl Features,0,11,2009,WD,34900
1628,182,902207130,30,RM,68.0,9656,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,2,2,1923,1970,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,BrkTil,Fa,Fa,No,Unf,0.0,Unf,0.0,678.0,678.0,GasA,TA,N,SBrkr,832,0,0,832,0.0,0.0,1,0,2,1,TA,5,Typ,1,Gd,Detchd,1928.0,Unf,2.0,780.0,Fa,Fa,N,0,0,0,0,0,0,No Pool,No Fence,No Addl Features,0,6,2010,WD,12789


The two properties that are less than $14,000 are so much less than the next closest that I'm going to eliminate them.

In [398]:
df2.drop(index = df2[df2['saleprice'] < 15_000].index, inplace = True)

In [399]:

df2[df2['saleprice']<35_000]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
1309,727,902477120,30,C (all),60.0,7879,Pave,,Reg,Lvl,AllPub,Inside,Gtl,IDOTRR,Norm,Norm,1Fam,1Story,4,5,1920,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,TA,TA,No,Rec,495.0,Unf,0.0,225.0,720.0,GasA,TA,N,FuseA,720,0,0,720,0.0,0.0,1,0,2,1,TA,4,Typ,0,No Fireplace,No Garage,1978.695248,No Garage,0.0,0.0,No Garage,No Garage,N,0,523,115,0,0,0,No Pool,GdWo,No Addl Features,0,11,2009,WD,34900


In [400]:

df2[df2['saleprice']>500_000]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
81,367,527214050,20,RL,63.0,17423,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,1Story,9,5,2008,2009,Hip,CompShg,VinylSd,VinylSd,Stone,748.0,Ex,TA,PConc,Ex,TA,No,GLQ,1904.0,Unf,0.0,312.0,2216.0,GasA,Ex,Y,SBrkr,2234,0,0,2234,1.0,0.0,2,0,1,1,Ex,9,Typ,1,Gd,Attchd,2009.0,Fin,3.0,1166.0,TA,TA,Y,0,60,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2009,New,501837
138,2331,527210040,60,RL,60.0,18062,Pave,,IR1,HLS,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,2Story,10,5,2006,2006,Hip,CompShg,CemntBd,CmentBd,BrkFace,662.0,Ex,TA,PConc,Ex,TA,Gd,Unf,0.0,Unf,0.0,1528.0,1528.0,GasA,Ex,Y,SBrkr,1528,1862,0,3390,0.0,0.0,3,1,5,1,Ex,10,Typ,1,Ex,BuiltIn,2006.0,Fin,3.0,758.0,TA,TA,Y,204,34,0,0,0,0,No Pool,No Fence,No Addl Features,0,9,2006,New,545224
151,2333,527212030,60,RL,85.0,16056,Pave,,IR1,Lvl,AllPub,Inside,Gtl,StoneBr,Norm,Norm,1Fam,2Story,9,5,2005,2006,Hip,CompShg,CemntBd,CmentBd,Stone,208.0,Gd,TA,PConc,Ex,TA,Av,GLQ,240.0,Unf,0.0,1752.0,1992.0,GasA,Ex,Y,SBrkr,1992,876,0,2868,0.0,0.0,3,1,4,1,Ex,11,Typ,1,Gd,BuiltIn,2005.0,Fin,3.0,716.0,TA,TA,Y,214,108,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,New,556581
623,457,528176030,20,RL,100.0,14836,Pave,,IR1,HLS,AllPub,Inside,Mod,NridgHt,Norm,Norm,1Fam,1Story,10,5,2004,2005,Hip,CompShg,CemntBd,CmentBd,Stone,730.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,2146.0,Unf,0.0,346.0,2492.0,GasA,Ex,Y,SBrkr,2492,0,0,2492,1.0,0.0,2,1,2,1,Ex,8,Typ,1,Ex,Attchd,2004.0,Fin,3.0,949.0,TA,TA,Y,226,235,0,0,0,0,No Pool,No Fence,No Addl Features,0,2,2009,WD,552000
800,1702,528118050,20,RL,59.0,17169,Pave,,IR2,Lvl,AllPub,CulDSac,Gtl,NridgHt,Norm,Norm,1Fam,1Story,10,5,2007,2007,Hip,CompShg,CemntBd,CmentBd,BrkFace,970.0,Ex,TA,PConc,Ex,TA,Av,GLQ,1684.0,Unf,0.0,636.0,2320.0,GasA,Ex,Y,SBrkr,2290,0,0,2290,2.0,0.0,2,1,2,1,Ex,7,Typ,1,Gd,Attchd,2007.0,Fin,3.0,1174.0,TA,TA,Y,192,30,0,0,0,0,No Pool,No Fence,No Addl Features,0,8,2007,New,500067
823,16,527216070,60,RL,47.0,53504,Pave,,IR2,HLS,AllPub,CulDSac,Mod,StoneBr,Norm,Norm,1Fam,2Story,8,5,2003,2003,Hip,CompShg,CemntBd,Wd Shng,BrkFace,603.0,Ex,TA,PConc,Gd,TA,Gd,ALQ,1416.0,Unf,0.0,234.0,1650.0,GasA,Ex,Y,SBrkr,1690,1589,0,3279,1.0,0.0,3,1,4,1,Ex,12,Mod,1,Gd,BuiltIn,2003.0,Fin,3.0,841.0,TA,TA,Y,503,36,0,0,210,0,No Pool,No Fence,No Addl Features,0,6,2010,WD,538000
1164,424,528106020,20,RL,105.0,15431,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,1Story,10,5,2008,2008,Hip,CompShg,VinylSd,VinylSd,Stone,200.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,1767.0,ALQ,539.0,788.0,3094.0,GasA,Ex,Y,SBrkr,2402,0,0,2402,1.0,0.0,2,0,2,1,Ex,10,Typ,2,Gd,Attchd,2008.0,Fin,3.0,672.0,TA,TA,Y,0,72,0,0,170,0,No Pool,No Fence,No Addl Features,0,4,2009,WD,555000
1592,2335,527214060,60,RL,82.0,16052,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,StoneBr,Norm,Norm,1Fam,2Story,10,5,2006,2006,Hip,CompShg,VinylSd,VinylSd,Stone,734.0,Ex,TA,PConc,Ex,TA,No,GLQ,1206.0,Unf,0.0,644.0,1850.0,GasA,Ex,Y,SBrkr,1850,848,0,2698,1.0,0.0,2,1,4,1,Ex,11,Typ,1,Gd,Attchd,2006.0,RFn,3.0,736.0,TA,TA,Y,250,0,0,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,New,535000
1671,45,528150070,20,RL,100.0,12919,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,1Story,9,5,2009,2010,Hip,CompShg,VinylSd,VinylSd,Stone,760.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,2188.0,Unf,0.0,142.0,2330.0,GasA,Ex,Y,SBrkr,2364,0,0,2364,1.0,0.0,2,1,2,1,Ex,11,Typ,2,Gd,Attchd,2009.0,Fin,3.0,820.0,TA,TA,Y,0,67,0,0,0,0,No Pool,No Fence,No Addl Features,0,3,2010,New,611657
1692,2451,528360050,60,RL,114.0,17242,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NoRidge,Norm,Norm,1Fam,2Story,9,5,1993,1994,Hip,CompShg,MetalSd,MetalSd,BrkFace,738.0,Gd,Gd,PConc,Ex,TA,Gd,Rec,292.0,GLQ,1393.0,48.0,1733.0,GasA,Ex,Y,SBrkr,1933,1567,0,3500,1.0,0.0,3,1,4,1,Ex,11,Typ,1,TA,Attchd,1993.0,RFn,3.0,959.0,TA,TA,Y,870,86,0,0,210,0,No Pool,No Fence,No Addl Features,0,5,2006,WD,584500


Nothing makes me think any of these higher priced ones are unreasonable.

enclosed porch, other two porches, pool area, misc value,

In [401]:
df2[df2['enclosed_porch']>300]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
149,203,903426180,50,RM,57.0,8094,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1.5Fin,4,5,1915,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Fa,CBlock,TA,TA,No,Unf,0.0,Unf,0.0,888.0,888.0,GasA,Ex,Y,SBrkr,888,1074,0,1962,0.0,0.0,1,1,4,1,TA,9,Typ,1,TA,Detchd,1915.0,Unf,2.0,572.0,TA,TA,Y,160,0,364,0,0,0,No Pool,GdPrv,No Addl Features,0,6,2010,WD,149500
241,2560,534475160,20,RL,89.0,10858,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1952,1952,Gable,CompShg,Wd Sdng,Plywood,Stone,150.0,TA,Gd,CBlock,TA,TA,Mn,LwQ,40.0,Unf,0.0,1404.0,1444.0,GasA,Ex,Y,SBrkr,1624,0,0,1624,1.0,0.0,1,0,2,1,TA,6,Min1,1,Gd,Attchd,1952.0,RFn,1.0,240.0,TA,TA,Y,0,40,324,0,0,0,No Pool,MnPrv,No Addl Features,0,7,2006,WD,146500
507,989,924100060,60,RL,70.0,10457,Pave,,IR1,Lvl,AllPub,Inside,Mod,Mitchel,Norm,Norm,1Fam,2Story,5,7,1969,1969,Gable,CompShg,VinylSd,VinylSd,BrkFace,178.0,Gd,Ex,CBlock,TA,TA,Gd,BLQ,496.0,LwQ,288.0,0.0,784.0,GasA,Ex,Y,SBrkr,784,848,0,1632,0.0,0.0,1,1,4,1,TA,7,Typ,1,TA,Attchd,1969.0,RFn,2.0,898.0,TA,TA,Y,0,173,368,0,0,0,No Pool,MnPrv,No Addl Features,0,4,2009,WD,173000
828,661,535381040,50,RL,60.0,10410,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1.5Fin,4,5,1915,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0.0,Unf,0.0,1313.0,1313.0,GasA,TA,Y,SBrkr,1313,0,1064,2377,0.0,0.0,2,0,3,1,Gd,8,Min2,1,TA,Detchd,1954.0,Unf,2.0,528.0,TA,TA,Y,0,0,432,0,0,0,No Pool,No Fence,No Addl Features,0,6,2009,WD,142900
831,2663,902328100,75,RM,65.0,8850,Pave,,IR1,Bnk,AllPub,Corner,Gtl,OldTown,Norm,Norm,1Fam,2.5Unf,7,6,1916,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0.0,Unf,0.0,815.0,815.0,GasA,Ex,Y,SBrkr,815,875,0,1690,0.0,0.0,1,0,3,1,TA,7,Typ,1,Gd,Detchd,1916.0,Unf,1.0,225.0,TA,TA,Y,0,0,330,0,0,0,No Pool,No Fence,No Addl Features,0,7,2006,ConLw,144000
1090,209,904100140,70,RL,0.0,24090,Pave,,Reg,Lvl,AllPub,Inside,Gtl,ClearCr,Norm,Norm,1Fam,2Story,7,7,1940,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,CBlock,TA,TA,Mn,Unf,0.0,Unf,0.0,1032.0,1032.0,GasA,Ex,Y,SBrkr,1207,1196,0,2403,0.0,0.0,2,0,4,1,TA,10,Typ,2,TA,Attchd,1940.0,Unf,1.0,349.0,TA,TA,Y,56,0,318,0,0,0,No Pool,No Fence,No Addl Features,0,6,2010,COD,244400
1782,1522,909250040,70,RL,51.0,9842,Pave,,Reg,Lvl,AllPub,Inside,Gtl,SWISU,Feedr,Norm,1Fam,2Story,5,6,1921,1998,Gable,CompShg,MetalSd,Wd Sdng,,0.0,TA,TA,BrkTil,TA,Fa,No,Unf,0.0,Unf,0.0,612.0,612.0,GasA,Ex,Y,SBrkr,990,1611,0,2601,0.0,0.0,3,1,4,1,TA,8,Typ,0,No Fireplace,BuiltIn,1998.0,RFn,2.0,621.0,TA,TA,Y,183,0,301,0,0,0,No Pool,No Fence,No Addl Features,0,5,2008,WD,189000


I don't see any reason to suspect that this isn't accurate.

In [402]:
df2[df2['3ssn_porch']>300]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
892,1209,534400060,20,RL,100.0,10175,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1964,1964,Gable,CompShg,HdBoard,Plywood,BrkFace,272.0,TA,TA,CBlock,TA,TA,No,BLQ,490.0,Unf,0.0,935.0,1425.0,GasA,Gd,Y,SBrkr,1425,0,0,1425,0.0,0.0,2,0,3,1,TA,7,Typ,1,Gd,Attchd,1964.0,RFn,2.0,576.0,TA,TA,Y,0,0,0,407,0,0,No Pool,No Fence,No Addl Features,0,7,2008,WD,180500
1219,365,527180100,20,RL,99.0,11851,Pave,,Reg,Lvl,AllPub,Corner,Gtl,Gilbert,Norm,Norm,1Fam,1Story,7,5,1990,1990,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,1424.0,1424.0,GasA,Ex,Y,SBrkr,1442,0,0,1442,0.0,0.0,2,0,3,1,TA,5,Typ,0,No Fireplace,Attchd,1990.0,RFn,2.0,500.0,TA,TA,Y,0,34,0,508,0,0,No Pool,No Fence,No Addl Features,0,5,2009,WD,180500
1577,2592,535327210,20,RL,70.0,13300,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,5,1956,2001,Hip,CompShg,Wd Sdng,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,494.0,Unf,0.0,521.0,1015.0,GasA,Gd,Y,SBrkr,1384,0,0,1384,1.0,0.0,1,0,2,1,TA,6,Min1,0,No Fireplace,Attchd,2001.0,Unf,2.0,896.0,TA,TA,Y,75,0,0,323,0,0,No Pool,No Fence,Shed,400,6,2006,WD,159000
2045,1051,528102030,20,RL,96.0,12444,Pave,,Reg,Lvl,AllPub,FR2,Gtl,NridgHt,Norm,Norm,1Fam,1Story,8,5,2008,2008,Hip,CompShg,VinylSd,VinylSd,Stone,426.0,Ex,TA,PConc,Ex,TA,Av,GLQ,1336.0,Unf,0.0,596.0,1932.0,GasA,Ex,Y,SBrkr,1932,0,0,1932,1.0,0.0,2,0,2,1,Ex,7,Typ,1,Gd,Attchd,2008.0,Fin,3.0,774.0,TA,TA,Y,0,66,0,304,0,0,No Pool,No Fence,No Addl Features,0,11,2008,New,394617


Again nothing makes me suspect these aren't accurate.

In [403]:
df2[df2['screen_porch']>300]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
52,2351,527356020,60,RL,80.0,16692,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NWAmes,RRAn,Norm,1Fam,2Story,7,5,1978,1978,Gable,CompShg,Plywood,Plywood,BrkFace,184.0,TA,TA,CBlock,Gd,TA,No,BLQ,790.0,LwQ,469.0,133.0,1392.0,GasA,TA,Y,SBrkr,1392,1392,0,2784,1.0,0.0,3,1,5,1,Gd,12,Typ,2,TA,Attchd,1978.0,RFn,2.0,564.0,TA,TA,Y,0,112,0,0,440,519,Fa,MnPrv,TenC,2000,7,2006,WD,250000
113,166,535457010,20,RL,87.0,10000,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1962,1962,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,261.0,TA,TA,CBlock,TA,TA,No,Unf,0.0,Unf,0.0,1116.0,1116.0,GasA,TA,Y,SBrkr,1116,0,0,1116,0.0,0.0,1,1,3,1,TA,5,Typ,0,No Fireplace,Attchd,1962.0,Unf,2.0,440.0,TA,TA,Y,0,0,0,0,385,0,No Pool,No Fence,No Addl Features,0,2,2010,WD,160000
240,2740,905451050,20,RL,80.0,12048,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,5,6,1952,2002,Gable,CompShg,Wd Sdng,Wd Sdng,BrkFace,232.0,TA,TA,Slab,No Basement,No Basement,No Basement,No Basement,0.0,No Basement,0.0,0.0,0.0,GasA,Gd,Y,SBrkr,1488,0,0,1488,0.0,0.0,1,0,3,1,TA,7,Typ,1,Ex,Attchd,2002.0,RFn,2.0,569.0,TA,TA,Y,0,189,36,0,348,0,No Pool,No Fence,No Addl Features,0,4,2006,WD,135000
438,626,535125070,20,RL,80.0,14695,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,8,1966,2008,Gable,CompShg,MetalSd,MetalSd,BrkFace,210.0,TA,Gd,CBlock,TA,TA,No,ALQ,1387.0,Unf,0.0,175.0,1562.0,GasA,Gd,Y,SBrkr,1567,0,0,1567,1.0,0.0,2,0,2,1,Gd,5,Typ,2,Gd,Attchd,1966.0,Unf,2.0,542.0,TA,TA,Y,0,110,0,0,342,0,No Pool,GdWo,No Addl Features,0,7,2009,WD,256000
703,2863,909280030,50,RL,86.0,11500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,1.5Fin,7,7,1936,1987,Gable,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,Gd,TA,No,Rec,223.0,Unf,0.0,794.0,1017.0,GasA,Gd,Y,SBrkr,1020,1037,0,2057,0.0,0.0,1,1,3,1,Gd,6,Typ,1,Gd,Attchd,1936.0,Fin,1.0,180.0,Fa,TA,Y,0,0,0,0,322,0,No Pool,No Fence,No Addl Features,0,6,2006,WD,250000
974,1289,902105020,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Norm,Norm,1Fam,1.5Fin,6,7,1920,1950,Gable,CompShg,BrkFace,Wd Sdng,,0.0,Gd,Gd,BrkTil,Gd,TA,No,LwQ,493.0,Unf,0.0,1017.0,1510.0,GasW,Ex,Y,SBrkr,1584,1208,0,2792,0.0,0.0,2,0,5,1,TA,8,Mod,2,TA,Detchd,1920.0,Unf,2.0,520.0,Fa,TA,Y,0,547,0,0,480,0,No Pool,MnPrv,Shed,1150,6,2008,WD,256000
1035,2667,902400110,75,RM,90.0,22950,Pave,,IR2,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,2.5Fin,10,9,1892,1993,Gable,WdShngl,Wd Sdng,Wd Sdng,,0.0,Gd,Gd,BrkTil,TA,TA,Mn,Unf,0.0,Unf,0.0,1107.0,1107.0,GasA,Ex,Y,SBrkr,1518,1518,572,3608,0.0,0.0,2,1,4,1,Ex,12,Typ,2,TA,Detchd,1993.0,Unf,3.0,840.0,Ex,TA,Y,0,260,0,0,410,0,No Pool,GdPrv,No Addl Features,0,6,2006,WD,475000
1174,2722,905200340,20,RL,102.0,17920,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,1Fam,1Story,5,4,1955,1974,Hip,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,CBlock,TA,TA,Mn,ALQ,306.0,Rec,1085.0,372.0,1763.0,GasA,TA,Y,SBrkr,1779,0,0,1779,1.0,0.0,1,1,3,1,TA,6,Typ,1,Gd,Attchd,1955.0,Unf,2.0,454.0,TA,TA,Y,0,418,0,0,312,0,No Pool,No Fence,No Addl Features,0,7,2006,WD,170000
1357,388,527375100,20,RL,0.0,9373,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,NWAmes,PosN,Norm,1Fam,1Story,5,7,1975,1975,Gable,CompShg,HdBoard,HdBoard,BrkFace,161.0,TA,TA,CBlock,Gd,TA,Av,ALQ,1333.0,LwQ,168.0,120.0,1621.0,GasA,TA,Y,SBrkr,1621,0,0,1621,1.0,0.0,2,0,3,1,TA,7,Typ,2,Fa,Attchd,1975.0,RFn,2.0,478.0,TA,TA,Y,0,0,0,0,490,0,No Pool,No Fence,No Addl Features,0,6,2009,WD,213000
1908,2448,528344060,60,RL,78.0,12011,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,1998,1998,Gable,CompShg,VinylSd,VinylSd,BrkFace,530.0,Gd,TA,PConc,Gd,TA,Av,GLQ,956.0,Unf,0.0,130.0,1086.0,GasA,Ex,Y,SBrkr,1086,838,0,1924,1.0,0.0,2,1,3,1,Gd,7,Typ,1,TA,Attchd,1998.0,RFn,2.0,592.0,TA,TA,Y,208,75,0,0,374,0,No Pool,No Fence,No Addl Features,0,6,2006,WD,280000


Again, nothing makes me think this is unreasonable.

In [404]:
df[df['kitchen_abvgr']==3]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
1359,716,902325070,75,RM,90.0,8100,Pave,,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Norm,Norm,1Fam,2.5Unf,5,5,1898,1965,Hip,CompShg,AsbShng,AsbShng,,0.0,TA,TA,PConc,TA,TA,No,Unf,0.0,Unf,0.0,849.0,849.0,GasA,TA,N,FuseA,1075,1063,0,2138,0.0,0.0,2,0,2,3,TA,11,Typ,0,No Fireplace,Detchd,1910.0,Unf,2.0,360.0,Fa,Po,N,40,156,0,0,0,0,No Pool,MnPrv,No Addl Features,0,11,2009,WD,106000


In [405]:
df[df['kitchen_abvgr']==2].shape

(87, 81)

I don't see anything in the 3 kitchen house to make me think that it isn't real, so I'm going to change it to the average kitchen number, 1.

In [406]:
df2['kitchen_abvgr'].mean()

1.043026706231454

In [407]:
df2.loc[df2[df2['kitchen_abvgr']==3].index, 'kitchen_abvgr'] = 1

In [408]:
df2[df2['kitchen_abvgr'] == 3]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [409]:
# As noted above, I decided not to use the dataframe that resulted from the additional cleaning,
# as the predictions that resulted were a bit weaker.

# df2.to_csv('datasets/draft3_cleaned_train.csv', index = False)

I did the following to pull a list that would make it easier to build my data dictionary.

In [410]:
col_dtypes.reset_index(inplace = True)
col_dtypes_array = col_dtypes.to_numpy()

In [411]:
col_dtypes_array

array([['id', dtype('int64')],
       ['pid', dtype('int64')],
       ['ms_subclass', dtype('int64')],
       ['ms_zoning', dtype('O')],
       ['lot_frontage', dtype('float64')],
       ['lot_area', dtype('int64')],
       ['street', dtype('O')],
       ['alley', dtype('O')],
       ['lot_shape', dtype('O')],
       ['land_contour', dtype('O')],
       ['utilities', dtype('O')],
       ['lot_config', dtype('O')],
       ['land_slope', dtype('O')],
       ['neighborhood', dtype('O')],
       ['condition_1', dtype('O')],
       ['condition_2', dtype('O')],
       ['bldg_type', dtype('O')],
       ['house_style', dtype('O')],
       ['overall_qual', dtype('int64')],
       ['overall_cond', dtype('int64')],
       ['year_built', dtype('int64')],
       ['year_remod/add', dtype('int64')],
       ['roof_style', dtype('O')],
       ['roof_matl', dtype('O')],
       ['exterior_1st', dtype('O')],
       ['exterior_2nd', dtype('O')],
       ['mas_vnr_type', dtype('O')],
       ['mas_vnr_area', 