# Prepping the Test DataFrame
The purpose of this notebook is to prepare the Test dataframe from test.csv (the one to be used for the Kaggle competition). This is a duplicate of the 'initial cleaning' notebook designed to get the Kaggle Test dataframe to the point where the features match the cleaned Training dataframe.

I've left many of the notes from the original cleaning notebook to make it easier to follow the workflow.

In [216]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [265]:
df = pd.read_csv('./datasets/test.csv')

In [266]:
df.shape

(878, 80)

I'm adding in a 'saleprice' column with values == 0 to be sure that I can concatenate the final Kaggle Test dataframe with my training dataframe before creating dummy columns.

In [267]:
df['saleprice'] = 0
df['saleprice'].value_counts()

0    878
Name: saleprice, dtype: int64

In other notebooks I cited where I found the code below to display all rows.

In [268]:
pd.options.display.max_rows = 999

In [269]:
df.isnull().sum()

Id                   0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       160
Lot Area             0
Street               0
Alley              820
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type         1
Mas Vnr Area         1
Exter Qual           0
Exter Cond           0
Foundation           0
Bsmt Qual           25
Bsmt Cond           25
Bsmt Exposure       25
BsmtFin Type 1      25
BsmtFin SF 1         0
BsmtFin Type 2      25
BsmtFin SF 2         0
Bsmt Unf SF          0
Total Bsmt SF        0
Heating              0
Heating QC           0
Central Air          0
Electrical 

Because there are so many columns, I wanted to identify only the columns that contain null values. I experimented and wasn't having any luck, so I did some research: this [stackoverflow](https://stackoverflow.com/questions/53137100/filter-pandas-dataframe-columns-with-null-data) showed me the approach I used below.

In [270]:
df.columns[df.isna().any()].tolist()

['Lot Frontage',
 'Alley',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Electrical',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Pool QC',
 'Fence',
 'Misc Feature']

In [271]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [272]:
df.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD,0
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD,0


## Data Dictionary

I reference the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) a lot in the following.

### Alley - NaN --> 'None'

In [273]:
df['alley'].value_counts(dropna = False)

NaN     820
Grvl     35
Pave     23
Name: alley, dtype: int64

It appears that NaN means there is no alley. I'm going to replace NaN with 'None' for now. 

In [274]:
df['alley'] = df['alley'].fillna('None')

df['alley'].value_counts(dropna = False)

None    820
Grvl     35
Pave     23
Name: alley, dtype: int64

### Basements!

I created this for-loop below to examine how many NAs existed in each basement column, to help me decide how to deal with them, since there's presumably at least some relationship betwen them. Rather than copying the code over and over again, I also returned to the for-loop to use as my sanity check as I cleaned the basement columns.

In [275]:
bsmt_list = df.columns[df.columns.str.contains('bsmt')].tolist()

bsmt_list

['bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'bsmt_full_bath',
 'bsmt_half_bath']

In [276]:
for i in bsmt_list:
    print(f'{i} has {df[i].isnull().sum()}')

bsmt_qual has 25
bsmt_cond has 25
bsmt_exposure has 25
bsmtfin_type_1 has 25
bsmtfin_sf_1 has 0
bsmtfin_type_2 has 25
bsmtfin_sf_2 has 0
bsmt_unf_sf has 0
total_bsmt_sf has 0
bsmt_full_bath has 0
bsmt_half_bath has 0


I was surprised by the variation on the number of NaNs. Knowing that some of the NaNs actually represent "No Basement", I thought there'd be a pretty consistent number.

I decided to start by looking at the smaller values to see what's going on to start.

** I needed to find where I accidentally dropped a column, so I did a shape check and will do a few more **

In [277]:
df.shape

(878, 81)

### Basement Full Bath and Half Bath

In [278]:
df[df['bsmt_full_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [279]:
df['bsmt_full_bath'].value_counts(dropna = False)

0    507
1    356
2     15
Name: bsmt_full_bath, dtype: int64

I did the above two lines to get a sense of what was going on with these instances. It appears these NaNs should be 0.0, reflecting no basement. The same appears to be true for the half baths, which I checked below.

In [280]:
df['bsmt_full_bath'] = df['bsmt_full_bath'].fillna(0.0)

In [281]:
df['bsmt_full_bath'].value_counts(dropna = False)

0    507
1    356
2     15
Name: bsmt_full_bath, dtype: int64

In [282]:
df[df['bsmt_half_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [283]:
df['bsmt_half_bath'] = df['bsmt_half_bath'].fillna(0.0)

In [284]:
df['bsmt_half_bath'].value_counts(dropna = False)

0    829
1     49
Name: bsmt_half_bath, dtype: int64

### Basement Sq Ft, Finished (both types) & Unfinished, Total

In [285]:
df[df['bsmt_unf_sf'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Having looked at values in other instances, it appears this also should be 0.0, for no basement, as well.

In [286]:
df['bsmt_unf_sf'] = df['bsmt_unf_sf'].fillna(0.0)
df['bsmt_unf_sf'].isnull().sum()

0

In [287]:
df[df['bsmtfin_sf_1'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [288]:
df['bsmtfin_sf_1'] = df['bsmtfin_sf_1'].fillna(0.0)
df['bsmtfin_sf_1'].isnull().sum()

0

In [289]:
df[df['bsmtfin_sf_2'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [290]:
df['bsmtfin_sf_2'] = df['bsmtfin_sf_2'].fillna(0.0)
df['bsmtfin_sf_2'].isnull().sum()

0

In [291]:
df[df['total_bsmt_sf'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [292]:
df['total_bsmt_sf'] = df['total_bsmt_sf'].fillna(0.0)
df['total_bsmt_sf'].isnull().sum()

0

**still have 878 columns**

In [293]:
df.shape

(878, 81)

### Basement Quality

The dictionary specifies that 'Na' is 'No Basement', so I'm going to cast it as such. This appears to bet true for the categorical basement elements.

In [294]:
df['bsmt_qual'].value_counts(dropna = False)

TA     396
Gd     355
Ex      73
Fa      28
NaN     25
Po       1
Name: bsmt_qual, dtype: int64

In [295]:
df['bsmt_qual'] = df['bsmt_qual'].fillna('No Basement')

In [296]:
df['bsmt_qual'].value_counts(dropna = False)

TA             396
Gd             355
Ex              73
Fa              28
No Basement     25
Po               1
Name: bsmt_qual, dtype: int64

### Basement Condition

The data dictionary says that 'Na' means None, so I'm making that explicit.

In [297]:
df['bsmt_cond'] = df['bsmt_cond'].fillna('No Basement')
df['bsmt_cond'].isnull().sum()

0

In [298]:
df['bsmt_cond'].value_counts()

TA             781
Fa              39
Gd              33
No Basement     25
Name: bsmt_cond, dtype: int64

### Basement Exposure
Dictionary says 'Na' means 'No Basement', so making that explicit.

In [299]:
df['bsmt_exposure'] = df['bsmt_exposure'].fillna('No Basement')
df['bsmt_exposure'].isna().sum()

0

In [300]:
df['bsmt_exposure'].value_counts()

No             567
Av             130
Gd              80
Mn              76
No Basement     25
Name: bsmt_exposure, dtype: int64

### Basement Finish Type 1 and 2
The dictionary sas 'Na' means 'No Basement', so making that explicit. There's one more Finish Type 2 than Type 1, so I looked into that before making the change to 'No Basement'.

In [301]:
df['bsmtfin_type_1'] = df['bsmtfin_type_1'].fillna('No Basement')
df['bsmtfin_type_1'].value_counts()

Unf            248
GLQ            243
ALQ            136
Rec            105
BLQ             69
LwQ             52
No Basement     25
Name: bsmtfin_type_1, dtype: int64

In [302]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In the original data frame there's one instance that has a null 'basmtfin_type_2' that clearly has a basement based on the other entries. Since I have no way of speculating what type that should be, I'm deleting this instance.

In [303]:
df.drop(index = df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')].index, inplace = True)

In [304]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Now making the change to 'No Basement' on Finish Type 2

In [305]:
df['bsmtfin_type_2'] = df['bsmtfin_type_2'].fillna('No Basement')
df['bsmtfin_type_2'].value_counts()

Unf            749
LwQ             29
Rec             26
No Basement     25
BLQ             20
ALQ             18
GLQ             11
Name: bsmtfin_type_2, dtype: int64

Checking in on which columns still have Na's

In [306]:
df.columns[df.isnull().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'electrical',
 'fireplace_qu',
 'garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_qual',
 'garage_cond',
 'pool_qc',
 'fence',
 'misc_feature']

### Garage Columns

In [307]:
garage_list = df.columns[df.columns.str.contains('garage')].tolist()
garage_list

['garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond']

Again, I recycled the below for-loop as a final sanity check, returning to this line repeatedly as I worked, rather than rewriting the loop for every step I took.

In [308]:
for i in garage_list:
    print(f'{i} has {df[i].isnull().sum()} nulls')

garage_type has 44 nulls
garage_yr_blt has 45 nulls
garage_finish has 45 nulls
garage_cars has 0 nulls
garage_area has 0 nulls
garage_qual has 45 nulls
garage_cond has 45 nulls


I did a search to find the not null method for the following, which turned up the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html).

In [309]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
764,Detchd,,,1,360,,


A garage_type is listed for this one instance, but NaN in a number of the other columns means 'No Garage' according to the dictionary, so I'm going to operate on the assumption that the garage type is in error, here and change it to 'No Garage'

In [311]:
df.reset_index().head()

Unnamed: 0,index,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD,0
1,1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD,0
2,2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,,,,0,9,2006,New,0
3,3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,,,,0,7,2007,WD,0
4,4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD,0


In [313]:
garage_index = df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())].index

df.iloc[garage_index]['garage_type']

764    Detchd
Name: garage_type, dtype: object

In [314]:
df.shape

(878, 81)

In [315]:
garage_index.astype(int)

Int64Index([764], dtype='int64')

In [316]:
df.shape

(878, 81)

In [317]:
df['garage_type'] = df['garage_type'].fillna('No Garage')
   
df['garage_type'].isnull().sum()

0

In [318]:
df.shape

(878, 81)

The dictionary spells out the below "no_garage_list" columns as NA = No Garage. This is the first place I copied the df because I wanted a backup in case I made a mistake filling the NAs in multiple columns in one block, as it's the first time I've done it that way. When I succeeded, I moved the copy_df code to afterwards to save it before my next experiment.

# Error Here

This didn't wind up hurting this dataframe because it didn't have nulls here, but because I copied this notebook from the other, the mistake was here as well.

In [331]:
# no_garage_list = ['garage_finish', 'garage_qual', 'garage_cond']

# df[no_garage_list] = df[no_garage_list].fillna('No Garage')

In [319]:
# don't need this because I'm not making that change here.
# copy_df = df.copy()

In [320]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


In [321]:
# no_garage_list2 = ['garage_yr_blt', 'garage_cars', 'garage_area']

#this was an artifact of the original cleaning that would create a new row
#df.at[1712, no_garage_list2] = 'No Garage'

In [322]:
df.shape

(878, 81)

I've got to figure out what's going on with 'garage_yr_blt' because it seems possible some of these NAs may be for things other than "No Garage".

In [330]:
df['garage_area'].head()

0    440
1    580
2    426
3    480
4    514
Name: garage_area, dtype: int64

In [323]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
29,No Garage,,,0,0,,
45,No Garage,,,0,0,,
66,No Garage,,,0,0,,
68,No Garage,,,0,0,,
105,No Garage,,,0,0,,


In [324]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_finish'] != 'No Garage')]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
29,1904,534451020,50,RL,51.0,3500,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrkSide,Feedr,Norm,1Fam,1.5Fin,3,5,1945,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,No,LwQ,144,Unf,0,226,370,GasA,TA,N,FuseA,442,228,0,670,1,0,1,0,2,1,Fa,4,Typ,0,,No Garage,,,0,0,,,N,0,21,0,0,0,0,,MnPrv,Shed,2000,7,2007,WD,0
45,979,923228150,160,RM,21.0,1533,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,2Story,4,6,1970,2008,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,TA,Y,SBrkr,798,546,0,1344,0,0,1,1,3,1,TA,6,Typ,1,TA,No Garage,,,0,0,,,Y,0,0,0,0,0,0,,,,0,5,2009,WD,0
66,2362,527403120,20,RL,,8125,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,4,4,1971,1971,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,TA,TA,No,BLQ,614,Unf,0,244,858,GasA,TA,Y,SBrkr,858,0,0,858,0,0,1,0,3,1,TA,5,Typ,0,,No Garage,,,0,0,,,Y,0,0,0,0,0,0,,,,0,6,2006,WD,0
68,2188,908226180,30,RH,70.0,4270,Pave,,Reg,Bnk,AllPub,Inside,Mod,Edwards,Norm,Norm,1Fam,1Story,3,6,1931,2006,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Rec,544,Unf,0,0,544,GasA,Ex,Y,SBrkr,774,0,0,774,0,0,1,0,3,1,Gd,6,Typ,0,,No Garage,,,0,0,,,Y,0,0,286,0,0,0,,,,0,5,2007,WD,0
105,1988,902207010,30,RM,40.0,3880,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,9,1945,1997,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,ALQ,329,Unf,0,357,686,GasA,Gd,Y,SBrkr,866,0,0,866,0,0,1,0,2,1,Gd,4,Typ,0,,No Garage,,,0,0,,,Y,58,42,0,0,0,0,,,,0,8,2007,WD,0
109,217,905101300,90,RL,72.0,10773,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,4,3,1967,1967,Gable,Tar&Grv,Plywood,Plywood,BrkFace,72.0,Fa,Fa,CBlock,TA,TA,No,ALQ,704,Unf,0,1128,1832,GasA,TA,N,SBrkr,1832,0,0,1832,2,0,2,0,4,2,TA,8,Typ,0,,No Garage,,,0,0,,,Y,0,58,0,0,0,0,,,,0,5,2010,WD,0
113,2908,923205120,20,RL,90.0,17217,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1Story,5,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,1140,1140,GasA,Ex,Y,SBrkr,1140,0,0,1140,0,0,1,0,3,1,TA,6,Typ,0,,No Garage,,,0,0,,,Y,36,56,0,0,0,0,,,,0,7,2006,WD,0
144,1507,908250040,50,RL,57.0,8050,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1.5Fin,5,8,1947,1993,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,Slab,No Basement,No Basement,No Basement,No Basement,0,No Basement,0,0,0,GasA,Gd,Y,SBrkr,929,208,0,1137,0,0,1,1,4,1,TA,8,Min1,0,,No Garage,,,0,0,,,Y,0,0,0,0,0,0,,,,0,4,2008,WD,0
152,1368,903476110,50,RM,60.0,5586,Pave,,IR1,Bnk,AllPub,Inside,Gtl,OldTown,Feedr,Norm,1Fam,1.5Fin,6,7,1920,1998,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,901,901,GasA,Gd,Y,SBrkr,1088,110,0,1198,0,0,1,0,4,1,TA,7,Typ,0,,No Garage,,,0,0,,,N,0,98,0,0,0,0,,MnPrv,,0,9,2008,ConLD,0
156,332,923228270,160,RM,21.0,1900,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,4,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,Ex,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,5,Typ,0,,No Garage,,,0,0,,,Y,0,0,0,0,0,0,,,,0,6,2010,WD,0


It appears all the garage_yr_built = NaN rows correspond to No Garage.

I don't use garage year built in the model, so I'm going to cast NaNs to mean year built

In [325]:
df['garage_yr_blt'].mean()

1976.7599039615845

In [326]:
df['garage_yr_blt'] = df['garage_yr_blt'].fillna(round(df['garage_yr_blt'].mean()))

In [327]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


In [328]:
df.shape

(878, 81)

In [332]:
df['garage_area'].head(3)

0    440
1    580
2    426
Name: garage_area, dtype: int64

In [333]:
df['garage_yr_blt'].head(3)

0    1910.0
1    1977.0
2    2006.0
Name: garage_yr_blt, dtype: float64

In [334]:
df['garage_cars'].head(3)

0    1
1    2
2    2
Name: garage_cars, dtype: int64

I'm revisiting the null list to see what's left. I came back up to this one a few times, rather than rewriting it below, for additional sanity-check and to keep track of what's left.

Learned how to show all rows from [this site](https://stackoverflow.com/questions/46554597/how-to-display-all-rows-in-jupyter-notebook)

In [335]:
pd.options.display.max_rows = 999

In [336]:
pd.Series(df.isnull().sum())

id                   0
pid                  0
ms_subclass          0
ms_zoning            0
lot_frontage       160
lot_area             0
street               0
alley                0
lot_shape            0
land_contour         0
utilities            0
lot_config           0
land_slope           0
neighborhood         0
condition_1          0
condition_2          0
bldg_type            0
house_style          0
overall_qual         0
overall_cond         0
year_built           0
year_remod/add       0
roof_style           0
roof_matl            0
exterior_1st         0
exterior_2nd         0
mas_vnr_type         1
mas_vnr_area         1
exter_qual           0
exter_cond           0
foundation           0
bsmt_qual            0
bsmt_cond            0
bsmt_exposure        0
bsmtfin_type_1       0
bsmtfin_sf_1         0
bsmtfin_type_2       0
bsmtfin_sf_2         0
bsmt_unf_sf          0
total_bsmt_sf        0
heating              0
heating_qc           0
central_air          0
electrical 

[This site](https://datascienceparichay.com/article/show-all-columns-of-pandas-dataframe-in-jupyter-notebook/) showed me how to display all columns.

In [337]:
pd.set_option('display.max_columns', None)

In [338]:
df[df['electrical'].isnull()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
634,1578,916386080,80,RL,73.0,9735,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,SLvl,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,384,384,GasA,Gd,Y,,754,640,0,1394,0,0,2,1,3,1,Gd,7,Typ,0,,BuiltIn,2007.0,Fin,2,400,TA,TA,Y,100,0,0,0,0,0,,,,0,5,2008,WD,0


In [339]:
df['electrical'].value_counts()

SBrkr    813
FuseA     48
FuseF     15
FuseP      1
Name: electrical, dtype: int64

I decided to fill the null with the most common type.

In [340]:
df['electrical'] = df['electrical'].fillna('SBrkr')

In [341]:
df[df['electrical'].isna()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [342]:
df.columns[df.isna().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'fireplace_qu',
 'garage_finish',
 'garage_qual',
 'garage_cond',
 'pool_qc',
 'fence',
 'misc_feature']

#### Lot Frontage

In [343]:
df['lot_frontage'].value_counts(dropna = False)

NaN      160
60.0      97
80.0      43
75.0      37
70.0      37
50.0      27
85.0      24
65.0      22
21.0      18
24.0      16
68.0      16
90.0      15
78.0      13
64.0      12
51.0      11
55.0      10
59.0       9
76.0       9
72.0       9
63.0       9
79.0       9
61.0       8
73.0       8
74.0       8
52.0       8
86.0       8
40.0       7
82.0       7
44.0       7
66.0       7
35.0       6
57.0       6
88.0       6
110.0      6
69.0       6
71.0       6
120.0      6
53.0       6
98.0       5
34.0       5
48.0       5
100.0      5
94.0       4
58.0       4
56.0       4
84.0       4
42.0       4
81.0       4
36.0       4
95.0       4
89.0       4
77.0       4
67.0       4
93.0       4
54.0       4
124.0      3
96.0       3
87.0       3
62.0       3
99.0       3
83.0       3
41.0       3
43.0       3
102.0      3
118.0      3
105.0      3
33.0       2
107.0      2
49.0       2
45.0       2
104.0      2
115.0      2
160.0      2
108.0      2
121.0      2
149.0      2
32.0       2

In [344]:
df['lot_frontage'].isna().sum()/len(df)

0.18223234624145787

16% of the data seems like a lot to drop.

In [345]:
df[df['lot_frontage'] == 0.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [346]:
df[df['lot_frontage'] < 30.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
5,333,923228370,160,RM,21.0,1890,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,6,1972,1972,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,294,Unf,0,252,546,GasA,TA,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,5,Typ,0,,Attchd,1972.0,Unf,1,286,TA,TA,Y,0,0,64,0,0,0,,,,0,6,2010,WD,0
45,979,923228150,160,RM,21.0,1533,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,2Story,4,6,1970,2008,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,TA,Y,SBrkr,798,546,0,1344,0,0,1,1,3,1,TA,6,Typ,1,TA,No Garage,1977.0,,0,0,,,Y,0,0,0,0,0,0,,,,0,5,2009,WD,0
117,1679,527451020,160,RM,24.0,2016,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrDale,Norm,Norm,TwnhsE,2Story,5,5,1970,1970,Gable,CompShg,HdBoard,HdBoard,BrkFace,304.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,630,630,GasA,TA,Y,SBrkr,630,672,0,1302,0,0,2,1,3,1,TA,6,Typ,0,,Detchd,1970.0,Unf,2,440,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2007,WD,0
134,1167,533213120,160,FV,24.0,2280,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,6,5,1999,1999,Gable,CompShg,MetalSd,MetalSd,Stone,216.0,TA,TA,PConc,Gd,TA,No,GLQ,550,Unf,0,194,744,GasA,Gd,Y,SBrkr,757,792,0,1549,1,0,2,1,3,1,TA,6,Typ,0,,Detchd,1999.0,Unf,2,440,TA,TA,Y,0,32,0,0,0,0,,,,0,4,2008,WD,0
156,332,923228270,160,RM,21.0,1900,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,4,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,546,546,GasA,Ex,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,5,Typ,0,,No Garage,1977.0,,0,0,,,Y,0,0,0,0,0,0,,,,0,6,2010,WD,0
173,2509,533212120,160,FV,24.0,2544,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,7,5,2005,2005,Gable,CompShg,MetalSd,MetalSd,,0.0,Gd,TA,PConc,Gd,TA,No,Unf,0,Unf,0,600,600,GasA,Ex,Y,SBrkr,520,623,80,1223,0,0,2,1,2,1,Gd,4,Typ,0,,Detchd,2005.0,RFn,2,480,TA,TA,Y,0,166,0,0,0,0,,,,0,7,2006,WD,0
178,2921,923228310,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,2Story,4,5,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,252,Unf,0,294,546,GasA,TA,Y,SBrkr,546,546,0,1092,0,0,1,1,3,1,TA,6,Typ,0,,CarPort,1970.0,Unf,1,286,TA,TA,Y,0,24,0,0,0,0,,,,0,4,2006,WD,0
206,2378,527455100,160,RL,24.0,2179,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,2Story,6,5,1976,1976,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,70,Unf,0,785,855,GasA,Gd,Y,SBrkr,855,601,0,1456,0,0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,0,28,0,0,0,0,,,,0,7,2006,WD,0
231,331,923226320,180,RM,21.0,1491,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,SFoyer,4,6,1972,1972,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,Gd,TA,Av,LwQ,150,GLQ,480,0,630,GasA,Ex,Y,SBrkr,630,0,0,630,1,0,1,0,1,1,TA,3,Typ,0,,No Garage,1977.0,,0,0,,,Y,96,24,0,0,0,0,,,,0,5,2010,WD,0
262,1677,527450150,160,RM,21.0,1890,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrDale,Norm,Norm,Twnhs,2Story,6,5,1973,1973,Gable,CompShg,HdBoard,HdBoard,BrkFace,285.0,TA,TA,CBlock,TA,TA,No,BLQ,356,Unf,0,316,672,GasA,TA,Y,SBrkr,672,546,0,1218,0,0,1,1,3,1,TA,7,Typ,0,,Detchd,1973.0,Unf,1,264,TA,TA,Y,144,28,0,0,0,0,,,,0,5,2007,WD,0


It appears possible that NaN means no frontage because there are no 0.0 lots. In fact, there's nothing less than 21.0 feet of frontage. In reading a little about "flag lots" [here](https://www.city-data.com/forum/real-estate/1735402-value-land-without-road-frontage.html), it seems plausible that's what these are, and while 21 feet seems a little small from that reading, it seems possible these smaller lot frontages represent lots that include a path to the road in the property (as opposed to an easement across someone else's property). For now, I'm going to cast those as 0.0.

In [347]:
df['lot_frontage'] = df['lot_frontage'].fillna(0.0)

df['lot_frontage'].isna().sum()

0

### Fireplace Quality

The dictionary spells out that NA == no fireplace.

In [348]:
df['fireplace_qu'] = df['fireplace_qu'].fillna('No Fireplace')

df['fireplace_qu'].isna().sum()

0

### Pool Quality

The dictionary spells out that NA == no pool.

In [349]:
df['pool_qc'] = df['pool_qc'].fillna('No Pool')

df['fireplace_qu'].isna().sum()

0

### Fence

The dictionary spells out that NA == no fence.

In [350]:
df['fence'] = df['fence'].fillna('No Fence')

df['fence'].isna().sum()

0

### Misc Features

The dictionary spells out that NA == no additional features

In [351]:
df['misc_feature'] = df['misc_feature'].fillna('No Addl Features')
df['misc_feature'].isna().sum()

0

In [352]:
df.shape

(878, 81)

### Masonry Veneer Type

I examined these initially, and wasn't sure about how to handle them. I leaned toward dropping the 22 NaN since it's clear None is another category and hard to know how these NaN should be interpreted. With only 22 instances, I decided to drop these data.

In [353]:
df['mas_vnr_type'].value_counts(dropna = False)

None       534
BrkFace    250
Stone       80
BrkCmn      12
CBlock       1
NaN          1
Name: mas_vnr_type, dtype: int64

In [354]:
df[df['mas_vnr_type'].isna()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
865,868,907260030,60,RL,70.0,8749,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2002,2002,Gable,CompShg,VinylSd,VinylSd,,,Gd,TA,PConc,Gd,TA,No,Unf,0,Unf,0,840,840,GasA,Ex,Y,SBrkr,840,885,0,1725,0,0,2,1,3,1,Gd,6,Typ,0,No Fireplace,Attchd,2002.0,RFn,2,550,TA,TA,Y,0,48,0,0,0,0,No Pool,No Fence,No Addl Features,0,11,2009,WD,0


### Masonry Veneer Area

In [355]:
df['mas_vnr_area'].value_counts(dropna = False)

0.0       532
216.0       7
80.0        5
420.0       5
196.0       5
340.0       4
120.0       4
144.0       4
176.0       3
456.0       3
88.0        3
149.0       3
306.0       3
194.0       3
302.0       3
240.0       3
128.0       3
285.0       3
270.0       3
50.0        3
200.0       3
198.0       3
90.0        3
182.0       3
180.0       3
286.0       2
20.0        2
99.0        2
150.0       2
174.0       2
108.0       2
260.0       2
123.0       2
280.0       2
16.0        2
130.0       2
205.0       2
147.0       2
45.0        2
268.0       2
82.0        2
70.0        2
352.0       2
166.0       2
53.0        2
161.0       2
206.0       2
320.0       2
226.0       2
350.0       2
256.0       2
162.0       2
300.0       2
265.0       2
76.0        2
106.0       2
14.0        2
169.0       2
164.0       2
44.0        2
156.0       2
104.0       2
246.0       2
72.0        2
621.0       2
423.0       2
360.0       2
450.0       2
252.0       2
98.0        2
232.0       2
100.0 

In [356]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

1

The nulls from Masonry Veneer Type and Area are the same ones.

I dropped the row that had null values when I cleaned my training dataset, but I can't do that here without messing up the Kaggle data, which needs 878 rows. Instead, I'm going to assume this means there is not masonry, so type is none and area is 0

In [357]:
df['mas_vnr_type'] = df['mas_vnr_type'].fillna('None')
df['mas_vnr_area'] = df['mas_vnr_area'].fillna(0.0)

In [358]:
# this is an artifact of the original cleaning that removes a row, throwing of the count of rows
# for the kaggle competition

# df.dropna(subset = 'mas_vnr_type', inplace = True)

In [359]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

0

In [360]:
df.shape

(878, 81)

In [363]:
df.dtypes

id                   int64
pid                  int64
ms_subclass          int64
ms_zoning           object
lot_frontage       float64
lot_area             int64
street              object
alley               object
lot_shape           object
land_contour        object
utilities           object
lot_config          object
land_slope          object
neighborhood        object
condition_1         object
condition_2         object
bldg_type           object
house_style         object
overall_qual         int64
overall_cond         int64
year_built           int64
year_remod/add       int64
roof_style          object
roof_matl           object
exterior_1st        object
exterior_2nd        object
mas_vnr_type        object
mas_vnr_area       float64
exter_qual          object
exter_cond          object
foundation          object
bsmt_qual           object
bsmt_cond           object
bsmt_exposure       object
bsmtfin_type_1      object
bsmtfin_sf_1         int64
bsmtfin_type_2      object
b

In [361]:
df.to_csv('datasets/draft1_cleaned_kaggle_test.csv', index = False)