# Prepping the Test DataFrame
The purpose of this notebook is to prepare the Test dataframe from test.csv (the one to be used for the Kaggle competition). This is a duplicate of the 'initial cleaning' notebook designed to get the Kaggle Test dataframe to the point where the features match the cleaned Training dataframe.

I've left the notes from the original cleaning notebook to make it easier to follow the workflow.

In [166]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [493]:
df = pd.read_csv('./datasets/test.csv')

In [494]:
df.shape

(878, 80)

I'm adding in a 'saleprice' column with values == 0 to be sure that I can concatenate the final Kaggle Test dataframe with my training dataframe before creating dummy columns.

In [495]:
df['saleprice'] = 0
df['saleprice'].value_counts()

0    878
Name: saleprice, dtype: int64

In [496]:
df.isnull().sum()

Id                0
PID               0
MS SubClass       0
MS Zoning         0
Lot Frontage    160
               ... 
Misc Val          0
Mo Sold           0
Yr Sold           0
Sale Type         0
saleprice         0
Length: 81, dtype: int64

Because there are so many columns, I wanted to identify only the columns that contain null values. I experimented and wasn't having any luck, so I did some research: this [stackoverflow](https://stackoverflow.com/questions/53137100/filter-pandas-dataframe-columns-with-null-data) showed me the approach I used below.

In [497]:
df.columns[df.isna().any()].tolist()

['Lot Frontage',
 'Alley',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Electrical',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Pool QC',
 'Fence',
 'Misc Feature']

In [498]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [499]:
df.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,,,,0,4,2006,WD,0
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,,,,0,8,2006,WD,0


### Sanity Check From EDA + Further Cleaning Notebook

I'm returning to this point to be certain that I didn't accidentally create any of the oddities I'm finding during EDA.

In [500]:
df['street'].value_counts()

Pave    873
Grvl      5
Name: street, dtype: int64

## Data Dictionary

I reference the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) a lot in the following.

### Alley - NaN --> 'None'

In [501]:
df['alley'].value_counts(dropna = False)

NaN     820
Grvl     35
Pave     23
Name: alley, dtype: int64

It appears that NaN means there is no alley. I'm going to replace NaN with 'None' for now. 

In [502]:
df['alley'] = df['alley'].fillna('None')

df['alley'].value_counts(dropna = False)

None    820
Grvl     35
Pave     23
Name: alley, dtype: int64

### Basements!

I created this for-loop below to examine how many NAs existed in each basement column, to help me decide how to deal with them, since there's presumably at least some relationship betwen them. Rather than copying the code over and over again, I also returned to the for-loop to use as my sanity check as I cleaned the basement columns.

In [503]:
bsmt_list = df.columns[df.columns.str.contains('bsmt')].tolist()

bsmt_list

['bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'bsmt_full_bath',
 'bsmt_half_bath']

In [504]:
for i in bsmt_list:
    print(f'{i} has {df[i].isnull().sum()}')

bsmt_qual has 25
bsmt_cond has 25
bsmt_exposure has 25
bsmtfin_type_1 has 25
bsmtfin_sf_1 has 0
bsmtfin_type_2 has 25
bsmtfin_sf_2 has 0
bsmt_unf_sf has 0
total_bsmt_sf has 0
bsmt_full_bath has 0
bsmt_half_bath has 0


I was surprised by the variation on the number of NaNs. Knowing that some of the NaNs actually represent "No Basement", I thought there'd be a pretty consistent number.

I decided to start by looking at the smaller values to see what's going on to start.

** I needed to find where I accidentally dropped a column, so I did a shape check and will do a few more **

In [505]:
df.shape

(878, 81)

### Basement Full Bath and Half Bath

In [506]:
df[df['bsmt_full_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [507]:
df['bsmt_full_bath'].value_counts(dropna = False)

0    507
1    356
2     15
Name: bsmt_full_bath, dtype: int64

I did the above two lines to get a sense of what was going on with these instances. It appears these NaNs should be 0.0, reflecting no basement. The same appears to be true for the half baths, which I checked below.

In [508]:
df['bsmt_full_bath'] = df['bsmt_full_bath'].fillna(0.0)

In [509]:
df['bsmt_full_bath'].value_counts(dropna = False)

0    507
1    356
2     15
Name: bsmt_full_bath, dtype: int64

In [510]:
df[df['bsmt_half_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [511]:
df['bsmt_half_bath'] = df['bsmt_half_bath'].fillna(0.0)

In [512]:
df['bsmt_half_bath'].value_counts(dropna = False)

0    829
1     49
Name: bsmt_half_bath, dtype: int64

### Basement Sq Ft, Finished (both types) & Unfinished, Total

In [513]:
df[df['bsmt_unf_sf'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Having looked at values in other instances, it appears this also should be 0.0, for no basement, as well.

In [514]:
df['bsmt_unf_sf'] = df['bsmt_unf_sf'].fillna(0.0)
df['bsmt_unf_sf'].isnull().sum()

0

In [515]:
df[df['bsmtfin_sf_1'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [516]:
df['bsmtfin_sf_1'] = df['bsmtfin_sf_1'].fillna(0.0)
df['bsmtfin_sf_1'].isnull().sum()

0

In [517]:
df[df['bsmtfin_sf_2'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [518]:
df['bsmtfin_sf_2'] = df['bsmtfin_sf_2'].fillna(0.0)
df['bsmtfin_sf_2'].isnull().sum()

0

In [519]:
df[df['total_bsmt_sf'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In [520]:
df['total_bsmt_sf'] = df['total_bsmt_sf'].fillna(0.0)
df['total_bsmt_sf'].isnull().sum()

0

**still have 878 columns**

In [521]:
df.shape

(878, 81)

### Basement Quality

The dictionary specifies that 'Na' is 'No Basement', so I'm going to cast it as such. This appears to bet true for the categorical basement elements.

In [522]:
df['bsmt_qual'].value_counts(dropna = False)

TA     396
Gd     355
Ex      73
Fa      28
NaN     25
Po       1
Name: bsmt_qual, dtype: int64

In [523]:
df['bsmt_qual'] = df['bsmt_qual'].fillna('No Basement')

In [524]:
df['bsmt_qual'].value_counts(dropna = False)

TA             396
Gd             355
Ex              73
Fa              28
No Basement     25
Po               1
Name: bsmt_qual, dtype: int64

### Basement Condition

The data dictionary says that 'Na' means None, so I'm making that explicit.

In [525]:
df['bsmt_cond'] = df['bsmt_cond'].fillna('No Basement')
df['bsmt_cond'].isnull().sum()

0

In [526]:
df['bsmt_cond'].value_counts()

TA             781
Fa              39
Gd              33
No Basement     25
Name: bsmt_cond, dtype: int64

### Basement Exposure
Dictionary says 'Na' means 'No Basement', so making that explicit.

In [527]:
df['bsmt_exposure'] = df['bsmt_exposure'].fillna('No Basement')
df['bsmt_exposure'].isna().sum()

0

In [528]:
df['bsmt_exposure'].value_counts()

No             567
Av             130
Gd              80
Mn              76
No Basement     25
Name: bsmt_exposure, dtype: int64

### Basement Finish Type 1 and 2
The dictionary sas 'Na' means 'No Basement', so making that explicit. There's one more Finish Type 2 than Type 1, so I looked into that before making the change to 'No Basement'.

In [529]:
df['bsmtfin_type_1'] = df['bsmtfin_type_1'].fillna('No Basement')
df['bsmtfin_type_1'].value_counts()

Unf            248
GLQ            243
ALQ            136
Rec            105
BLQ             69
LwQ             52
No Basement     25
Name: bsmtfin_type_1, dtype: int64

In [530]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


In the original data frame there's one instance that has a null 'basmtfin_type_2' that clearly has a basement based on the other entries. Since I have no way of speculating what type that should be, I'm deleting this instance.

In [531]:
df.drop(index = df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')].index, inplace = True)

In [532]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Now making the change to 'No Basement' on Finish Type 2

In [533]:
df['bsmtfin_type_2'] = df['bsmtfin_type_2'].fillna('No Basement')
df['bsmtfin_type_2'].value_counts()

Unf            749
LwQ             29
Rec             26
No Basement     25
BLQ             20
ALQ             18
GLQ             11
Name: bsmtfin_type_2, dtype: int64

Checking in on which columns still have Na's

In [534]:
df.columns[df.isnull().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'electrical',
 'fireplace_qu',
 'garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_qual',
 'garage_cond',
 'pool_qc',
 'fence',
 'misc_feature']

### Garage Columns

In [535]:
garage_list = df.columns[df.columns.str.contains('garage')].tolist()
garage_list

['garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond']

Again, I recycled the below for-loop as a final sanity check, returning to this line repeatedly as I worked, rather than rewriting the loop for every step I took.

In [536]:
for i in garage_list:
    print(f'{i} has {df[i].isnull().sum()} nulls')

garage_type has 44 nulls
garage_yr_blt has 45 nulls
garage_finish has 45 nulls
garage_cars has 0 nulls
garage_area has 0 nulls
garage_qual has 45 nulls
garage_cond has 45 nulls


I did a search to find the not null method for the following, which turned up the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html).

In [537]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
764,Detchd,,,1,360,,


A garage_type is listed for this one instance, but NaN in a number of the other columns means 'No Garage' according to the dictionary, so I'm going to operate on the assumption that the garage type is in error, here and change it to 'No Garage'

In [538]:
df.reset_index()

Unnamed: 0,index,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,...,0,0,,,,0,4,2006,WD,0
1,1,2718,905108090,90,RL,,9662,Pave,,IR1,...,0,0,,,,0,8,2006,WD,0
2,2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,...,0,0,,,,0,9,2006,New,0
3,3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,...,0,0,,,,0,7,2007,WD,0
4,4,625,535105100,20,RL,,9500,Pave,,IR1,...,185,0,,,,0,7,2009,WD,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873,873,1662,527377110,60,RL,80.0,8000,Pave,,Reg,...,0,0,,,,0,11,2007,WD,0
874,874,1234,535126140,60,RL,90.0,14670,Pave,,Reg,...,0,0,,MnPrv,,0,8,2008,WD,0
875,875,1373,904100040,20,RL,55.0,8250,Pave,,Reg,...,0,0,,,,0,8,2008,WD,0
876,876,1672,527425140,20,RL,60.0,9000,Pave,,Reg,...,0,0,,GdWo,,0,5,2007,WD,0


In [539]:
garage_index = df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())].index

df.iloc[garage_index]['garage_type'].map({'garage_type': 'No Basement'})

df.iloc[garage_index]['garage_type']

764    Detchd
Name: garage_type, dtype: object

In [540]:
df.shape

(878, 81)

In [541]:
garage_index.astype(int)

Int64Index([764], dtype='int64')

In [542]:
df.shape

(878, 81)

This code was left from before and it added a row because this index doesn't exist in the new dataframe

In [543]:
# df.at[1712, 'garage_type'] = 'No Garage'

In [544]:
df.shape

(878, 81)

In [545]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
764,Detchd,,,1,360,,


In [546]:
df.iloc[garage_index][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
764,Detchd,,,1,360,,


In [547]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
764,Detchd,,,1,360,,


In [548]:
df['garage_type'] = df['garage_type'].fillna('No Garage')
   
df['garage_type'].isnull().sum()

0

In [549]:
df.shape

(878, 81)

The dictionary spells out the below "no_garage_list" columns as NA = No Garage. This is the first place I copied the df because I wanted a backup in case I made a mistake filling the NAs in multiple columns in one block, as it's the first time I've done it that way. When I succeeded, I moved the copy_df code to afterwards to save it before my next experiment.

In [550]:
no_garage_list = ['garage_finish', 'garage_qual', 'garage_cond']

df[no_garage_list] = df[no_garage_list].fillna('No Garage')

In [551]:
copy_df = df.copy()

In [552]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


In [553]:
no_garage_list2 = ['garage_yr_blt', 'garage_cars', 'garage_area']

#this was an artifact of the original cleaning that would create a new row
#df.at[1712, no_garage_list2] = 'No Garage'

In [554]:
df.shape

(878, 81)

In [555]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


I've got to figure out what's going on with 'garage_yr_blt' because it seems possible some of these NAs may be for things other than "No Garage".

In [556]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
29,No Garage,,No Garage,0,0,No Garage,No Garage
45,No Garage,,No Garage,0,0,No Garage,No Garage
66,No Garage,,No Garage,0,0,No Garage,No Garage
68,No Garage,,No Garage,0,0,No Garage,No Garage
105,No Garage,,No Garage,0,0,No Garage,No Garage


In [557]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_finish'] != 'No Garage')]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


It appears all the garage_yr_built = NaN rows correspond to No Garage.

In [558]:
df['garage_yr_blt'] = df['garage_yr_blt'].fillna('No Garage')

In [559]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


In [560]:
df.shape

(878, 81)

I'm revisiting the null list to see what's left. I came back up to this one a few times, rather than rewriting it below, for additional sanity-check and to keep track of what's left.

In [561]:
df.columns[df.isna().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'electrical',
 'fireplace_qu',
 'pool_qc',
 'fence',
 'misc_feature']

#### Lot Frontage

In [562]:
df['lot_frontage'].value_counts(dropna = False)

NaN      160
60.0      97
80.0      43
75.0      37
70.0      37
        ... 
168.0      1
31.0       1
150.0      1
122.0      1
182.0      1
Name: lot_frontage, Length: 105, dtype: int64

In [563]:
df['lot_frontage'].isna().sum()/len(df)

0.18223234624145787

16% of the data seems like a lot to drop.

In [564]:
df[df['lot_frontage'] == 0.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [565]:
df[df['lot_frontage'] < 30.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
5,333,923228370,160,RM,21.0,1890,Pave,,Reg,Lvl,...,0,0,,,,0,6,2010,WD,0
45,979,923228150,160,RM,21.0,1533,Pave,,Reg,Lvl,...,0,0,,,,0,5,2009,WD,0
117,1679,527451020,160,RM,24.0,2016,Pave,,Reg,Lvl,...,0,0,,,,0,4,2007,WD,0
134,1167,533213120,160,FV,24.0,2280,Pave,Pave,Reg,Lvl,...,0,0,,,,0,4,2008,WD,0
156,332,923228270,160,RM,21.0,1900,Pave,,Reg,Lvl,...,0,0,,,,0,6,2010,WD,0
173,2509,533212120,160,FV,24.0,2544,Pave,Pave,Reg,Lvl,...,0,0,,,,0,7,2006,WD,0
178,2921,923228310,160,RM,21.0,1894,Pave,,Reg,Lvl,...,0,0,,,,0,4,2006,WD,0
206,2378,527455100,160,RL,24.0,2179,Pave,,Reg,Lvl,...,0,0,,,,0,7,2006,WD,0
231,331,923226320,180,RM,21.0,1491,Pave,,Reg,Lvl,...,0,0,,,,0,5,2010,WD,0
262,1677,527450150,160,RM,21.0,1890,Pave,,Reg,Lvl,...,0,0,,,,0,5,2007,WD,0


It appears possible that NaN means no frontage because there are no 0.0 lots. In fact, there's nothing less than 21.0 feet of frontage. In reading a little about "flag lots" [here](https://www.city-data.com/forum/real-estate/1735402-value-land-without-road-frontage.html), it seems plausible that's what these are, and while 21 feet seems a little small from that reading, it seems possible these smaller lot frontages represent lots that include a path to the road in the property (as opposed to an easement across someone else's property). For now, I'm going to cast those as 0.0.

In [566]:
df['lot_frontage'] = df['lot_frontage'].fillna(0.0)

df['lot_frontage'].isna().sum()

0

### Fireplace Quality

The dictionary spells out that NA == no fireplace.

In [567]:
df['fireplace_qu'] = df['fireplace_qu'].fillna('No Fireplace')

df['fireplace_qu'].isna().sum()

0

### Pool Quality

The dictionary spells out that NA == no pool.

In [568]:
df['pool_qc'] = df['pool_qc'].fillna('No Pool')

df['fireplace_qu'].isna().sum()

0

### Fence

The dictionary spells out that NA == no fence.

In [569]:
df['fence'] = df['fence'].fillna('No Fence')

df['fence'].isna().sum()

0

### Misc Features

The dictionary spells out that NA == no additional features

In [570]:
df['misc_feature'] = df['misc_feature'].fillna('No Addl Features')
df['misc_feature'].isna().sum()

0

In [571]:
df.shape

(878, 81)

### Masonry Veneer Type

I examined these initially, and wasn't sure about how to handle them. I leaned toward dropping the 22 NaN since it's clear None is another category and hard to know how these NaN should be interpreted. With only 22 instances, I decided to drop these data.

In [572]:
df['mas_vnr_type'].value_counts(dropna = False)

None       534
BrkFace    250
Stone       80
BrkCmn      12
CBlock       1
NaN          1
Name: mas_vnr_type, dtype: int64

In [573]:
df[df['mas_vnr_type'].isna()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
865,868,907260030,60,RL,70.0,8749,Pave,,Reg,Lvl,...,0,0,No Pool,No Fence,No Addl Features,0,11,2009,WD,0


### Masonry Veneer Area

In [574]:
df['mas_vnr_area'].value_counts(dropna = False)

0.0      532
216.0      7
80.0       5
420.0      5
196.0      5
        ... 
281.0      1
95.0       1
481.0      1
459.0      1
410.0      1
Name: mas_vnr_area, Length: 233, dtype: int64

In [575]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

1

The nulls from Masonry Veneer Type and Area are the same ones.

I dropped the row that had null values when I cleaned my training dataset, but I can't do that here without messing up the Kaggle data, which needs 878 rows. Instead, I'm going to assume this means there is not masonry, so type is none and area is 0

In [584]:
df['mas_vnr_type'] = df['mas_vnr_type'].fillna('None')
df['mas_vnr_area'] = df['mas_vnr_area'].fillna(0.0)

In [585]:
# this is an artifact of the original cleaning that removes a row, throwing of the count of rows
# for the kaggle competition

# df.dropna(subset = 'mas_vnr_type', inplace = True)

In [586]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

0

In [587]:
df.shape

(878, 81)

In [588]:
df.to_csv('datasets/draft1_cleaned_kaggle_test.csv', index = False)