In [296]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [297]:
df = pd.read_csv('./datasets/train.csv')

In [298]:
df.shape

(2051, 81)

In [299]:
df.isnull().sum()

Id                0
PID               0
MS SubClass       0
MS Zoning         0
Lot Frontage    330
               ... 
Misc Val          0
Mo Sold           0
Yr Sold           0
Sale Type         0
SalePrice         0
Length: 81, dtype: int64

Because there are so many columns, I wanted to identify only the columns that contain null values. I experimented and wasn't having any luck, so I did some research: this [stackoverflow](https://stackoverflow.com/questions/53137100/filter-pandas-dataframe-columns-with-null-data) showed me the approach I used below.

In [300]:
df.columns[df.isna().any()].tolist()

['Lot Frontage',
 'Alley',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin SF 1',
 'BsmtFin Type 2',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Area',
 'Garage Qual',
 'Garage Cond',
 'Pool QC',
 'Fence',
 'Misc Feature']

In [301]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [302]:
df.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000


### Sanity Check From EDA + Further Cleaning Notebook

I'm returning to this point to be certain that I didn't accidentally create any of the oddities I'm finding during EDA.

In [397]:
df['street'].value_counts()

Pave    2021
Grvl       7
Name: street, dtype: int64

## Data Dictionary

I reference the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) a lot in the following.

### Alley - NaN --> 'None'

In [305]:
df['alley'].value_counts(dropna = False)

NaN     1911
Grvl      85
Pave      55
Name: alley, dtype: int64

It appears that NaN means there is no alley. I'm going to replace NaN with 'None' for now. 

In [306]:
df['alley'] = df['alley'].fillna('None')

df['alley'].value_counts(dropna = False)

None    1911
Grvl      85
Pave      55
Name: alley, dtype: int64

### Basements!

I created this for-loop below to examine how many NAs existed in each basement column, to help me decide how to deal with them, since there's presumably at least some relationship betwen them. Rather than copying the code over and over again, I also returned to the for-loop to use as my sanity check as I cleaned the basement columns.

In [311]:
bsmt_list = df.columns[df.columns.str.contains('bsmt')].tolist()

bsmt_list

['bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'bsmt_full_bath',
 'bsmt_half_bath']

In [367]:
for i in bsmt_list:
    print(f'{i} has {df[i].isnull().sum()}')

bsmt_qual has 0
bsmt_cond has 0
bsmt_exposure has 0
bsmtfin_type_1 has 0
bsmtfin_sf_1 has 0
bsmtfin_type_2 has 0
bsmtfin_sf_2 has 0
bsmt_unf_sf has 0
total_bsmt_sf has 0
bsmt_full_bath has 0
bsmt_half_bath has 0


I was surprised by the variation on the number of NaNs. Knowing that some of the NaNs actually represent "No Basement", I thought there'd be a pretty consistent number.

I decided to start by looking at the smaller values to see what's going on to start.

### Basement Full Bath and Half Bath

In [313]:
df[df['bsmt_full_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
616,,,,,0.0,,0.0,0.0,0.0,,
1327,,,,,,,,,,,


In [314]:
df['bsmt_full_bath'].value_counts(dropna = False)

0.0    1200
1.0     824
2.0      23
NaN       2
3.0       2
Name: bsmt_full_bath, dtype: int64

I did the above two lines to get a sense of what was going on with these instances. It appears these NaNs should be 0.0, reflecting no basement. The same appears to be true for the half baths, which I checked below.

In [315]:
df['bsmt_full_bath'] = df['bsmt_full_bath'].fillna(0.0)

In [316]:
df['bsmt_full_bath'].value_counts(dropna = False)

0.0    1202
1.0     824
2.0      23
3.0       2
Name: bsmt_full_bath, dtype: int64

In [317]:
df[df['bsmt_half_bath'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
616,,,,,0.0,,0.0,0.0,0.0,0.0,
1327,,,,,,,,,,0.0,


In [318]:
df['bsmt_half_bath'] = df['bsmt_half_bath'].fillna(0.0)

In [319]:
df['bsmt_half_bath'].value_counts(dropna = False)

0.0    1925
1.0     122
2.0       4
Name: bsmt_half_bath, dtype: int64

### Basement Sq Ft, Finished (both types) & Unfinished, Total

In [320]:
df[df['bsmt_unf_sf'].isna()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,,,,,,0.0,0.0


Having looked at values in other instances, it appears this also should be 0.0, for no basement, as well.

In [321]:
df['bsmt_unf_sf'] = df['bsmt_unf_sf'].fillna(0.0)
df['bsmt_unf_sf'].isnull().sum()

0

In [322]:
df[df['bsmtfin_sf_1'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,,,,0.0,,0.0,0.0


In [323]:
df['bsmtfin_sf_1'] = df['bsmtfin_sf_1'].fillna(0.0)
df['bsmtfin_sf_1'].isnull().sum()

0

In [324]:
df[df['bsmtfin_sf_2'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,0.0,,,0.0,,0.0,0.0


In [325]:
df['bsmtfin_sf_2'] = df['bsmtfin_sf_2'].fillna(0.0)
df['bsmtfin_sf_2'].isnull().sum()

0

In [326]:
df[df['total_bsmt_sf'].isnull()][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1327,,,,,0.0,,0.0,0.0,,0.0,0.0


In [327]:
df['total_bsmt_sf'] = df['total_bsmt_sf'].fillna(0.0)
df['total_bsmt_sf'].isnull().sum()

0

### Basement Quality

The dictionary specifies that 'Na' is 'No Basement', so I'm going to cast it as such. This appears to bet true for the categorical basement elements.

In [328]:
df['bsmt_qual'].value_counts(dropna = False)

TA     887
Gd     864
Ex     184
Fa      60
NaN     55
Po       1
Name: bsmt_qual, dtype: int64

In [329]:
df['bsmt_qual'] = df['bsmt_qual'].fillna('No Basement')

In [330]:
df['bsmt_qual'].value_counts(dropna = False)

TA             887
Gd             864
Ex             184
Fa              60
No Basement     55
Po               1
Name: bsmt_qual, dtype: int64

### Basement Condition

The data dictionary says that 'Na' means None, so I'm making that explicit.

In [331]:
df['bsmt_cond'] = df['bsmt_cond'].fillna('No Basement')
df['bsmt_cond'].isnull().sum()

0

In [332]:
df['bsmt_cond'].value_counts()

TA             1834
Gd               89
Fa               65
No Basement      55
Po                5
Ex                3
Name: bsmt_cond, dtype: int64

### Basement Exposure
Dictionary says 'Na' means 'No Basement', so making that explicit.

In [333]:
df['bsmt_exposure'] = df['bsmt_exposure'].fillna('No Basement')
df['bsmt_exposure'].isna().sum()

0

In [334]:
df['bsmt_exposure'].value_counts()

No             1339
Av              288
Gd              203
Mn              163
No Basement      58
Name: bsmt_exposure, dtype: int64

### Basement Finish Type 1 and 2
The dictionary sas 'Na' means 'No Basement', so making that explicit. There's one more Finish Type 2 than Type 1, so I looked into that before making the change to 'No Basement'.

In [335]:
df['bsmtfin_type_1'] = df['bsmtfin_type_1'].fillna('No Basement')
df['bsmtfin_type_1'].value_counts()

GLQ            615
Unf            603
ALQ            293
BLQ            200
Rec            183
LwQ            102
No Basement     55
Name: bsmtfin_type_1, dtype: int64

In [336]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath
1147,Gd,TA,No,GLQ,1124.0,,479.0,1603.0,3206.0,1.0,0.0


In the original data frame there's one instance that has a null 'basmtfin_type_2' that clearly has a basement based on the other entries. Since I have no way of speculating what type that should be, I'm deleting this instance.

In [337]:
df.drop(index = df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')].index, inplace = True)

In [338]:
df[(df['bsmtfin_type_2'].isnull()) & (df['bsmtfin_type_1'] != 'No Basement')][bsmt_list]

Unnamed: 0,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,bsmt_full_bath,bsmt_half_bath


Now making the change to 'No Basement' on Finish Type 2

In [339]:
df['bsmtfin_type_2'] = df['bsmtfin_type_2'].fillna('No Basement')
df['bsmtfin_type_2'].value_counts()

Unf            1749
Rec              80
LwQ              60
No Basement      55
BLQ              48
ALQ              35
GLQ              23
Name: bsmtfin_type_2, dtype: int64

Checking in on which columns still have Na's

In [340]:
df.columns[df.isnull().any()].tolist()

['lot_frontage',
 'mas_vnr_type',
 'mas_vnr_area',
 'fireplace_qu',
 'garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond',
 'pool_qc',
 'fence',
 'misc_feature']

### Garage Columns

In [341]:
garage_list = df.columns[df.columns.str.contains('garage')].tolist()
garage_list

['garage_type',
 'garage_yr_blt',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond']

Again, I recycled the below for-loop as a final sanity check, returning to this line repeatedly as I worked, rather than rewriting the loop for every step I took.

In [366]:
for i in garage_list:
    print(f'{i} has {df[i].isnull().sum()} nulls')

garage_type has 0 nulls
garage_yr_blt has 0 nulls
garage_finish has 0 nulls
garage_cars has 0 nulls
garage_area has 0 nulls
garage_qual has 0 nulls
garage_cond has 0 nulls


I did a search to find the not null method for the following, which turned up the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html).

In [343]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,Detchd,,,,,,


A garage_type is listed for this one instance, but NaN in a number of the other columns means 'No Garage' according to the dictionary, so I'm going to operate on the assumption that the garage type is in error, here and change it to 'No Garage'

In [344]:
df.reset_index()

Unnamed: 0,index,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,0,109,533352170,60,RL,,13517,Pave,,IR1,...,0,0,,,,0,3,2010,WD,130500
1,1,544,531379050,60,RL,43.0,11492,Pave,,IR1,...,0,0,,,,0,4,2009,WD,220000
2,2,153,535304180,20,RL,68.0,7922,Pave,,Reg,...,0,0,,,,0,1,2010,WD,109000
3,3,318,916386060,60,RL,73.0,9802,Pave,,Reg,...,0,0,,,,0,4,2010,WD,174000
4,4,255,906425045,50,RL,82.0,14235,Pave,,IR1,...,0,0,,,,0,3,2010,WD,138500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2045,2046,1587,921126030,20,RL,79.0,11449,Pave,,IR1,...,0,0,,,,0,1,2008,WD,298751
2046,2047,785,905377130,30,RL,,12342,Pave,,IR1,...,0,0,,,,0,3,2009,WD,82500
2047,2048,916,909253010,50,RL,57.0,7558,Pave,,Reg,...,0,0,,,,0,3,2009,WD,177000
2048,2049,639,535179160,20,RL,80.0,10400,Pave,,Reg,...,0,0,,,,0,11,2009,WD,144000


In [345]:
garage_index = df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())].index

df.iloc[garage_index]['garage_type'].map({'garage_type': 'No Basement'})

df.iloc[garage_index]['garage_type']

1713    Attchd
Name: garage_type, dtype: object

In [346]:
garage_index.astype(int)

Int64Index([1712], dtype='int64')

In [347]:
df.at[1712, 'garage_type'] = 'No Garage'

In [348]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,,,,,


In [349]:
df.iloc[garage_index][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1713,Attchd,1952.0,RFn,1.0,280.0,TA,TA


In [350]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_type'].notnull())][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,,,,,


In [351]:
df['garage_type'] = df['garage_type'].fillna('No Garage')
   
df['garage_type'].isnull().sum()

0

The dictionary spells out the below "no_garage_list" columns as NA = No Garage. This is the first place I copied the df because I wanted a backup in case I made a mistake filling the NAs in multiple columns in one block, as it's the first time I've done it that way. When I succeeded, I moved the copy_df code to afterwards to save it before my next experiment.

In [352]:
no_garage_list = ['garage_finish', 'garage_qual', 'garage_cond']

df[no_garage_list] = df[no_garage_list].fillna('No Garage')

In [353]:
copy_df = df.copy()

In [354]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
1712,No Garage,,No Garage,,,No Garage,No Garage


In [356]:
no_garage_list2 = ['garage_yr_blt', 'garage_cars', 'garage_area']

df.at[1712, no_garage_list2] = 'No Garage'

In [357]:
df[df['garage_cars'].isnull()][garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


I've got to figure out what's going on with 'garage_yr_blt' because it seems possible some of these NAs may be for things other than "No Garage".

In [361]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond
28,No Garage,,No Garage,0.0,0.0,No Garage,No Garage
53,No Garage,,No Garage,0.0,0.0,No Garage,No Garage
65,No Garage,,No Garage,0.0,0.0,No Garage,No Garage
79,No Garage,,No Garage,0.0,0.0,No Garage,No Garage
101,No Garage,,No Garage,0.0,0.0,No Garage,No Garage


In [363]:
df[(df['garage_yr_blt'].isnull()) & (df['garage_finish'] != 'No Garage')]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


It appears all the garage_yr_built = NaN rows correspond to No Garage.

In [364]:
df['garage_yr_blt'] = df['garage_yr_blt'].fillna('No Garage')

In [365]:
df[df['garage_yr_blt'].isnull()].head()[garage_list]

Unnamed: 0,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond


I'm revisiting the null list to see what's left. I came back up to this one a few times, rather than rewriting it below, for additional sanity-check and to keep track of what's left.

In [387]:
df.columns[df.isna().any()].tolist()

['mas_vnr_type', 'mas_vnr_area']

#### Lot Frontage

In [303]:
df['lot_frontage'].value_counts(dropna = False)

NaN      330
60.0     179
70.0      96
80.0      94
50.0      90
        ... 
119.0      1
122.0      1
22.0       1
155.0      1
135.0      1
Name: lot_frontage, Length: 119, dtype: int64

In [304]:
df['lot_frontage'].isna().sum()/len(df)

0.16089712335446124

16% of the data seems like a lot to drop.

In [369]:
df[df['lot_frontage'] == 0.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice


In [372]:
df[df['lot_frontage'] < 30.0]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
10,1044,527451290,160,RM,21.0,1680,Pave,,Reg,Lvl,...,0,0,,,,0,7,2008,COD,85400
13,1177,533236070,160,FV,24.0,2645,Pave,Pave,Reg,Lvl,...,0,0,,,,0,12,2008,ConLD,200000
53,330,923226250,160,RM,21.0,1476,Pave,,Reg,Lvl,...,0,0,,,,0,3,2010,WD,76000
56,98,533212020,160,FV,24.0,2544,Pave,Pave,Reg,Lvl,...,0,0,,,,0,2,2010,WD,149500
73,2832,908188140,160,RM,24.0,2522,Pave,,Reg,Lvl,...,0,0,,,,0,4,2006,WD,137500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1987,2377,527453070,120,RL,24.0,2280,Pave,,Reg,Lvl,...,0,0,,,,0,4,2006,WD,137500
2004,1843,533213130,160,FV,24.0,2280,Pave,Pave,Reg,Lvl,...,0,0,,,,0,9,2007,WD,176500
2006,1041,527450250,160,RM,21.0,1680,Pave,,Reg,Lvl,...,0,0,,,,0,11,2008,WD,100000
2007,980,923228230,160,RM,21.0,2217,Pave,,Reg,Lvl,...,0,0,,,,0,8,2009,WD,88000


It appears possible that NaN means no frontage because there are no 0.0 lots. In fact, there's nothing less than 21.0 feet of frontage. In reading a little about "flag lots" [here](https://www.city-data.com/forum/real-estate/1735402-value-land-without-road-frontage.html), it seems plausible that's what these are, and while 21 feet seems a little small from that reading, it seems possible these smaller lot frontages represent lots that include a path to the road in the property (as opposed to an easement across someone else's property). For now, I'm going to cast those as 0.0.

In [374]:
df['lot_frontage'] = df['lot_frontage'].fillna(0.0)

df['lot_frontage'].isna().sum()

0

### Fireplace Quality

The dictionary spells out that NA == no fireplace.

In [376]:
df['fireplace_qu'] = df['fireplace_qu'].fillna('No Fireplace')

df['fireplace_qu'].isna().sum()

0

### Pool Quality

The dictionary spells out that NA == no pool.

In [378]:
df['pool_qc'] = df['pool_qc'].fillna('No Pool')

df['fireplace_qu'].isna().sum()

0

### Fence

The dictionary spells out that NA == no fence.

In [380]:
df['fence'] = df['fence'].fillna('No Fence')

df['fence'].isna().sum()

0

### Misc Features

The dictionary spells out that NA == no additional features

In [382]:
df['misc_feature'] = df['misc_feature'].fillna('No Addl Features')
df['misc_feature'].isna().sum()

0

### Masonry Veneer Type

I examined these initially, and wasn't sure about how to handle them. I leaned toward dropping the 22 NaN since it's clear None is another category and hard to know how these NaN should be interpreted. With only 22 instances, I decided to drop these data.

In [307]:
df['mas_vnr_type'].value_counts(dropna = False)

None       1218
BrkFace     630
Stone       168
NaN          22
BrkCmn       13
Name: mas_vnr_type, dtype: int64

In [308]:
df[df['mas_vnr_type'].isna()]

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
22,2393,528142010,60,RL,103.0,12867,Pave,,IR1,Lvl,...,0,0,,,,0,7,2006,New,344133
41,2383,528110050,20,RL,107.0,13891,Pave,,Reg,Lvl,...,0,0,,,,0,9,2006,New,465000
86,539,531371050,20,RL,67.0,10083,Pave,,Reg,Lvl,...,0,0,,,,0,8,2009,WD,184900
212,518,528458020,20,FV,90.0,7993,Pave,,IR1,Lvl,...,0,0,,,,0,10,2009,New,225000
276,2824,908130020,20,RL,75.0,8050,Pave,,Reg,Lvl,...,0,0,,,,0,4,2006,WD,117250
338,1800,528458150,60,FV,112.0,12217,Pave,,IR1,Lvl,...,0,0,,,,0,12,2007,New,310013
431,1455,907251090,60,RL,75.0,9473,Pave,,Reg,Lvl,...,0,0,,,,0,3,2008,WD,237000
451,1120,528439010,20,RL,87.0,10037,Pave,,Reg,Lvl,...,0,0,,,,0,8,2008,WD,247000
591,1841,533208040,120,FV,35.0,4274,Pave,Pave,IR1,Lvl,...,0,0,,,,0,11,2007,New,199900
844,1840,533208030,120,FV,30.0,5330,Pave,Pave,IR2,Lvl,...,0,0,,,,0,7,2007,New,207500


### Masonry Veneer Area

In [309]:
df['mas_vnr_area'].value_counts(dropna = False)

0.0      1216
NaN        22
120.0      11
176.0      10
200.0      10
         ... 
142.0       1
215.0       1
235.0       1
233.0       1
426.0       1
Name: mas_vnr_area, Length: 374, dtype: int64

In [310]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

22

The nulls from Masonry Veneer Type and Area are the same ones.

In [384]:
copy_df = df

I reviewed the .dropna documentation to understand how to target this to the column I wanted (which had the effect of taking care of the other column, since it was the same rows with the relevant NA)

In [391]:
df.dropna(subset = 'mas_vnr_type', inplace = True)

In [392]:
len(df[(df['mas_vnr_type'].isna()) & (df['mas_vnr_area'].isna())])

0

In [396]:
df.to_csv('datasets/draft1_cleaned_train.csv', index = False)