## Data cleaning
- Refer to data dictionary on the Github repository

In [1]:
import pandas as pd

In [4]:
df_train = pd.read_csv("../kaggle-california-housing-data/train.csv")

### Inspect columns and values
- Let's first take a look at the null values
- Make sure that a null value makes sense in these columns
- Then decide on a function to replace
- NaN means this in the columns:
    - LotFrontage: No street connected to the property
    - Alley: NaN means no Alley
    - MasVnrType: Not sure, already has a None category
    - MasVnrArea: Not sure, already has a 0 area
    - BsmtQual: No basement
    - BsmtFinType1: No basement
    - BsmtFinType2: No basement
    - Electrical: Not sure, no electrical system?
    - FireplaceQu: No fireplace
    - GarageType: No garage
    - GarageYrBlt: No garage (check)
    - GarageFinish: No garage (check)
    - GarageQual: No garage (check)
    - GarageCond: No garage (check)
    - PoolQC: No pool
    - Fence: No fence
    - MiscFeature: No miscellaneous features

In [11]:
df_train.columns[
    list(
        df_train.isnull().any()
    )
]

Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

### Individual checks
- Make sure that the NaNs are consistent

In [34]:
df_train.loc[
    df_train["BsmtQual"].isnull(), 
    ["BsmtFinType1", "BsmtFinType2"]
].drop_duplicates()

Unnamed: 0,BsmtFinType1,BsmtFinType2
17,,


In [35]:
df_train.loc[
    df_train["GarageType"].isnull(), 
    [
        "GarageYrBlt", 
        "GarageFinish", 
        "GarageQual", 
        "GarageCond"
    ]
].drop_duplicates()

Unnamed: 0,GarageYrBlt,GarageFinish,GarageQual,GarageCond
39,,,,


## Impute values

In [37]:
df_train.loc[df_train["Electrical"].isnull(),:]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1379,1380,80,RL,73.0,9735,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2008,WD,Normal,167500
