## Part 2 - Data Cleaning
During our EDA we encountered some variables with incomplete or corrupted data.  
In this notebook we will use Pandas to:
* Remove outliers  
* Handle missing, null or corrupted values  

In [None]:
import time
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from geopy import Nominatim
import geojson
import folium
from branca.colormap import LinearColormap, StepColormap

%matplotlib inline

## Preview the data 

In [None]:
df_dirty = pd.read_csv('./data/sf/data.csv')
df_dirty.head(5) # display first 5 entries of DataFrame

## Remove outliers
Recall from our EDA that our price data has outliers which result in high skewness and kurtosis values.

In [None]:
# globally set our seaborn plot size to 12 by 8 inches:
sns.set(rc={'figure.figsize':(12, 8)})

def plot_prices(dataframe: pd.DataFrame, bins: list):
    fig, ax = plt.subplots()
    ax.set_xticks(bins)
    plt.xticks(rotation='vertical')
    return sns.distplot(dataframe.price, bins=bins)

bins = range(int(df_dirty.price.min()),int(df_dirty.price.max()),500000)
bins
plot_prices(df_dirty.dropna(), bins)
print(f'Skewness: {df_dirty.price.skew()}')
print(f'Kurtosis: {df_dirty.price.kurt()}')

In [None]:
print(f'max price before: {df_clean.price.max()}')
cutoff = 8e6
df_clean = df_dirty[df_dirty['price'] <= cutoff]
print(f'max price after: {df_clean.price.max()}')

In [None]:
bins = range(int(df_clean.price.min()),int(df_clean.price.max()),500000)
plot_prices(df_clean, bins)
print("Skewness: %f" % df_clean['price'].skew())
print("Kurtosis: %f" % df_clean['price'].kurt())

The skewness and kurtosis values have improved. The distribution is still skewed however there are transformations we can apply to the dataset to make it more normally distributed. More on these transformations in a later notebook.

In [None]:
num_zero_sqft = (df_clean['sqft'] < 10).sum()
print("There are {} entries with zero sqft".format(num_zero_sqft))

In [None]:
df_clean = df_clean[df_clean['sqft'] > 10]
num_zero_sqft = (df_clean['sqft'] < 10).sum()
print("There are {} entries with zero sqft".format(num_zero_sqft))

There was also some `sqft` outliers at the very high range, let's rid ourselves of these values as well

In [None]:
sns.regplot(df_clean['sqft'], df_clean['price'], fit_reg=False)

In [None]:
print(f'max sqft before: {df_clean.sqft.max()}')
df_clean = df_clean[df_clean['sqft'] < 9000]
print(f'max sqft after: {df_clean.sqft.max()}')

## Deal with Null and Missing values

In [None]:
df_clean.info()

In [None]:
missing = df_clean.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
plt.title("Counts of Missing Values")
plt.show()
missing_ratio = missing / len(df_clean)
missing_ratio.plot.bar()
plt.title("Ratio of Missing Values")
plt.show()

First, we will choose to remove the `latlng` column completely. Although there may be some use cases for this data (eg. find distance to nearby schools, parks, etc.) we will remove it and keep the `postal_code` column as our location data.  
We will also remove `real estate provider` since there are too many unique values.

In [None]:
print(df_clean.columns)
df_clean = df_clean.drop(columns=['latlng', 'real estate provider'])
print(df_clean.columns)

Now we could choose to drop all rows with null/missing values with `df.dropna()`, but we may benefit from "imputing" these values instead:  

**Imputation** fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.

In [None]:
df_clean_dropna = df_clean.dropna()

In [None]:
from sklearn.preprocessing import Imputer
df_clean_imputed = df_clean.copy() # copy original for safe keeping
columns_to_impute = ['bed', 'bath', 'sqft'] # only impute numerical columns
imputer = Imputer(strategy='mean')
imputed_columns = imputer.fit_transform(df_clean_imputed[columns_to_impute])
df_clean_imputed[columns_to_impute] = imputed_columns
df_clean_imputed.info()

Now that we have imputed all of the values we can, let's drop the rest of the rows containing null values

In [None]:
df_clean_imputed = df_clean_imputed.dropna()

In [None]:
df_clean_imputed = df_clean_imputed[df_clean_imputed.postal_code != 94501] 

In [None]:
df_clean_imputed.info()

 ## Save the dataframes to .csv

In [None]:
df_clean_dropna.to_csv('./data/sf/data_clean_dropna.csv', index=False)
df_clean_imputed.to_csv('./data/sf/data_clean_imputed.csv', index=False)