## Part 2 - Data Cleaning
During our EDA we encountered some variables with incomplete or corrupted data.  
In this notebook we will use Pandas to:
* Remove outliers  
* remove non houses (ie. Land/Lot)
* Handle missing, null or corrupted values  

In [2]:
import time
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from geopy import Nominatim
import geojson
import folium
from branca.colormap import LinearColormap, StepColormap

%matplotlib inline

## Preview the data 

In [4]:
import glob
all_csvs = []
# load the csv files from all scraping runs
for filename in glob.glob('./data/sf/**/*.csv'):
    all_csvs.append(pd.read_csv(filename))
# combine all dataframes together and drop any duplicate entries
df_dirty = pd.concat(all_csvs, ignore_index=True).drop_duplicates()
# save this combined dataframe as csv for safe keeping
df_dirty.to_csv('./data/sf/all.csv')
df_dirty.head(5) # display first 5 entries of DataFrame

Unnamed: 0,title,address,city,state,postal_code,price,facts and features,real estate provider,url
0,Condo For Sale,550 Davis St UNIT 44,San Francisco,CA,94111,"$1,995,000","3 bds , 2 ba , 1,520 sqft",Sotheby's International Realty,https://www.zillow.com/homedetails/550-Davis-S...
1,Condo For Sale,240 Lombard St APT 437,San Francisco,CA,94111,"$625,000","1 bd , 1 ba , 566 sqft",SimpleListing.com,https://www.zillow.com/homedetails/240-Lombard...
2,Condo For Sale,550 Davis St UNIT 39,San Francisco,CA,94111,"$1,196,000","1 bd , 1 ba , 914 sqft",,https://www.zillow.com/homedetails/550-Davis-S...
3,Condo For Sale,77 Dow Pl APT 701,San Francisco,CA,94107,"$935,000","1 bd , 1.5 ba , 1,022 sqft",Vanguard Properties,https://www.zillow.com/homedetails/77-Dow-Pl-A...
4,House For Sale,807 Francisco St,San Francisco,CA,94109,"$16,900,000","6 bds , 6.5 ba , 6,180 sqft",Compass,https://www.zillow.com/homedetails/807-Francis...


## Remove outliers
Recall from our EDA that our data has outliers which result in high skewness and kurtosis values.

In [None]:
# globally set our seaborn plot size to 12 by 8 inches:
sns.set(rc={'figure.figsize':(12, 8)})

def plot_prices(df: pd.DataFrame, bins: list):
    fig, ax = plt.subplots()
    ax.set_xticks(bins)
    plt.xticks(rotation='vertical')
    return sns.distplot(df.price, bins=bins)

bins = range(int(df_dirty.price.min()),int(df_dirty.price.max()),1000000)
plot_prices(df_dirty, bins)
print("Skewness: %f" % df_dirty['price'].skew())
print("Kurtosis: %f" % df_dirty['price'].kurt())

In [None]:
cutoff = 12e6
df_clean = df_dirty[df_dirty['price'] <= cutoff]

In [None]:
bins = range(int(df_clean.price.min()),int(df_clean.price.max()),1000000)
plot_prices(df_clean, bins)
print("Skewness: %f" % df_clean['price'].skew())
print("Kurtosis: %f" % df_clean['price'].kurt())

The skewness and kurtosis values have improved. The distribution is still skewed however there are transformations we can apply to the dataset to make it more normally distributed. More on these transformations in a later notebook.

## Remove entries with `property_type` Land/Lot
In our EDA we encountered houses with zero square footage. This was due to listings of Land/Lot rather than House. Let's remove these entries since we are not interested in predicting Land/Lot value

In [None]:
num_zero_sqft = (df_dirty['sqft'] == 0).sum()
print("There are {} entries with zero sqft".format(num_zero_sqft))

In [None]:
df_clean = df_clean[df_clean['property_type'] != 'Land/Lot'] # only include entries which are NOT Land/Lot
num_zero_sqft = (df_clean['sqft'] == 0).sum()
print("There are {} entries with zero sqft".format(num_zero_sqft))

## Deal with Null and Missing values

In [None]:
df_clean.info()

In [None]:
missing = df_clean.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
plt.title("Counts of Missing Values")
plt.show()
missing_ratio = missing / len(df_clean)
missing_ratio.plot.bar()
plt.title("Ratio of Missing Values")
plt.show()

Rather than remove these missing values from the dataset, let's consider these numbers during Feature Selection in a later notebook.

 ## Save the dataframe to .csv file

In [None]:
df_clean.to_csv('./data/rew_van_jan12_clean.csv', index=False)