# Part 1 - Exploratory Data Analysis (EDA)

We will start with an Exploratory Data Analysis (EDA) of the Vancouver housing dataset.  
It is always a good idea to start with an EDA before designing and training a machine learning algorithm.  
EDA gives us better insight to the data by using statistical and visualization techniques.  

Upon completing this notebook, we should have:  
* Familiarity with [Pandas] and [NumPy] for data management and analysis
* Familiarity with [Matplotlib] and [seaborn] for visualization
* A decent understanding of the characteristics of our dataset
[Pandas]: https://pandas.pydata.org/
[NumPy]: http://www.numpy.org/
[Matplotlib]: https://matplotlib.org/
[seaborn]: https://seaborn.pydata.org/

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from geopy import Nominatim
import geojson
import folium
from branca.colormap import LinearColormap, StepColormap

%matplotlib inline

## Let's start by loading the data and have a peek at the contents

In [None]:
df = pd.read_csv('./data/rew_van_jan12.csv') # load contents of .csv into a pandas.DataFrame object
df.head(5) # display first 5 entries of DataFrame

### Our data is now contained in a variable named `df` which is a pandas DataFrame 

In [None]:
df.columns

## Display some quick stats about the DataFrame
DataFrame has a few built in functions we can call to get a quick summary of the data:  
* `info()` displays a count of all non-null objects and their datatypes  
* `describe()` calculates basic statistics about all numerical values in the DataFrame

In [None]:
df.info()

In [None]:
df.describe()

### Wow, very large maximum price albeit not suprising. 

In [None]:
# describe only the 'price' column
df['price'].describe()

## We have the gist of the dataset size and its contents, it's time to go more in depth and Visualize the data.  
We will use `Seaborn` to visualize the data.

### Plot histogram of prices

In [None]:
# globally set our seaborn plot size to 12 by 8 inches:
sns.set(rc={'figure.figsize':(12, 8)})

def plot_prices(df: pd.DataFrame, bins: list):
    fig, ax = plt.subplots()
    ax.set_xticks(bins)
    plt.xticks(rotation='vertical')
    return sns.distplot(df.price, bins=bins)

bins = range(int(df.price.min()),int(df.price.max()),1000000)
plot_prices(df, bins)

### Definitely a skewed distribution, looks as if we have a few outliers at the higher range of the prices.  
### We can quantify this by calculating:  
* `Skewness` - A measure of the symmetry (or lack thereof) of a distribution
* `Kurtosis` - Whether distrubition is "heavy-tailed" or "light-tailed" or in other words: how "sharp" the peak is.

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df['price'].skew())
print("Kurtosis: %f" % df['price'].kurt())

## Plot with outliers removed

In [None]:
df_no_outliers = df[df.price < 15e6]
bins = range(int(df_no_outliers.price.min()),int(df_no_outliers.price.max()),500000)
plot_prices(df_no_outliers, bins)
print("Skewness (outliers removed): %f" % df_no_outliers['price'].skew())
print("Kurtosis (outliers removed): %f" % df_no_outliers['price'].kurt())

### Removing the outliers improved our skewness and kurtosis values.
We will remember this when cleaning the data for our model. Machine learning models work best with normally distributed data. Outliers may affect model performance.

## Plot missing values.
Recall that there were some columns which are incomplete. Plot a bar graph describing this:

In [None]:
missing = df.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

Variables that are missing values can either be removed from the dataset or have their missing values replaced (perhaps with 0 or the mean of the column). Remember this for data cleaning.

## Since we have geolocations of the houses in the `latlng` column, let's visualize the data on a slippy map and see if there are any patterns w.r.t. price.  
We use `folium` to render html in the notebook.  
Note that there are hundreds of houses to be displayed and this requires a lot of RAM. If your browser crashes you can adjust the amount to be displayed by changing the variable `display_max`.

In [None]:
# create a folium map object centered in Vancouver
m = folium.Map(location=(49.271554, -123.106738))
# create a colormap of the prices (we limit prices between 5e5 and 10e6)
colors = ['gray', 'green','blue','red','orange', 'yellow']
min_price, max_price = 5e5, 10e6
colormap = StepColormap(colors=colors,vmin=min_price, vmax=max_price, caption='price')
m.add_child(colormap)
# amount of points to render on the map. WARNING: significant RAM required to plot all points and may crash your browser 
display_max = len(df) # plot all
# display_max = 100 # uncomment and adjust this number if needed
displayed = 0
for i, latlng in enumerate(df['latlng']):
    price = df.loc[i, 'price']
    if latlng is not None and latlng != 'MISSING':
        if isinstance(latlng, str):
            lat, lng = latlng.replace('(','').replace(')','').split(',')
            latlng = (float(lat), float(lng))
        if not isinstance(latlng, tuple):
            continue
        style = {'fillColor': colormap(price),
                'color' : colormap(price)}
        p = geojson.Point(coordinates=(latlng[1], latlng[0]), style=style)
        # build an HTML string to be displayed if we click a marker.
        html_info = '<li>Price: ${}</li><li>Property Type: {}</li>'.format(df.loc[i, 'price'], df.loc[i, 'property_type'])
        m.add_child(folium.Marker(location=latlng, icon=folium.Icon(color='black', icon_color=colormap(price)), popup=folium.Popup(html=html_info)))
        displayed += 1
        if displayed > display_max:
            break
m

### We can observe some patterns w.r.t. location with higher prices in the West where many exclusive communities are located

## Next, let's see how some of the variables interact with the list price.  
Since `price` is our target variable (the variable we are trying to predict), it is useful visualize how each variable relates to `price`. 

### sqft
Total square footage

In [None]:
# sqft/saleprice
var = 'sqft'
sns.regplot(df[var], df['price'], )

The relationship looks linear with some spreading as sqft increases. We can also see there are some houses with zero square feet! Let's investigate why:  
  
Note on `pandas.DataFrame` indexing:  
* `df['sqft'] == 0` gives us a "truth array" where True values match the condition and False otherwise. If we index the original DataFrame with this truth array we get a filtered result

In [None]:
# filter the DataFrame with zero sqft
df[df['sqft'] == 0].head(3) # only display first 3 entries

The `property_type` of these are *Land/Lot*. We will remember to remove these when we get to our data cleaning notebook

### bed
Number of bedrooms

In [None]:
var = 'bed'
sns.regplot(df[var], df['price'], )

Relationship is non-linear

### bath
Number of bathrooms

In [None]:
var = 'bath'
sns.regplot(df[var], df['price'], )

Relationship is non-linear.

## Generate a correlation matrix

A correlation matrix will graphically show us which variables are most correlated to our target variable `price`

In [None]:
corrmat = df.corr()
sns.heatmap(corrmat, vmax=1, square=True);

Hmmm something funny is going on with the `age` variable. Absolutely zero correlation is a hint there may be corrupt values.  
Let's investigate: 

In [None]:
df[df['age'] < 0].head(3)

Sure enough there are some strange negative values in age. Remember this for data cleaning.  
Plot our correlation matrix again omitting negative `age` values:

In [None]:
df_no_neg_age = df[df['age'] >= 0]
corrmat = df_no_neg_age.corr()
sns.heatmap(corrmat, vmax=1, square=True);

## Categorical Variables.  
So far we have only dealt with numeric variables however there are several non-numerical (**Categorical**) variables to be investigated as well.  
Categorical variables are ones which provide information but are not quantified numerically. For instance, the `sub_area` variable gives us information about what neighbourhood the house is located in (ie. "Kerrisdale", "Yaletown" etc.). From our map plot, we found this information is important when considering house prices.  
In order to use these categorical variables in our model, we encode them into a numerical representation called a [Dummy Variable]. We cover Dummy Variables in a later notebook.
[Dummy Variable]: https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

In [None]:
print(df.columns)

Let's choose `area`, `sub_area`, `property_type`, and `strata_type` to investigate.  
We can use the `unique()` function on the categorical columns to see the different categories.

In [None]:
print(df['area'].unique())
print(df['sub_area'].unique())
print(df['property_type'].unique())
print(df['strata_type'].unique())

### There are **2** `area`, **39** `sub_area`, **7** `property_type` and **8** `strata_type` categories.  


Visualize these 4 categories as box plots.  
We use the `pandas.melt()` function to flatten our variables into a single column so we can plot.  
The result of using `melt()` is most easily understood by displaying the result.

In [None]:
vars_to_analyze = ['area', 'sub_area', 'property_type', 'strata_type']
df_melt = pd.melt(df, id_vars=['price'], value_vars=vars_to_analyze)
for var in vars_to_analyze:
    df_var = df_melt[df_melt['variable'] == var]
    sns.boxplot(x=df_var['value'], y=df_var['price'])
    x=plt.xticks(rotation=90)
    plt.title(var)
    plt.show()

## Analysis of variance (ANOVA)
We use ANOVA to explore how much variance occurs **between** groups (ie. *[price vs sub_area]* vs *[price vs area]* vs *[price vs property_type]* vs *[price vs strata_type]*) versus how much variance occurs **within** each group (ie *price vs sub_area* alone).  
In the end this tells us is how useful it will be to group `price` into these 4 groups (and if including each variable in our model is useful to us).  
Here's a quick YouTube video that may better explain ANOVA:  

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo(id='ITf4vHhyGpc')

In [None]:
def anova(df):
    anv = pd.DataFrame()
    anv['feature'] = vars_to_analyze
    pvals = []
    for c in vars_to_analyze:
        samples = []
        for cls in df[c].unique():
            s = df[df[c] == cls]['price'].values
            samples.append(s)
        try:
            pval = stats.f_oneway(*samples)[1]
        except Exception as e:
            pval=None
        finally:
            pvals.append(pval)
    anv['pval'] = pvals
    return anv.sort_values('pval')

a = anova(df)
a['disparity'] = np.log(1./a['pval'].values)
sns.barplot(data=a, x='feature', y='disparity')
x=plt.xticks(rotation=90)

This gives us a rough estimate of effect each variable will have on our model. It makes intuitive sense that `sub_area` is highest since a home in Point Grey is likely to be more expensive than one in Grandview. 

## Hopefully the EDA has improved our intuition about the dataset. Now we can move onto data cleaning!