# Ames (Iowa) housing regression

We'd like to know:

* What are the main drivers of the target (SalePrice)?
* What features might be good for Iowa USA but not for London, UK?
* Which features might translate but due to differences in culture might make less sense in a demo for London?
* Are there any outliers? Should we dig further into them?
* Do we have missing data?
* Do we trust this dataset?

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
%matplotlib inline

import sys
print(f"Python version {sys.version}")

import warnings
warnings.filterwarnings("ignore")

import dabl

print(f"Pandas {pd.__version__} version")
import pandas_profiling
print(f"Pandas Profiling {pandas_profiling.__version__} version")

# Due to a Pandas/Pandas Profiling bug
# https://github.com/pandas-profiling/pandas-profiling/issues/911
# we have to make sure we're on Pandas 1.3.5 and not 1.4

import seaborn as sns
sns.set(style="white")
import altair as alt
alt.renderers.enable('html')

from utility import mpl_set_label_rotation

In [None]:
# https://www.kaggle.com/prevek18/ames-housing-dataset/data
df = pd.read_csv('ames_housing.csv')
df.columns = [c.replace(' ', '_') for c in df.columns]

# note good vis and explanation here
# https://www.kaggle.com/ammar111/house-price-prediction-an-end-to-end-ml-project
df.head()

# Common easy Pandas descriptions

How easy are these to understand?

In [None]:
df.info()

In [None]:
df.describe()

# Plots help us poke around

This is time-consuming but useful.

## MatPlotLib is default and "easy" but hard to use for EDA

In [None]:
df.plot(kind="scatter", x='SalePrice', y='Gr_Liv_Area');

## Altair is much nicer for interactive exploration!

Tooltips and zoom are good.

In [None]:
# find 1 or 2 dimensions that explain the SalePrice well
alt.Chart(df, title='Interactive data exploration').mark_circle(size=60).encode(
    x='SalePrice',
    y='Gr_Liv_Area',
    color='Overall_Qual',
    # What else might go into the tool tips for debugging?
    tooltip=['Overall_Qual', 'Order']
).interactive()

# First review with Pandas Profiling

In [None]:
profile = ProfileReport(df.sample(200), minimal=True) # small sample else it takes ages with many variables
# profile.to_file(output_file="ames_housing_univariate_report.html") # export a local HTML file
profile

# Dabl

It needs a regression target...

In [None]:
# BEWARE you'll get a lot of Warnings - don't worry about these
TARGET = 'what_should_go_here?' # TODO
dabl.plot(df, target_col=TARGET);

# Show a relationship or two

Can you plot another interesting relationship given what you see in `dabl` above?

In [None]:
# find 1 or 2 dimensions that explain the SalePrice well
alt.Chart(df, title='Interactive data exploration').mark_circle(size=60).encode(
    x='Garage_Area',
    y='Gr_Liv_Area',
    color='Year_Built',
    #color='Overall_Qual',
    # What else might go into the tool tips for debugging?
    tooltip=['Overall_Qual', 'Order']
).interactive()

## Do we have enough rows to support a sensible box plot?

Does quality change over time?

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.boxplot(data=df.query('Year_Built>1970'), x='Year_Built', y='Overall_Qual', ax=ax);
mpl_set_label_rotation(ax)

In [None]:
# do a count plot here on the same dataset used above...

## Plot a Seaborn Pair Plot

Use Seaborn to plot a Pair Plot - what interesting or weird observations can you see? Is this easier than using `dabl` or harder?

In [None]:
cols = ['SalePrice', 'Overall_Qual', 'Gr_Liv_Area', 'Garage_Cars', 'Garage_Area', 
        'Total_Bsmt_SF', 'Year_Built', 'Garage_Yr_Blt']

# NOTE we take a sample as the pairgrid calculations can be slow
# NOTE we must dropna() else some of the pairgrid plots won't show
sample_for_pairgrid = df[cols].sample(500).dropna()
# Fill in the Pair Plot here
# https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
# the end of the Notebook - feel free to add more cells!