![Arial View of Housing](housing-header.jpg)

The Ames Housing Dataset is one of the richest and most detailed datasets for
house price prediction. It was compiled by Dean De Cock (2009) as a modern
replacement for the Boston dataset. With nearly 80 detailed features describing
aspects like neighborhood, lot size, and even garage quality, there are plenty
of features to work with.

The goal of this notebook is to use these features to predict the `SalePrice`,
the target of this analysis. As it is a continuous feature we are predicting,
we will make use of regression techniques in our model.

We'll start by importing some initial modules and taking a preliminary look at
our dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('ames.csv', index_col=0)
df.info()

As you can see, there are quite a lot of features to work with. It can be quite
overwhelming, but we'll address this in the EDA section. There are also a
significant number of null values to deal with, so we'll need to look at each
of those.

**Workflow:**

I'll be following a modified version of the  CRISP-DM framework for this
workflow, addressing:

1. Business Understanding
   1. Objectives
   2. Situation
   3. Metrics & Goals
   4. Project Plan
2. Data Understanding
   1. Data Descriptions
   2. Exploratory Data Analysis
3. Data Preparation
   1. Clean Data
   2. Transform Data I: Feature Engineering
   3. Transform Data II: Feature Encoding & Standardization
4. Modeling
   1. Select Modeling Techniques
   2. Generate Test Design
   3. Build & Train the Model
5. Evaluation

## Business Understanding

### Objectives: What are we hoping to get out of this?

Given the nature of this data, the ideal outcome of this analysis is a pricing
model that uses effective machine learning model(s) to predict how much a house
will sell for. I intentionally use the word "effective" to describe the models,
instead of something like "advanced" or "complex," because we don't need to
introduce unnecessary complexity just for the sake of it. However, if we find
that the model fails to perform well on the dataset, we can start expanding the
complexity of the models.

There could be many reasons we are in need of this model. For instance, we
could have a client in the real estate industry, an investor who is finding
that they consistently overpay for properties by conventional methods. We could
also be working for a lender who is looking to improve the appraisal portion of
their underwriting process. Or, we could simply be an over-eager individual
looking to purchase a home. Regardless of the purpose behind the objective,
a house pricing predictor is clearly a useful tool if set up properly.

### Situation

In short, what resources do we have at our disposal?

Our primary source of information is the *Ames Housing Dataset*. The original
dataset can be found on the American Statistical Association's [website][1]
and a cleaned version can be found in [this repository][2]. The latter source
also has a text file describing the various fields available to us. We'll come
back to that later.

Outside of these 2 files, we don't have a lot of other data available to us. In
an industry case, we'd likely have multiple sources to extract data from,
aggregating and joining across multiple database tables. We're both lucky and
unlucky here: we don't have much work to do around data extraction, but we're
also limited to the fields provided in the dataset. However, we still have a
great number of features to work with.

[1]: https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt
[2]: https://github.com/melindaleung/Ames-Iowa-Housing-Dataset

### Metrics & Goals

*How do we measure our success, and where do we hope to land?*

We have a range of options to measure performance on a regression model. Some
that are most commonly used include mean squared error, mean absolute error,
root mean squared error, and mean absolute percentage error. I'm selecting to
go with **root mean squared error** for a couple reasons:

- **Higher Penalty for Large Deviations:** Using a squared error (as opposed
  to absolute or percentage) means we put extra emphasis on correctness. It's
  better to be off by a few thousand here and there than to make a major
  miscalculation on a home purchase.
- **Explainability:** When we present this to non-technical stakeholders,
  saying the model has an error of $8,000 is much better than saying it has
  an error of 64,000,000. When using a squared error, this can only be achieved
  by using the root-mean version of the error.

With that said, I would like to get our model's error within 5% of the home
values, meaning the target error amount will depend on the average sale price
in the dataset.

### Project Plan

*What kind of technologies do we want to use, and how will we implement them?*

Models

- Linear Regression
- Ridge Regression
- Lasso
- Elastic Net
- Decision Tree
- Random Forest
- Support Vector Machine
- Gaussian Process Regression

Techniques

- Training/Testing Split
- Grid Search
- Polynomial Features
- Standard Scaling
- Min-Max Scaling
- One-Hot Encoding

## Data Understanding

In [None]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data Descriptions

There will be a lot of data to describe. Certainly a lot of opportunity for
feature engineering, but it will take some time to review each category of
feature to understand it.

With each category of feature, the objective is to:

1. Understand what it is telling us; what the feature represents
2. Define the range of possible values, whether categorical, discrete, or
   continuous.
3. Explore descriptive statistics. For numerical features, the mean, standard
   deviation, and percentiles. For categorical features, the most common and
   least common values, as well as the number of unique values.

#### Property & Sale Details

- `SalePrice` `[float]` - The target variable

- `SaleType` `[str]`
  - `WD` - Warranty Deed - Conventional
  - `CWD` - Warranty Deed - Cash
  - `VWD` - Warranty Deed - VA Loan
  - `New` - Home just constructed and sold
  - `COD` - Court Officer Deed/Estate
  - `Con` - Contract 15% Down payment regular terms
  - `ConLw` - Contract Low Down payment and low interest
  - `ConLI` - Contract Low Interest
  - `ConLD` - Contract Low Down
  - `Oth` - Other

- `SaleCondition` `[str]`
  - `Normal` - Normal Sale
  - `Abnorml` - Abnormal Sale -  trade, foreclosure, short sale
  - `AdjLand` - Adjoining Land Purchase
  - `Alloca` - Allocation - two linked properties with separate deeds,
    typically condo with a garage unit   
  - `Family` - Sale between family members
  - `Partial` - Home was not completed when last assessed (associated with New
    Homes)

In [None]:
def display_numeric_summary(data: pd.Series, fmt: str='{:,}') -> None:
    '''
    A reusable helper to display a summary of a column.
    '''
    n_total = len(data)
    n_null, p_null = data.isna().sum(), data.isna().mean()
    print('Null Values:  {:,}/{:,} ({:.1%})'.format(n_null, n_total, p_null))
    display(data.describe().to_frame(data.name).T)

    # Figure
    plt.figure(figsize=(8, 2))
    ax = sns.boxenplot(x=data, color='cornflowerblue')
    ax.set_title(str(data.name))

    # X-Axis
    ticks = ax.get_xticks()
    formatted_labels = map(fmt.format, ticks)
    ax.set_xticks(ticks, formatted_labels)
    ax.set_xlim(0)
    ax.set_xlabel('')

    # Y-Axis
    ax.set_yticks([])

    # Spines
    for side in ['left', 'top', 'right']:
        ax.spines[side].set_visible(False)

    plt.show()

In [None]:
# Display Null Values
display(pd.concat(axis=1, objs=[
    df[['SalePrice', 'SaleType', 'SaleCondition']]
        .isna()
        .sum()
        .to_frame('Nulls')
        .map('{:,.0f}'.format),
    df[['SalePrice', 'SaleType', 'SaleCondition']]
        .isna()
        .mean()
        .to_frame('% Null')
        .map('{:.1%}'.format)
]))


# Display Distributions
fig, (ax0, ax1, ax2) = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
color = 'cornflowerblue'

df.SalePrice.plot(
    ax=ax0,
    kind='hist',
    title='SalePrice',
    color=color,
)
df.SaleType.value_counts(dropna=False).plot(
    ax=ax1,
    kind='bar',
    color=color,
    title='SaleType',
    logy=True,
    xlabel=''
)
df.SaleCondition.value_counts(dropna=False).plot(
    ax=ax2,
    kind='bar',
    color=color,
    title='SaleCondition',
    logy=True,
    xlabel=''
)

plt.show()

> **Notes:**
>
> There are quite a few home prices that are outliers -- we have a strong skew
> to the right. It may be useful to apply a log transform to standardize the
> distribution here.
>
> There is also a wide gap between the 'WD' sale type (Warrenty Deed, i.e.
> conventional sale) and the rest of the sale types. Same situation with
> Sale Condition. It may be useful to group rare categories into an "other"
> bucket to reduce the noise.

<hr />

In [None]:
from ydata_profiling import ProfileReport

In [None]:
ProfileReport(df)

In [None]:
# Display columns 10-19, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 20-29, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 30-39, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 40-49, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 50-59, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 60-69, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 70-79, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display data types


> **Notes:**
>
> ...

In [None]:
# Understand the Null Distribution


> **Notes:**
>
> ...

### Exploratory Data Analysis

In [None]:
'''
Formulate questions in the previous section and answer them here.
'''

In [None]:
# ...

## Data Preparation

### Clean Data

In [None]:
'''
Address each of the data issues in the exploration section.
'''

### Transform Data I: Feature Engineering

In [None]:
'''
Create new features through combinations, etc.
'''

### Transform Data II: Feature Encoding & Standardization

In [None]:
'''
Encode any fields and apply any standardizations (min-max, standard scaling, etc.)
'''

## Modeling

### Select Modeling Techniques

In [None]:
'''
I hinted at this earlier, but there are several models we can apply here:

1. Logistic Regression
2. Naive-Bayes
3. k-Nearest Neighbors
4. Decision Tree
5. Random Forest
6. Gradient-Boosted Trees
7. Ada Boost
8. Support Vector Machines

While we could go as far as looking at neural networks to introduce more
sophisticated feature interactions, this may be overkill. However, if we don't
get the performance we'd like out of these models, we can circle back to deep
learning methods.
'''

### Generate Test Design

In [None]:
'''
As typical with machine learning, we'll need to split our data into training/
testing data. From there, we can apply cross-validation to each model to
determine the most optimal parameters for the model. A summary of our approach
can be seen below. 

<div style="text-align: center;">
  <img
    src="https://scikit-learn.org/stable/_images/grid_search_workflow.png"
    width="400px"
    height="240px"
    style="background: white; padding: 10px;"
  />
</div>

For measuring performance, we could look at an F1 score to account for
precision/recall. However, since the overall survival rate isn't significantly
different than the non-survival rate, a simple accuracy measure should suffice
here.

Finally, to tune the hyperparameters, we can use a grid search to find the
optimal parameters. Note that, as the diagram above shows, the hyperparameter
tuning will be done with a validation set, leaving the test set for the final
model evaluation.

We can store the model, grid search results, hyperparameters, and performance
scores in a DataFrame for easy access.
'''

### Build & Train the Model

In [None]:
# Import libraries


In [None]:
'''
# Setup dataframe to store model information
model_names = [
    'AdaBoost',
    'DecisionTree',
    'NaiveBayes',
    'GradientBoosted',
    'KNN',
    'LogisticRegression',
    'RandomForest',
    'SVC',
]
columns = ['Model', 'GridSearchResults', 'OptimalParams', 'Accuracy']
models = pd.DataFrame(index=model_names, columns=columns)
models
'''

In [None]:
'''
def apply_grid_search(estimator: BaseEstimator, params: dict) -> pd.Series:
    model = GridSearchCV(estimator, params, scoring='accuracy', n_jobs=-1)
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    return pd.Series(dict(
        Model=model.best_estimator_,
        GridSearchResults=pd.DataFrame(model.cv_results_),
        OptimalParams=model.best_params_,
        Accuracy=accuracy,
    ))
'''

In [None]:
# ...