![Arial View of Housing](housing-header.jpg)

The Ames Housing Dataset is one of the richest and most detailed datasets for
house price prediction. It was compiled by Dean De Cock (2009) as a modern
replacement for the Boston dataset. With nearly 80 detailed features describing
aspects like neighborhood, lot size, and even garage quality, there are plenty
of features to work with.

The goal of this notebook is to use these features to predict the `SalePrice`,
the target of this analysis. As it is a continuous feature we are predicting,
we will make use of regression techniques in our model.

We'll start by importing some initial modules and taking a preliminary look at
our dataset.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('ames.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuilt    

As you can see, there are quite a lot of features to work with. It can be quite
overwhelming, but we'll address this in the EDA section. There are also a
significant number of null values to deal with, so we'll need to look at each
of those.

**Workflow:**

I'll be following a modified version of the  CRISP-DM framework for this
workflow, addressing:

1. Business Understanding
   1. Objectives
   2. Situation
   3. Metrics & Goals
   4. Project Plan
2. Data Understanding
   1. Data Descriptions
   2. Exploratory Data Analysis
3. Data Preparation
   1. Clean Data
   2. Transform Data I: Feature Engineering
   3. Transform Data II: Feature Encoding & Standardization
4. Modeling
   1. Select Modeling Techniques
   2. Generate Test Design
   3. Build & Train the Model
5. Evaluation

## Business Understanding

### Objectives: What are we hoping to get out of this?

Given the nature of this data, the ideal outcome of this analysis is a pricing
model that uses effective machine learning model(s) to predict how much a house
will sell for. I intentionally use the word "effective" to describe the models,
instead of something like "advanced" or "complex," because we don't need to
introduce unnecessary complexity just for the sake of it. However, if we find
that the model fails to perform well on the dataset, we can start expanding the
complexity of the models.

There could be many reasons we are in need of this model. For instance, we
could have a client in the real estate industry, an investor who is finding
that they consistently overpay for properties by conventional methods. We could
also be working for a lender who is looking to improve the appraisal portion of
their underwriting process. Or, we could simply be an over-eager individual
looking to purchase a home. Regardless of the purpose behind the objective,
a house pricing predictor is clearly a useful tool if set up properly.

### Situation

In short, what resources do we have at our disposal?

Our primary source of information is the *Ames Housing Dataset*. The original
dataset can be found on the American Statistical Association's [website][1]
and a cleaned version can be found in [this repository][2]. The latter source
also has a text file describing the various fields available to us. We'll come
back to that later.

Outside of these 2 files, we don't have a lot of other data available to us. In
an industry case, we'd likely have multiple sources to extract data from,
aggregating and joining across multiple database tables. We're both lucky and
unlucky here: we don't have much work to do around data extraction, but we're
also limited to the fields provided in the dataset. However, we still have a
great number of features to work with.

[1]: https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt
[2]: https://github.com/melindaleung/Ames-Iowa-Housing-Dataset

### Metrics & Goals

*How do we measure our success, and where do we hope to land?*

We have a range of options to measure performance on a regression model. Some
that are most commonly used include mean squared error, mean absolute error,
root mean squared error, and mean absolute percentage error. I'm selecting to
go with **root mean squared error** for a couple reasons:

- **Higher Penalty for Large Deviations:** Using a squared error (as opposed
  to absolute or percentage) means we put extra emphasis on correctness. It's
  better to be off by a few thousand here and there than to make a major
  miscalculation on a home purchase.
- **Explainability:** When we present this to non-technical stakeholders,
  saying the model has an error of $8,000 is much better than saying it has
  an error of 64,000,000. When using a squared error, this can only be achieved
  by using the root-mean version of the error.

With that said, I would like to get our model's error within 5% of the home
values, meaning the target error amount will depend on the average sale price
in the dataset.

### Project Plan

*What kind of technologies do we want to use, and how will we implement them?*

Models

- Linear Regression
- Ridge Regression
- Lasso
- Elastic Net
- Decision Tree
- Random Forest
- Support Vector Machine
- Gaussian Process Regression

Techniques

- 

In [None]:
'''
What kind of technologies do we want to use, and how will we implement them?

We'll need to revisit this as we get further understanding of the data, but
initially, we could make use of:

- Logistic Regression
- Naive-Bayes
- k-Nearest Neighbors
- Decision Tree
- Random Forest
- Gradient-Boosted Trees
- Ada Boost
- Support Vector Machines

Much of these are quickly available in the Scikit-Learn library. In addition,
this library also provides tools for:

- Train-Test splits
- K-fold cross-validation
- Stratified K-Fold cross-validation
- Grid search CV
- Accuracy metrics
'''

## Data Understanding

In [None]:
# Import modules


### Data Descriptions

In [None]:
# Display columns 0-9, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 10-19, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 20-29, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 30-39, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 40-49, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 50-59, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 60-69, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display columns 70-79, first 10 rows

> **Notes:**
>
> ...

In [None]:
# Display data types


> **Notes:**
>
> ...

In [None]:
# Understand the Null Distribution


> **Notes:**
>
> ...

### Exploratory Data Analysis

In [None]:
'''
Formulate questions in the previous section and answer them here.
'''

In [None]:
# ...

## Data Preparation

### Clean Data

In [None]:
'''
Address each of the data issues in the exploration section.
'''

### Transform Data I: Feature Engineering

In [None]:
'''
Create new features through combinations, etc.
'''

### Transform Data II: Feature Encoding & Standardization

In [None]:
'''
Encode any fields and apply any standardizations (min-max, standard scaling, etc.)
'''

## Modeling

### Select Modeling Techniques

In [None]:
'''
I hinted at this earlier, but there are several models we can apply here:

1. Logistic Regression
2. Naive-Bayes
3. k-Nearest Neighbors
4. Decision Tree
5. Random Forest
6. Gradient-Boosted Trees
7. Ada Boost
8. Support Vector Machines

While we could go as far as looking at neural networks to introduce more
sophisticated feature interactions, this may be overkill. However, if we don't
get the performance we'd like out of these models, we can circle back to deep
learning methods.
'''

### Generate Test Design

In [None]:
'''
As typical with machine learning, we'll need to split our data into training/
testing data. From there, we can apply cross-validation to each model to
determine the most optimal parameters for the model. A summary of our approach
can be seen below. 

<div style="text-align: center;">
  <img
    src="https://scikit-learn.org/stable/_images/grid_search_workflow.png"
    width="400px"
    height="240px"
    style="background: white; padding: 10px;"
  />
</div>

For measuring performance, we could look at an F1 score to account for
precision/recall. However, since the overall survival rate isn't significantly
different than the non-survival rate, a simple accuracy measure should suffice
here.

Finally, to tune the hyperparameters, we can use a grid search to find the
optimal parameters. Note that, as the diagram above shows, the hyperparameter
tuning will be done with a validation set, leaving the test set for the final
model evaluation.

We can store the model, grid search results, hyperparameters, and performance
scores in a DataFrame for easy access.
'''

### Build & Train the Model

In [None]:
# Import libraries


In [None]:
'''
# Setup dataframe to store model information
model_names = [
    'AdaBoost',
    'DecisionTree',
    'NaiveBayes',
    'GradientBoosted',
    'KNN',
    'LogisticRegression',
    'RandomForest',
    'SVC',
]
columns = ['Model', 'GridSearchResults', 'OptimalParams', 'Accuracy']
models = pd.DataFrame(index=model_names, columns=columns)
models
'''

In [None]:
'''
def apply_grid_search(estimator: BaseEstimator, params: dict) -> pd.Series:
    model = GridSearchCV(estimator, params, scoring='accuracy', n_jobs=-1)
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    return pd.Series(dict(
        Model=model.best_estimator_,
        GridSearchResults=pd.DataFrame(model.cv_results_),
        OptimalParams=model.best_params_,
        Accuracy=accuracy,
    ))
'''

In [None]:
# ...