# Project 2: Predicting Sale Prices in Ames Housing Dataset

## Data Science problem

We have been tasked with creating a regression model to predict the price of a house at sale from the Ames Housing Dataset, under the guise of taking the model to a real estate company and selling them on the model's merits.

**What influences home purchase prices, and which features are needed to minimize out-of-sample prediction error?** What should homeowners improve to increase the value of their homes? Which features might not be worth investing in? If buyers are looking for certain features, how much should they expect to pay?witch? Are there any risks that we
need to watch for to help make the change?**

## Executive Summary

For this project, I utilized the Ames Housing Dataset, which contains home sales data between 2006 and 2010 from the municipality of Ames, IA. Two data files were provided, a training file used for building the regression model consisting of 2,051 records, and a separate, external test file used for validating the model's performance; the test data included 878 additional records.

In total, the Ames dataset includes 78 different features that describe a home and the land it occupies. The final column, SalePrice, only exists in the training file, and that is what we seek to predict in the test data file.

The two files were cleaned extensively in sequence, starting with the training data. Whatever cleaning was done on the training file was then repeated on the test file, synching any data issues that arose from dataset to dataset, such as dropping feature levels that did not exist in one or the other file.

In addition to cleaning the file for missing or incorrect data, any categorical variables were converted to 0/1 dummy variables. Additional feature engineering included creating interaction effects and non-linear terms (squared and or cubed values of features), log-transformation of variables (including SalePrice) with large magnitudes, as well as recasting certain outlier data with additional threshold dummy variables. **The final candidate set of predictor variables totaled 179**!

I ended up attempting **four different approaches** to the modeling, all with the log-transformed sale price as the dependent variable. These approaches included:
- **Standard linear regression** with a mixture of "raw" and log-transformed predictors, trying to find a relatively simple model to explain SalePrice
- **PowerTransformed regression** models where all variables in the model get transformed prior to model estimation using scikit-learn's PowerTransformer module
- **Ensemble regression modeling**, where I estimated six different PowerTransformed regression models, each including a core set of key predictors plus thematic subsets of additional predictors unique to each sub-model. The final predictions coming out of this approach were the average predicted sale prices across the six models.
- **LASSO regression model** - this model resulted in the **final, best predictions with the lowest amount of error**. Here, all 179 candidate predictors were included, with the LASSO regularization ultimately selecting which features contribute to improving out-of-sample prediction accuracy, setting non-impactful features to zero. **The final model retained 107 of these features, with a testing-subsample R-square of 0.8999**.

In order to understand the contribution of such a large set of predictors to sale price, I utilized the SHAP Python package to estimate Shapley Values from the model. Shapley Values arose out of Game Theory, where the goal was to quantify how much each player contributed to a victory.

In addition to creating a final presentation based on the modeling, we entered our models in a GA Data Science 11 cohort-wide private kaggle competition, with final rankings based on an additional holdout set of test data. My model finished in 8th place overall (out of 89 entries), 1st in the Boston office, with a final root mean squared error of $19,155.

### Contents:
**Code**
- [02 Data Cleaning Train Data](./02%20data%20cleaning%20train%20data.ipynb)
- [03 Data Cleaning Test Data](./03%20data%20cleaning%20test%20data.ipynb)
- [04 Exploratory Data Analysis](./04%20exploratory%20data%20analysis.ipynb)
- [05 Standard Regression Models](./05%20standard%20models.ipynb)
- [06 PowerTransformer Models](./06%20powertransformer%20models.ipynb)
- [07 Ensemble Models](./07%20ensemble%20modeling.ipynb)
- [08 Lasso Model](./08%20lasso%20models.ipynb)

**Presentation**
- [09 Ames Housing Regression Modeling Report](../presentation/Ames%20Housing%20-%20Jon%20Godin%20v01.pdf)


## Conclusions and Recommendations

The LASSO regression model is an excellent approach when trying to achieve high accuracy/low model error in the presence of a large number of predictors. It also serves a dual purpose, allowing us to diagnose what drives home price value.

**Key drivers of sale price include:**:
- Gross Living Area & 1st Floor Square Footage
- Overall Quality & Condition ratings of the house
- Age At Sale & Years Since Last Remodel/Addition
- Lot Area

**Additional features with high value include**:
- Excellent Kitchen Quality
- Finished basement with average or better exposure
- Full bath in the basement
- Fireplace
- At least a 2-car garage
- 2 or more full bathrooms in the house
- A screened porch, with larger sizes being more valued
- A paved driveway

**Recommendations**

Buyers should know that they will need to pay more for larger homes, larger lots, newer and better-quality homes, all else being equal.

Sellers should focus on maintaining their house and property, and making improvements related to the key features listed above.