# Project 2: Ames Housing Data and Kaggle Challenge

# Contents:

- Overview / Problem Statement
- Executive Summary
- Data Sources
- Data Dictionaries
- Data Import and Cleaning
- Exploratory Data Analysis
- Models Tested
- Data Visualization / Descriptive and Inferential Statistics
- Outside Research
- Conclusions and Recommendations
- Additional sources and references
---

# Overview
### Problem Statement:

Task: Create a machine learning model based on the Ames Housing Dataset to predict the price of a house at sale.

This can be used by anyone interested in buying a house or to assess their properties value. This may include investors and home buyers.

In this project, I will make a model using a train test set in order to predict the price of a house based on the Ames housing dataset. I will be exploring which features may be strong indicators of home price.

---
### Executive Summary:
The Ames Housing Dataset is made up of 2930 house sale observations collected between 2006 and 2010 from the Ames Assessor's Office. It was used to compute assess values for residential properties.  The train data set is made up of 2051 homes sold in Ames, IA and each observation has 81 parameters (features) regarding various parameters of the house (i.e. Square footage, building type, house style, etc.).

The Model I submitted for the Kaggle competition was a Linear Regression Model made up of the integer and float datatype columns as well as the 'Neighborhood' column dummy-encoded. This dataset was passed through a pipeline which Scaled the data, Imputed the missing values, and transformed the target regressor (Sale Price). The r2 value for this model was 0.916 and the rmse was 22762.

While this model performed well for minimizing the rmse value, when plotted against the true Sale Prices, there are a few outliers observed which may indicate it is not the best model to fit for the data. Because of this, the **production model** I am choosing is the pipe_production as it has more features than the Kaggle model submission. The outliers created in the Kaggle submission may indicate that the model does well on most data, but can still create Sale Price values that are very far from the actual price.

The features I found to be strong indicators of price were based around size (square footage of various parameters) and condition/quality (a subjective amount scaled 1-10). Some of these parameters include:
- Overall Qual
- Gr Liv Area
- Garage Area
- Garage Cars
- Total Bsmt SF
- 1st Flr SF
- Neighborhood
- Exter Qual
- Bsmt Qual
- Kitchen Qual
- Garage Qual
- Garage Type

---
### Data Sources
- Train and Test Set provided on Kaggle.com:
https://www.kaggle.com/c/dsir-113020-project-2-regression-challenge/data

- Data Description:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

---
### Notebook Organization
- Libraries and data imports
- EDA
    - Pairplot
    - Correlations
    - EDA
- Data Cleaning
    - Null values
    - Finding categorical/numerical data
- Build a model
    - Linear Regression
        - Model with only a few features
        - Model with Additional features and scaling
        - Model with Fewer features and scaling
        - Model with Scaling, Imputer, RFE
        - Model with Scaling, Imputer, RFE, GridsearchCV
        - Model with all object features
        - Model with Neighborhoods OneHotEncoded
        - Model with Neighborhoods dummy-encoded
        - Production model with additional combintations of categorical features
        - Kaggle Submission
    - Ridge Model
        - Ridge Model scaled, dummy-encoded
        - Ridge Model not scaled, dummy-encoded
    - Lasso
    - Elastic Net
    - KNN Model
    - Polynomial Features
- Evaluating Models
    - Actual vs. Predicted visualizations
    - Feature Coefficients
- Conclusions and Recomendations

### Data Dictionaries
Ames Housing Data Description
<!-- |Feature|Type|Dataset|Description|
|---|---|---|---|
|**state**|*object*|SAT|US State|
|**participation**|*integer*|SAT|Percent of students in state who took the SAT| 
|**reading_writing**|*integer*|SAT|Average SAT score in reading and writing| 
|**math**|*float*|SAT|Average SAT score in math| 
|**total**|*integer*|SAT|Average total SAT score|  -->

 - http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
 
 ---

# EDA

### Data Import and Cleaning
Null Values:
- Columns that had many null values (>~1000) were dropped and include 'Alley', 'Fireplace Qu', 'Pool QC', 'Fence', 'Misc. Feature'
- 'PID' (parcel identification number) was dropped because this is an ID value and doesn't indicate home price.
- The initial models consisted of features that were integers and floats. Object features were later added in encoded columns.

---
### Exploratory Data Analysis
- The target (Sale Price) followed a general normal distribution. It was slightly right skewed and had a tail to the right. It was mostly centered around \\$100,000 - \\$200,000.
    - Because of this, we will be transforming the target with transformedtargetregressor to normalize the target
- The maximum home price was \\$611,657.
- The minimum home price was \\$12,789
- Highest correlations between sale price and feature include:
    - Gr Liv Area
    - Garage Area
    - Garage Cars
    - Total Bsmt SF
    - 1st Flr SF
- Several graphs were created to show the relationship between variables.
---
### Models Tested

Some of the models I used to model the data include:
- Linear Regression
- Ridge (scaled/not scaled)
- Lasso
- Elastic Net
- KNN (with GridSearchCV)
- Polynomial Features
- Linear Regression with Feature Selection

I tried various combinations of feature selections as well as standardscaler, simpleimputer, and transformedtargetregressor. I also used dummy encoding for both the Kaggle model and the production model for categorical data.

### Baseline Score

The null model submitted on Kaggle had a rmse value of approximately 86080. This model used the full_like feature to predict values of only the average sale price.

---
### Data Visualization / Descriptive and Inferential Statistics
I made plots showing the actual sale price and the predicted sale price for each type of model. What is interesting to note are the trends associated with each model. 

In the production model, the actual and predicted prices were consistent among the dataset, but the model tended to predict higher values for the sale price. When creating the model, as I added more categorical features, both the r2 value decreased and the rmse values increased, indicating that adding additional features may create a worse-performing model.

For the Kaggle model, most predictions were somewhat accurate compared to the actual price, however there were a few outliers (one prediction noticible at ~\\$2.5 M) which may indicate it this model does not perform well in all situations.

The ridge, lasso, elastic net, and KNN models were somewhat accurate but tended to be be less accurate at higher priced houses.

Looking at the scaled feature weights, some features such as garage area,  gr liv area, 2nd/1st floor area, and basement square footage seem to be strong indicators in the models.

---
# Conclusions and Recommendations

A linear regression model can account for data in approximately 85-90% of the data.

Factors that may may affect home price include:
- Features dealing with area/square footage
    - Gr Liv Area
    - Garage Area
    - Garage Cars
    - Total Bsmt SF
    - 1st Flr SF
    - Size and amount of beds/baths
- Quality parameters
    - Overall Qual
    - Exter Qual
    - Bsmt Qual
    - Kitchen Qual
    - Garage Qual
    - Garage Type
- Neighborhood

Additional research may be needed to evaluate any collinearity and to deal with any outliers in some of the features, as well as some of the models.

In addition, updated data bay provide new insights based on more recent house sale trends.

---
# Additional sources and references:
Data source: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt 

Kaggle: https://www.kaggle.com/c/dsir-113020-project-2-regression-challenge

SciKit-Learn Linear Model Coefficient interpretation: https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html
