# Ames Housing Data Analysis

# Ames Housing Project Suggestions

Data science is not a linear process. In this project, in particular, you will likely find that EDA, data cleaning, and exploratory visualizations will constantly feed back into each other. Here's an example:

1. During basic EDA, you identify many missing values in a column/feature.
2. You consult the data dictionary and use domain knowledge to decide _what_ is meant by this missing feature.
3. You impute a reasonable value for the missing value.
4. You plot the distribution of your feature.
5. You realize what you imputed has negatively impacted your data quality.
6. You cycle back, re-load your clean data, re-think your approach, and find a better solution.

Then you move on to your next feature. _There are dozens of features in this dataset._

Figuring out programmatically concise and repeatable ways to clean and explore your data will save you a lot of time.

The outline below does not necessarily cover every single thing that you will want to do in your project. You may choose to do some things in a slightly different order. Many students choose to work in a single notebook for this project. Others choose to separate sections out into separate notebooks. Check with your local instructor for their preference and further suggestions.


## Problem Statement

Predict the price of homes at sale for the Aimes Iowa Housing dataset

## Executive Summary


- [01 EDA and Cleaning](./01_EDA_and_Cleaning.ipynb)
- [02 Preprocessing and Feature Engineering](./02_Preprocessing_and_Feature_Engineering.ipynb  )
- [03 Model_Benchmarks](./03_Model_Benchmarks.ipynb)
- [04 Model Tuning](./04_Model_Tuning.ipynb  )
- [05 Production Model and Insights](./05_Production_Model_and_Insights.ipynb)
- [06 Kaggle Submission](./06_Kaggle_Submissions.ipynb)   


### Notebook Outline
- [Visualizing the Elastic-Net](#intro)

<a id='intro'></a>

## Overview of regularization

---

**The goal of "regularizing" regression models is to structurally prevent overfitting by imposing a penalty on the coefficients of the model.**

## Exploratory Data Analysis

- **Read the data dictionary.**
- Determine _what_ missing values mean.
- Figure out what each categorical value represents.
- Identify outliers.
- Consider whether discrete values are better represented as categorical or continuous. (Are relationships to the target linear?)


## Data Cleaning

- Decide how to impute null values.
- Decide how to handle outliers.
- Do you want to combine any features?
- Do you want to have interaction terms?
- Do you want to manually drop collinear features?

## Exploratory Visualisation

- Look at distributions.
- Look at correlations.
- Look at relationships to target (scatter plots for continuous, box plots for categorical).

## Pre-processing

- One-hot encode categorical variables.
- Train/test split your data.
- Scale your data.
- Consider using automated feature selection.

## Modeling

- **Establish your baseline score.**
- Fit linear regression. Look at your coefficients. Are any of them wildly overblown?
- Fit lasso/ridge/elastic net with default parameters.
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This does not have to be your best performing Kaggle model, but rather the model that best answers your problem statement.)
- Refine and interpret your production model.

## Inferential Visualizations
- Look at feature loadings.
- Look at how accurate your predictions are.
- Is there a pattern to your errors? Consider reworking your model to address this.

## Business Recommendations
- Which features appear to add the most value to a home?
- Which features hurt the value of a home the most?
- What are things that homeowners could improve in their homes to increase the value?
- What neighborhoods seem like they might be a good investment?
- Do you feel that this model will generalize to other cities? How co

**Next:** [2. Preprocessing and Feature Engineering](./02_Preprocessing_and_Feature_Engineering.ipynb  )