# Regression and Classification with the Ames Housing Data

---

## Premise: 
I have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

## Data:
The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt

## Objective:
1. Develop an algorithm to estimate the value of the residential homes based on fixed characteristics (those that are not considered easy to renovate).
2. Identify characteristics of homes that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Use the information to buy homes that are likely to sell for more than the initial purchase.

## Process:
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize and evaluate the model. (How does it perform and what are the best estimates of price?)

## Citations:
- http://www.remodeling.hw.net/cost-vs-value/2010/west-north-central/des-moines-ia/
- http://cdnassets.hw.net/b6/3d/047accdd4174a4965051631d7900/cvv-2010-2011-professional-desmoinesia.pdf

In [1]:
!pip install display --quiet

In [2]:
import sys
import numpy as np, pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt, seaborn as sns

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

### Load and Inspect the Data

In [4]:
from os import chdir

In [5]:
pwd

'/home/jovyan/DSI/DSI_Plus_1_Curriculum/project-three/project-three/ipynb'

In [8]:
chdir('../')

In [9]:
clean_house_df = pd.read_pickle('./assets/clean_house_df.p')

In [10]:
clean_house_df.sample(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
621,622,60,RL,90.0,10800,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,6,2008,WD,Normal,240000
169,170,20,RL,70.0,16669,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,1,2006,WD,Normal,228000
368,369,20,RL,78.0,7800,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,3,2010,WD,Normal,132000
97,98,20,RL,73.0,10921,Pave,Reg,HLS,AllPub,Inside,...,0,0,0,0,0,5,2007,WD,Normal,94750
1221,1222,20,RL,55.0,8250,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,8,2008,WD,Normal,134000


**The size of my dataset is 3.2 MB as shown below:**

In [11]:
sys.getsizeof(clean_house_df)/1000000

3.258239