<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Standardized Test Analysis

--- 
# Part 1

Part 1 requires knowledge of basic Python.

---

## Problem Statement

Can we predict the prices of home sales in the Ames, Iowa market based on certain features of the home?  Which features are most important (have the highest correlation) in predicting the price and which features impact the value (higher coefficients in the linear model) the most?

Additionally, can a linear regression model score with high accuracy ($R^2$ greater than 0.5)?

### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-Data)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

To-Do: Background about Ames

### Datasets
To analyze this problem, we are using housing data from Ames, Iowa.

All these datasets are in the github repo under datasets/.

### Outside Research

To-Do: Additional Research for Housing stuff

### Data cleaning functions
    
As needed

You will use these functions later on in the project!

In [5]:
def str_to_num(x):
    '''fn takes in a string that is a numeral with other special characters, 
    removes the special characters and returns a float'''
    try:
        return float(x.replace(',',''))
    except:
        return x

In [6]:
def pct_to_dec(x):
    '''fn takes in a string that is a numbered %, and removes the special character and converts it to a decimal'''
    try:
        return float(x.strip('%'))/100
    except:
        return x

In [7]:
def pct_from_total(x):
    '''fn takes in an array, assuming the first item is the total and the rest are subgroups.
    return a list with the total first and the other elements as percentages'''
    try:
        return ([x[0]] + [item/x[0] for item in x[1:]])
    except:
        return x

--- 
# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

*All libraries used should be added here*

In [1]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random as random

## Data Import and Cleaning

### Data Import & Cleaning

Import the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary:
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values.
3. Check for any obvious issues with the observations (keep in mind the minimum & maximum possible values for each test/subtest).
4. Fix any errors you identified in steps 2-3.
5. Display the data types of each feature.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If your dataset has a column of percents (ex. '50%', '30.5%', etc.), use the function you wrote in Part 1 (coding challenges, number 3) to convert this to floats! *Hint*: use `.map()` or `.apply()`.
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

In [59]:
ames = pd.read_csv('./datasets/train.csv')
ames.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


#### Features I'd want to look into further.

I spent a good couple hours just trying to understand all the features, how they're categorized, and some additional real-estate jargon.  This basic list is unfortunately the culmination of those few hours.  There's not too much modeling/cleaning work that got done today, but now I know where to focus my efforts.
Cross-referencing from data-dictionary:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

* MS Zoning - zoning classification (low density vs industrial etc): useful, but need to dummify
* Lot Area - continuous, square footage
* Utilities - ordinal, need to dummify
* Neighborhood - useful, but is it already correlated to something else?
* Bldg Type - type of family dwelling (single-fam vs duplex vs townhouse etc)
* House style - style (num of stories, split foyer)
* Overall Qual - ordinal, from very poor to very excellent
* Overall Cond - ordinal, from very poor to very excellent
* Year Remod/Add - discrete, year of latest construction (original date if no new construction)
* Exter Qual - ordinal, from poor to excellent
* Exter Cond - ordinal, from poor to excellent
* Bsmt Qual - height, from poor to excellent (NA)
* Bsmt Cond - general condition, from poor to excellent (NA)
* Bsmt Exposure - walkout exposure, from none to good
* BsmtFin Type1 - quality of finished living quarters, from unf to good quality
* BsmtFin SF 1 - sq ft of finished basement
* Total Bsmt SF - total sq ft of basement
* HeatingQC - quality of heating, from poor to excellent
* Central Air - binary, yes or no
* Gr Liv Area - above ground living area square foot (seems to be mentioned in the documentation a lot)
* Bsmt Full Bath/Bsmt Half Bath/Full Bath/Half Bath - can all be combined into 'Baths'?
* Bedroom - # of discrete rooms above grade
* Kitchen - # of discrete kitchens above grade
* KitchenQual - quality of kitchen, from poor to excellent
* TotRmsAbvGrd - total # of livable rooms
* Functional - home functionality, from salvage only to typical
* Garage Finish - interior finish quality, from none to finished
* Garage Cars - num of cars that fit
* Garage Area - size in sq ft
* Garage Qual - quality, from poor to excellent (NA)
* Garage Cond - condition, from poor to excellent (NA)
* Wood Deck SF/Open Porch SF/Ecnlosed Porch/3-Ssn Porch/Screen Porch - can all be combined into 'porch/deck sq ft'
* Misc Val - value of additional assets
* Mo Sold - month of sale
* Yr sold - year of sale, but hard to use as a predictor since you can't repeat a year
* Sale Condition - normal vs abnormal/adjland/family/partial - would be good to investigate further

In [60]:
ames.shape
#2051,81

#ames.info()
# lots of nulls in lot frontage, alley,mas vnr type, mas vnr area, basement features,
# fireplace, garage features, pool quality (only 9 entries),fence, misc

(2051, 81)

In [53]:
#let's drop any columns with more than 25% nulls
# null_features = []
# for col in ames.columns:
#     x = ames[col].isnull().sum()
#     if x > ames.shape[0]/4:
#         null_features.append(col)
# print(null_features)
# ames.dropna(axis=1,thresh=(ames.shape[0]*3/4), inplace=True)

['Alley', 'Fireplace Qu', 'Pool QC', 'Fence', 'Misc Feature']


In [66]:
ames.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

In [73]:
#new strategy: split all features into nominal/ordinal/discrete/continuous, 
#and make a quick model with continuous features only
#this info is in the data documentation

nominal = ['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type', 'Misc Feature', 'Sale Type']
ordinal = ['Lot Shape', 'Utilities', 'Land Slope', 'Overall Qual', 'Overall Cond', 'Exter Qual', 'Exter Cond', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Electrical', 'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence']
discrete = ['Year Built', 'Year Remod/Add', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt', 'Garage Cars', 'Mo Sold', 'Yr Sold']
continuous = ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val']

ames_continuous = ames[continuous]
ames_continuous

Unnamed: 0,Lot Frontage,Lot Area,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val
0,,13517,289.0,533.0,0.0,192.0,725.0,725,754,0,1479,475.0,0,44,0,0,0,0,0
1,43.0,11492,132.0,637.0,0.0,276.0,913.0,913,1209,0,2122,559.0,0,74,0,0,0,0,0
2,68.0,7922,0.0,731.0,0.0,326.0,1057.0,1057,0,0,1057,246.0,0,52,0,0,0,0,0
3,73.0,9802,0.0,0.0,0.0,384.0,384.0,744,700,0,1444,400.0,100,0,0,0,0,0,0
4,82.0,14235,0.0,0.0,0.0,676.0,676.0,831,614,0,1445,484.0,0,59,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,79.0,11449,0.0,1011.0,0.0,873.0,1884.0,1728,0,0,1728,520.0,0,276,0,0,0,0,0
2047,,12342,0.0,262.0,0.0,599.0,861.0,861,0,0,861,539.0,158,0,0,0,0,0,0
2048,57.0,7558,0.0,0.0,0.0,896.0,896.0,1172,741,0,1913,342.0,0,0,0,0,0,0,0
2049,80.0,10400,0.0,155.0,750.0,295.0,1200.0,1200,0,0,1200,294.0,0,189,140,0,0,0,0


In [74]:
ames_continuous

Unnamed: 0,Lot Frontage,Lot Area,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val
0,,13517,289.0,533.0,0.0,192.0,725.0,725,754,0,1479,475.0,0,44,0,0,0,0,0
1,43.0,11492,132.0,637.0,0.0,276.0,913.0,913,1209,0,2122,559.0,0,74,0,0,0,0,0
2,68.0,7922,0.0,731.0,0.0,326.0,1057.0,1057,0,0,1057,246.0,0,52,0,0,0,0,0
3,73.0,9802,0.0,0.0,0.0,384.0,384.0,744,700,0,1444,400.0,100,0,0,0,0,0,0
4,82.0,14235,0.0,0.0,0.0,676.0,676.0,831,614,0,1445,484.0,0,59,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,79.0,11449,0.0,1011.0,0.0,873.0,1884.0,1728,0,0,1728,520.0,0,276,0,0,0,0,0
2047,,12342,0.0,262.0,0.0,599.0,861.0,861,0,0,861,539.0,158,0,0,0,0,0,0
2048,57.0,7558,0.0,0.0,0.0,896.0,896.0,1172,741,0,1913,342.0,0,0,0,0,0,0,0
2049,80.0,10400,0.0,155.0,750.0,295.0,1200.0,1200,0,0,1200,294.0,0,189,140,0,0,0,0


### Data Dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

**My Data Dictionary**

|Feature|Type|Dataset|Description|
|---|---|---|---|
|2017_avg_income|*float*|all_state_info.csv|Median household income, by state, in 2017 (in USD)|
|2018_avg_income|*float*|all_state_info.csv|Median household income, by state, in 2018 (in USD)|
|2019_avg_income|*float*|all_state_info.csv|Median household income, by state, in 2019 (in USD)|

## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement.
    - **You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

--- 
# Part 3

Part 3 requires knowledge of modeling and cross validation.

---

In [36]:
def scatter_plot(df, column, ax):
    '''fn takes in a df, a column name (string), and a defined plt object and returns a scatter plot'''
    #add a scatter plot of the observed values against the states
    
    #random colors each time I run the function
    r = random.random()
    b = random.random()
    g = random.random()
    color = (r, g, b)
    
    ax.scatter(df.index, df[column],s = 10,color=color,label=column)
    ax.set_xlabel('States');

In [37]:
def bar_plot(df, column, ax):
    '''fn takes in a df, a column name (string), and a defined plt object and returns a bar plot'''
    #add a bar plot of the average value of a category against the states
    
    #random colors
    r = random.random()
    b = random.random()
    g = random.random()
    color = (r, g, b)
    
    ax.bar(x=df.index, height=df[column], label=column, color=color, alpha=0.7)
    ax.set_xlabel('States');

## Conclusions and Recommendations

To-Do: Based on your exploration of the data, what are you key takeaways and recommendations? Make sure to answer your question of interest or address your problem statement here.

## Citations of Data and Research

1. 
2.
3. 