<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Estimating the value of homes from fixed characteristics.

---

Your superiors have outlined this year's strategy for the company:
1. Develop an algorithm to reliably estimate the value of residential houses based on *fixed* characteristics.
2. Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Then we can use that to buy houses that are likely to sell for more than the cost of the purchase plus renovations.

Your first job is to tackle #1. You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
You need to build a reliable estimator for the price of the house given characteristics of the house that cannot be renovated. Some examples include:
- The neighborhood
- Square feet
- Bedrooms, bathrooms
- Basement and garage space

and many more. 

Some examples of things that **ARE renovate-able:**
- Roof and exterior features
- "Quality" metrics, such as kitchen quality
- "Condition" metrics, such as condition of garage
- Heating and electrical components

and generally anything you deem can be modified without having to undergo major construction on the house.

---

**Your goals:**
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?

> **Note:** The EDA and feature engineering component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

In [2]:
# Load the data
house = pd.read_csv('./housing.csv')

IOError: File ./housing.csv does not exist

In [None]:
# A:
house.info()

In [None]:
# drop columns with lots of null values : Alley, FireplaceQu, PoolQC, Fence, MiscFeature
house.drop(inplace=True, axis=1, labels=['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'])


In [None]:
pd.options.display.max_columns = 999
house.head()

In [None]:
# Remove non-residential housing
house_res = house[~house['MSZoning'].isin(['A','C','I','C (all)'])]


In [None]:
house_res.describe(include='all')

## Determining fixed characteristics
Based on the data descriptiion file, fixed characteristics are as follows:
***


MSSubClass: Identifies the type of dwelling involved in the sale.	

MSZoning: Identifies the general zoning classification of the sale.
		
LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to various conditions

Condition2: Proximity to various conditions (if more than one is present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

MasVnrArea: Masonry veneer area in square feet

BsmtFinSF1: Type 1 finished square feet

BsmtFinSF2: Type 2 finished square feet

TotalBsmtSF: Total square feet of basement area

1stFlrSF: First Floor square feet
 
2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

SaleCondition: Condition of sale






### Unfixed
---



Street: Type of road access to property

Alley: Type of alley access to property

OverallQual: Rates the overall material and finish of the house

OverallCond: Rates the overall condition of the house

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one material)

MasVnrType: Masonry veneer type

ExterQual: Evaluates the quality of the material on the exterior 
		
ExterCond: Evaluates the present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Evaluates the height of the basement

BsmtCond: Evaluates the general condition of the basement

BsmtExposure: Refers to walkout or garden level walls

BsmtFinType1: Rating of basement finished area

BsmtFinType2: Rating of basement finished area (if multiple types)

BsmtUnfSF: Unfinished square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system
		
KitchenQual: Kitchen quality
       	

Functional: Home functionality (Assume typical unless deductions are warranted)
		
Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality
		
		
GarageFinish: Interior finish of the garage

GarageQual: Garage quality

GarageCond: Garage condition
		
PavedDrive: Paved driveway

PoolQC: Pool quality
		
Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other categories


In [None]:
for f in house_res.columns:
   # fill NaN values with mean
    if house_res[f].dtype == 'float64':
        house_res[f][np.isnan(house_res[f])] = house_res[f].mean()
    elif house_res[f].dtype == 'int64':
        house_res[f][np.isnan(house_res[f])] = house_res[f].mean()
# fill NaN values with most occured value
    elif house_res[f].dtype == 'object':
        house_res[f][house_res[f] != house_res[f]] = house_res[f].value_counts().index[0]

In [None]:
fixed_char = ['Id', 'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', \
              'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', \
              'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', \
              'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'TotalBsmtSF', \
              '1stFlrSF', '2ndFlrSF', 'LowQualFinSF' ,'GrLivArea', 'BsmtFullBath', \
              'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', \
              'GarageType', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', \
              'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea' ,'MiscVal', 'MoSold', 'YrSold', \
              'SaleType', 'SaleCondition']

In [None]:
house_res1 = house_res.loc[:,fixed_char]

In [None]:
# Use patsy dmatrices to dummify categorical variables
import patsy

In [None]:
# Check columns to dummify (categorical/string/object columns)
house_res1.select_dtypes(include=['object']).columns

In [None]:
house_res1.head()

In [None]:
# sum all porch areas into new column
house_res1['porch_area'] = house_res1.loc[:,['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']].sum(axis=1)
house_res1 = house_res1.drop(['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch'],axis=1)

In [None]:
# Drop columns that have aggregate columns, like totalbasement SF, GrLivArea etc
house_res1 = house_res1.drop(['Id','BsmtFinSF1','BsmtFinSF2', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF'], axis=1)
house_res1.head()

In [None]:
catcol = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'Utilities', 'LotConfig','LandSlope', 'Neighborhood', 'Condition1', 'Condition2',\
       'BldgType', 'HouseStyle', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath',\
       'FullBath', 'HalfBath', 'TotRmsAbvGrd','BedroomAbvGr','KitchenAbvGr',\
       'GarageType', 'GarageYrBlt', 'GarageCars', 'MoSold', 'YrSold',\
       'SaleType', 'SaleCondition']
for f in catcol:
    sns.barplot(x=house_res1[f].value_counts().index, y= house_res1[f].value_counts())
    plt.show()

In [None]:
# Drop categorical columns with low variance according to countplots
house_res1 = house_res1.drop(['Utilities', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'YearRemodAdd', 'BsmtHalfBath', 'KitchenAbvGr', 'GarageYrBlt', 'SaleType','SaleCondition'], axis=1)

In [None]:
house_res1['BsmtFullBath_0'] = house_res1['BsmtFullBath'].map(lambda x: 1 if x == 0 else 0)

In [None]:
house_res1['fullbath2<'] = house_res1['FullBath'].map(lambda x: 1 if x<2 else 0)
house_res1['halfbath_0'] = house_res1['HalfBath'].map(lambda x: 1 if x==0 else 0)
# house_res1.head(20)

In [None]:
house_res1 = house_res1.drop(['BsmtFullBath', 'FullBath', 'HalfBath'], axis=1)

In [None]:
# Clean num columns
house_res1.describe()

In [None]:
# near zero variance

def nearZeroVariance(X, freqCut = 95 / 5, uniqueCut = 10):
    '''
    Determine predictors with near zero or zero variance.
    Inputs:
    X: pandas data frame
    freqCut: the cutoff for the ratio of the most common value to the second most common value
    uniqueCut: the cutoff for the percentage of distinct values out of the number of total samples
    Returns a tuple containing a list of column names: (zeroVar, nzVar)
    '''

    colNames = X.columns.values.tolist()
    freqRatio = dict()
    uniquePct = dict()

    for names in colNames:
        counts = (
            (X[names])
            .value_counts()
            .sort_values(ascending = False)
            .values
            )

        if len(counts) == 1:
            freqRatio[names] = -1
            uniquePct[names] = (len(counts) / len(X[names])) * 100
            continue

        freqRatio[names] = counts[0] / counts[1]
        uniquePct[names] = (len(counts) / len(X[names])) * 100

    zeroVar = list()
    nzVar = list()
    for k in uniquePct.keys():
        if freqRatio[k] == -1:
            zeroVar.append(k)

        if uniquePct[k] < uniqueCut and freqRatio[k] > freqCut:
            nzVar.append(k)

    return(zeroVar, nzVar)

In [None]:
zerovartest = house_res1.loc[:,['LotFrontage','LotArea','MasVnrArea', 'TotalBsmtSF','GrLivArea',\
                                'GarageArea','WoodDeckSF', 'PoolArea', 'MiscVal']]
nearZeroVariance(zerovartest)

In [None]:
house_res1 = house_res1.drop(['MasVnrArea', 'PoolArea', 'MiscVal'],axis=1)

In [None]:
house_res1.columns

In [None]:
house_res1 = pd.get_dummies(house_res1, columns=['MSSubClass', 'MSZoning','LotShape', 'LandContour',\
                                                 'LotConfig', 'Neighborhood','HouseStyle', 'BedroomAbvGr',\
                                                 'TotRmsAbvGrd', 'GarageType', 'GarageCars','MoSold'], drop_first=True)

In [None]:
house_res1.head()

In [None]:
traindf = house_res1[house_res1['YrSold']!=2010]
testdf = house_res1[house_res1['YrSold']==2010]

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
ss = StandardScaler()
y = traindf['SalePrice'].values
X = traindf.drop(['SalePrice','YrSold'],axis=1)
Xs = ss.fit_transform(X)

In [None]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score

In [None]:
optimal_lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
optimal_lasso.fit(Xs, y)

print optimal_lasso.alpha_

In [None]:
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso, Xs, y, cv=10)

print lasso_scores
print np.mean(lasso_scores)

In [None]:
lasso.fit(Xs, y)

In [None]:
lasso_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(20)

In [None]:
lasso_coefs.info()

In [None]:
print 'Percent variables zeroed out:', np.sum((lasso.coef_ == 0))/float(Xs.shape[1])

In [None]:
lasso_coefs[lasso_coefs['abs_coef']==0].info()

In [None]:
# predicting test
y1 = testdf['SalePrice'].values
X1 = testdf.drop(['SalePrice','YrSold'],axis=1)
Xs1 = ss.transform(X1)

In [None]:
lasso.predict(Xs1)

In [None]:
lasso_scores = cross_val_score(lasso, Xs1, y1, cv=10)

print lasso_scores
print np.mean(lasso_scores)

In [None]:
# workings
# Utilities LandSlope Condition1 Condition2 BldgType, 'YearRemodAdd', 'BsmtHalfBath', 'KitchenAbvGr', GarageYrBlt, 'SaleType','SaleCondition'
# 'BsmtFullBath(0, >0)', FullBath(<2, >2), HalfBath(0,>0)

# Y'SalePrice', u'MSSubClass', u'MSZoning', NUM'LotFrontage',
#        NUM'LotArea', u'LotShape', u'LandContour', u'Utilities', u'LotConfig',
#        u'LandSlope', u'Neighborhood', u'Condition1', u'Condition2',
#        u'BldgType', u'HouseStyle', u'YearRemodAdd', NUM'MasVnrArea'DD,
#        , NUM'TotalBsmtSF', , NUM'GrLivArea', u'BsmtFullBath', u'BsmtHalfBath',
#        u'FullBath', u'HalfBath', u'Bedroom', u'Kitchen', u'TotRmsAbvGrd',
#        u'GarageType', u'GarageYrBlt', u'GarageCars', NUM'GarageArea',
#        NUM'WoodDeckSF', NUM'PoolArea'DD, NUM'MiscVal', u'MoSold', u'YrSold',
#        u'SaleType', u'SaleCondition']

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

In [None]:
# A: Sub setting df into unfixed characteristics

unfixed = []
for f in house_res.columns:
    if f not in fixed_char:
        unfixed.append(f)
    else:
        print f



In [None]:
unfixed.append('YrSold')
unfixed.append('SalePrice')

In [None]:
# df with unfixed characteristics only
houseres2 = house_res.loc[:,unfixed]
houseres2.head()

In [None]:
for f in unfixed:
    sns.barplot(x=houseres2[f].value_counts().index, y= houseres2[f].value_counts())
    plt.show()

In [None]:
# Based on plots remove columns: Street, RoofStyle, RoofMatl, ExterCond, BsmtCond, BsmtFinType2, Heating, 
# CentralAir, Electrical, Functional, GarageQual, GarageCond, PavedDrive 
['OverallQual', 'OverallCond', 'YearBuilt', 'Exterior1st',\
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'Foundation', 'BsmtQual',\
       'BsmtExposure', 'BsmtFinType1', 'HeatingQC',\
       'KitchenQual', 'Fireplaces', 'GarageFinish']

In [None]:
houseres2 = houseres2.drop(['Street', 'RoofStyle', 'RoofMatl', 'ExterCond', 'BsmtCond',\
                            'BsmtFinType2', 'Heating', 'CentralAir', 'Electrical', 'Functional', 'GarageQual',\
                            'GarageCond', 'PavedDrive'], axis=1)

In [None]:
houseres2.columns

In [None]:
catcol2 = ['OverallQual', 'OverallCond', 'YearBuilt', 'Exterior1st',\
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'Foundation', 'BsmtQual',\
       'BsmtExposure', 'BsmtFinType1', 'HeatingQC',\
       'KitchenQual', 'Fireplaces', 'GarageFinish']
for f in catcol2:
    print f, houseres2[f].nunique()
    
# Note that Yearbuilt, exterior1st and 2nd have many categories. Binarise these columns for more 
# meaningful interpretation. Year built has a mean of 1972

In [None]:
houseres2.describe()

In [None]:
houseres2['Exterior1st'].value_counts()

In [None]:
# BINARISE YEARBUILT AND EXT 1 and 2
houseres2['yearbuilt_bef1972'] = houseres2['YearBuilt'].map(lambda x: 1 if x<1972 else 0)
houseres2['Exterior1st_Vinyl'] = houseres2['Exterior1st'].map(lambda x: 1 if x == 'VinylSd' else 0)
houseres2['Exterior2nd_Vinyl'] = houseres2['Exterior2nd'].map(lambda x: 1 if x == 'VinylSd' else 0)
# houseres2.head(30)

In [None]:
houseres2 = houseres2.drop(['YearBuilt','Exterior1st','Exterior2nd'], axis=1)

In [None]:
houseres2 = pd.get_dummies(houseres2, columns=['OverallQual', u'OverallCond', u'MasVnrType', u'ExterQual',\
       'Foundation', 'BsmtQual', 'BsmtExposure', 'BsmtFinType1',\
       'HeatingQC', 'KitchenQual', 'Fireplaces',\
       'GarageFinish', 'yearbuilt_bef1972',\
       'Exterior1st_Vinyl', 'Exterior2nd_Vinyl'], drop_first=True)

In [None]:
houseres2.columns

In [None]:
y = house_res1['SalePrice'].values
X = house_res1.drop(['SalePrice','YrSold'],axis=1)
Xs = ss.fit_transform(X)

In [None]:
y_pred = lasso.predict(Xs)
y_resi = y_pred - y

In [None]:
houseres2['y_resi'] = y_resi

In [None]:
# train test split based on year sold
traindf = houseres2[houseres2['YrSold']!=2010]
testdf = houseres2[houseres2['YrSold']==2010]

In [None]:
y = traindf['y_resi'].values
X = traindf.drop(['YrSold', 'SalePrice', 'y_resi','BsmtUnfSF'],axis=1)
Xs = ss.fit_transform(X)

In [None]:
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso, Xs, y, cv=5)

print lasso_scores
print np.mean(lasso_scores)

In [None]:
lasso.fit(Xs, y)

In [None]:
lasso_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(20)

In [None]:
print 'Percent variables zeroed out:', np.sum((lasso.coef_ == 0))/float(Xs.shape[1])

In [None]:
y1 = testdf['y_resi'].values
X1 = testdf.drop(['YrSold', 'SalePrice', 'y_resi','BsmtUnfSF'],axis=1)
Xs1 = ss.transform(X1)

In [None]:
# scores on test set
lasso_scores = cross_val_score(lasso, Xs1, y1, cv=5)

print lasso_scores
print np.mean(lasso_scores)

In [None]:
# Use lasso regression to find top 10 coefficients for unfixed characteristics
y2 = traindf['SalePrice'].values
X2 = traindf.drop(['YrSold', 'SalePrice', 'y_resi','BsmtUnfSF'],axis=1)
Xs2 = ss.fit_transform(X2)

In [None]:
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso, Xs2, y2, cv=5)

print lasso_scores
print np.mean(lasso_scores)

In [None]:
lasso.fit(Xs2,y2)

In [None]:
lasso_coefs = pd.DataFrame({'variable':X2.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(20)

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. What property characteristics predict an "abnormal" sale?

---

The `SaleCondition` feature indicates the circumstances of the house sale. From the data file, we can see that the possibilities are:

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)
       
One of the executives at your company has an "in" with higher-ups at the major regional bank. His friends at the bank have made him a proposal: if he can reliably indicate what features, if any, predict "abnormal" sales (foreclosures, short sales, etc.), then in return the bank will give him first dibs on the pre-auction purchase of those properties (at a dirt-cheap price).

He has tasked you with determining (and adequately validating) which features of a property predict this type of sale. 

---

**Your task:**
1. Determine which features predict the `Abnorml` category in the `SaleCondition` feature.
- Justify your results.

This is a challenging task that tests your ability to perform classification analysis in the face of severe class imbalance. You may find that simply running a classifier on the full dataset to predict the category ends up useless: when there is bad class imbalance classifiers often tend to simply guess the majority class.

It is up to you to determine how you will tackle this problem. I recommend doing some research to find out how others have dealt with the problem in the past. Make sure to justify your solution. Don't worry about it being "the best" solution, but be rigorous.

Be sure to indicate which features are predictive (if any) and whether they are positive or negative predictors of abnormal sales.

In [None]:
# A:
house_res.columns

In [None]:
house_res.SaleCondition.unique()

In [None]:
house_res['abnotab'] = house_res['SaleCondition'].map(lambda x: 1 if x =='Abnorml' else 0)

In [None]:
house_res.columns

In [None]:
house_res = house_res.drop(['Id','YearBuilt', 'YearRemodAdd','GarageYrBlt','MiscVal', 'MoSold', 'YrSold'],axis=1)

In [None]:
house_res = house_res.drop(['BsmtFinType2', 'BsmtFinSF2'],axis=1)

In [None]:
housedum = pd.get_dummies(house_res, columns=['MSSubClass', 'MSZoning', 'Street',\
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',\
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType',\
       'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle',\
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',\
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',\
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1',\
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical','BsmtFullBath',\
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',\
       'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',\
       'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars',\
       'GarageQual', 'GarageCond', 'PavedDrive',\
       'SaleType'], drop_first=True)

In [None]:
housedum = housedum.drop(['SaleCondition'],axis=1)

In [None]:
from sklearn.model_selection import train_test_split

y = housedum['abnotab'].values
X = housedum.drop(['abnotab'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=16125986)

In [None]:
xtrain = pd.DataFrame(data=X_train,columns=housedum.drop(['abnotab'],axis=1).columns)
xtrain.head(2)

In [None]:
# get values of correlation for abnotab vs all
listname = []
listcorr = []
for col in xtrain.columns:
    a = stats.spearmanr(y_train,xtrain[col].values)[0]
    listname.append(col)
    listcorr.append(a)

In [None]:
newlist = zip(listname, listcorr)
sorted(newlist,key=lambda x: x[1])

In [None]:
dfa = pd.DataFrame(newlist,columns=['key','corr'])

In [None]:
# find attributes that have corr with label that are +/- 2 s.d. away from mean of corr
meancorr = np.mean(dfa['corr'])
stdcorr = np.std(dfa['corr'])
print meancorr, meancorr + 2*stdcorr, meancorr - 2*stdcorr


In [None]:
dfa['lab'] = dfa['corr'].map(lambda x: 1 if x <= -0.0941970568899 or x >= 0.097109299639 else 0)

In [None]:
sum(dfa.lab)

In [None]:
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_sample(X_train, y_train)

In [None]:
dfa[dfa['lab']==1].key.values

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV

In [None]:
param = {'penalty':['l1','l2'] ,\
         'C': list(np.linspace(0.1,1,num=10))}

In [None]:
clf = GridSearchCV(LogisticRegression(),param, cv=5)
clf.fit(X_res,y_res)

In [None]:
clf.best_params_

In [None]:
clf.best_score_

In [None]:
y_pred = clf.predict(X_test)

In [None]:
from imblearn.metrics import classification_report_imbalanced
print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])