
## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

In [27]:
'''
I'm going to preface this with the fact that I wasn't able to get question 3 to work, so I'm not going to go over it
all in order to comment it out. The basic premise was that I tried to use knn to predict abnormal sales, however my
final results were very poor at predicting abnormal sale conditions. I tried to play with the thresholds but wasn't
able to improve on the performance in a meaningful way.

'''

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import pandas_profiling as pp
from sklearn.preprocessing import StandardScaler
from scipy.stats import skew
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.feature_selection import RFECV, SelectFromModel
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline


In [2]:
ls


Ames Real Estate Data.xlsx      final data manipulations.ipynb
Ames_dataset.xlsx               fixed_col.csv
Data manipulation.ipynb         [31mhousing.csv[m[m*
Project-03-Q2.ipynb             lat_long_test.csv
[31mREADME.md[m[m*                      lat_long_train.csv
Tbl_DataExportSpec.xls          new_full_set.csv
Untitled.ipynb                  new_housing.csv
ames.csv                        nfs2 - nfs2.csv
ames.twb                        nfs2.csv
ames_edit.csv                   [31mproject-03-Q1.ipynb[m[m*
[31mdata_description.txt[m[m*           project-03-Q3.ipynb
df1.csv                         test.csv
df2.csv                         test_data - df2.csv
df3.csv                         train.csv
df4.csv                         train_data - df1.csv


In [3]:
ames_edit = pd.read_csv('./ames_edit.csv')


In [4]:
ames_edit.columns


Index(['Unnamed: 0', 'Id', 'Prop_Addr', 'Latitude', 'Longitude', 'MSZoning',
       'LotFrontage', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'MoSold_y',
       'YrSold

In [5]:
ames_edit['SaleCondition'].value_counts()


Normal     1190
Partial     125
Abnorml      92
Family       20
Alloca       13
AdjLand       4
Name: SaleCondition, dtype: int64

I have decided to approach this problem as one of classification.


In [6]:
ames_edit['SaleCondition'] = [1 if row == 'Abnorml' else 0 for row in ames_edit['SaleCondition']]

In [7]:
ames_edit['SaleCondition'].value_counts()

0    1352
1      92
Name: SaleCondition, dtype: int64

In [8]:
y = ames_edit['SaleCondition']

In [33]:
print(y)

0       0
1       0
2       0
3       1
4       0
5       0
6       0
7       0
8       1
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      1
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
1414    0
1415    0
1416    0
1417    0
1418    0
1419    1
1420    0
1421    0
1422    0
1423    0
1424    0
1425    0
1426    0
1427    0
1428    0
1429    0
1430    0
1431    0
1432    0
1433    1
1434    0
1435    0
1436    0
1437    1
1438    0
1439    0
1440    0
1441    0
1442    0
1443    0
Name: SaleCondition, Length: 1444, dtype: int64


In [9]:
'''
Calculate Baseline accuracy
'''

print((y.value_counts(normalize=True)*100))

0    93.628809
1     6.371191
Name: SaleCondition, dtype: float64


In [10]:
abs(ames_edit.corr().SaleCondition).sort_values(ascending=False)

SaleCondition    1.000000
sales_price      0.137083
YearRemodAdd     0.123184
GarageQual       0.094710
FireplaceQu      0.094231
GarageFinish     0.092510
GarageCars       0.090318
ExterQual        0.089019
YearBuilt        0.088956
GarageCond       0.083142
OverallQual      0.081718
GarageYrBlt      0.081072
GarageArea       0.080889
Fireplaces       0.076856
HeatingQC        0.069819
FullBath         0.069331
BsmtQual         0.068524
Longitude        0.056752
HalfBath         0.052739
YrSold           0.049743
BsmtExposure     0.044153
TotRmsAbvGrd     0.042791
MasVnrArea       0.041365
WoodDeckSF       0.040220
lot_area         0.038073
BsmtFinType2     0.035167
GrLivArea        0.034565
OverallCond      0.033128
KitchenQual      0.032750
2ndFlrSF         0.030810
OpenPorchSF      0.028363
BsmtUnfSF        0.027236
BsmtFinType1     0.023684
TotalBsmtSF      0.023200
KitchenAbvGr     0.021354
1stFlrSF         0.018311
BsmtFinSF1       0.014712
BsmtFullBath     0.011863
MoSold_y    

In [11]:
feature_cols = ['YearRemodAdd', 'GarageQual', 'FireplaceQu', 'GarageFinish', 'GarageCars', 'ExterQual', 
                'YearBuilt', 'GarageCond', 'OverallQual', 'GarageYrBlt', 'GarageArea']
X = ames_edit[feature_cols]

In [12]:
X = X.astype(float)

In [13]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [14]:
print(Xs.shape)
print(y.shape)

(1444, 11)
(1444,)


In [46]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.33, random_state=42, stratify=y)

In [47]:
'''
Gridsearch for best knn hyperparam - neighbors
'''
params = {'n_neighbors':[2,3,4,5,6,7,8,9]}

knn = KNeighborsClassifier()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_

{'n_neighbors': 4}

In [67]:
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=6)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
print(metrics.accuracy_score(y_test, y_pred))

0.9371069182389937


In [68]:
# Let's say again that we are predicting cancer based on some kind of detection measure, as before.
conmat = np.array(confusion_matrix(y_test, y_pred, labels=[1,0]))
print(conmat)
confusion = pd.DataFrame(conmat, index=['is_abnormal', 'is_not_abnormal'],
                         columns=['predicted_abnormal','predicted_not_abnormal'])
confusion

[[  0  30]
 [  0 447]]


Unnamed: 0,predicted_abnormal,predicted_not_abnormal
is_abnormal,0,30
is_not_abnormal,0,447


In [69]:
# Get the predicted probability vector and explicitly name the columns:
y_pred_data = pd.DataFrame(knn.predict_proba(X_test), columns=['predicted_not_abnormal','predicted_abnormal'])
# y_pred_data

In [76]:
# In order to do this, we can lower the threshold for predicting class 1.
# This will reduce our false negative rate to 0, but at the expense of a higher false positive rate.
y_pred_data['pred_class_thresh10'] = [1 if x >= 0.20 else 0 for x in y_pred_data.predicted_abnormal.values]
y_pred_data

Unnamed: 0,predicted_not_abnormal,predicted_abnormal,pred_class_thresh10
0,1.000000,0.000000,0
1,1.000000,0.000000,0
2,1.000000,0.000000,0
3,0.833333,0.166667,0
4,1.000000,0.000000,0
5,1.000000,0.000000,0
6,1.000000,0.000000,0
7,1.000000,0.000000,0
8,1.000000,0.000000,0
9,1.000000,0.000000,0


In [77]:
conmat = np.array(confusion_matrix(y_test, y_pred_data.pred_class_thresh10.values, labels=[1,0]))
confusion = pd.DataFrame(conmat, index=['is_abnormal', 'is_not_abnormal'],
                         columns=['predicted_abnormal','predicted_not_abnormal'])
confusion

Unnamed: 0,predicted_abnormal,predicted_not_abnormal
is_abnormal,2,28
is_not_abnormal,12,435
