# Technical Objective in short
Training on the 'Prop_Data_Final' dataset and testing on the 'Score2'

# Technical Objective in detail
***
Your objective is to predict the sales price of 100 real estate properties before they are actually sold. To build the best model with the highest predictive power, you need to train many models and select the best one which you will deploy into making the predictions.<br><br>
There are several ways you can try to boost the generalization of the model without having to build super complex algorithms.<br><br>
**One approach** you may want to consider is spend a good amount of time understanding the data in terms of the **informative attributes**. What are the correlations between the target variable and the input variables you have? What input variables are redundant?<br><br>
An attribute (i.e., feature, variable) isn't necessarily informative in its original scale and shape. You can consider transforming it to make it more "accessible" for your algorithms to exploit. Besides the necessary techniques that you have to use in order to include the variables of certain types, e.g.,
- categorizing, recoding, and/or re-weighting nominal/ordinal variables
- standardizing interval variables for algorithms such as kNN<br>

you may also find it benefitial to logarithm-transform certain interval input variables, and maybe even the target variable, if the true relationship between the target and the input variables are better captured after the log-transformation. This may be true if both the target and the input variables are interval and have similarly skewed distributions with outliers that similarly lie at the extreme end of the corresponding distribution. In this case, log-transforming both the target and the input will help you better deal with the outliers and decrease the _mean absolute percentage error_.<br><br>

Though increasing model complexity can help improve accuracy, you want to avoid too much of overfitting. The art of balancing the bias-variance tradeoff is a topic of _regularization_. Regularization is beyond the scope of this course; however, I encourage you to self-learn the essence this concept and explore the regularization tools provided in `sklearn` or even beyond `sklearn` package. <u>_That said, this is not required, and is only recommended after you have fulfilled all necessary steps of this project._</u> Good luck!

# Project Domain and Dataset
***
### Predicting the Sales Price of Real Estate Properties
The entire training data consist of 4 data files (.csv format), that is, "Property_Survey_1", "Property_Survey_2", "House_Features", and "Quality_Assessment". After _reading_ each data file, eventually you might want to merge them into a single DataFrame; before merging, however, you need to understand the underlying _associations_ of the data contained in each of the data files.

# Hands-on Section
***
Use the space below to build the prerequisite of the project

## Step 0. Import Necessary Packages, Define Utilities
This space is for importing packages and, if needed, for defining any custom functions that may help enhance the efficiency of your project flow as well as the readability of your project codes

In [83]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%matplotlib notebook

# import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics


In [178]:
df = pd.read_csv('House_Feature.csv')

In [86]:
#return head 3
df.head(3)

Unnamed: 0,PID,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,BsmtFinSqft,BsmtUnfinSqft,HeatingQC,FstFlrSqft,SecFlrSqft,...,FullBathHouse,BdrmAbvGrnd,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft,SalePrice
0,526301100,1960,1960,112,1,639,441,0,1656,0,...,1,3,7,2,3,2,528,210,62,215000
1,526350040,1961,1961,0,1,468,270,1,896,0,...,1,2,5,0,3,1,730,140,0,105000
2,526351010,1958,1958,108,1,923,406,1,1329,0,...,1,3,6,0,3,1,312,393,36,172000


In [85]:
df.shape

(2370, 23)

## Step 1. Import Dataset, Understand Basic Info About Properties in each DataFrame & (possibly) Merge DataFrames
**The basic info you should check of each individual DataFrame include, but may not be limited to, the following:**
- shape, columns and data types
- interval variables: summary statistics
- nominal/binary variables: distributions of unique levels
- both interval and nominal variables: skewness of distributions and possible outliers
- missing values

**Should you do something to the two "Property Survey" data files first before conducting surface-level explorations of the variables in these data? It's your call!**<br><br>
>_Data File Quality Assessment_: Don't forget to check if there is any duplicate property (in terms of the `"PID"` column values) between the two Property Survey datasets before doing anything about them; if there is, how should you deal with it?

**Should you conduct deep EDAs (see below step) _before_ or _after_ you merge the separate DataFames into a single DataFrame?**<br><br>
After you merge them into a single DataFrame, it will be wide! This means there will be many columns in this **big** DataFrame. Will you be able to manage the EDAs well by working with this big DataFrame, or will you be better off dealing with each DataFrame first before you have to merge them in order to get correlation matrix, etc.? Again, it's your call.

In [87]:
# Returns Datatype of each column
df.dtypes

PID               int64
YearBuilt         int64
YearRemodel       int64
VeneerExterior    int64
BsmtFinTp         int64
BsmtFinSqft       int64
BsmtUnfinSqft     int64
HeatingQC         int64
FstFlrSqft        int64
SecFlrSqft        int64
AbvGrndLiving     int64
FullBathBsmt      int64
HalfBathHouse     int64
FullBathHouse     int64
BdrmAbvGrnd       int64
RmAbvGrnd         int64
Fireplaces        int64
GarageTp          int64
GarageCars        int64
GarageArea        int64
WdDckSqft         int64
OpenPrchSqft      int64
SalePrice         int64
dtype: object

In [88]:
# Return Columns
df.columns


Index(['PID', 'YearBuilt', 'YearRemodel', 'VeneerExterior', 'BsmtFinTp',
       'BsmtFinSqft', 'BsmtUnfinSqft', 'HeatingQC', 'FstFlrSqft', 'SecFlrSqft',
       'AbvGrndLiving', 'FullBathBsmt', 'HalfBathHouse', 'FullBathHouse',
       'BdrmAbvGrnd', 'RmAbvGrnd', 'Fireplaces', 'GarageTp', 'GarageCars',
       'GarageArea', 'WdDckSqft', 'OpenPrchSqft', 'SalePrice'],
      dtype='object')

In [89]:
# Returns the statistics of each column
df.describe()

Unnamed: 0,PID,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,BsmtFinSqft,BsmtUnfinSqft,HeatingQC,FstFlrSqft,SecFlrSqft,...,FullBathHouse,BdrmAbvGrnd,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft,SalePrice
count,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,...,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0
mean,715330700.0,1970.570886,1984.08692,88.143882,0.704641,429.805907,557.101688,2.153586,1116.442616,325.198734,...,1.508017,2.816456,6.25865,0.589451,2.232911,1.718987,457.978903,91.650633,46.002532,173730.772574
std,188640100.0,30.109415,20.694221,158.718586,0.4563,408.779757,410.685375,0.944222,344.282409,406.198426,...,0.501727,0.742817,1.392625,0.630429,1.010171,0.708998,197.608559,120.616635,64.245617,64080.843305
min,526301100.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,407.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,12789.0
25%,531369000.0,1953.0,1965.0,0.0,0.0,0.0,239.25,1.0,864.0,0.0,...,1.0,2.0,5.0,0.0,1.0,1.0,312.0,0.0,0.0,129900.0
50%,535455100.0,1972.0,1993.0,0.0,1.0,378.0,470.0,3.0,1056.0,0.0,...,2.0,3.0,6.0,1.0,3.0,2.0,463.0,0.0,25.0,159000.0
75%,907135100.0,1999.0,2003.0,144.0,1.0,715.5,792.0,3.0,1331.5,688.75,...,2.0,3.0,7.0,1.0,3.0,2.0,572.0,168.0,68.0,207000.0
max,1007100000.0,2010.0,2010.0,1600.0,1.0,2085.0,2140.0,3.0,2898.0,1721.0,...,2.0,6.0,12.0,4.0,3.0,5.0,1488.0,736.0,547.0,475000.0


In [90]:
# Returns the number of missing values in the data set.
df.isnull().sum()

PID               0
YearBuilt         0
YearRemodel       0
VeneerExterior    0
BsmtFinTp         0
BsmtFinSqft       0
BsmtUnfinSqft     0
HeatingQC         0
FstFlrSqft        0
SecFlrSqft        0
AbvGrndLiving     0
FullBathBsmt      0
HalfBathHouse     0
FullBathHouse     0
BdrmAbvGrnd       0
RmAbvGrnd         0
Fireplaces        0
GarageTp          0
GarageCars        0
GarageArea        0
WdDckSqft         0
OpenPrchSqft      0
SalePrice         0
dtype: int64

In [179]:
dfa = pd.read_csv('Quality_Assessment.csv')
dfa.shape

(2370, 3)

In [92]:
# Returns Columns
dfa.columns

Index(['PID', 'OverallQuality', 'OverallCondition'], dtype='object')

In [93]:
# Returns Datatypes
dfa.dtypes

PID                 int64
OverallQuality      int64
OverallCondition    int64
dtype: object

In [94]:
dfa.head(3)

Unnamed: 0,PID,OverallQuality,OverallCondition
0,526301100,6,5
1,526350040,5,6
2,526351010,6,6


In [180]:
df = pd.merge(df,dfa, how='outer', on=['PID'])
df.head(3)

Unnamed: 0,PID,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,BsmtFinSqft,BsmtUnfinSqft,HeatingQC,FstFlrSqft,SecFlrSqft,...,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft,SalePrice,OverallQuality,OverallCondition
0,526301100,1960,1960,112,1,639,441,0,1656,0,...,7,2,3,2,528,210,62,215000,6,5
1,526350040,1961,1961,0,1,468,270,1,896,0,...,5,0,3,1,730,140,0,105000,5,6
2,526351010,1958,1958,108,1,923,406,1,1329,0,...,6,0,3,1,312,393,36,172000,6,6


In [95]:
dfa.describe()

Unnamed: 0,PID,OverallQuality,OverallCondition
count,2370.0,2370.0,2370.0
mean,715330700.0,6.050633,5.63038
std,188640100.0,1.252423,1.095717
min,526301100.0,2.0,1.0
25%,531369000.0,5.0,5.0
50%,535455100.0,6.0,5.0
75%,907135100.0,7.0,6.0
max,1007100000.0,10.0,9.0


In [96]:
# Returns the number of missing values in the data set.
dfa.isnull().sum()

PID                 0
OverallQuality      0
OverallCondition    0
dtype: int64

In [98]:
Property_Survey_1 = pd.read_csv('Property_Survey_1.csv')
Property_Survey_1.shape

(600, 4)

In [99]:
# Returns Datatypes
Property_Survey_1.dtypes

PID         int64
LotArea     int64
LotShape    int64
BldgTp      int64
dtype: object

In [100]:
# Returns columns
Property_Survey_1.columns

Index(['PID', 'LotArea', 'LotShape', 'BldgTp'], dtype='object')

In [101]:
Property_Survey_1.dtypes

PID         int64
LotArea     int64
LotShape    int64
BldgTp      int64
dtype: object

In [103]:
# Returns top 3 values
Property_Survey_1.head(3)

Unnamed: 0,PID,LotArea,LotShape,BldgTp
0,526301100,31770,0,1
1,526350040,11622,1,1
2,526351010,14267,0,1


In [104]:
# Returns number of null values
Property_Survey_1.isnull().sum()

PID         0
LotArea     0
LotShape    0
BldgTp      0
dtype: int64

In [105]:
# Reading the dataset from a csv file 'Property_Survey_2.csv'
Property_Survey_2 = pd.read_csv('Property_Survey_2.csv')
Property_Survey_2.shape

(1770, 4)

In [106]:
# Returns datatypes
Property_Survey_2.dtypes

PID         int64
LotArea     int64
LotShape    int64
BldgTp      int64
dtype: object

In [107]:
# Returns columns
Property_Survey_2.columns

Index(['PID', 'LotArea', 'LotShape', 'BldgTp'], dtype='object')

In [108]:
# Returns statistics of each column of the dataset
Property_Survey_2.describe()

Unnamed: 0,PID,LotArea,LotShape,BldgTp
count,1770.0,1770.0,1770.0,1770.0
mean,736992000.0,9801.694915,0.636723,0.883616
std,188049200.0,6587.298945,0.48108,0.320776
min,526302000.0,1300.0,0.0,0.0
25%,532359600.0,7200.0,0.0,1.0
50%,902329600.0,9305.5,1.0,1.0
75%,907255000.0,11268.75,1.0,1.0
max,1007100000.0,159000.0,1.0,1.0


In [109]:
# Retruns top 3 entries
Property_Survey_2.head(3)

Unnamed: 0,PID,LotArea,LotShape,BldgTp
0,903430060,5520,1,1
1,903451090,6876,1,1
2,903458170,6240,1,1


In [110]:
# Returns number of null values
Property_Survey_2.isnull().sum()

PID         0
LotArea     0
LotShape    0
BldgTp      0
dtype: int64

In [111]:
# Merging Property_Survey_1 and Property_Survey_2 dataframes into Property_Survey
Property_Survey = Property_Survey_1.append(Property_Survey_2)
Property_Survey.shape

(2370, 4)

In [188]:
df = pd.read_csv('House_Feature.csv')
dfa = pd.read_csv('Quality_Assessment.csv')
df = pd.merge(df,dfa, how='outer', on=['PID'])
df = pd.merge(df, Property_Survey, how = 'outer', on=['PID'] )

### Correlations

#### Correlations

In [None]:
### Correlations

In [None]:
#### Correlations

In [124]:
# Prop_data1 = Prop_Data[[columns]]
Prop_Data1 = df[['PID','SalePrice','YearBuilt','YearRemodel','VeneerExterior','BsmtFinTp','BsmtFinSqft','BsmtUnfinSqft','HeatingQC','FstFlrSqft','SecFlrSqft',
                    'AbvGrndLiving','FullBathBsmt','HalfBathHouse','FullBathHouse']]

# Prop_data2 = Prop_Data[[columns]]
Prop_Data2 = df[['PID','SalePrice', 'BdrmAbvGrnd', 'RmAbvGrnd', 'Fireplaces', 'GarageTp', 'GarageCars','GarageArea','WdDckSqft','OpenPrchSqft','SalePrice','OverallQuality',
                'OverallCondition', 'LotArea', 'LotShape', 'BldgTp']]


In [125]:
Prop_Data1.head(3)

Unnamed: 0,PID,SalePrice,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,BsmtFinSqft,BsmtUnfinSqft,HeatingQC,FstFlrSqft,SecFlrSqft,AbvGrndLiving,FullBathBsmt,HalfBathHouse,FullBathHouse
0,526301100,215000,1960,1960,112,1,639,441,0,1656,0,1656,1,0,1
1,526350040,105000,1961,1961,0,1,468,270,1,896,0,896,0,0,1
2,526351010,172000,1958,1958,108,1,923,406,1,1329,0,1329,0,1,1


In [126]:
Prop_Data2.head(3)

Unnamed: 0,PID,SalePrice,BdrmAbvGrnd,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft,SalePrice.1,OverallQuality,OverallCondition,LotArea,LotShape,BldgTp
0,526301100,215000,3,7,2,3,2,528,210,62,215000,6,5,31770,0,1
1,526350040,105000,2,5,0,3,1,730,140,0,105000,5,6,11622,1,1
2,526351010,172000,3,6,0,3,1,312,393,36,172000,6,6,14267,0,1


#### Final df

In [127]:
# Returns the correlation of Prop_Data1
Prop_Data1.corr()

Unnamed: 0,PID,SalePrice,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,BsmtFinSqft,BsmtUnfinSqft,HeatingQC,FstFlrSqft,SecFlrSqft,AbvGrndLiving,FullBathBsmt,HalfBathHouse,FullBathHouse
PID,1.0,-0.220939,-0.334114,-0.132115,-0.209688,-0.086958,-0.112254,-0.047409,-0.068412,-0.153736,0.007854,-0.10711,-0.058387,-0.16708,-0.185578
SalePrice,-0.220939,1.0,0.585731,0.525906,0.415283,0.116483,0.384289,0.158441,0.438989,0.629534,0.284066,0.741919,0.247587,0.303561,0.590486
YearBuilt,-0.334114,0.585731,1.0,0.592891,0.277729,0.133977,0.246858,0.121896,0.435485,0.305742,0.028533,0.248082,0.188796,0.28172,0.525114
YearRemodel,-0.132115,0.525906,0.592891,1.0,0.138039,-0.061026,0.076375,0.163324,0.517051,0.220909,0.155815,0.306065,0.090458,0.208347,0.495482
VeneerExterior,-0.209688,0.415283,0.277729,0.138039,1.0,0.114318,0.220944,0.044521,0.133432,0.303536,0.113385,0.332229,0.102816,0.185576,0.23759
BsmtFinTp,-0.086958,0.116483,0.133977,-0.061026,0.114318,1.0,0.680872,-0.596296,-0.086697,0.153707,-0.194873,-0.065909,0.510811,-0.067914,-0.08
BsmtFinSqft,-0.112254,0.384289,0.246858,0.076375,0.220944,0.680872,1.0,-0.58104,0.027391,0.402694,-0.203622,0.118312,0.615401,-0.054916,0.068709
BsmtUnfinSqft,-0.047409,0.158441,0.121896,0.163324,0.044521,-0.596296,-0.58104,1.0,0.195086,0.28995,-0.01975,0.211021,-0.442824,-0.049871,0.254172
HeatingQC,-0.068412,0.438989,0.435485,0.517051,0.133432,-0.086697,0.027391,0.195086,1.0,0.173391,0.173745,0.288213,0.053106,0.173408,0.379653
FstFlrSqft,-0.153736,0.629534,0.305742,0.220909,0.303536,0.153707,0.402694,0.28995,0.173391,1.0,-0.310932,0.487889,0.2403,-0.153809,0.350441


In [128]:
# Returns the correlation Prop_Data2
Prop_Data2.corr()

Unnamed: 0,PID,SalePrice,BdrmAbvGrnd,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft,SalePrice.1,OverallQuality,OverallCondition,LotArea,LotShape,BldgTp
PID,1.0,-0.220939,0.006184,-0.068212,-0.089072,-0.284161,-0.226571,-0.196249,-0.030366,-0.072043,-0.220939,-0.235921,0.137142,0.033321,0.117828,0.157367
SalePrice,-0.220939,1.0,0.192239,0.527042,0.462909,0.459829,0.669653,0.652042,0.293314,0.334222,1.0,0.780217,-0.136934,0.290162,-0.322661,0.035231
BdrmAbvGrnd,0.006184,0.192239,1.0,0.68003,0.092314,0.010866,0.099996,0.087177,0.047128,0.108488,0.192239,0.089851,-0.001963,0.18544,-0.042815,0.368024
RmAbvGrnd,-0.068212,0.527042,0.68003,1.0,0.313321,0.153998,0.347685,0.311,0.152862,0.219257,0.527042,0.39182,-0.077297,0.217184,-0.135676,0.275416
Fireplaces,-0.089072,0.462909,0.092314,0.313321,1.0,0.292059,0.298027,0.261026,0.191108,0.141652,0.462909,0.347647,-0.064555,0.2612,-0.187162,0.035758
GarageTp,-0.284161,0.459829,0.010866,0.153998,0.292059,1.0,0.440926,0.415059,0.231618,0.135299,0.459829,0.400395,-0.201731,0.194103,-0.246136,-0.034654
GarageCars,-0.226571,0.669653,0.099996,0.347685,0.298027,0.440926,1.0,0.882917,0.209642,0.202938,0.669653,0.592664,-0.221783,0.176603,-0.250359,-0.02317
GarageArea,-0.196249,0.652042,0.087177,0.311,0.261026,0.415059,0.882917,1.0,0.204467,0.209486,0.652042,0.535823,-0.192911,0.203278,-0.219441,0.046293
WdDckSqft,-0.030366,0.293314,0.047128,0.152862,0.191108,0.231618,0.209642,0.204467,1.0,0.018564,0.293314,0.20619,-0.001546,0.117744,-0.167002,0.007968
OpenPrchSqft,-0.072043,0.334222,0.108488,0.219257,0.141652,0.135299,0.202938,0.209486,0.018564,1.0,0.334222,0.283925,-0.090611,0.09027,-0.096851,-0.002307


In [192]:
# Prop_Data_Final = Prop_Data[[columns]]

Prop_Data_Final =  df[['YearBuilt', 'YearRemodel', 'VeneerExterior', 'HeatingQC', 'FstFlrSqft', 'AbvGrndLiving', 'FullBathHouse','Fireplaces', 'RmAbvGrnd', 'GarageCars',
       'GarageArea', 'OverallQuality']]

## Step 3: Build Models -- Type 1

### Step 3.1: Prepare Data for Modeling -- Type 1

In [193]:
# Separating X and y
import statsmodels.api as sm

X = Prop_Data_Final
X = sm.add_constant(X) 



In [194]:
X.head(3)

Unnamed: 0,const,YearBuilt,YearRemodel,VeneerExterior,HeatingQC,FstFlrSqft,AbvGrndLiving,FullBathHouse,Fireplaces,RmAbvGrnd,GarageCars,GarageArea,OverallQuality
0,1.0,1960,1960,112,0,1656,1656,1,2,7,2,528,6
1,1.0,1961,1961,0,1,896,896,1,0,5,1,730,5
2,1.0,1958,1958,108,1,1329,1329,1,0,6,1,312,6


In [196]:
est=sm.OLS(y, X)
est = est.fit()
est.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.848
Model:,OLS,Adj. R-squared:,0.847
Method:,Least Squares,F-statistic:,1093.0
Date:,"Tue, 09 Jun 2020",Prob (F-statistic):,0.0
Time:,22:56:25,Log-Likelihood:,-27364.0
No. Observations:,2370,AIC:,54750.0
Df Residuals:,2357,BIC:,54830.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.315e+06,7.02e+04,-18.730,0.000,-1.45e+06,-1.18e+06
YearBuilt,344.2661,26.040,13.221,0.000,293.202,395.330
YearRemodel,303.1460,34.594,8.763,0.000,235.308,370.984
VeneerExterior,19.7426,3.603,5.479,0.000,12.677,26.808
HeatingQC,3603.8284,660.738,5.454,0.000,2308.140,4899.517
FstFlrSqft,33.2973,1.882,17.696,0.000,29.607,36.987
AbvGrndLiving,52.4800,2.493,21.049,0.000,47.591,57.369
FullBathHouse,-1.062e+04,1535.817,-6.917,0.000,-1.36e+04,-7610.814
Fireplaces,8227.7668,955.608,8.610,0.000,6353.847,1.01e+04

0,1,2,3
Omnibus:,246.879,Durbin-Watson:,1.619
Prob(Omnibus):,0.0,Jarque-Bera (JB):,981.893
Skew:,0.451,Prob(JB):,6.09e-214
Kurtosis:,6.022,Cond. No.,461000.0


In [195]:
y = Prop_Data1['SalePrice']
y.head(3)

0    215000
1    105000
2    172000
Name: SalePrice, dtype: int64

In [197]:
est.pvalues

const             4.367220e-73
YearBuilt         1.516314e-38
YearRemodel       3.564203e-18
VeneerExterior    4.730234e-08
HeatingQC         5.431420e-08
FstFlrSqft        6.865592e-66
AbvGrndLiving     2.842776e-90
FullBathHouse     5.940264e-12
Fireplaces        1.313440e-17
RmAbvGrnd         3.655058e-01
GarageCars        8.929456e-01
GarageArea        1.009834e-14
OverallQuality    4.982750e-86
dtype: float64

### Step 3.2: Model Building, Assessment & Tuning -- Type 1

In [80]:
# OLS Regression Fitting 
resultProp1 = 

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.843
Model:,OLS,Adj. R-squared:,0.842
Method:,Least Squares,F-statistic:,1150.0
Date:,"Mon, 08 Jun 2020",Prob (F-statistic):,0.0
Time:,16:23:48,Log-Likelihood:,-27400.0
No. Observations:,2370,AIC:,54820.0
Df Residuals:,2358,BIC:,54890.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.479e+06,7.05e+04,-20.975,0.000,-1.62e+06,-1.34e+06
YearBuilt,448.1131,25.853,17.333,0.000,397.416,498.810
YearRemodel,276.2025,34.983,7.895,0.000,207.603,344.802
VeneerExterior,25.6079,3.655,7.006,0.000,18.440,32.776
HeatingQC,3448.9209,671.456,5.136,0.000,2132.216,4765.626
FstFlrSqft,34.8620,1.899,18.361,0.000,31.139,38.585
AbvGrndLiving,59.0611,2.489,23.728,0.000,54.180,63.942
FullBathHouse,-9539.3854,1545.180,-6.174,0.000,-1.26e+04,-6509.333
RmAbvGrnd,-2357.4483,672.348,-3.506,0.000,-3675.902,-1038.994

0,1,2,3
Omnibus:,238.823,Durbin-Watson:,1.587
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1067.731
Skew:,0.39,Prob(JB):,1.4e-232
Kurtosis:,6.195,Cond. No.,1600000.0


In [41]:
# Get pvalues of resultProp1

const              1.049213e-89
YearBuilt          1.914930e-63
YearRemodel        4.393487e-15
VeneerExterior     3.189317e-12
HeatingQC          3.028608e-07
FstFlrSqft         1.738025e-70
AbvGrndLiving     8.627239e-112
FullBathHouse      7.833888e-10
RmAbvGrnd          4.628817e-04
OverallQuality    8.258355e-124
LotArea            1.119435e-17
BldgTp             7.762563e-11
dtype: float64

In [187]:
from sklearn.preprocessing import StandardScaler

data_scaler = StandardScaler().fit(df)
stan_df = data_scaler.transform(df)

stan_df.columns = ['YearBuilt', 'YearRemodel', 'VeneerExterior', 'HeatingQC', 'FstFlrSqft', 'AbvGrndLiving', 'FullBathHouse','Fireplaces', 'RmAbvGrnd', 'GarageCars',
       'GarageArea', 'OverallQuality']

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

## Step 4. Build Models -- Type 2

### Step 4.1: Prepare Data for Modeling -- Type 2

In [186]:
# Create a new stan_df containing the standardized values of Prop_Data_Final
stan_df

array([[-1.00227651, -0.35115651, -1.16418991, ...,  0.64415408,
        -0.04043651, -0.57543383],
       [-1.00201702, -0.31793729, -1.11585704, ..., -1.07278984,
        -0.83905757,  0.33740297],
       [-1.00201188, -0.41759493, -1.26085564, ..., -0.0270149 ,
        -0.04043651,  0.33740297],
       ...,
       [ 1.10257217,  0.41288537, -0.05253395, ..., -0.66696673,
        -0.83905757, -0.57543383],
       [ 1.10322978,  0.71185828,  0.38246186, ..., -0.65135815,
        -0.83905757, -0.57543383],
       [ 1.10694104,  0.11391246, -0.43919689, ..., -0.05823207,
        -0.83905757, -0.57543383]])

In [168]:
# Returns statistics of stan_df
stan_df.describe()

Unnamed: 0,YearBuilt,YearRemodel,VeneerExterior,HeatingQC,FstFlrSqft,AbvGrndLiving,FullBathHouse,RmAbvGrnd,OverallQuality,LotArea,BldgTp
count,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0,2370.0
mean,1970.570886,1984.08692,88.143882,2.153586,1116.442616,1446.174262,1.508017,6.25865,6.050633,9700.865401,0.877215
std,30.109415,20.694221,158.718586,0.944222,344.282409,445.597554,0.501727,1.392625,1.252423,6153.729681,0.328259
min,1872.0,1950.0,0.0,0.0,407.0,407.0,0.0,3.0,2.0,1300.0,0.0
25%,1953.0,1965.0,0.0,1.0,864.0,1105.5,1.0,5.0,5.0,7200.0,1.0
50%,1972.0,1993.0,0.0,3.0,1056.0,1397.0,2.0,6.0,6.0,9316.0,1.0
75%,1999.0,2003.0,144.0,3.0,1331.5,1688.75,2.0,7.0,7.0,11235.0,1.0
max,2010.0,2010.0,1600.0,3.0,2898.0,3608.0,2.0,12.0,10.0,159000.0,1.0


In [169]:
# Makes a copy of the DataFrame['SalePrice'], here, acting as a dependent variable y
y = Prop_Data1['SalePrice']

# Independent Variable X = stan_df
X = stan_df

In [172]:
# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

#printing the output and coefficients
coeff_df = pd.DataFrame(linreg.coef_,X.columns,columns=['Coefficient']) 
coeff_df

-1505570.6225751594
[ 4.52385342e+02  2.86137692e+02  2.13894790e+01  3.35073867e+03
  3.44028144e+01  6.10201556e+01 -1.01272014e+04 -2.57616632e+03
  1.59551990e+04  8.14589909e-01  1.26484036e+04]


Unnamed: 0,Coefficient
YearBuilt,452.385342
YearRemodel,286.137692
VeneerExterior,21.389479
HeatingQC,3350.738672
FstFlrSqft,34.402814
AbvGrndLiving,61.020156
FullBathHouse,-10127.201378
RmAbvGrnd,-2576.16632
OverallQuality,15955.199002
LotArea,0.81459


In [170]:
# Prepare the test / tra# Prepare the test / train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [171]:
# Get the sets size
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(1896, 11)
(474, 11)
(1896,)
(474,)


In [94]:
# Linear Regression model fitting

# Create a variable myScore_regression containing the cross validation score


In [95]:
myScore

array([-5.74834224e+08, -8.27308187e+08, -5.82458955e+08, -6.76715104e+08,
       -5.08042092e+08, -9.15864740e+08, -5.01013586e+08, -7.62532604e+08,
       -5.97044154e+08, -7.22381486e+08])

In [102]:
# Plotting the accuracy

# Plot the Accuracy using the Cross validation score

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Accuracy of x-valid regression model')

## Step 6: Deployment - Import, Understand & Prepare Score Data Set

In [50]:
# Reading dataset in Score

In [51]:
# Retruns shape

(100, 27)

In [52]:
# Retruns top 5 entries

Unnamed: 0,PID,LotArea,LotShape,BldgTp,OverallQuality,OverallCondition,YearBuilt,YearRemodel,VeneerExterior,BsmtFinTp,...,HalfBathHouse,FullBathHouse,BdrmAbvGrnd,RmAbvGrnd,Fireplaces,GarageTp,GarageCars,GarageArea,WdDckSqft,OpenPrchSqft
0,528445060,8987,1,1,8,5,2005,2006,226.0,0,...,0,2,2.0,6.0,1,3,3,880,144,0
1,528456160,9215,1,1,7,5,2009,2010,0.0,0,...,0,2,2.0,4.0,0,3,2,676,0,136
2,528458070,8640,1,1,7,5,2009,2009,0.0,1,...,1,2,3.0,7.0,0,3,2,614,169,45
3,906380190,6762,1,1,7,5,2006,2006,24.0,1,...,0,2,2.0,6.0,0,3,2,632,105,61
4,906385010,10402,0,1,7,5,2009,2009,0.0,0,...,0,2,3.0,6.0,0,3,3,740,0,36


In [53]:
# Retruns Columns

Index(['PID', 'LotArea', 'LotShape', 'BldgTp', 'OverallQuality',
       'OverallCondition', 'YearBuilt', 'YearRemodel', 'VeneerExterior',
       'BsmtFinTp', 'BsmtFinSqft', 'BsmtUnfinSqft', 'HeatingQC', 'FstFlrSqft',
       'SecFlrSqft', 'AbvGrndLiving', 'FullBathBsmt', 'HalfBathHouse',
       'FullBathHouse', 'BdrmAbvGrnd', 'RmAbvGrnd', 'Fireplaces', 'GarageTp',
       'GarageCars', 'GarageArea', 'WdDckSqft', 'OpenPrchSqft'],
      dtype='object')

In [107]:
# Score2 = Score[[columns]]

In [106]:
Score2.head()

Unnamed: 0,PID,YearBuilt,YearRemodel,VeneerExterior,HeatingQC,FstFlrSqft,AbvGrndLiving,FullBathHouse,RmAbvGrnd,OverallQuality,LotArea,BldgTp
0,528445060,2005,2006,226.0,3,1595.0,1595,2,6.0,8,8987,1
1,528456160,2009,2010,0.0,3,1218.0,1218,2,4.0,7,9215,1
2,528458070,2009,2009,0.0,3,764.0,1547,2,7.0,7,8640,1
3,906380190,2006,2006,24.0,3,1208.0,1208,2,6.0,7,6762,1
4,906385010,2009,2009,0.0,3,1226.0,1226,2,6.0,7,10402,1


In [109]:
# Prop_Data_Final2 = Prop_Data[[columns]]

########################################
Prop_Data_Final2.head()

Unnamed: 0,PID,YearBuilt,YearRemodel,VeneerExterior,HeatingQC,FstFlrSqft,AbvGrndLiving,FullBathHouse,RmAbvGrnd,OverallQuality,LotArea,BldgTp,SalePrice
0,526301100,1960,1960,112,0,1656,1656,1,7,6,31770,1,215000
1,526350040,1961,1961,0,1,896,896,1,5,5,11622,1,105000
2,526351010,1958,1958,108,1,1329,1329,1,6,6,14267,1,172000
3,526353030,1968,1968,0,3,2110,2110,2,8,7,11160,1,244000
4,527105010,1997,1998,0,2,928,1629,2,6,5,13830,1,189900


## Step 7: Predict Target of Score Data Set

In [110]:
# Creating a linear regressor

In [None]:
# Split into X and y

In [111]:
# Fitting linear Regression model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [114]:
# Testing linear Regression model on Score2 dataset

## Step 8: Prepare & Export (save) Required DataFrame

In [116]:
# Dropping columns except PID in Score2

In [120]:
# Appending the precitions NumPy array 'Scored2' to the Pandas DataFrame 'Score2'
Score2['SalePrice'] = Scored2[:]
Score2.head()

Unnamed: 0,PID,SalePrice
0,528445060,255563.813486
1,528456160,205967.764228
2,528458070,201744.139225
3,906380190,196436.350297
4,906385010,202716.520035


In [118]:
# Saving the dataset to a .csv file
Score2.to_csv('scoreddata.csv')

##### Congratulations! You nailed it!