# <font color=blue>Assignments for "Understanding The Relationship"</font>

To close out this lesson, you're going to do three assignments. For the first assignment, you'll write up a short answer to a question in a Gist file.  For the second two assignments, you'll do your work in Jupyter notebooks, and you should link to those notebooks in the same Gist file.

Please submit a single Gist file containing the answer to first assignment, plus links for second two.

## 1. Interpretion and signficance

Suppose that we would like to know how much families in the US are spending on recreation annually. We estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer and save in a Gist. 

- We can apply t-test to understand whether a coefficient is statistically significant.
- We also need to know P-values and t-values.

## 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous lesson. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

  import pandas.util.testing as tm


In [2]:
weather_df = pd.read_csv("https://djl-lms-assets.s3.eu-central-1.amazonaws.com/datasets/weatherHistory.csv")
weather_df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


- Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 

In [3]:
y = weather_df['Temperature (C)'] - weather_df['Apparent Temperature (C)']

X = weather_df[['Humidity', 'Wind Speed (km/h)']]

In [4]:
X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 08 May 2021",Prob (F-statistic):,0.0
Time:,19:16:20,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4381,0.021,-115.948,0.000,-2.479,-2.397
Humidity,3.0292,0.024,126.479,0.000,2.982,3.076
Wind Speed (km/h),0.1193,0.001,176.164,0.000,0.118,0.121

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.264
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


P-values for all coefficients are 0, which means that they are all statistically significant. While both coefficients have a positive sign, coefficient for humidity is larger than wind speed, which is in accordance with my expectation.

- Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [5]:
X['humidity*windspeed'] = X['Humidity'] * X['Wind Speed (km/h)']

results = sm.OLS(y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sat, 08 May 2021",Prob (F-statistic):,0.0
Time:,19:16:20,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0839,0.033,-2.511,0.012,-0.149,-0.018
Humidity,-0.1775,0.043,-4.133,0.000,-0.262,-0.093
Wind Speed (km/h),-0.0905,0.002,-36.797,0.000,-0.095,-0.086
humidity*windspeed,0.2971,0.003,88.470,0.000,0.291,0.304

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.262
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


P-values (expect that of the constant) are still 0, which means that they are still statistically significant. However, signs of both Humidity and Wind Speed have changed, they are now negative and have smaller absolute values than earlier. The interaction variable, on the other hand, has a positive coefficient.

##  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

In [6]:
house_prices_df = pd.read_csv("https://djl-lms-assets.s3.eu-central-1.amazonaws.com/datasets/house_prices.csv", sep = ";")
house_prices_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,...,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,...,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,...,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,...,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,...,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,...,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


- Run your house prices model again and interpret the results. Which features are statistically significant and which are not?

In [7]:
house_prices_df_null_ratios = house_prices_df.isna().sum() / len(house_prices_df)
house_prices_df.drop(house_prices_df_null_ratios[house_prices_df_null_ratios > .1].index, axis=1, inplace=True)
house_prices_df.dropna(inplace=True)

In [8]:
X = house_prices_df.select_dtypes(exclude='object').drop('SalePrice', axis=1)
y = house_prices_df['SalePrice']

In [9]:
X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.808
Model:,OLS,Adj. R-squared:,0.803
Method:,Least Squares,F-statistic:,161.5
Date:,"Sat, 08 May 2021",Prob (F-statistic):,0.0
Time:,19:16:21,Log-Likelihood:,-15881.0
No. Observations:,1338,AIC:,31830.0
Df Residuals:,1303,BIC:,32010.0
Df Model:,34,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.344e+05,1.49e+06,0.292,0.770,-2.48e+06,3.35e+06
Id,-0.4159,2.304,-0.180,0.857,-4.937,4.105
MSSubClass,-164.9821,28.463,-5.796,0.000,-220.820,-109.144
LotArea,0.3950,0.102,3.860,0.000,0.194,0.596
OverallQual,1.854e+04,1277.521,14.509,0.000,1.6e+04,2.1e+04
OverallCond,5539.4909,1144.985,4.838,0.000,3293.275,7785.707
YearBuilt,343.8423,78.784,4.364,0.000,189.286,498.399
YearRemodAdd,149.4128,75.737,1.973,0.049,0.833,297.992
MasVnrArea,27.9086,6.117,4.562,0.000,15.908,39.910

0,1,2,3
Omnibus:,644.55,Durbin-Watson:,1.951
Prob(Omnibus):,0.0,Jarque-Bera (JB):,115179.942
Skew:,-1.119,Prob(JB):,0.0
Kurtosis:,48.398,Cond. No.,1.31e+16


- Now, exclude the insignificant features from your model. Did anything change?

In [10]:
X_significant = X[['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
                   'YearBuilt', 'MasVnrArea', 'BsmtFinSF1', 'TotalBsmtSF',
                    '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'BedroomAbvGr',
                    'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars', 'WoodDeckSF',
                    'ScreenPorch']]

In [11]:
results = sm.OLS(y, X_significant).fit()

results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared (uncentered):,0.969
Model:,OLS,Adj. R-squared (uncentered):,0.969
Method:,Least Squares,F-statistic:,2450.0
Date:,"Sat, 08 May 2021",Prob (F-statistic):,0.0
Time:,19:16:21,Log-Likelihood:,-15919.0
No. Observations:,1338,AIC:,31870.0
Df Residuals:,1321,BIC:,31960.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
MSSubClass,-148.4595,28.533,-5.203,0.000,-204.434,-92.485
LotArea,0.3647,0.102,3.565,0.000,0.164,0.565
OverallQual,2.256e+04,1187.777,18.990,0.000,2.02e+04,2.49e+04
OverallCond,3176.0287,951.836,3.337,0.001,1308.754,5043.304
YearBuilt,-35.3897,5.659,-6.254,0.000,-46.491,-24.289
MasVnrArea,30.0674,6.088,4.939,0.000,18.124,42.011
BsmtFinSF1,10.7016,3.186,3.359,0.001,4.452,16.952
TotalBsmtSF,21.6900,5.914,3.667,0.000,10.087,33.293
2ndFlrSF,10.6670,6.030,1.769,0.077,-1.163,22.497

0,1,2,3
Omnibus:,726.855,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,110445.71
Skew:,-1.466,Prob(JB):,0.0
Kurtosis:,47.413,Cond. No.,99200.0


R-squared has improved significantly.

- Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have more prominent effect on the house prices?

While some coefficients did not change much, some others have changed a lot. For instance, coefficient of the variable "OverallQual," which has the highest effect on the target variable, has improved from 18540 to 22560. "OverallCond," on the other hand, decreased from 5539.4909 to 3176.0287. Two other variables with highest coefficients, hence the most prominent effects after "OverallQual" are "BsmtFullBath" and "TotRmsAbvGrd."

- Do the results sound reasonable to you? If not, try to explain the potential reasons.

Yes, I think they are reasonable.