### Dummy Variables

You saw in the earlier notebook that you weren't able to directly add a categorical variable to your multiple linear regression model. In this notebook, you will get some practice incorporating categorical data by converting to dummy variables in your models and interpreting the output.

Let's start by reading in the necessary libraries and data.

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./house_prices.csv')
df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


`1.` Use the [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) documentation to assist you with obtaining dummy variables for the **neighborhood** column.  Then use [join](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) to add the dummy variables to your dataframe, **df**, and store the joined results in **df_new**.

Fit a linear model using **all three levels** of **neighborhood** to predict the price. Don't forget an intercept.

Use your results to answer quiz 1 below.

In [5]:
neighborhood_dummies = pd.get_dummies(df['neighborhood'])
df_new = df.join(neighborhood_dummies)
df_new.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,A,B,C
0,1112,B,1188,3,2,ranch,598291,0,1,0
1,491,B,3512,5,3,victorian,1744259,0,1,0
2,5952,B,1134,3,2,ranch,571669,0,1,0
3,3525,A,1940,4,2,ranch,493675,1,0,0
4,5108,B,2208,6,4,victorian,1101539,0,1,0


In [6]:
df_new['intercept'] = 1
lm = sm.OLS(df_new['price'], df_new[['intercept', 'A', 'B', 'C']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.246
Model:,OLS,Adj. R-squared:,0.246
Method:,Least Squares,F-statistic:,983.1
Date:,"Wed, 10 Feb 2021",Prob (F-statistic):,0.0
Time:,06:40:08,Log-Likelihood:,-87082.0
No. Observations:,6028,AIC:,174200.0
Df Residuals:,6025,BIC:,174200.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,5.381e+05,4439.653,121.210,0.000,5.29e+05,5.47e+05
A,3001.8311,8650.726,0.347,0.729,-1.4e+04,2e+04
B,5.325e+05,7894.313,67.448,0.000,5.17e+05,5.48e+05
C,2669.4717,8925.271,0.299,0.765,-1.48e+04,2.02e+04

0,1,2,3
Omnibus:,689.315,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1154.155
Skew:,0.793,Prob(JB):,2.3900000000000002e-251
Kurtosis:,4.442,Cond. No.,2410000000000000.0


`2.`  Now, fit an appropriate linear model for using **neighborhood** to predict the price of a home. Use **neighborhood A** as your baseline. (And remember that the values shown in the results for the other neighborhoods will be based on comparisons with this baseline neighborhood A then.) Use your resulting model to answer the questions in Quiz 2 and Quiz 3 below.

In [None]:
lm2 = sm.OLS(df_new['price'], df_new[['intercept', 'B', 'C']])
results2 = lm2.fit()
results2.summary()

`3.` Run the two cells below to look at the home prices for the A and C neighborhoods. Add neighborhood B. This creates a glimpse into the differences that you found in the previous linear model.

In [None]:
plt.hist(df_new.query("C == 1")['price'], alpha = 0.3, label = 'C');
plt.hist(df_new.query("A == 1")['price'], alpha = 0.3, label = 'A');

plt.legend();

`4.` Now, add dummy variables for the **style** of house. Create a new linear model using these new dummies, as well as the previous **neighborhood** dummies.  Use **ranch** as the baseline for the **style**.  Additionally, add **bathrooms** and **bedrooms** to your linear model.  Don't forget an intercept.  Use the results of your linear model to answer the last two questions below. **Home prices are measured in dollars, and this dataset is not real.**

To minimize scrolling, it might be useful to open another browser window to this concept to answer the quiz questions.