## Miscellaneous Notebook

In this notebook, I have included any miscellaneous data exploration and analysis that did not fit into my finished project or that I ultimately decided not to use.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import numpy as np
import math
import matplotlib.cm as cm
%matplotlib nbagg
import seaborn as sns
from sklearn.linear_model import LinearRegression

In [2]:
data = pd.read_csv('Data/kc_house_data.csv')
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,greenbelt,...,sewer_system,sqft_above,sqft_basement,sqft_garage,sqft_patio,yr_built,yr_renovated,address,lat,long
0,7399300360,5/24/2022,675000.0,4,1.0,1180,7140,1.0,NO,NO,...,PUBLIC,1180,0,0,40,1969,0,"2102 Southeast 21st Court, Renton, Washington ...",47.461975,-122.19052
1,8910500230,12/13/2021,920000.0,5,2.5,2770,6703,1.0,NO,NO,...,PUBLIC,1570,1570,0,240,1950,0,"11231 Greenwood Avenue North, Seattle, Washing...",47.711525,-122.35591
2,1180000275,9/29/2021,311000.0,6,2.0,2880,6156,1.0,NO,NO,...,PUBLIC,1580,1580,0,0,1956,0,"8504 South 113th Street, Seattle, Washington 9...",47.502045,-122.2252
3,1604601802,12/14/2021,775000.0,3,3.0,2160,1400,2.0,NO,NO,...,PUBLIC,1090,1070,200,270,2010,0,"4079 Letitia Avenue South, Seattle, Washington...",47.56611,-122.2902
4,8562780790,8/24/2021,592500.0,2,2.0,1120,758,2.0,NO,NO,...,PUBLIC,1120,550,550,30,2012,0,"2193 Northwest Talus Drive, Issaquah, Washingt...",47.53247,-122.07188


In [22]:
y = data[['price']]

I wanted to investigate the target variable to determine if it was more useful for further analysis in its log-transformed version:

In [23]:
data['price'].hist();

<IPython.core.display.Javascript object>

In [25]:
price_qq = sm.qqplot(data['price'], line='r');

<IPython.core.display.Javascript object>

The histogram does not seem to show a perfectly normal looking distribution, and the QQ plot shows the residuals getting further and further away from the theoretical fit line.

In [26]:
np.log(data['price']).hist();

The log-transformed histogram of price looks more normal than the original variable's distribution.

In [27]:
sm.qqplot(np.log(data['price']), line='r');

<IPython.core.display.Javascript object>

This QQ plot is still not perfectly linear but is it better than before we log-transformed the target. The data is still only reliable within the same range, so it likely is not worthwhile to use the log-transformed target feature moving forward as it will be harder to interpret.

I then was curious to see if a line plot of sqft_living vs. price would be a useful visualization to include to show their relationship:

In [28]:
price_usable = data[(data['price']>= 684205.75)&(data['price']<= 2901227.43)]


In [29]:
fig, ax = plt.subplots(figsize = (15,8))
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.lineplot(data=data, x='sqft_living', y='price').set(xlabel="Square Feet of Living Space", ylabel="Price", title="sqft_living vs. price");

<IPython.core.display.Javascript object>

This did not seem like the best way to show this relationship so I did not end up including this line plot.

I also want to investigate the distribution of the baseline predictor (sqft_living) to see if it is normal and if its residuals follow the theoretical model:

In [30]:
data['sqft_living'].hist();

In [31]:
sqftliving_qq = sm.qqplot(data['sqft_living'], line='r');

<IPython.core.display.Javascript object>

The histogram of sqft_living does not look very normal, and much like the QQ plot of price, the residuals also tend to get further away from the theoretical line. To see if it improves the distribution, I am going to try log transforming this variable.

In [32]:
sqftliving_log_hist = np.log(data['sqft_living']).hist()

In [33]:
sqftliving_log_qq = sm.qqplot(np.log(data['sqft_living']), line='r');

<IPython.core.display.Javascript object>

The transformed feature's distribution looks much more normal than the untransformed variable, and the residuals follow the theoretical line better as well. However, once again using the log-transformed version of this feature would cause any models to be harder to interpret, so I determined it would not be beneficial to use this.

After determining that combining sqft_living and view on a multiple regression model did improve upon the baseline model, I was curious to see if adding waterfront, the second most related categorical variable to price, would improve it further:

In [34]:
multi_x2 = data[['sqft_living', 'view', 'waterfront']]
multi_x2 = pd.get_dummies(multi_x2, columns=["waterfront", "view"], drop_first=True)

multi_x2

Unnamed: 0,sqft_living,waterfront_YES,view_EXCELLENT,view_FAIR,view_GOOD,view_NONE
0,1180,0,0,0,0,1
1,2770,0,0,0,0,0
2,2880,0,0,0,0,0
3,2160,0,0,0,0,0
4,1120,0,0,0,0,1
...,...,...,...,...,...,...
30150,1910,0,0,0,0,1
30151,2020,0,0,1,0,0
30152,1620,0,0,0,0,1
30153,2570,0,0,0,0,1


In [35]:
model_multi2 = sm.OLS(y, sm.add_constant(multi_x2))
multi2_results = model_multi2.fit()

print(multi2_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.423
Model:                            OLS   Adj. R-squared:                  0.423
Method:                 Least Squares   F-statistic:                     3686.
Date:                Sun, 11 Jun 2023   Prob (F-statistic):               0.00
Time:                        17:23:29   Log-Likelihood:            -4.4780e+05
No. Observations:               30155   AIC:                         8.956e+05
Df Residuals:                   30148   BIC:                         8.957e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           1.029e+05   1.89e+04      5.

##### Plotting the actual vs. predicted values of this model:

In [36]:
fig, ax = plt.subplots(figsize=(10,5))
sm.graphics.plot_fit(multi2_results, "sqft_living", ax=ax)
plt.show()

<IPython.core.display.Javascript object>

In [37]:
fig, ax = plt.subplots(figsize=(10,5))
sm.graphics.plot_fit(multi2_results, "view_EXCELLENT", ax=ax)
plt.show()

<IPython.core.display.Javascript object>

In [38]:
fig, ax = plt.subplots(figsize=(10,5))
sm.graphics.plot_fit(multi2_results, "waterfront_YES", ax=ax)
plt.show()

<IPython.core.display.Javascript object>

##### Plotting the regression line:

In [39]:
fig = plt.figure(figsize=(15,5))
sm.graphics.plot_partregress_grid(multi2_results, exog_idx=["sqft_living", "view_EXCELLENT", "waterfront_YES"], fig=fig)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

##### Plotting the residuals:

In [44]:
fig, axes = plt.subplots(ncols=3, figsize=(15,5), sharey=True)

sqftliving_ax = axes[0]
sqftliving_ax.scatter(multi_x2["sqft_living"], multi2_results.resid)
sqftliving_ax.axhline(y=0, color="black")
sqftliving_ax.set_xlabel("sqft_living")
sqftliving_ax.set_ylabel("residuals")

view_ax = axes[1]
view_ax.scatter(multi_x2["view_EXCELLENT"], multi2_results.resid)
view_ax.axhline(y=0, color="black")
view_ax.set_xlabel("view_EXCELLENT")

wf_ax = axes[2]
wf_ax.scatter(multi_x2["waterfront_YES"], multi2_results.resid)
wf_ax.axhline(y=0, color="black")
wf_ax.set_xlabel("waterfront_YES");

<IPython.core.display.Javascript object>

These plots look about the same and since this model explains 42.3% of the variance, this model may be an improvement to the baseline and to the first multiple linear regressions model.

##### MAE for multiple linear regression model:

In [45]:
mae = multi2_results.resid.abs().sum() / len(y)
mae

389498.98261728266

This multiple linear regression model is off by about $389,498.98 in a given prediction. This is more than the first mutiple linear regression model so this model is probably not the best model afterall.