# How should we price homes in Seattle?

## Goals

In previous cases, we learned how linear regression can be a powerful tool to understand the behavior of a variable of interest in relation to other variables in our dataset. However, in many real-life situations our data may not meet basic assumptions that one needs for a linear regression model to be suitable. In cases where linear regression is not *directly* applicable in these scenarios, we need to figure out how to go around this problem.

In this case, you will learn:

1. How to select and use appropriate variable transformations to correct our data such that it becomes suitable for applying linear regression
2. How to decide whether additional independent variables actually benefit the model
3. How to further extend the applicability of linear models by taking into account non-linear interactions that can exist between the independent variables

## Introduction

**Business Context.** You have been hired as a data scientist by a large real estate company in their Seattle office. Your job is to assist Seattle residents willing to sell their home with determining an optimal price to sell their property at in order to maximize their proceeds while still being able to find willing buyers. To do this, the firm would like you to build a pricing model for Seattle real estate, in order to maximize the probability of helping residents close sales (and thus maximizing commissions for the firm).

**Business Problem.** Your task is to **build a model that uses past sales data in Seattle to recommend an optimal sale price for any particular property**.

**Analytical Context.** The provided dataset was retrieved from [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction) and includes sales prices of houses in the state of Washington (King county, where Seattle is located) between May 2014 and May 2015. As we have learned, the primary tool to predict a dependent variable is the multiple regression model. However, sometimes the assumptions of a linear model are not met by our data. We will learn a set of strategies to mitigate some common issues that appear during regression analysis.

The case is structured as follows: you will (1) conduct basic EDA of some of the variables to determine that standard linear regression is not sufficient; (2) learn about variable transformations and use these to improve the initial model; and finally (3) learn how to incorporate interaction effects (which are themselves a form of variable transformation involving two or more variables) into our model.

## Data exploration

Let's start by reviewing the columns of the dataset and what they mean:

1. **id**: identification for a house
2. **date**: date house was sold
3. **price**: price house was sold at
4. **bedrooms**: number of bedrooms
5. **bathrooms**: number of bathrooms
6. **sqft_living**: square footage of the home
7. **sqft_lot**: square footage of the lot
8. **floors**: total floors (levels) in house
9. **waterfront**: whether or not the house has a view of a waterfront
10. **view**: an index from 0 to 4 of how good the view from the property is
11. **condition**: how good the condition of the house is
12. **grade**: overall grade given to the housing unit, based on King County grading system
13. **sqft_above**: square footage of the house apart from basement
14. **sqft_basement**: square footage of the basement
15. **yr_built**: year house was built
16. **yr_renovated**: year house was renovated
17. **zipcode**: zipcode of the house
18. **lat**: latitude coordinate of the house
19. **long**: longitude coordinate of the house

(See [here](https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices) for a complete explanation of the columns.)

Here are the first 5 rows of the dataset:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>id</th>      <th>date</th>      <th>price</th>      <th>bedrooms</th>      <th>bathrooms</th>      <th>sqft_living</th>      <th>sqft_lot</th>      <th>floors</th>      <th>waterfront</th>      <th>view</th>      <th>condition</th>      <th>grade</th>      <th>sqft_above</th>      <th>sqft_basement</th>      <th>yr_built</th>      <th>yr_renovated</th>      <th>zipcode</th>      <th>lat</th>      <th>long</th>      <th>sqft_living15</th>      <th>sqft_lot15</th>      <th>log_price</th>      <th>log_sqft_living</th>      <th>renovated</th>      <th>has_basement</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>7129300520</td>      <td>20141013T000000</td>      <td>221900.0</td>      <td>3</td>      <td>1.00</td>      <td>1180</td>      <td>5650</td>      <td>1.0</td>      <td>0</td>      <td>0</td>      <td>3</td>      <td>7</td>      <td>1180</td>      <td>0</td>      <td>1955</td>      <td>0</td>      <td>98178</td>      <td>47.5112</td>      <td>-122.257</td>      <td>1340</td>      <td>5650</td>      <td>12.309982</td>      <td>7.073270</td>      <td>False</td>      <td>0.0</td>    </tr>    <tr>      <th>1</th>      <td>6414100192</td>      <td>20141209T000000</td>      <td>538000.0</td>      <td>3</td>      <td>2.25</td>      <td>2570</td>      <td>7242</td>      <td>2.0</td>      <td>0</td>      <td>0</td>      <td>3</td>      <td>7</td>      <td>2170</td>      <td>400</td>      <td>1951</td>      <td>1991</td>      <td>98125</td>      <td>47.7210</td>      <td>-122.319</td>      <td>1690</td>      <td>7639</td>      <td>13.195614</td>      <td>7.851661</td>      <td>True</td>      <td>1.0</td>    </tr>    <tr>      <th>2</th>      <td>5631500400</td>      <td>20150225T000000</td>      <td>180000.0</td>      <td>2</td>      <td>1.00</td>      <td>770</td>      <td>10000</td>      <td>1.0</td>      <td>0</td>      <td>0</td>      <td>3</td>      <td>6</td>      <td>770</td>      <td>0</td>      <td>1933</td>      <td>0</td>      <td>98028</td>      <td>47.7379</td>      <td>-122.233</td>      <td>2720</td>      <td>8062</td>      <td>12.100712</td>      <td>6.646391</td>      <td>False</td>      <td>0.0</td>    </tr>    <tr>      <th>3</th>      <td>2487200875</td>      <td>20141209T000000</td>      <td>604000.0</td>      <td>4</td>      <td>3.00</td>      <td>1960</td>      <td>5000</td>      <td>1.0</td>      <td>0</td>      <td>0</td>      <td>5</td>      <td>7</td>      <td>1050</td>      <td>910</td>      <td>1965</td>      <td>0</td>      <td>98136</td>      <td>47.5208</td>      <td>-122.393</td>      <td>1360</td>      <td>5000</td>      <td>13.311329</td>      <td>7.580700</td>      <td>False</td>      <td>1.0</td>    </tr>    <tr>      <th>4</th>      <td>1954400510</td>      <td>20150218T000000</td>      <td>510000.0</td>      <td>3</td>      <td>2.00</td>      <td>1680</td>      <td>8080</td>      <td>1.0</td>      <td>0</td>      <td>0</td>      <td>3</td>      <td>8</td>      <td>1680</td>      <td>0</td>      <td>1987</td>      <td>0</td>      <td>98074</td>      <td>47.6168</td>      <td>-122.045</td>      <td>1800</td>      <td>7503</td>      <td>13.142166</td>      <td>7.426549</td>      <td>False</td>      <td>0.0</td>    </tr>  </tbody></table>

### Exercise 1

This is the distribution of house prices as characterized by its descriptive statistics and a histogram:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>price</th>    </tr>  </thead>  <tbody>    <tr>      <th>count</th>      <td>2.161300e+04</td>    </tr>    <tr>      <th>mean</th>      <td>5.400881e+05</td>    </tr>    <tr>      <th>std</th>      <td>3.671272e+05</td>    </tr>    <tr>      <th>min</th>      <td>7.500000e+04</td>    </tr>    <tr>      <th>25%</th>      <td>3.219500e+05</td>    </tr>    <tr>      <th>50%</th>      <td>4.500000e+05</td>    </tr>    <tr>      <th>75%</th>      <td>6.450000e+05</td>    </tr>    <tr>      <th>max</th>      <td>7.700000e+06</td>    </tr>  </tbody></table>

![House prices histogram](data/images/histogram_prices.png)

Based solely on these results, would you say that the distribution of prices is normal?

**Answer.**

-------

One way to assess if data comes from a particular given distribution is by drawing a [**Quantile-Quantile plot (QQ-plot)**](https://www.youtube.com/watch?v=okjYjClSjOg&ab_channel=StatQuestwithJoshStarmer). In a QQ-plot, the quantiles of the data are plotted against the quantiles of the desired distribution. If the resulting plot deviates sufficiently from the identity line (i.e. the line $y = x$), we can say that our data does not come from that particular distribution.

To better understand how to interpret QQ-plots, let's consider a normally-distributed variable and find its percentiles (100-quantiles):

![A normal distribution with its percentiles](data/images/normal_percentiles.png)

Let's now take a sample from the same variable. This sample consists of 100 observations and, therefore, each observation counts as a percentile (that is, after you sort them, the $i$-th observation will be exactly the $i$-the percentile).

![A sample from a normal distribution with its percentiles](data/images/sample_percentile.png)

If we plot the percentiles of the normal distribution against the percentiles of the sample as a scatterplot, we get the following:

![QQ plot original vs. sample](data/images/qq_original_sample.png)

We notice that the points approximately follow a straight line from the lower left-hand corner to the upper right-hand corner (i.e. roughly the line $y = x$). This makes sense - since the shapes of the distributions are similar, their percentiles (which summarize those shapes) should be similar too.

Let's now compare our original distribution with *another* distribution that is also normal but which has a very different mean and standard deviation (notice the change in the $y$-axis):

![QQ plot original vs. another sample](data/images/qq_original_another_sample.png)

As you can see, the linear relationship between the percentiles is maintained. In general, if you draw a QQ-plot of two normal variables, the points will tend to form a straight line even if the two distributions have different means and standard deviations. If you would like to learn more about QQ-plots, you can check [this video](https://www.youtube.com/watch?v=okjYjClSjOg&ab_channel=StatQuestwithJoshStarmer) (for a more technical overview, [this](https://onlinestatbook.com/2/advanced_graphs/q-q_plots.html) is a good resource).

### Question 1

This is the QQ-plot for `price` vs. the normal distribution. What can you conclude?

![QQ plot of prices](data/images/qq_prices.png)

### Exercise 2

Examine the relationship between house prices and price per square foot of living space (plotted below). What can you conclude?

![Price vs. square feet](data/images/price_vs_sqft.png)

**Answer.**

-------

## Variable transformation

We have seen in Exercise 1 that the distribution of house prices is not normal, and that this may be contributing to the "fanning out" effect we observed in Exercise 2. We want to find a way to remove the "fanning out" effect, as it implies that a linear fit becomes less and less suitable for larger values of `sqft_living`. A common method of addressing this issue is to transform the dependent variable and/or the independent variable. Such a **variable transformation** involves applying a function to one or more of these variables to achieve conditions that are suitable for the application of a linear model.

Typical mathematical functions used to transform variables include powers (quadratic, cubic, square root, etc.), logarithms, and trigonometric functions. Let's start with the logarithmic transformation to see if we can achieve some results.

### Exercise 3

We took the logarithm of house prices and created the below plots and table. Ascertain if this made the distribution of the transformed variable roughly normal.

![QQ log prices](data/images/qq_log_prices.png)
![Histogram log prices](data/images/histogram_log_prices.png)
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>descriptive statistics of price</th>    </tr>  </thead>  <tbody>    <tr>      <th>count</th>      <td>21613.000000</td>    </tr>    <tr>      <th>mean</th>      <td>13.047817</td>    </tr>    <tr>      <th>std</th>      <td>0.526685</td>    </tr>    <tr>      <th>min</th>      <td>11.225243</td>    </tr>    <tr>      <th>25%</th>      <td>12.682152</td>    </tr>    <tr>      <th>50%</th>      <td>13.017003</td>    </tr>    <tr>      <th>75%</th>      <td>13.377006</td>    </tr>    <tr>      <th>max</th>      <td>15.856731</td>    </tr>  </tbody></table>

**Answer.**

-------

### Building a linear model with transformed variables

Of course, we aren't restricted to just applying the logarithmic transformation to house prices - we can do it to any other variable in our dataset. Let's transform both house prices and price per square foot by this method and fit the following linear model:

$$
\log(price) = \beta_0 + \beta_1 \log(sqft{\_}living) + \varepsilon
$$

Below is the corresponding plot and output table (in this and all subsequent output tables, whenever a variable is a logarithmic transformation it is prefixed by `np.log` and appears between parentheses):

![Sqft vs. price (log-log)](data/images/sqft_living_price_log_log.png)

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.456
Model:                            OLS   Adj. R-squared:                  0.455
Method:                 Least Squares   F-statistic:                 1.808e+04
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        15:29:18   Log-Likelihood:                -10240.
No. Observations:               21613   AIC:                         2.048e+04
Df Residuals:                   21611   BIC:                         2.050e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               6.7299      0.047    143.001      0.000       6.638       6.822
np.log(sqft_living)     0.8368      0.006    134.459      0.000       0.825       0.849
==============================================================================
Omnibus:                      123.344   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              113.759
Skew:                           0.142   Prob(JB):                     1.98e-25
Kurtosis:                       2.787   Cond. No.                         137.
==============================================================================
~~~


We have to be mindful of how we interpret the coefficients here. In a regression in which both the independent and dependent variables are logarithmic transformations (a *log-log regression*), the coefficient $\beta_1$ should be interpreted as the *percentage change* in the dependent variable that is associated with 1\% change in the independent variable. This percentage vs. percentage change comparison is known as **elasticity**. Thus, in our model a 1\% increase in living space is associated with a 0.8368\% increase in price. If you'd like to see why $\beta_1$ can be interpreted in this way, you can refer to the appendix at the bottom of this notebook, where we examine this property mathematically.

Let's now build a linear model where the logarithmic transform is only applied to the house prices:

$$
\log(price) = \beta_0 + \beta_1 sqft{\_}living + \varepsilon
$$

The plot and the output table are below:

![Square feet vs. log of price](data/images/sqft_living_log_price.png)

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.483
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                 2.023e+04
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        17:43:10   Log-Likelihood:                -9670.2
No. Observations:               21613   AIC:                         1.934e+04
Df Residuals:                   21611   BIC:                         1.936e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      12.2185      0.006   1916.883      0.000      12.206      12.231
sqft_living     0.0004    2.8e-06    142.233      0.000       0.000       0.000
==============================================================================
Omnibus:                        3.128   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.209   Jarque-Bera (JB):                3.149
Skew:                           0.027   Prob(JB):                        0.207
Kurtosis:                       2.974   Cond. No.                     5.63e+03
==============================================================================
~~~

The interpretation of the regression coefficient is once again different. This is a *log-level regression*; i.e. one in which the dependent variable has been logarithmically transformed and the independent variable has not. We interpret the coefficient as a **semi-elasticity**, where an absolute increase in `sqft_living` (because it has not had the logarithm function applied to it) corresponds to a percentage increase in `price`. Specifically, here we can say that an increase in living space by 1 square foot leads to a 0.04% percent increase in price (refer to the appendix for the mathematical reasoning).

### Exercise 4

From the plots of the log-log and the log-level models, which of the two is "more linear"?

**Answer.**

-------

### The Box-Cox transformation

Logarithmic transformations are just one possible transformation to make our relationships more linear. Earlier, we mentioned powers (e.g. squares, cubes, square roots, etc.) as well as trigonometric functions. In some cases, choosing a transformation can be straightforward (e.g. the logarithm because it is easily interpretable); other times, it is much more difficult. A formal way to decide on which transformation to use is something called the [**Box-Cox**](https://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/) criterion. This is a nifty mathematical function that outputs a number when you apply it to a variable. This number is usually represented by the Greek letter $\lambda$ (lambda). The interpretation of some common values of $\lambda$ is shown below:

| If you get a lambda value ($λ$) of... 	| then transform $Y$ using this function 	|
|-	|-	|
| -3 	| $Y^{-3} = \frac{1}{Y^3}$ 	|
| -2 	| $Y^{-2} = \frac{1}{Y^2}$	|
| -1 	| $Y^{-1} = \frac{1}{Y^1}$ 	|
| -0.5 	| $Y^{-0.5} = \frac{1}{Y^{0.5}} = \frac{1}{\sqrt{Y}}$ 	|
| 0 	| $log(Y)$ 	|
| 0.5 	| $Y^{0.5} = \sqrt{Y}$ 	|
| 1 	| $Y^1 = Y$ 	|
| 2 	| $Y^2$ 	|
| 3 	| $Y^3$ 	|


For instance, if you have a variable $Y$ and you got a $\lambda=2$ after you applied the Box-Cox function, then $Y^2$ should be approximately normally distributed. As a special case, when $\lambda=0$ we don't transform using $Y^0$ (which would give you the constant number 1), but rather using $\log(Y)$. In our case, we have a BC $\lambda=-0.23$, so it would seem that using the logarithmic transformation is sensible since the Box-Cox value is close to zero. Although it is very useful, the Box-Cox criterion is not perfect, so you should always double-check the QQ-plot after transforming your variable to confirm that it does seem to be normally distributed.

## Multiple linear regression with transformed variables

Of course, as we have seen from previous cases, it doesn't make sense to restrict ourselves to modeling house prices using only one independent variable. Let's add in several more variables, some transformed and some not.

### Exercise 5

Below is the output table of the following model:

$$
\log(price) = \beta_0 + \beta_1 \log(sqft{\_}living) + \beta_2 \log(sqft{\_}lot) + \beta_3 bedrooms + \beta_4 floors + \beta_5 bathrooms + \beta_6 waterfront + \beta_7 condition + \beta_8 view + \beta_9 grade + \beta_{10} yr{\_}built + \beta_{11} lat + \beta_{12} long + \varepsilon
$$

Provide interpretations for the coefficient of log `sqft_living`, and `waterfront`.

**Hint:** The variables in this model that are categorical appear in the table inside parentheses and after a letter `C` that stands for "categorical". This syntax comes from the software library we used to fit the model.

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.763
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     4638.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        20:27:57   Log-Likelihood:                -1246.2
No. Observations:               21613   AIC:                             2524.
Df Residuals:                   21597   BIC:                             2652.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             -38.9925      2.035    -19.157      0.000     -42.982     -35.003
C(waterfront)[T.1]      0.3995      0.025     15.919      0.000       0.350       0.449
C(view)[T.1]            0.1950      0.014     13.605      0.000       0.167       0.223
C(view)[T.2]            0.1383      0.009     15.964      0.000       0.121       0.155
C(view)[T.3]            0.2039      0.012     17.215      0.000       0.181       0.227
C(view)[T.4]            0.2805      0.018     15.380      0.000       0.245       0.316
np.log(sqft_living)     0.4067      0.009     46.385      0.000       0.389       0.424
np.log(sqft_lot)       -0.0076      0.002     -3.076      0.002      -0.012      -0.003
bedrooms               -0.0252      0.002    -10.100      0.000      -0.030      -0.020
floors                  0.0502      0.004     11.602      0.000       0.042       0.059
bathrooms               0.0702      0.004     17.432      0.000       0.062       0.078
condition               0.0552      0.003     18.866      0.000       0.049       0.061
grade                   0.1858      0.002     74.364      0.000       0.181       0.191
yr_built               -0.0038   8.65e-05    -44.108      0.000      -0.004      -0.004
lat                     1.3437      0.013    100.666      0.000       1.318       1.370
long                    0.0747      0.015      4.932      0.000       0.045       0.104
==============================================================================
Omnibus:                      563.806   Durbin-Watson:                   1.981
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1114.960
Skew:                           0.179   Prob(JB):                    7.75e-243
Kurtosis:                       4.054   Cond. No.                     2.31e+06
==============================================================================

~~~

**Answer.**

-------

What other factors that we left out might impact the price? Examples might include proximity to services (hospitals, schools, commercial areas, movie theaters, metro stops, etc.) and crime rates. Our dataset does not have a comprehensive list of possible factors; however, we do have some variables that would be interesting to investigate further.

In general, house prices change depending on location. Two houses with comparable features can be priced very differently depending on neighborhood and geographic position. In this dataset, we have zipcode and geographic coordinates. Let's start by taking a look at the relationship between latitude and prices.

### Exercise 6

In the plot below, we can see the relationship between latitude and the logarithm of house prices. What do you observe? What transformation function could be appropriate here?

![Log price vs. latitude](data/images/log_price_latitude.png)

**Answer.**

-------

Let's add the square of the latitude as an additional independent variable to the model from Exercise 5:

~~~plain
                        OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                  0.778
Method:                 Least Squares   F-statistic:                     4741.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        20:48:49   Log-Likelihood:                -525.85
No. Observations:               21613   AIC:                             1086.
Df Residuals:                   21596   BIC:                             1221.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept           -7714.7455    198.955    -38.776      0.000   -8104.713   -7324.778
C(waterfront)[T.1]      0.3800      0.024     15.649      0.000       0.332       0.428
C(view)[T.1]            0.1715      0.014     12.356      0.000       0.144       0.199
C(view)[T.2]            0.1299      0.008     15.498      0.000       0.114       0.146
C(view)[T.3]            0.2004      0.011     17.491      0.000       0.178       0.223
C(view)[T.4]            0.2733      0.018     15.491      0.000       0.239       0.308
np.log(sqft_living)     0.4060      0.008     47.876      0.000       0.389       0.423
np.log(sqft_lot)        0.0120      0.002      4.909      0.000       0.007       0.017
bedrooms               -0.0250      0.002    -10.358      0.000      -0.030      -0.020
floors                  0.0498      0.004     11.900      0.000       0.042       0.058
bathrooms               0.0659      0.004     16.892      0.000       0.058       0.074
condition               0.0598      0.003     21.097      0.000       0.054       0.065
grade                   0.1763      0.002     72.614      0.000       0.172       0.181
yr_built               -0.0032   8.53e-05    -37.235      0.000      -0.003      -0.003
lat                   323.7933      8.357     38.743      0.000     307.412     340.175
I(lat ** 2)            -3.3919      0.088    -38.582      0.000      -3.564      -3.220
long                   -0.0164      0.015     -1.109      0.267      -0.046       0.013
==============================================================================
Omnibus:                      642.697   Durbin-Watson:                   1.984
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1405.507
Skew:                           0.172   Prob(JB):                    6.28e-306
Kurtosis:                       4.201   Cond. No.                     3.54e+08
==============================================================================

~~~

### Question 2

Is the coefficient of this new independent variable significant? What can you say about the $R^2$ of this model? (the relevant coefficient in the output table is that of the `I(lat ** 2)` variable - this comes from the syntax of the particular software library we used to run the regression).

### Exercise 7

#### 7.1

One of the properties of $R^2$ is that it can never decrease when the set of predictors is increased. In other words, there is no penalty for continuing to add variables to the model. Why do you think this may be a drawback of $R^2$? How would you go about deciding the correct set of predictors to use?

**Answer.**

-------

#### 7.2

There are several model selection criteria that quantify the quality of a model by managing the trade-off between goodness-of-fit and simplicity. The most common one is the **AIC (Akaike Information Criterion)**. The AIC penalizes the addition of more terms to a model, so in order for an updated model to have a better AIC, its $R^2$ needs to improve by at least as much as the additional imposed penalty. Given several models, the one with the lowest AIC is the recommended one.

For now, do not worry about the technical details behind AIC (you can read more about it [here](https://en.wikipedia.org/wiki/Akaike_information_criterion)). In future cases on *regularization*, you will learn more about the rationale behind why these sorts of estimators matter and how to construct and use them.

Use the AIC score (you can look it up in the output table of the model summary) to evaluate whether or not the model with the added squared term is better than the model without it.

**Answer.**

-------

## Modeling interaction effects

Interaction effects can complicate the perceived effect of the independent variables on the dependent variable. Let's dig into potential interactions by looking at two of the predictors in tandem: `waterfront` and geographic position (`lat` and `long`). Specifically, is the effect of geographic position different among the houses that have a waterfront view vs. those that do not?

### Exercise 8

#### 8.1

Below, we have drawn a plot of the relationship between `log_price` and `lat`. This plot fits two separate regression lines for houses that do and do not have a `waterfront` view. What do you see? Is the relationship the same or different?

![Log price vs. latitude & waterfront](data/images/log_price_latitude_waterfront.png)

**Answer.**

-------

#### 8.2

This is a plot that fits separate regression lines for houses with different `view` indexes (i.e. how good the view for the property was). What do you see? Is the relationship the same or different?

![Log price vs. latitude & view](data/images/log_price_latitude_view.png)

**Answer.**

-------

We can verify the findings of Exercise 8 by adding **interaction terms** to our linear model. We add them by including a multiplication in the model. For instance, the model below includes an interaction term between `lat` and `waterfront`:

$$
\log(price) = \beta_0 + \beta_1 waterfront + \beta_2 lat + \beta_3 (lat \times waterfront) + \varepsilon
$$

To understand why we include interaction terms this way, let's assume a linear model (this example was taken from [this link](https://stats.stackexchange.com/a/486147/86081)):

$$
y=\beta_0+\beta_1 x_1+\gamma x_2+e
$$

in which $\gamma$ linearly depends on $x_1$:

$$
\gamma = \beta_2+\beta_3 x_1
$$

This means that the effect of $x_2$ that you see in $y$ not only depends on the value of $x_2$ but also on the value of $x_1$. Both variables are interacting. If we substitute this equation into the previous one, we obtain our multiplicative interaction term:

$$\begin{align}
y&=\beta_0+\beta_1 x_1 + (\beta_2+\beta_3 x_1)x_2+e\\
&=\beta_0 + \beta_1 x_1 + \beta_2 x_2+\beta_3 x_1x_2+e
\end{align}
$$

You can find a more detailed explanation [here](https://stats.stackexchange.com/questions/393579/in-multiple-regression-why-are-interactions-modelled-as-products-and-not-somet).

This is the output table that corresponds to this model:

~~~plain
                             OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.236
Model:                            OLS   Adj. R-squared:                  0.236
Method:                 Least Squares   F-statistic:                     2228.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        21:13:40   Log-Likelihood:                -13898.
No. Observations:               21613   AIC:                         2.780e+04
Df Residuals:                   21609   BIC:                         2.784e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                -68.0880      1.078    -63.180      0.000     -70.200     -65.976
C(waterfront)[T.1]      -102.3040     14.909     -6.862      0.000    -131.526     -73.082
lat                        1.7058      0.023     75.280      0.000       1.661       1.750
lat:C(waterfront)[T.1]     2.1753      0.314      6.936      0.000       1.561       2.790
==============================================================================
Omnibus:                     1283.881   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1927.815
Skew:                           0.510   Prob(JB):                         0.00
Kurtosis:                       4.049   Cond. No.                     2.27e+05
==============================================================================
~~~

The way we read the interaction effect given by this summary is as follows:
    
1. `C(waterfront)[T.1]` reads in the same way as before. Here the model states that an adjustment of $102\%$ should be done to the price of a house having a waterfront view. We saw  before that `waterfront` had a positive impact in the price of the house, however our model is probably giving more weight to other factors (compare the intercept with previous models) and then correcting this over-estimation. This adjustment would nevertheless make the price negative (!), which is an indication that the model needs to be refined.

2. `lat` and `lat:C(waterfront)[T.1]` reads as follows - for each degree of increment in latitude, the price of the house should increase by $1.7\%$ among houses that do not have waterfront and by $1.70\% + 2.17\% = 3.87\%$ among houses that do have waterfront.

### Exercise 9

Now, consider a model with an interaction term between `lat` and `view`. What do you see? Do these results agree with your findings from Exercise 8.2?

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.323
Model:                            OLS   Adj. R-squared:                  0.323
Method:                 Least Squares   F-statistic:                     1145.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        21:36:48   Log-Likelihood:                -12593.
No. Observations:               21613   AIC:                         2.521e+04
Df Residuals:                   21603   BIC:                         2.529e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          -66.6850      1.053    -63.317      0.000     -68.749     -64.621
C(view)[T.1]       -17.0569     10.953     -1.557      0.119     -38.525       4.411
C(view)[T.2]        -7.7183      5.444     -1.418      0.156     -18.390       2.953
C(view)[T.3]        -9.6270      6.936     -1.388      0.165     -23.223       3.969
C(view)[T.4]       -26.4461     10.311     -2.565      0.010     -46.656      -6.237
lat                  1.6753      0.022     75.652      0.000       1.632       1.719
lat:C(view)[T.1]     0.3677      0.230      1.597      0.110      -0.084       0.819
lat:C(view)[T.2]     0.1715      0.114      1.499      0.134      -0.053       0.396
lat:C(view)[T.3]     0.2164      0.146      1.483      0.138      -0.070       0.502
lat:C(view)[T.4]     0.5769      0.217      2.662      0.008       0.152       1.002
==============================================================================
Omnibus:                      944.061   Durbin-Watson:                   1.949
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1466.482
Skew:                           0.395   Prob(JB):                         0.00
Kurtosis:                       4.002   Cond. No.                     1.77e+05
==============================================================================
~~~

**Answer.**

-------

### Incorporating interaction effects into a linear model

Let's fit a baseline model including all of the other variables we have discussed before. In addition, let us incorporate to the model a `zipcode` variable and a new variable `renovated` indicating whether the house was previously renovated or not:

$$
\log(price) = \beta_0 + \beta_1 \log(sqft{\_}living)+ \beta_2 \log(sqft{\_}lot) + \beta_3 bedrooms + \beta_4 floors + \beta_5 bathrooms + \beta_6 waterfront + \beta_7 condition  + \beta_8 view + \beta_9 grade + \beta_{10} yr{\_}built + \beta_{11} lat + \beta_{12} (lat^2) + \beta_{13}long + \beta_{14}zipcode + \beta_{15}renovated + \varepsilon
$$

The output table is as follows:

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.880
Model:                            OLS   Adj. R-squared:                  0.879
Method:                 Least Squares   F-statistic:                     1833.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        21:39:13   Log-Likelihood:                 6088.4
No. Observations:               21613   AIC:                        -1.200e+04
Df Residuals:                   21526   BIC:                        -1.131e+04
Df Model:                          86                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept            -6003.8719    525.035    -11.435      0.000   -7032.980   -4974.764
C(waterfront)[T.1]       0.4354      0.018     23.793      0.000       0.400       0.471
C(view)[T.1]             0.1204      0.010     11.644      0.000       0.100       0.141
C(view)[T.2]             0.1110      0.006     17.664      0.000       0.099       0.123
C(view)[T.3]             0.1834      0.009     21.398      0.000       0.167       0.200
C(view)[T.4]             0.2904      0.013     21.986      0.000       0.264       0.316
C(zipcode)[T.98002]      0.0201      0.017      1.219      0.223      -0.012       0.052
C(zipcode)[T.98003]     -0.0109      0.015     -0.740      0.459      -0.040       0.018
C(zipcode)[T.98004]      0.9063      0.028     32.305      0.000       0.851       0.961
C(zipcode)[T.98005]      0.5165      0.030     17.280      0.000       0.458       0.575
C(zipcode)[T.98006]      0.4546      0.026     17.748      0.000       0.404       0.505
C(zipcode)[T.98007]      0.4437      0.031     14.389      0.000       0.383       0.504
C(zipcode)[T.98008]      0.4540      0.029     15.427      0.000       0.396       0.512
C(zipcode)[T.98010]      0.3073      0.025     12.154      0.000       0.258       0.357
C(zipcode)[T.98011]      0.2650      0.037      7.258      0.000       0.193       0.337
C(zipcode)[T.98014]      0.1968      0.040      4.861      0.000       0.117       0.276
C(zipcode)[T.98019]      0.2173      0.039      5.501      0.000       0.140       0.295
C(zipcode)[T.98022]      0.3431      0.024     14.043      0.000       0.295       0.391
C(zipcode)[T.98023]     -0.0630      0.014     -4.653      0.000      -0.090      -0.036
C(zipcode)[T.98024]      0.3142      0.037      8.530      0.000       0.242       0.386
C(zipcode)[T.98027]      0.3759      0.026     14.254      0.000       0.324       0.428
C(zipcode)[T.98028]      0.2056      0.035      5.797      0.000       0.136       0.275
C(zipcode)[T.98029]      0.4751      0.030     16.064      0.000       0.417       0.533
C(zipcode)[T.98030]      0.0026      0.017      0.152      0.879      -0.031       0.036
C(zipcode)[T.98031]     -0.0240      0.019     -1.297      0.195      -0.060       0.012
C(zipcode)[T.98032]     -0.1287      0.020     -6.330      0.000      -0.169      -0.089
C(zipcode)[T.98033]      0.5719      0.031     18.593      0.000       0.512       0.632
C(zipcode)[T.98034]      0.3267      0.033     10.022      0.000       0.263       0.391
C(zipcode)[T.98038]      0.1915      0.019      9.996      0.000       0.154       0.229
C(zipcode)[T.98039]      1.0768      0.037     29.101      0.000       1.004       1.149
C(zipcode)[T.98040]      0.6601      0.026     25.647      0.000       0.610       0.711
C(zipcode)[T.98042]      0.0509      0.016      3.102      0.002       0.019       0.083
C(zipcode)[T.98045]      0.3160      0.035      8.909      0.000       0.247       0.386
C(zipcode)[T.98052]      0.4534      0.031     14.445      0.000       0.392       0.515
C(zipcode)[T.98053]      0.4509      0.034     13.412      0.000       0.385       0.517
C(zipcode)[T.98055]     -0.0058      0.021     -0.273      0.785      -0.047       0.036
C(zipcode)[T.98056]      0.1432      0.023      6.204      0.000       0.098       0.188
C(zipcode)[T.98058]      0.0374      0.020      1.853      0.064      -0.002       0.077
C(zipcode)[T.98059]      0.2034      0.023      8.962      0.000       0.159       0.248
C(zipcode)[T.98065]      0.3744      0.033     11.291      0.000       0.309       0.439
C(zipcode)[T.98070]      0.0334      0.025      1.354      0.176      -0.015       0.082
C(zipcode)[T.98072]      0.3060      0.036      8.432      0.000       0.235       0.377
C(zipcode)[T.98074]      0.3963      0.031     12.947      0.000       0.336       0.456
C(zipcode)[T.98075]      0.4334      0.030     14.454      0.000       0.375       0.492
C(zipcode)[T.98077]      0.2956      0.038      7.834      0.000       0.222       0.370
C(zipcode)[T.98092]      0.0826      0.015      5.611      0.000       0.054       0.111
C(zipcode)[T.98102]      0.7063      0.032     21.970      0.000       0.643       0.769
C(zipcode)[T.98103]      0.5483      0.030     18.448      0.000       0.490       0.607
C(zipcode)[T.98105]      0.6989      0.031     22.776      0.000       0.639       0.759
C(zipcode)[T.98106]      0.0777      0.024      3.232      0.001       0.031       0.125
C(zipcode)[T.98107]      0.5569      0.031     18.148      0.000       0.497       0.617
C(zipcode)[T.98108]      0.1054      0.026      4.036      0.000       0.054       0.157
C(zipcode)[T.98109]      0.7183      0.032     22.432      0.000       0.656       0.781
C(zipcode)[T.98112]      0.7980      0.029     27.782      0.000       0.742       0.854
C(zipcode)[T.98115]      0.5551      0.030     18.428      0.000       0.496       0.614
C(zipcode)[T.98116]      0.4531      0.026     17.385      0.000       0.402       0.504
C(zipcode)[T.98117]      0.5208      0.030     17.107      0.000       0.461       0.580
C(zipcode)[T.98118]      0.2308      0.024      9.771      0.000       0.184       0.277
C(zipcode)[T.98119]      0.6881      0.030     22.742      0.000       0.629       0.747
C(zipcode)[T.98122]      0.5509      0.028     19.915      0.000       0.497       0.605
C(zipcode)[T.98125]      0.3070      0.032      9.515      0.000       0.244       0.370
C(zipcode)[T.98126]      0.2712      0.024     11.095      0.000       0.223       0.319
C(zipcode)[T.98133]      0.1985      0.033      5.965      0.000       0.133       0.264
C(zipcode)[T.98136]      0.3950      0.025     15.803      0.000       0.346       0.444
C(zipcode)[T.98144]      0.4237      0.026     16.159      0.000       0.372       0.475
C(zipcode)[T.98146]      0.0194      0.023      0.850      0.396      -0.025       0.064
C(zipcode)[T.98148]     -0.0299      0.029     -1.027      0.304      -0.087       0.027
C(zipcode)[T.98155]      0.1894      0.035      5.459      0.000       0.121       0.257
C(zipcode)[T.98166]      0.0804      0.021      3.856      0.000       0.040       0.121
C(zipcode)[T.98168]     -0.1559      0.022     -7.018      0.000      -0.199      -0.112
C(zipcode)[T.98177]      0.3239      0.035      9.312      0.000       0.256       0.392
C(zipcode)[T.98178]     -0.0602      0.023     -2.622      0.009      -0.105      -0.015
C(zipcode)[T.98188]     -0.0904      0.023     -3.949      0.000      -0.135      -0.046
C(zipcode)[T.98198]     -0.0746      0.017     -4.353      0.000      -0.108      -0.041
C(zipcode)[T.98199]      0.5534      0.029     18.834      0.000       0.496       0.611
C(renovated)[T.True]     0.0572      0.007      8.568      0.000       0.044       0.070
np.log(sqft_living)      0.4217      0.006     66.545      0.000       0.409       0.434
np.log(sqft_lot)         0.0710      0.002     34.155      0.000       0.067       0.075
bedrooms                -0.0140      0.002     -7.765      0.000      -0.018      -0.010
floors                   0.0180      0.003      5.535      0.000       0.012       0.024
bathrooms                0.0349      0.003     11.903      0.000       0.029       0.041
condition                0.0450      0.002     20.626      0.000       0.041       0.049
grade                    0.1115      0.002     58.332      0.000       0.108       0.115
yr_built                -0.0004    7.5e-05     -4.758      0.000      -0.001      -0.000
lat                    250.3756     22.109     11.324      0.000     207.040     293.711
I(lat ** 2)             -2.6284      0.233    -11.299      0.000      -3.084      -2.172
long                    -0.4088      0.052     -7.842      0.000      -0.511      -0.307
==============================================================================
Omnibus:                     1429.328   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5980.405
Skew:                          -0.192   Prob(JB):                         0.00
Kurtosis:                       5.548   Cond. No.                     1.27e+09
==============================================================================
~~~

We can see that both a waterfront view and renovations have a positive impact on price. The effect of a waterfront view is ~43.54 percent on prices of comparable homes, while the effect of renovations is ~5.72 percent.

So far, we have looked at global effects of independent variables, irrespective of the values of the other variables. However, we might ask if the effect of a waterfront view is different for houses that were recently renovated versus those that were not. To answer this question, we need to add an interaction term:

$$
\log(price) = \beta_0 + \beta_1 \log(sqft{\_}living)+ \beta_2 \log(sqft{\_}lot) + \beta_3 bedrooms + \beta_4 floors + \beta_5 bathrooms + \beta_6 waterfront + \beta_7 condition  + \beta_8 view + \beta_9 grade + \beta_{10} yr{\_}built + \beta_{11} lat + \beta_{12} (lat^2) + \beta_{13}long + \beta_{14}zipcode + \beta_{15}renovated + \beta_{16} (waterfront \times renovated)+ \varepsilon
$$

The output table is as follows:

~~~plain
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.880
Model:                            OLS   Adj. R-squared:                  0.879
Method:                 Least Squares   F-statistic:                     1812.
Date:                Thu, 10 Jun 2021   Prob (F-statistic):               0.00
Time:                        23:03:41   Log-Likelihood:                 6092.2
No. Observations:               21613   AIC:                        -1.201e+04
Df Residuals:                   21525   BIC:                        -1.131e+04
Df Model:                          87                                         
Covariance Type:            nonrobust                                         
===========================================================================================================
                                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
Intercept                               -5991.1993    524.975    -11.412      0.000   -7020.188   -4962.210
C(waterfront)[T.1]                          0.4585      0.020     22.793      0.000       0.419       0.498
C(view)[T.1]                                0.1202      0.010     11.627      0.000       0.100       0.140
C(view)[T.2]                                0.1110      0.006     17.659      0.000       0.099       0.123
C(view)[T.3]                                0.1833      0.009     21.383      0.000       0.166       0.200
C(view)[T.4]                                0.2904      0.013     21.989      0.000       0.264       0.316
C(zipcode)[T.98002]                         0.0201      0.016      1.215      0.224      -0.012       0.052
C(zipcode)[T.98003]                        -0.0108      0.015     -0.733      0.464      -0.040       0.018
C(zipcode)[T.98004]                         0.9068      0.028     32.326      0.000       0.852       0.962
C(zipcode)[T.98005]                         0.5170      0.030     17.298      0.000       0.458       0.576
C(zipcode)[T.98006]                         0.4554      0.026     17.781      0.000       0.405       0.506
C(zipcode)[T.98007]                         0.4441      0.031     14.405      0.000       0.384       0.505
C(zipcode)[T.98008]                         0.4544      0.029     15.441      0.000       0.397       0.512
C(zipcode)[T.98010]                         0.3070      0.025     12.144      0.000       0.257       0.357
C(zipcode)[T.98011]                         0.2652      0.037      7.266      0.000       0.194       0.337
C(zipcode)[T.98014]                         0.1971      0.040      4.868      0.000       0.118       0.276
C(zipcode)[T.98019]                         0.2174      0.039      5.506      0.000       0.140       0.295
C(zipcode)[T.98022]                         0.3425      0.024     14.022      0.000       0.295       0.390
C(zipcode)[T.98023]                        -0.0625      0.014     -4.619      0.000      -0.089      -0.036
C(zipcode)[T.98024]                         0.3143      0.037      8.535      0.000       0.242       0.386
C(zipcode)[T.98027]                         0.3762      0.026     14.267      0.000       0.325       0.428
C(zipcode)[T.98028]                         0.2059      0.035      5.804      0.000       0.136       0.275
C(zipcode)[T.98029]                         0.4755      0.030     16.081      0.000       0.418       0.534
C(zipcode)[T.98030]                         0.0028      0.017      0.164      0.870      -0.031       0.036
C(zipcode)[T.98031]                        -0.0237      0.019     -1.282      0.200      -0.060       0.013
C(zipcode)[T.98032]                        -0.1285      0.020     -6.319      0.000      -0.168      -0.089
C(zipcode)[T.98033]                         0.5722      0.031     18.606      0.000       0.512       0.633
C(zipcode)[T.98034]                         0.3270      0.033     10.032      0.000       0.263       0.391
C(zipcode)[T.98038]                         0.1916      0.019     10.002      0.000       0.154       0.229
C(zipcode)[T.98039]                         1.0764      0.037     29.092      0.000       1.004       1.149
C(zipcode)[T.98040]                         0.6607      0.026     25.672      0.000       0.610       0.711
C(zipcode)[T.98042]                         0.0510      0.016      3.108      0.002       0.019       0.083
C(zipcode)[T.98045]                         0.3163      0.035      8.917      0.000       0.247       0.386
C(zipcode)[T.98052]                         0.4537      0.031     14.456      0.000       0.392       0.515
C(zipcode)[T.98053]                         0.4513      0.034     13.425      0.000       0.385       0.517
C(zipcode)[T.98055]                        -0.0054      0.021     -0.257      0.797      -0.047       0.036
C(zipcode)[T.98056]                         0.1434      0.023      6.215      0.000       0.098       0.189
C(zipcode)[T.98058]                         0.0377      0.020      1.866      0.062      -0.002       0.077
C(zipcode)[T.98059]                         0.2038      0.023      8.980      0.000       0.159       0.248
C(zipcode)[T.98065]                         0.3747      0.033     11.302      0.000       0.310       0.440
C(zipcode)[T.98070]                         0.0343      0.025      1.392      0.164      -0.014       0.083
C(zipcode)[T.98072]                         0.3063      0.036      8.441      0.000       0.235       0.377
C(zipcode)[T.98074]                         0.3964      0.031     12.953      0.000       0.336       0.456
C(zipcode)[T.98075]                         0.4335      0.030     14.458      0.000       0.375       0.492
C(zipcode)[T.98077]                         0.2959      0.038      7.843      0.000       0.222       0.370
C(zipcode)[T.98092]                         0.0825      0.015      5.611      0.000       0.054       0.111
C(zipcode)[T.98102]                         0.7067      0.032     21.986      0.000       0.644       0.770
C(zipcode)[T.98103]                         0.5487      0.030     18.462      0.000       0.490       0.607
C(zipcode)[T.98105]                         0.6989      0.031     22.782      0.000       0.639       0.759
C(zipcode)[T.98106]                         0.0781      0.024      3.248      0.001       0.031       0.125
C(zipcode)[T.98107]                         0.5573      0.031     18.163      0.000       0.497       0.617
C(zipcode)[T.98108]                         0.1058      0.026      4.054      0.000       0.055       0.157
C(zipcode)[T.98109]                         0.7187      0.032     22.447      0.000       0.656       0.781
C(zipcode)[T.98112]                         0.7982      0.029     27.795      0.000       0.742       0.855
C(zipcode)[T.98115]                         0.5554      0.030     18.439      0.000       0.496       0.614
C(zipcode)[T.98116]                         0.4533      0.026     17.396      0.000       0.402       0.504
C(zipcode)[T.98117]                         0.5212      0.030     17.123      0.000       0.462       0.581
C(zipcode)[T.98118]                         0.2309      0.024      9.780      0.000       0.185       0.277
C(zipcode)[T.98119]                         0.6883      0.030     22.755      0.000       0.629       0.748
C(zipcode)[T.98122]                         0.5511      0.028     19.926      0.000       0.497       0.605
C(zipcode)[T.98125]                         0.3082      0.032      9.554      0.000       0.245       0.371
C(zipcode)[T.98126]                         0.2716      0.024     11.113      0.000       0.224       0.320
C(zipcode)[T.98133]                         0.1988      0.033      5.975      0.000       0.134       0.264
C(zipcode)[T.98136]                         0.3952      0.025     15.816      0.000       0.346       0.444
C(zipcode)[T.98144]                         0.4244      0.026     16.188      0.000       0.373       0.476
C(zipcode)[T.98146]                         0.0193      0.023      0.845      0.398      -0.025       0.064
C(zipcode)[T.98148]                        -0.0296      0.029     -1.015      0.310      -0.087       0.028
C(zipcode)[T.98155]                         0.1896      0.035      5.465      0.000       0.122       0.258
C(zipcode)[T.98166]                         0.0807      0.021      3.869      0.000       0.040       0.122
C(zipcode)[T.98168]                        -0.1555      0.022     -7.001      0.000      -0.199      -0.112
C(zipcode)[T.98177]                         0.3240      0.035      9.317      0.000       0.256       0.392
C(zipcode)[T.98178]                        -0.0603      0.023     -2.626      0.009      -0.105      -0.015
C(zipcode)[T.98188]                        -0.0901      0.023     -3.934      0.000      -0.135      -0.045
C(zipcode)[T.98198]                        -0.0742      0.017     -4.332      0.000      -0.108      -0.041
C(zipcode)[T.98199]                         0.5538      0.029     18.847      0.000       0.496       0.611
C(renovated)[T.True]                        0.0606      0.007      8.931      0.000       0.047       0.074
C(waterfront)[T.1]:C(renovated)[T.True]    -0.0923      0.033     -2.761      0.006      -0.158      -0.027
np.log(sqft_living)                         0.4217      0.006     66.566      0.000       0.409       0.434
np.log(sqft_lot)                            0.0710      0.002     34.160      0.000       0.067       0.075
bedrooms                                   -0.0140      0.002     -7.776      0.000      -0.018      -0.011
floors                                      0.0180      0.003      5.549      0.000       0.012       0.024
bathrooms                                   0.0348      0.003     11.893      0.000       0.029       0.041
condition                                   0.0450      0.002     20.639      0.000       0.041       0.049
grade                                       0.1114      0.002     58.301      0.000       0.108       0.115
yr_built                                   -0.0004    7.5e-05     -4.729      0.000      -0.001      -0.000
lat                                       249.8443     22.107     11.302      0.000     206.514     293.175
I(lat ** 2)                                -2.6228      0.233    -11.276      0.000      -3.079      -2.167
long                                       -0.4085      0.052     -7.837      0.000      -0.511      -0.306
==============================================================================
Omnibus:                     1429.354   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5975.932
Skew:                          -0.193   Prob(JB):                         0.00
Kurtosis:                       5.547   Cond. No.                     1.27e+09
==============================================================================
~~~

There does seem to be a small differential effect for properties with a waterfront that were renovated versus those that were not renovated ($\beta_{16}=-0.0923$).

### Exercise 10

The AIC for the baseline model was $-12045.34$ and the $R^2$ was $0.88$. Let's expand the model by doing the following:

1. Adding a term which accounts for the square of the year the house was built
2. Adding an interaction term for the presence of a basement in affecting the relationship between the longitude coordinate and the price of a house

Below are the output tables. Compare these models' fit and AIC with each other and with the previous model. Which is the best model?

Here is the table for the model with an additional $\beta_{17}(yr{\_}built^2)$:

<pre><small>
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.883
Method:                 Least Squares   F-statistic:                     1840.
Date:                Fri, 02 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:19:58   Log-Likelihood:                 6454.8
No. Observations:               21613   AIC:                        -1.273e+04
Df Residuals:                   21523   BIC:                        -1.201e+04
Df Model:                          89                                         
Covariance Type:            nonrobust                                         
============================================================================================================
                                               coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                -6013.2858    516.502    -11.642      0.000   -7025.668   -5000.904
C(renovated)[T.True]                        -0.3768      0.107     -3.521      0.000      -0.587      -0.167
C(view)[T.1]                                 0.1278      0.010     12.563      0.000       0.108       0.148
C(view)[T.2]                                 0.1187      0.006     19.179      0.000       0.107       0.131
C(view)[T.3]                                 0.1911      0.008     22.649      0.000       0.175       0.208
C(view)[T.4]                                 0.3004      0.013     23.090      0.000       0.275       0.326
C(waterfront)[T.1]                         -39.7492      5.998     -6.628      0.000     -51.505     -27.994
C(zipcode)[T.98002]                          0.0229      0.016      1.410      0.158      -0.009       0.055
C(zipcode)[T.98003]                          0.0072      0.015      0.493      0.622      -0.021       0.036
C(zipcode)[T.98004]                          0.9255      0.028     33.521      0.000       0.871       0.980
C(zipcode)[T.98005]                          0.5504      0.029     18.705      0.000       0.493       0.608
C(zipcode)[T.98006]                          0.4796      0.025     19.025      0.000       0.430       0.529
C(zipcode)[T.98007]                          0.4816      0.030     15.864      0.000       0.422       0.541
C(zipcode)[T.98008]                          0.4907      0.029     16.937      0.000       0.434       0.548
C(zipcode)[T.98010]                          0.3003      0.025     12.078      0.000       0.252       0.349
C(zipcode)[T.98011]                          0.2842      0.036      7.913      0.000       0.214       0.355
C(zipcode)[T.98014]                          0.2018      0.040      5.066      0.000       0.124       0.280
C(zipcode)[T.98019]                          0.2216      0.039      5.704      0.000       0.145       0.298
C(zipcode)[T.98022]                          0.3436      0.024     14.300      0.000       0.296       0.391
C(zipcode)[T.98023]                         -0.0471      0.013     -3.536      0.000      -0.073      -0.021
C(zipcode)[T.98024]                          0.3108      0.036      8.579      0.000       0.240       0.382
C(zipcode)[T.98027]                          0.3860      0.026     14.877      0.000       0.335       0.437
C(zipcode)[T.98028]                          0.2199      0.035      6.302      0.000       0.151       0.288
C(zipcode)[T.98029]                          0.4917      0.029     16.897      0.000       0.435       0.549
C(zipcode)[T.98030]                          0.0045      0.017      0.270      0.787      -0.028       0.038
C(zipcode)[T.98031]                         -0.0096      0.018     -0.530      0.596      -0.045       0.026
C(zipcode)[T.98032]                         -0.1094      0.020     -5.471      0.000      -0.149      -0.070
C(zipcode)[T.98033]                          0.5861      0.030     19.369      0.000       0.527       0.645
C(zipcode)[T.98034]                          0.3511      0.032     10.948      0.000       0.288       0.414
C(zipcode)[T.98038]                          0.1902      0.019     10.091      0.000       0.153       0.227
C(zipcode)[T.98039]                          1.0932      0.036     30.025      0.000       1.022       1.165
C(zipcode)[T.98040]                          0.6889      0.025     27.185      0.000       0.639       0.739
C(zipcode)[T.98042]                          0.0573      0.016      3.552      0.000       0.026       0.089
C(zipcode)[T.98045]                          0.3307      0.035      9.478      0.000       0.262       0.399
C(zipcode)[T.98052]                          0.4752      0.031     15.387      0.000       0.415       0.536
C(zipcode)[T.98053]                          0.4428      0.033     13.386      0.000       0.378       0.508
C(zipcode)[T.98055]                          0.0029      0.021      0.140      0.889      -0.038       0.044
C(zipcode)[T.98056]                          0.1486      0.023      6.544      0.000       0.104       0.193
C(zipcode)[T.98058]                          0.0576      0.020      2.898      0.004       0.019       0.096
C(zipcode)[T.98059]                          0.2032      0.022      9.101      0.000       0.159       0.247
C(zipcode)[T.98065]                          0.3725      0.033     11.419      0.000       0.309       0.436
C(zipcode)[T.98070]                          0.0271      0.025      1.101      0.271      -0.021       0.075
C(zipcode)[T.98072]                          0.3254      0.036      9.117      0.000       0.255       0.395
C(zipcode)[T.98074]                          0.4170      0.030     13.848      0.000       0.358       0.476
C(zipcode)[T.98075]                          0.4449      0.029     15.084      0.000       0.387       0.503
C(zipcode)[T.98077]                          0.3149      0.037      8.484      0.000       0.242       0.388
C(zipcode)[T.98092]                          0.0850      0.014      5.876      0.000       0.057       0.113
C(zipcode)[T.98102]                          0.6989      0.032     22.101      0.000       0.637       0.761
C(zipcode)[T.98103]                          0.5366      0.029     18.349      0.000       0.479       0.594
C(zipcode)[T.98105]                          0.7024      0.030     23.275      0.000       0.643       0.762
C(zipcode)[T.98106]                          0.0716      0.024      3.029      0.002       0.025       0.118
C(zipcode)[T.98107]                          0.5423      0.030     17.963      0.000       0.483       0.602
C(zipcode)[T.98108]                          0.0958      0.026      3.728      0.000       0.045       0.146
C(zipcode)[T.98109]                          0.7049      0.032     22.376      0.000       0.643       0.767
C(zipcode)[T.98112]                          0.7964      0.028     28.187      0.000       0.741       0.852
C(zipcode)[T.98115]                          0.5667      0.030     19.125      0.000       0.509       0.625
C(zipcode)[T.98116]                          0.4464      0.026     17.409      0.000       0.396       0.497
C(zipcode)[T.98117]                          0.5189      0.030     17.327      0.000       0.460       0.578
C(zipcode)[T.98118]                          0.2234      0.023      9.614      0.000       0.178       0.269
C(zipcode)[T.98119]                          0.6693      0.030     22.480      0.000       0.611       0.728
C(zipcode)[T.98122]                          0.5282      0.027     19.394      0.000       0.475       0.582
C(zipcode)[T.98125]                          0.3278      0.032     10.329      0.000       0.266       0.390
C(zipcode)[T.98126]                          0.2655      0.024     11.043      0.000       0.218       0.313
C(zipcode)[T.98133]                          0.2136      0.033      6.525      0.000       0.149       0.278
C(zipcode)[T.98136]                          0.3883      0.025     15.791      0.000       0.340       0.436
C(zipcode)[T.98144]                          0.4058      0.026     15.723      0.000       0.355       0.456
C(zipcode)[T.98146]                          0.0277      0.022      1.234      0.217      -0.016       0.072
C(zipcode)[T.98148]                         -0.0133      0.029     -0.464      0.642      -0.069       0.043
C(zipcode)[T.98155]                          0.2101      0.034      6.157      0.000       0.143       0.277
C(zipcode)[T.98166]                          0.0966      0.021      4.700      0.000       0.056       0.137
C(zipcode)[T.98168]                         -0.1457      0.022     -6.667      0.000      -0.189      -0.103
C(zipcode)[T.98177]                          0.3463      0.034     10.122      0.000       0.279       0.413
C(zipcode)[T.98178]                         -0.0415      0.023     -1.835      0.067      -0.086       0.003
C(zipcode)[T.98188]                         -0.0743      0.023     -3.298      0.001      -0.118      -0.030
C(zipcode)[T.98198]                         -0.0537      0.017     -3.181      0.001      -0.087      -0.021
C(zipcode)[T.98199]                          0.5682      0.029     19.656      0.000       0.512       0.625
np.log(sqft_living)                          0.4116      0.006     65.718      0.000       0.399       0.424
np.log(sqft_living):C(renovated)[T.True]     0.0584      0.014      4.175      0.000       0.031       0.086
np.log(sqft_lot)                             0.0853      0.002     40.296      0.000       0.081       0.089
bedrooms                                    -0.0104      0.002     -5.833      0.000      -0.014      -0.007
floors                                      -0.0121      0.003     -3.569      0.000      -0.019      -0.005
bathrooms                                    0.0250      0.003      8.608      0.000       0.019       0.031
condition                                    0.0539      0.002     24.837      0.000       0.050       0.058
grade                                        0.1081      0.002     57.358      0.000       0.104       0.112
yr_built                                    -0.1904      0.007    -26.398      0.000      -0.205      -0.176
lat                                        258.3372     21.754     11.875      0.000     215.697     300.977
lat:C(waterfront)[T.1]                       0.8452      0.126      6.699      0.000       0.598       1.092
I(lat ** 2)                                 -2.7122      0.229    -11.849      0.000      -3.161      -2.264
long                                        -0.4591      0.051     -8.948      0.000      -0.560      -0.359
I(yr_built ** 2)                          4.854e-05   1.84e-06     26.350      0.000    4.49e-05    5.21e-05
==============================================================================
Omnibus:                     1559.839   Durbin-Watson:                   2.005
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6885.153
Skew:                          -0.218   Prob(JB):                         0.00
Kurtosis:                       5.730   Cond. No.                     1.64e+12
==============================================================================
</small></pre>

Table with an additional $\beta_{17}(has{\_}basement\times long)$ (without the squared year term):

<pre><small>
                        OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.881
Method:                 Least Squares   F-statistic:                     1776.
Date:                Fri, 02 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:24:21   Log-Likelihood:                 6224.2
No. Observations:               21613   AIC:                        -1.227e+04
Df Residuals:                   21522   BIC:                        -1.154e+04
Df Model:                          90                                         
Covariance Type:            nonrobust                                         
============================================================================================================
                                               coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                -6311.4036    522.780    -12.073      0.000   -7336.092   -5286.715
C(renovated)[T.True]                        -0.1974      0.108     -1.829      0.067      -0.409       0.014
C(view)[T.1]                                 0.1259      0.010     12.235      0.000       0.106       0.146
C(view)[T.2]                                 0.1162      0.006     18.563      0.000       0.104       0.128
C(view)[T.3]                                 0.1904      0.009     22.301      0.000       0.174       0.207
C(view)[T.4]                                 0.2938      0.013     22.347      0.000       0.268       0.320
C(waterfront)[T.1]                         -39.0766      6.065     -6.443      0.000     -50.964     -27.189
C(zipcode)[T.98002]                          0.0158      0.016      0.965      0.334      -0.016       0.048
C(zipcode)[T.98003]                         -0.0094      0.015     -0.643      0.520      -0.038       0.019
C(zipcode)[T.98004]                          0.9125      0.028     32.676      0.000       0.858       0.967
C(zipcode)[T.98005]                          0.5267      0.030     17.701      0.000       0.468       0.585
C(zipcode)[T.98006]                          0.4645      0.026     18.205      0.000       0.414       0.514
C(zipcode)[T.98007]                          0.4501      0.031     14.666      0.000       0.390       0.510
C(zipcode)[T.98008]                          0.4616      0.029     15.747      0.000       0.404       0.519
C(zipcode)[T.98010]                          0.3090      0.025     12.295      0.000       0.260       0.358
C(zipcode)[T.98011]                          0.2829      0.036      7.785      0.000       0.212       0.354
C(zipcode)[T.98014]                          0.2056      0.040      5.108      0.000       0.127       0.285
C(zipcode)[T.98019]                          0.2307      0.039      5.872      0.000       0.154       0.308
C(zipcode)[T.98022]                          0.3447      0.024     14.194      0.000       0.297       0.392
C(zipcode)[T.98023]                         -0.0590      0.013     -4.380      0.000      -0.085      -0.033
C(zipcode)[T.98024]                          0.3186      0.037      8.700      0.000       0.247       0.390
C(zipcode)[T.98027]                          0.3927      0.026     14.931      0.000       0.341       0.444
C(zipcode)[T.98028]                          0.2228      0.035      6.313      0.000       0.154       0.292
C(zipcode)[T.98029]                          0.4774      0.029     16.225      0.000       0.420       0.535
C(zipcode)[T.98030]                          0.0008      0.017      0.046      0.963      -0.033       0.034
C(zipcode)[T.98031]                         -0.0235      0.018     -1.276      0.202      -0.060       0.013
C(zipcode)[T.98032]                         -0.1237      0.020     -6.118      0.000      -0.163      -0.084
C(zipcode)[T.98033]                          0.5816      0.031     19.000      0.000       0.522       0.642
C(zipcode)[T.98034]                          0.3412      0.032     10.517      0.000       0.278       0.405
C(zipcode)[T.98038]                          0.1859      0.019      9.758      0.000       0.149       0.223
C(zipcode)[T.98039]                          1.0738      0.037     29.164      0.000       1.002       1.146
C(zipcode)[T.98040]                          0.6653      0.026     25.976      0.000       0.615       0.715
C(zipcode)[T.98042]                          0.0478      0.016      2.928      0.003       0.016       0.080
C(zipcode)[T.98045]                          0.3166      0.035      8.975      0.000       0.247       0.386
C(zipcode)[T.98052]                          0.4657      0.031     14.900      0.000       0.404       0.527
C(zipcode)[T.98053]                          0.4487      0.033     13.415      0.000       0.383       0.514
C(zipcode)[T.98055]                         -0.0041      0.021     -0.196      0.845      -0.045       0.037
C(zipcode)[T.98056]                          0.1421      0.023      6.187      0.000       0.097       0.187
C(zipcode)[T.98058]                          0.0392      0.020      1.952      0.051      -0.000       0.079
C(zipcode)[T.98059]                          0.1976      0.023      8.749      0.000       0.153       0.242
C(zipcode)[T.98065]                          0.3690      0.033     11.188      0.000       0.304       0.434
C(zipcode)[T.98070]                          0.0586      0.025      2.361      0.018       0.010       0.107
C(zipcode)[T.98072]                          0.3261      0.036      9.027      0.000       0.255       0.397
C(zipcode)[T.98074]                          0.4012      0.030     13.173      0.000       0.342       0.461
C(zipcode)[T.98075]                          0.4311      0.030     14.450      0.000       0.373       0.490
C(zipcode)[T.98077]                          0.3101      0.038      8.258      0.000       0.236       0.384
C(zipcode)[T.98092]                          0.0813      0.015      5.563      0.000       0.053       0.110
C(zipcode)[T.98102]                          0.7386      0.032     23.071      0.000       0.676       0.801
C(zipcode)[T.98103]                          0.5724      0.030     19.353      0.000       0.514       0.630
C(zipcode)[T.98105]                          0.7246      0.031     23.729      0.000       0.665       0.784
C(zipcode)[T.98106]                          0.0918      0.024      3.838      0.000       0.045       0.139
C(zipcode)[T.98107]                          0.5827      0.031     19.079      0.000       0.523       0.643
C(zipcode)[T.98108]                          0.1200      0.026      4.620      0.000       0.069       0.171
C(zipcode)[T.98109]                          0.7430      0.032     23.320      0.000       0.681       0.805
C(zipcode)[T.98112]                          0.8251      0.029     28.857      0.000       0.769       0.881
C(zipcode)[T.98115]                          0.5806      0.030     19.366      0.000       0.522       0.639
C(zipcode)[T.98116]                          0.4713      0.026     18.162      0.000       0.420       0.522
C(zipcode)[T.98117]                          0.5437      0.030     17.949      0.000       0.484       0.603
C(zipcode)[T.98118]                          0.2447      0.023     10.416      0.000       0.199       0.291
C(zipcode)[T.98119]                          0.7156      0.030     23.757      0.000       0.657       0.775
C(zipcode)[T.98122]                          0.5752      0.028     20.890      0.000       0.521       0.629
C(zipcode)[T.98125]                          0.3242      0.032     10.104      0.000       0.261       0.387
C(zipcode)[T.98126]                          0.2842      0.024     11.689      0.000       0.237       0.332
C(zipcode)[T.98133]                          0.2187      0.033      6.606      0.000       0.154       0.284
C(zipcode)[T.98136]                          0.4113      0.025     16.529      0.000       0.363       0.460
C(zipcode)[T.98144]                          0.4455      0.026     17.072      0.000       0.394       0.497
C(zipcode)[T.98146]                          0.0240      0.023      1.057      0.290      -0.020       0.069
C(zipcode)[T.98148]                         -0.0326      0.029     -1.125      0.260      -0.089       0.024
C(zipcode)[T.98155]                          0.2058      0.035      5.964      0.000       0.138       0.273
C(zipcode)[T.98166]                          0.0874      0.021      4.209      0.000       0.047       0.128
C(zipcode)[T.98168]                         -0.1487      0.022     -6.731      0.000      -0.192      -0.105
C(zipcode)[T.98177]                          0.3406      0.035      9.849      0.000       0.273       0.408
C(zipcode)[T.98178]                         -0.0513      0.023     -2.249      0.025      -0.096      -0.007
C(zipcode)[T.98188]                         -0.0879      0.023     -3.863      0.000      -0.133      -0.043
C(zipcode)[T.98198]                         -0.0689      0.017     -4.036      0.000      -0.102      -0.035
C(zipcode)[T.98199]                          0.5758      0.029     19.672      0.000       0.518       0.633
np.log(sqft_living)                          0.4512      0.007     67.209      0.000       0.438       0.464
np.log(sqft_living):C(renovated)[T.True]     0.0335      0.014      2.375      0.018       0.006       0.061
np.log(sqft_lot)                             0.0669      0.002     31.931      0.000       0.063       0.071
bedrooms                                    -0.0154      0.002     -8.582      0.000      -0.019      -0.012
floors                                      -0.0059      0.004     -1.626      0.104      -0.013       0.001
bathrooms                                    0.0403      0.003     13.717      0.000       0.035       0.046
condition                                    0.0461      0.002     21.248      0.000       0.042       0.050
grade                                        0.1069      0.002     55.518      0.000       0.103       0.111
yr_built                                    -0.0003   7.49e-05     -3.490      0.000      -0.000      -0.000
lat                                        263.3588     22.015     11.963      0.000     220.209     306.509
lat:C(waterfront)[T.1]                       0.8311      0.128      6.515      0.000       0.581       1.081
I(lat ** 2)                                 -2.7653      0.232    -11.938      0.000      -3.219      -2.311
long                                        -0.4035      0.052     -7.769      0.000      -0.505      -0.302
has_basement                                -7.2446      2.607     -2.779      0.005     -12.355      -2.134
has_basement:long                           -0.0589      0.021     -2.760      0.006      -0.101      -0.017
==============================================================================
Omnibus:                     1463.605   Durbin-Watson:                   2.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6304.862
Skew:                          -0.191   Prob(JB):                         0.00
Kurtosis:                       5.618   Cond. No.                     1.27e+09
==============================================================================
</small></pre>

**Answer.**

-------

## Conclusions

In this case, we applied various types of transformations to the independent and dependent variables to improve the quality of our linear modeling. In particular, we found that fitting the logarithm of house prices allowed us to get better results. Using our understanding of transformations, we were able to effectively model non-linear relationships, such as the quadratic relationship between latitude and the logarithm of price. Finally, we integrated our understanding of interaction effects from previous EDA cases in order to directly model and quantify the interactions of renovation and waterfront status on the relationship between square footage and price.

## Takeaways

Variable transformations are a powerful technique to improve the quality of our linear models. In particular:

1. Transforming the dependent variable can improve linearity and resolve the problem of uneven variance around the line of best fit.
2. Transforming the independent variables can be useful to improve the quality of the fit, capture non-linear relationships between the independent and dependent variables, and test a wider range of hypotheses.
3. Since $R^2$ always increases with the addition of independent variables, we should be careful to not overfit our model by adding variables that provide little explanatory information. We can use indicators like the AIC Score to help us decide if the added variables benefit our model.
4. Interaction terms are a specific type of variable transformation involving the product of two other independent variables. They can capture variability in the relationship between an independent variable and the dependent variable based on the value of a third variable.

## Appendix: Elasticities and semi-elasticities

### Elasticities

Recall our first model:

$$
\widehat{\log(price)}=6.7299 + 0.8368 \log(squareft{\_}living)
$$

Let's make use of the identity $e^{\log(x)}=x$ and the facts that $e^{(x+y)} = e^x e^y$ and $e^{px}=(e^x)^p$ to get

$$
\begin{aligned}
e^{\widehat{\log(price)}}&=e^{(6.7299 + 0.8368 \log({squareft{\_}living))}}\\
\widehat{price}&=e^{6.7299}e^{0.8368 \log({squareft{\_}living)}}\\
\widehat{price}&=e^{6.7299}(e^{\log({squareft{\_}living)}})^{0.8368}\\
\widehat{price}&=e^{6.7299}squareft{\_}living^{0.8368}
\end{aligned}
$$

This is a non-linear relationship, so it's not as straightforward as saying "increasing $squareft{\_}living$ by 1 means that $\widehat{price}$ goes up by $x$". However, we can try reframing this in percentage terms; i.e. how does a 1 percent increase in $squareft{\_}living$ affect $\widehat{price}$? We can see this by calculating the price of a house with only one square foot:

$$
\begin{aligned}
\widehat{price}&=e^{6.7299}squareft{\_}living^{0.8368}\\
&=e^{6.7299}1^{0.8368}\\
&=837.063
\end{aligned}
$$

And comparing it with the price of a house with 1.01 square feet (thus, 1% more):

$$
\begin{aligned}
\widehat{price}&=e^{6.7299}squareft{\_}living^{0.8368}\\
&=e^{6.7299}1.01^{0.8368}\\
&=844.062
\end{aligned}
$$

The house with 1.01 square feet has a predicted price that is 1.008361198 times that of the other one, that is, this 1% increase in square footage is associated with a 0.8361% increase in predicted price. You can see that this percentage is very close to our regression model's $\beta_1=0.8368$. This is the reason why we can interpret $\beta_1$ as an elasticity in a log-log linear regression. In a slightly more general case, defining $S=1.01\times squareft{\_}living$, we have:

$$
\begin{aligned}
\widehat{price_S}&=e^{6.7299}S^{0.8368}\\
&=(e^{6.7299}squareft{\_}living^{0.8368})(1.01)^{0.8368}\\
&= (1.008361198) \widehat{price} \\
&\approx \beta_1 \widehat{price} \\
\end{aligned}
$$

which implies that

$$
\beta_1 \approx \frac{\widehat{price_S}}{\widehat{price}}
$$

### Semi-elasticities

Similarly to the previous explanation, let us have the model

$$
\widehat{\log(price)}=\beta_0 + \beta_1 squareft{\_}living
$$

Therefore,

$$
\widehat{price}=e^{\beta_0} e^{\beta_1 squareft{\_}living}
$$

If we define $S=squareft{\_}living+1$ (notice: not a percentage increase, but an absolute increase), then

$$\begin{aligned}
\widehat{price_S}&=e^{\beta_0} e^{\beta_1 S}\\
&=e^{\beta_0} e^{\beta_1 (squareft{\_}living+1)}\\
&=e^{\beta_0} (e^{(squareft{\_}living+1)})^{\beta_1}\\
&=e^{\beta_0} e^{\beta_1 squareft{\_}living}e^{\beta_1}\\
&=e^{\beta_1}\widehat{price}\\
\end{aligned}
$$

From which this follows:

$$
\begin{aligned}
e^{\beta_1}&=\frac{\widehat{price_S}}{\widehat{price}}\\
\end{aligned}
$$

In our second model (the log-level model), we obtained $\beta_1=0.0004$. Thus,

$$
\begin{aligned}
\frac{\widehat{price_S}}{\widehat{price}}&=e^{0.0004}\\
&=1.0004\\
&\approx \beta_1 + 1
\end{aligned}
$$

And therefore

$$
\beta_1 \approx \frac{\widehat{price_S}}{\widehat{price}} - 1
$$

which is a semi-elasticity (the percentage change in the explained variable that is associated with an absolute change of one unit in the explanatory variable).

## Attribution

"House Sales in King County, USA", August 25, 2016, harlfoxem, CC0 Public Domain, https://www.kaggle.com/harlfoxem/housesalesprediction