# How should we price homes in Seattle?

## Pulling in the data

Let's start by importing the required libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.formula.api as smf

Let's now load the data:

In [None]:
houses = pd.read_csv('data/kc_house_data.csv')

## Making normal QQ-plots

We can make normal QQ-plots with the [**`scipy.stats.probplot()`**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html) function from `scipy`. The syntax is pretty straightforward:

~~~python
probplot(x, dist, plot)
~~~

`x` is the `pandas` Series whose quantiles will be plotted and `dist` determines the theoretical distribution which `x`'s quantiles will be compared with. In our case, we want to compare `x` to the normal distribution. We can specify that with the `norm` keyword (see the example below).

The `plot` argument tells `scipy` to produce a plot using a specific engine. We usually pass `plot=plt` to use `matplotlib` (`plt` comes from the `import matplotlib.pyplot as plt` line that we ran a moment ago in our imports cell).

### Example 1

Make a normal QQ-plot of the `price` variable.

**Answer.** Shown below:

In [None]:
## QQ plot of price
stats.probplot(x=houses['price'], dist="norm", plot=plt)
plt.title("QQ Plot for Prices")
plt.show()

### Exercise 1

Make a normal QQ-plot of the `sqft_living` variable.

**Answer.**

-------

## Regressions with logarithmically transformed variables

To run regressions in Python, we can use the `statmodels` library (we imported it with the `import statsmodels.formula.api as smf` line at the beginning of this notebook). As you know from a previous case, you need to first define a formula and then pass it to the `ols` function. For example, to run the model:

$$
\widehat{price} = \beta_0 + \beta_1 sqft{\_}living + \varepsilon
$$

we can use the following code:

In [None]:
model = smf.ols(formula='price ~ sqft_living', data=houses).fit()

We can then inspect the results by calling the model's `summary()` method:

In [None]:
model.summary()

### Example 2

Run the following model and show its output:

$$
\widehat{price} = \beta_0 + \beta_1\log(sqft{\_}living) + \varepsilon
$$

**Answer.** Here our input variable `sqft_living` has been logarithmically transformed. To translate this model into code, we can make use of the `numpy` library's `log` function. For instance, the code below outputs $\log(\text{sqft_living})$:

In [None]:
np.log(houses["sqft_living"])

One option is to create a new variable in the `houses` DataFrame that is $\log(sqft{\_}living)$ and then run the regression as usual with `statmodels`. Another (more convenient) approach is to directly incorporate `np.log()` in the formula itself, like this:

In [None]:
level_log_model = smf.ols(formula='price ~ np.log(sqft_living)', data=houses).fit()
level_log_model.summary()

### Exercise 2

Run the following model and show its output:

$$
\widehat{\log(price)} = \beta_0 + \beta_1\log(sqft{\_}living) + \varepsilon
$$

**Answer.**

-------

## Box-Cox $\lambda$ transformation

For this, we will use `scipy.stats` again. The function you need is `boxcox()`, which takes as input the `pandas` Series you want to transform and outputs the transformed Series and the value of $\lambda$.

### Example 3

Transform the `price` column using the Box-Cox criterion. Print both the transformed variable and the value of $\lambda$.

**Answer.** We can assign the two outputs of the `boxcox()` function to two variables, like this:

In [None]:
transformed_price, lambda_price = stats.boxcox(houses['price'])

This is the Box-Cox-transformed Series:

In [None]:
transformed_price

And this is the $\lambda$:

In [None]:
lambda_price

### Exercise 3

Transform the `sqft_living` column using the Box-Cox criterion. Print both the transformed variable and the value of $\lambda$.

**Answer.**

-------

## Categorical variables and squared variables

To include a categorical variable in a regression in `statsmodels`, you need to use this syntax in the model formula:

~~~python
C(the_variable)
~~~

`C` stands for "categorical". If you want to add a squared variable, you have to use this syntax (recall that to take a number `x` to the `y`-th power in Python you write `x**y`):

~~~python
I(the_variable**2)
~~~

This syntax comes from the [**`patsy`**](https://patsy.readthedocs.io/en/latest/) package, which is used by `statsmodels` under the hood to translate statistical expressions into Python code. The [**`I()`**](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language) (for "identity") function simply tells `patsy` that everything that is inside the parenthesis should be passed as-is to the model. Therefore, `I(the_variable**2)` means that Python should first compute `the_variable**2` and then pass the result as one of the input variables to the linear model.

### Example 4

Run the following model and show its output:

$$
\widehat{price} = \beta_0 + \beta_1 sqft{\_}living + \beta_2 waterfront + \beta_3 lat^2 + \varepsilon
$$

**Answer.** Here we have two variable transformations:

* $waterfront$ (a categorical variable): `C(waterfront)`
* $lat^2$: `I(lat**2)`

The code should be:

In [None]:
model_transformed_vars = smf.ols(formula='price ~ sqft_living + C(waterfront) + I(lat**2)', data=houses).fit()
model_transformed_vars.summary()

### Exercise 4

Run the following model and show its output:

$$
\widehat{price} = \beta_0 + \beta_1 sqft{\_}living + \beta_2 waterfront + \beta_3 lat^2 + \beta_4 view + \beta_5 yr{\_}built^2 + \varepsilon
$$

**Hint:** `view` is a categorical variable.

**Answer.**

-------

## Interaction terms

To model interaction effects in the `patsy` syntax, you just multiply the variables together with the `*` operator:

~~~python
variable_1 * variable_2
~~~

### Example 5

Run the following model and show its output:

$$
\widehat{price} = \beta_0 + \beta_1 sqft{\_}living + \beta_2 waterfront + \beta_3 lat + \beta_4 (waterfront \times lat) + \varepsilon
$$

**Answer.** We add the interaction term with this code:

~~~python
waterfront * lat
~~~

The complete code looks like this:

In [None]:
model_interaction = smf.ols(formula='price ~ sqft_living + C(waterfront) + lat + C(waterfront) * lat', data=houses).fit()
model_interaction.summary()

### Exercise 5

Run the following model and show its output:

$$
\widehat{price} = \beta_0 + \beta_1 \log(sqft{\_}living) + \beta_2 view + \beta_3 (\log(sqft{\_}living) \times view) + \varepsilon
$$

**Answer.**

-------

## Attribution

"House Sales in King County, USA", August 25, 2016, harlfoxem, CC0 Public Domain, https://www.kaggle.com/harlfoxem/housesalesprediction