# Problem Session 2

In this problem session we practice our skills with :

* Exploratory Data Analysis
* Simple linear regression
* Multiple linear regression
* k nearest neighbors regression
* kFold cross validation

In [None]:
## We first load in packages we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set_style("whitegrid")

#### 1. Introducing the data

Our data concerns Median house prices for California districts derived from the 1990 census.

This dataset was found on Kaggle.com, <a href="https://www.kaggle.com/datasets/camnugent/california-housing-prices/data">https://www.kaggle.com/datasets/camnugent/california-housing-prices/data</a>.

##### a. 

First load the data for this problem. It is stored in the file `housing.csv` in the `data` folder of the repository. After loading the data look at the first five rows of the dataset. Then run `housing.info()`.  Are there any missing values?

In [None]:
housing = 

Yes, `total_bedrooms` has some missing values.

##### b. 

There are future lecture notebooks that cover ways to <i>impute</i> missing values, but for this notebook you will simply remove the missing values. 

Use `dropna`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html</a> to get a version of the data set that has had the missing values removed.


##### c.

The column `median_house_value` currently contains strings instead of a floats.  Before doing any modeling you will have to clean the data a little bit.

Write a function `clean_column` which passes the indicated tests. 

Then use `.apply`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html</a> to apply clean_column to `median_house_value`.

In [None]:
# Define your function below

assert clean_column('$432,425.0') == 432425.0
assert clean_column('$15,326.0') == 15326.0

In [None]:
# Use the function to clean the median_house_value column

### Predictive Model

In the next couple of problem session notebooks you will build a series of models to predict the sale price of a given vehicle.

#### 2. Train test split

The first step in predictive modeling is performing a train test split. Perform a train test split on these data, setting aside $20\%$ of the data as a test set. Choose a `random_state` so your results are reproducible.

As a refresher you can use `sklearn`'s `train_test_split` function: 

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a>.

#### 3. Exploratory data analysis (EDA)

After the train test split we can work on some exploratory data analysis. Here is where we start to look at the data and see if we can generate any modeling ideas or hypotheses. You will make a series of plots and learn a modeling trick that should improve any models we make.

##### a. 

Use `seaborn`'s `pairplot`, <a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html">https://seaborn.pydata.org/generated/seaborn.pairplot.html</a> to plot `median_selling_value` against `km_driven`, `mileage` and `age`. Shell code is provided for you below.

In [None]:
# for your convenience I have copied the feature names here.
# you could instead get them programmatically by slicing the housing.columns array

features = ['median_house_value', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population','households', 'median_income']

sns.pairplot(housing_train,
                y_vars = ,
                x_vars = ,
                height = 5,
                diag_kind = None)

plt.show()

##### c.

Do any of the previous relationships look linear? Notice anything else interesting about the data?

The relationship with median_income does appear roughly linear.  Hard to tell with some of the other variables.

I notice that `median_house_value` seems to have been truncated at $\$500000$.  This is a real problem for linear regression!  It will severely bias our estimates.

The easiest way to deal with this is to discard all of these rows.

A more complicated way would be to try and utilize those rows using something like a [Tobit Model](https://en.wikipedia.org/wiki/Tobit_model).

Let's take the easy way out for now.  This gives us another independent test of our model:  after training our model on the rest of the data we can see whether it predicts that those rows have a median value above $\$500000$.

In [None]:
# Set aside all of the rows for which the median house value is 500000
housing_truncated = 
housing_train_truncated = 
housing_test_truncated = 

# Redefine these to only include the non-truncated examples
housing = 
housing_train = 
housing_test = 

##### d.

Another part of EDA is calculating descriptive statistics.

One statistic of interest to us in this situation is the <i>Pearson correlation coefficient</i>. For two variables $x$ and $y$ with $n$ observations each, the Pearson correlation is given by:

$$
r = \frac{\sum_{i=1}^n \left( x_i - \overline{x} \right) \left( y_i - \overline{y}  \right)}{\sqrt{\sum_{i=1}^n \left(x_i - \overline{x}\right)^2 \sum_{i=1}^n \left(y_i - \overline{y} \right)^2}} = \frac{\text{cov}\left(x, y\right)}{\sigma_x \sigma_y},
$$

where $x_i$ is the $i^\text{th}$ observation, $\overline{x} = \sum_{i=1}^n x_i/n$, $\text{cov}\left( x, y \right)$ is the covariance between $x$ and $y$, and $\sigma_x$ denotes the standard deviation of $x$.

$r \in [-1,1]$ gives a sense of the strength of the linear relationship between $x$ and $y$. The closer $|r|$ is to $1$, the stronger the linear relationship between $x$ and $y$, the sign of $r$ determines the direction of the relationship, with $r < 0$ meaning a line with a negative slope and $r > 0$ a line with a positive slope.

Calculate the correlation between `median_house_value` and the columns you have previously plotted.

<i>Hint: Either <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html</a> or <a href="https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html">https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html</a> should work.</i>.

##### e.

Based on your EDA, which feature do you think would best predict `median_house_value` in a simple linear regression model?

WARNING:  while using feature/outcome correlation is a reasonable choice for feature selection in a simple linear regression model, it is **not** a good choice for multiple linear regression.  [This stats.stackexchange post](https://stats.stackexchange.com/a/139031/97124) explains why!

##### f.

We have not yet investigated *spatial* variation in the housing prices.

Use [https://plotly.com/python/mapbox-density-heatmaps/](https://plotly.com/python/mapbox-density-heatmaps/) as inspiration and make a heatmap of `median_house_value`.

Does it seem like including the latitude and longitude somehow in our model would be helpful?

#### 4. Modeling

Now you will build some preliminary models for this data set.

##### a.

When doing predictive modeling it is good practice to have a <i>baseline model</i> which is a simple "model" solely for comparison purposes. These are not, typically, complex or good models, but they are important reference points to give us a sense of how well our models are actually performing.

A standard regression model baseline is to just predict the average value of $Y$ for any value of $X$. In this setting that model looks like this:

$$
\text{Baseline Model: } \ \ \ \ \text{Median House Value} = \mathbb{E}\left(\text{Median House Value}\right) + \epsilon,
$$

where $\epsilon$ is i.i.d. and normally distributed.

Write some code to estimate $\mathbb{E}\left(\text{Median House Value}\right)$ using the training set.

Below you will use cross-validation to compare one simple linear regression models, one multiple linear regression model, and one kNN model which uses the spatial data.

$$
\begin{align*}
\text{Baseline Model}:& \ \text{Median House Value} = \mathbb{E}\left(\text{Median House Value}\right) + \epsilon\\

\text{SLR Model}:& \ \text{Median House Value} = \beta_0 + \beta_1 \left( \text{Median Income} \right) + \epsilon\\

\text{MLR model}:& \ \text{Median House Value} = \beta_0 + \beta_1 \left(\text{Median Income}\right)  + \beta_2 \left(\text{Households}\right) + \epsilon\\

\text{kNN model}:& \ \text{Use k nearest neighbors regression on latitude and longitude with $k = 10$}\\
\end{align*}
$$

We will attempt hyperparameter tuning on $k$ in a later problem session, but just stick with $k=10$ for now.

##### b.
In this problem practice fitting just the MLR model using the training set and `sklearn`'s `LinearRegression` model, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html</a>.

In [None]:
# import the linear regression model

# instantiate a model object
mlr_model =

# Fit the model to the training data


In [None]:
mlr_model.intercept_

In [None]:
mlr_model.coef_

##### c.

In this problem you will try to implement $5$-fold cross-validation (CV) to compare these three models and the baseline model to see which one has the lowest average cross-validation root mean squared error (RMSE).

Because this may be your first time implementing CV, some of the code will be filled in for you.

In [None]:
## import KFold and kNeighborsRegressor here.

## import root_mean_squared_error


In [None]:
## Make a KFold object
## remember to set a random_state and set shuffle = True
num_splits = 5
num_models = 4
kfold = 

## This array will hold the mse for each model and split
rmses = np.zeros((num_models, num_splits))

## sets a split counter
i = 0

## loop through the kfold here
for train_index, test_index in     :
    ## cv training set
    housing_tt = 
    
    ## cv holdout set
    housing_ho = 
    
    ## "Fit" and get ho rmse for the baseline model.
    ## No need to use an sklearn function:  just get the mean.  
    ## baseline_pred should be a numpy array with the same number of elements as housing_ho

    baseline_pred = 
    
    rmses[0, i] = 
    
    ## Fit and get ho rmse for slr model
    slr = 
    
    
    ## Fit and get ho mse for mlr model
    mlr = 
    

    ## Fit and get ho rmse for the spatial model
    knn = 
    
    i = i + 1

In [None]:
## Find the avg cv mse for each model here
print(f"Baseline Avg. CV RMSE: {np.mean(rmses[0,:])} and STD: {np.std(rmses[0,:])}")
print(f"SLR Avg. CV MSE: {np.mean(rmses[1,:])} and STD: {np.std(rmses[1,:])}")
print(f"MLR Avg. CV MSE: {np.mean(rmses[2,:])} and STD: {np.std(rmses[2,:])}")
print(f"Spatial Avg. CV MSE: {np.mean(rmses[3,:])} and STD: {np.std(rmses[3,:])}")


##### d.

Which model had the lowest average cross validation root mean squared error?  

Discuss the meaning of the STD in this context.

##### e.

Train the simple linear regression model on the full training set and predict on the truncated dataset.  Does the model predict that the median house values are in excess of $\$500000$?

That's it for this notebook. In the next couple of regression based notebooks we will build additional models for this data set.

--------------------------

This notebook was written for the Erdős Institute Data Science Boot Camp by Steven Gubkin.

Please refer to the license in this repo for information on redistribution.