In [None]:
import warnings

warnings.filterwarnings("ignore")

# Learning goals
After today's lesson you should be able to:
- Use cross-validation
- Find the best model for classification and regression problems based on tuning hyperparameters and calculating performance scores

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

from pysal.lib import weights

import contextily


## 0.1 Import the data

For this exercise, let's use the San Diego AirBnB data set again. As a reminder: This dataset contains house intrinsic characteristics, both continuous (number of beds as in `beds`) and categorical (type of renting or, in Airbnb jargon, property group as in the series of `pg_X` binary variables), but also variables that explicitly refer to the location and spatial configuration of the dataset (e.g., distance to Balboa Park, `d2balboa` or neighborhood id, `neighborhood_cleansed`).


Our aim here is to make two kinds of predictions: 
- **`log_price`** (Regression): We want to use other features to predict the log(Price) of each airbnb
- **`coastal`** (Classification): We also want to use our features to see if we can predict whether an Airbnb is coastal. 

In [None]:
db = gpd.read_file("https://www.dropbox.com/s/zkucu7jf1xug869/regression_db.geojson?dl=1")


In [None]:
db.info()

In [None]:
db.head()

Again, notice here that we have: 
- **Discrete variables** (number of bedrooms, beds, baths)
- **Dummy variables** (whether there is a pool, whether near the coast, room type)

# 1. `log_price`
Let's start off with predicting the price of the Airbnb. 


In [None]:
y = db['log_price']

## We'll make our X, independent variables, the "kitchen sink" of all of our other variables for now. 
## I'm using all the variables we have available with the exception of `neighborhood`, which we have to turn into dummy variables in a second. 

X = db[['accommodates', 'bathrooms', 'bedrooms', 'beds', 'pool',
       'd2balboa', 'coastal', 'pg_Apartment',
       'pg_Condominium', 'pg_House', 'pg_Other', 'pg_Townhouse',
       'rt_Entire_home/apt', 'rt_Private_room', 'rt_Shared_room']]

## 1.1 Feature engineering
Feature engineering is the process of creating new variables from the ones you already have. Common feature engineering tasks include:
- Creating dummy variables from categorical variables
- Creating interaction terms between variables
- Creating polynomial terms from variables
- Creating log or square root terms from variables
- Creating lagged variables from time series data or lagged spatial variables 


In [None]:
neighborhood_dummies = pd.get_dummies(db['neighborhood'])

In [None]:
neighborhood_dummies.head()

In [None]:
## Here, I want to concatenate my X and neighborhood_dummies into one dataframe.
## I need to tell pd.concat() to either add new columns (axis=1) or add new rows (axis=0).
## The default is axis=0, ie new rows, so I need to specify axis=1.
X = pd.concat([X, neighborhood_dummies],axis=1)

In [None]:
X.head()

Let's also create a new column that is the KNN spatial lag for the 'neighborhood context'. Here, I'm going to use the columns: 
- `pool`, which is a binary (0,1) variable for whether the airbnb has a pool
- `pg_House` which is a binary (0,1) variable for whether the airbnb is a house

I will choose K=20, to give me the 20 closest neighboring Airbnbs. My spatial lag should be a number between 0 and 20 to estimate, of the 20 closest Airbnb, how many have pools and how many are other housees.

In [None]:
knn = weights.KNN.from_dataframe(db, k=20)

In [None]:
pool_lag = weights.lag_spatial(knn, db['pool'])
house_lag = weights.lag_spatial(knn, db['pg_House'])

The chart below shows that neighboring airbnbs mostly don't have a pool. 

In [None]:
plt.hist(pool_lag)

But there are a lot more neighborinng Airbnbs that are houses. 

In [None]:
plt.hist(house_lag)

Add these to new features to our original data. 

In [None]:
X['pool_lag'] = pool_lag
X['house_lag'] = house_lag


## 1.1 Create our Train-Test Split
We almost always start off with splitting our data into our **train** and **test** sets. 

In [None]:
from sklearn.model_selection import train_test_split

## Let's use the default split for now, which is 75-25 train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,)


## 1.2 Predict the data
Here, let's use a decision tree regressor to predict the price. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error



In [None]:
model = DecisionTreeRegressor()

model.fit(X_train, y_train)

## 1.3 Cross-validation
Now we use the k-fold cross-validation method is run our model several times on different parts of the training data. 

In [None]:
## The default scoring metric for Random Forest is R^2, so we can use cross_val_score() to get the R^2 for each fold.
scores = cross_val_score(model, X_train, y_train, cv=5)
scores

As you can see, there is some variation here. 

In [None]:
scores.mean()

Our average R^2 is about 45.5%. 

## 1.4 Test score 
Let's see how well our model does on the test data. 

In [None]:
# evaluate the model on the second set of data
ypred_rf = model.predict(X_test)
print("R^2 is:", r2_score(y_test, ypred_rf))
print("Mean absolute error is:", mean_absolute_error(y_test, ypred_rf))
print("Mean squared error is:", mean_squared_error(y_test, ypred_rf))

So the model performed slightly worse - 49% - on the test set. 

## 1.5 Tuning our hyperparameters

### 1.5.1 Tweaking our trees
Let's say we think to get a better score, we need to maximum tree depth. Let's test this. 

In [None]:
trees = np.linspace(1,20,20).astype(int)
trees

The below might take a bit of time to run. (uff it took 3 minutes and 14 seconds for me!)

In [None]:
## Let's create an empty list to hold our scores
mean_scores_train = []
mean_scores_test = []

## Now, let's loop through our trees and get the mean score for each
for t in trees: 
    model = DecisionTreeRegressor(max_depth=int(t))
    model.fit(X_train, y_train)
    scores_train = cross_val_score(model, X_train, y_train, cv=5)
    
    ypred_rf = model.predict(X_test)
    score_test = r2_score(y_test, ypred_rf)

    mean_scores_train.append(scores_train.mean())
    mean_scores_test.append(score_test)



In [None]:
plt.plot(trees,mean_scores_train,color='blue',label='Train')
plt.plot(trees,mean_scores_test,color='red',label='Test')
plt.legend()

Two insights emerge: 
1. Setting a maximum depth generally produced better scores than the default of not setting any depth (just letting the tree split until each leaf is "pure", i.e. contains values of the same category). 
2. We can see from this that after about a 5 maximum depth, our performance starts to decrease. 

## 1.6 Grid Search
But what about other parameters? 

`sklearn` has a way of optimizing for all the hyperparameters you'd like to tune. Let's say, here, we want to test the following hyperparatmers. You can see the full list for this algorithm in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html): 
- The loss criterion: `criterion`, which can be {“squared_error”, “friedman_mse”, “absolute_error”, “poisson”}
- The maximum tree depth: `max_depth`
- The minimum number of samples required for the next split: `min_samples_split`

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'criterion':['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
              'max_depth': np.linspace(1,20,20).astype(int),
              'min_samples_split': np.linspace(1,100,20).astype(int)}

grid = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, verbose=1)

~~This might take a while since we're doing 8000 model fits!~~ Don't run this now as it takes 80 minutes. You can try to decrease the 

In [None]:
grid.fit(X, y)

In [None]:
sorted(grid.cv_results_.keys())

In [None]:
grid.best_params_

`grid.best_params` got me `{'criterion': 'poisson', 'max_depth': 6, 'min_samples_split': 89}`

## Q.1 Classification
Instead of predicting the log price, now instead predict whether an AirBnB is coastal. Using the same steps (feature engineering, train-test split, cross-validation, hyperparameter tuning) as above, **select one hyperparameter** to optimize for a classification model. It doesn't have to be a Decision Tree! (10 pts)