# New Baseline Regression Models with Transformed Data

##### Modeling Step 3

### Notebook Summary:

#### Objective: improve the new baseline model to predict AirBnB listing prices by trying Nonlinear Regression models, or adding 200 features obtained from calculating ratios between existing features and then performing Feature Selection with Recursive Feature Elimination

* Nonlinear attempts consist of first modeling only with interaction features, and then modeling with quadratic terms 
* Features to be used to calculate ratios were selected arbitrarily, focusing on distance features (e.g. distance to ocean, distance to closest city recreation site, etc.) and count features (e.g. count of events within 1 KM vs count of parks within 1 KM) 

#### Conclusions: 
* Nonlinear attempts are highly computationally expensive and perform poorly in terms of model accuracy 
* Adding ratio features provides a small improvement to the baseline model picked in the "new_baseline_model" notebook 
* Performing feature selection with RFECV on the expanded dataset identifies 147 features as the optimal set that minimizes validation RMSE, but the new performance metrics show slightly lower accuracy. Despite that, we prefer to use this smaller set of features as it provides a much simpler model i.e. a model with 100 ca. less features  
* ##### Linear Regression with the expanded but then "selected" dataset is therefore our new best model

#### Next Steps: 
###### In the regularization notebook we use Lasso, Ridge and Elastic Net Regression models in order to further improve our model by simplifying it while also not giving up accuracy results obtained thus far

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import matplotlib.cm as cm
%matplotlib inline

In [2]:
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error
from sklearn.model_selection import KFold,cross_val_predict,cross_val_score, cross_validate, train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, PolynomialFeatures, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.neighbors import LocalOutlierFactor, KNeighborsRegressor
from sklearn.feature_selection import RFE, f_regression, RFECV
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

In [3]:
import sys
sys.path.append('./../lib')
from airbnb_modeling import detect_feature_importance, scale_data, normalize_data, eval_metrics, plot_residuals, plot_predictions
from parse_methods import parse_columns
from airbnb_modeling import detect_interactions, add_interactions, map_variable, plot_rmse_instances,plot_rmse_features, plot_accuracy_instances

  from pandas.core import datetools


In [4]:
%store -r scores_lin 
%store -r scores_tree 
%store -r scores_sv_reg 
%store -r scores_neigh_reg

In [5]:
%store -r best_model_svr 
%store -r best_model_kneigh 
%store -r best_model_dtree 
%store -r lin_reg

In [6]:
%store -r X_ratios
%store -r X_normed
%store -r X_test
%store -r y_normed
%store -r y_test
%store -r listings

Feature Interactions - We do this in two ways - first we check and add the interactions that increase accuracy beyond an arbitrary threshold (0.02); Then, we just add all interactions

In [7]:
increments = detect_interactions(X_normed,y_normed, 0.02)

  linalg.lstsq(X, y)


In [8]:
X_normed_wint = add_interactions(X_normed, increments)

In [9]:
increments.shape

(0, 3)

The above did not return any interactions! That means the interactions are probably not going to make large changes to our models above

In [10]:
poly = PolynomialFeatures(interaction_only=True)
X_normed_wint = poly.fit_transform(X_normed)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_normed_wint,y_normed, test_size=0.3, random_state=42)

In [12]:
lin_reg_intonly = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lin_reg_intonly.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [None]:
scores_lin_intonly = cross_validate(lin_reg_intonly, X_train, y_train, cv=10, return_train_score=True,
                         scoring=('r2', 'neg_mean_squared_error','neg_mean_absolute_error'))

In [None]:
print 'Evaluation Metrics for Linear Regression with CV - Interactions Only Added: '
eval_metrics(scores_lin_intonly)

In [None]:
#No need for plot here
#plot_rmse_instances(lin_reg_intonly, X_train, y_train)

Adding interactions simply overfits. Let's also try adding quadratic terms to the normalized dataset to see if nonlinaer regression might help, although it will probably also overfit

In [None]:
poly = PolynomialFeatures(2)
X_normed_quad = poly.fit_transform(X_normed)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_normed_quad,y_normed, test_size=0.3, random_state=42)

In [None]:
quad_reg = linear_model.LinearRegression(fit_intercept=True, normalize=False)
quad_reg.fit(X_train, y_train)

In [None]:
scores_quad = cross_validate(quad_reg, X_train, y_train, cv=5, return_train_score=True,
                         scoring=('r2', 'neg_mean_squared_error','neg_mean_absolute_error'))

In [None]:
print 'Evaluation Metrics for Linear Regression with CV - Interactions Only Added: '
eval_metrics(scores_quad)

In [None]:
#plot_rmse_instances(quad_reg, X_train, y_train)

Despite all the attempts, the most promising model is Linear Regression. Let's try the same Linear Regression with the ratios features. We are probably going to overfit again but we will then do Feature Selection: the goal is to choose a good feature subset 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ratios, y_normed, test_size=0.3, random_state=42)

In [None]:
lin_reg = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_train, y_train)

In [None]:
scores_lin_ratios = cross_validate(lin_reg, X_train, y_train, cv=10, return_train_score=True,
                         scoring=('r2', 'neg_mean_squared_error','neg_mean_absolute_error'))

In [None]:
print 'Evaluation Metrics for Linear Regression with CV: '
eval_metrics(scores_lin_ratios)

In [None]:
temp_pred = lin_reg.predict(X_test)
test_lin_ratios_nofs = np.sqrt(-mean_squared_error(y_test, temp_pred))

The model above is promising: the training RMSE has increased more than the validation RMSE, which is a sign that we may be approaching the sweet spot and getting close to starting to overfit

Feature Selection - We use this promising model above to pick only the best features

In [None]:
selector = RFECV(lin_reg, step=1, cv=5, scoring='neg_mean_squared_error')
selector.fit(X_ratios, y_normed)

In [None]:
print("Optimal number of features : %d" % selector.n_features_)

In [None]:
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation RMSE score")
plt.title("Optimal number of features : %d" % selector.n_features_)
plt.plot(range(1, len(selector.grid_scores_) + 1), np.sqrt(-selector.grid_scores_))
plt.show()

In [None]:
X_new = selector.transform(X_ratios)

All features have equally important ranking!

In [None]:
selector.ranking_

Now let's rerun our best model so far and evaluate changes to model metrics resulting from removing unneeded features

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_normed, test_size=0.3, random_state=42)

In [None]:
lin_reg = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_train,y_train)
scores_temp = cross_validate(lin_reg, X_train, y_train, cv=10, return_train_score=True,
                         scoring=('r2', 'neg_mean_squared_error','neg_mean_absolute_error'))
print 'Evaluation Metrics for Linear Regression with CV: '
eval_metrics(scores_temp)

The validation RMSE has actually increased, while the accuracy has decreased slightly - this is bad but we make the choice to trade off a little bit of accuracy in order to shed complexity (we are losing more than 100 features this way)

Now we rebuild the model only with the important features i.e. number of features where val error is lowest - we do this in order to replicate what we created above with RFECV

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ratios, y_normed, test_size=0.3, random_state=42)

In [None]:
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X_normed.columns, selector.ranking_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'RFECV_Ranking'})
importances = importances.sort_values(by='RFECV_Ranking')
importances.head()

We are restricting our features to the ones that minimize CV RMSE by feeding the new data into RFECV

In [None]:
best_features = list(importances.head(selector.n_features_).index)

In [None]:
lin_reg = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_train[best_features],y_train)
scores_lin_ratios_fsel = cross_validate(lin_reg, X_train[best_features], y_train, cv=10, 
                         scoring=('r2', 'neg_mean_squared_error','neg_mean_absolute_error'))
print 'Evaluation Metrics for Linear Regression with CV: '
eval_metrics(scores_lin_ratios_fsel)

In [None]:
plot_rmse_features(lin_reg, X_train, y_train, best_features)

Finally, let's see how the models performs against our Test Set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_normed, y_normed, test_size=0.3, random_state=42)

In [None]:
X_train_intonly, X_test_intonly, y_train_intonly, y_test_intonly = train_test_split(X_normed_wint,y_normed, test_size=0.3, random_state=42)

In [None]:
X_train_quad, X_test_quad, y_train_quad, y_test_quad = train_test_split(X_normed_quad,y_normed, test_size=0.3, random_state=42)

In [None]:
test_predictions_lin_reg_intonly = lin_reg_intonly.predict(X_test_intonly)
test_predictions_quad_reg = quad_reg.predict(X_test_quad)

In [None]:
print 'Evaluation Metrics for Linear Regression with Interactions'
print 'Test R2: ',r2_score(y_test, test_predictions_lin_reg_intonly)
print 'Test RMSE: ',np.sqrt(mean_squared_error(y_test_intonly, test_predictions_lin_reg_intonly))
print 'Test MAE: ',mean_absolute_error(y_test_intonly, test_predictions_lin_reg_intonly)
map_variable(y_test_intonly-test_predictions_lin_reg_intonly, listings)

In [None]:
print 'Evaluation Metrics for Quadratic Regression without Interactions'
print 'Test R2: ',r2_score(y_test_quad, test_predictions_quad_reg)
print 'Test RMSE: ',np.sqrt(mean_squared_error(y_test_quad, test_predictions_quad_reg))
print 'Test MAE: ',mean_absolute_error(y_test_quad, test_predictions_quad_reg)
map_variable(y_test_quad-test_predictions_quad_reg, listings)

In [None]:
from yellowbrick.regressor import ResidualsPlot

In [None]:
# Takes too long
#lin_reg_intonly_pred_cv = cross_val_predict(lin_reg_intonly, X_train_intonly, y_train_intonly, cv=10)

In [None]:
"""plt.scatter(lin_reg_intonly_pred_cv, lin_reg_intonly_pred_cv-y_train_intonly, 
            c='steelblue', marker='o', edgecolor='white',
           label='CV Train Data')
plt.scatter(test_predictions_lin_reg_intonly, test_predictions_lin_reg_intonly-y_test_intonly, 
            c='limegreen', marker='x', edgecolor='red',
           label='Test Data')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.legend(loc='upper right')
plt.hlines(y=0, color='black', xmin=y_train_intonly.min()-1, xmax=y_train_intonly.max()+1, lw=3)
plt.title('Predicted Values vs Residuals - Linear Regression with Interactions')
plt.show()"""

In [None]:
#Takes too long
#quad_reg_pred_cv = cross_val_predict(quad_reg, X_train_quad, y_train_quad, cv=10)

In [None]:
"""plt.scatter(quad_reg_pred_cv, quad_reg_pred_cv-y_train_quad, 
            c='steelblue', marker='o', edgecolor='white',
           label='CV Train Data')
plt.scatter(test_predictions_quad_reg, test_predictions_quad_reg-y_test_quad, 
            c='limegreen', marker='x', edgecolor='red',
           label='CV Test Data')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.legend(loc='upper right')
plt.hlines(y=0, color='black', xmin=y_train_intonly.min()-1, xmax=y_train_intonly.max()+1, lw=3)
plt.title('Predicted Values vs Residuals - Quadratic Regression')
plt.show()"""

Best model After Adding Ratios and Doing Feature Selection

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ratios, y_normed, test_size=0.3, random_state=42)

In [None]:
lin_reg = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_train, y_train)
test_predictions_ratios_lin_reg_fs = lin_reg.predict(X_test)

In [None]:
print 'Evaluation Metrics for Linear Regression with Ratio Features & Feature Selection'
print 'Test R2: ',r2_score(y_test, test_predictions_ratios_lin_reg_fs)
print 'Test RMSE: ',np.sqrt(mean_squared_error(y_test, test_predictions_ratios_lin_reg_fs))
print 'Test MAE: ',mean_absolute_error(y_test, test_predictions_ratios_lin_reg_fs)
map_variable(y_test-test_predictions_ratios_lin_reg_fs, listings)

In [None]:
ratios_lin_reg_pred_cv = cross_val_predict(lin_reg, X_train, y_train, cv=10)

In [None]:
plt.scatter(ratios_lin_reg_pred_cv, ratios_lin_reg_pred_cv-y_train, 
            c='steelblue', marker='o', edgecolor='white',
           label='CV Train Data')
plt.scatter(test_predictions_ratios_lin_reg, test_predictions_ratios_lin_reg-y_test, 
            c='limegreen', marker='x', edgecolor='red',
           label='Test Data')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.legend(loc='upper right')
plt.hlines(y=0, color='black', xmin=-.3, xmax=.4, lw=3)
plt.title('Predicted Values vs Residuals - Linear Regression with Ratio Features & Feature Selection')
plt.show()

In [None]:
#%store scores_lin, scores_tree, scores_sv_reg, scores_neigh_reg

In [None]:
%store scores_lin_intonly
%store scores_lin_ratios
%store scores_lin_ratios_fsel
%store scores_quad

In [None]:
%store test_predictions_ratios_lin_reg_fs
%store test_predictions_lin_reg_intonly
%store test_predictions_quad_reg