## Linear Regression with House Prices

Now that we have the house price train and test set, 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from xgboost import XGBRegressor
%pylab inline

### Import the data

Let's get the training and test data properly imported.

In [None]:
train_df = pd.read_csv('../data/house_train_final.csv')
test_df = pd.read_csv('../data/house_test_final.csv')
test_ids = test_df.Id
test_df = test_df.drop('Id', axis=1)
y_train = train_df.SalePrice
X = train_df.drop('SalePrice', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y_train, test_size=0.33, random_state=42)

In [None]:
test_df.head()

In [None]:
X.head()

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)

plt.scatter(y_test, y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

In [None]:
lm.coef_

In [None]:
y_train.hist()

In [None]:
plt.hist(y_pred)

In [None]:
plt.scatter(X_train['GrLivArea'], y_train, c='blue', alpha=0.5)
plt.scatter(X_test['GrLivArea'], y_pred, c='green')

In [None]:
y_pred

In [None]:
metrics.mean_squared_log_error(y_test, y_pred)

In [None]:
metrics.mean_squared_error(y_test, y_pred)

In [None]:
X.shape

In [None]:
test_df.shape

In [None]:
test_df.any().isnull()

In [None]:
kaggle_pred = lm.predict(test_df.as_matrix())

In [None]:
plt.scatter(X_train['GrLivArea'], y_train, c='blue', alpha=0.5)
plt.scatter(X_test['GrLivArea'], y_pred, c='green')
plt.scatter(test_df['GrLivArea'], kaggle_pred, c='red')

### Exercise

Chart at least one more feature versus price for your training, test prediction and kaggle prediction set

In [None]:
lr_output = np.vstack((test_ids, kaggle_pred)).T

In [None]:
submission = pd.DataFrame(lr_output, columns=['Id', 'SalePrice'])

In [None]:
submission.Id = submission.Id.astype(np.int)

In [None]:
submission.to_csv('../data/lr_output_submission.csv', index=False)

## Now go to submit it.. any issues?

In [None]:
submission.SalePrice.describe()

### What was your score? 

- Compare and share scores
- What features did you use? Can you make note of them?

In [None]:
svr = SVR(kernel='linear')

In [None]:
svr.fit(X_train, y_train)

In [None]:
y_pred = svr.predict(X_test)

In [None]:
metrics.mean_squared_log_error(y_test, y_pred)

### Exercise

Now you run the predictions for the kaggle set and plot them on a chart with the training data, test predictions and kaggle predictions

In [None]:
kaggle_pred = svr.predict(test_df)

In [None]:
kaggle_pred

In [None]:
%load ../solutions/predict_and_graph.py



### Discussion

Do you think this will score as well? Why or why not?

In [None]:
svr_output = np.vstack((test_ids, kaggle_pred)).T

In [None]:
submission = pd.DataFrame(svr_output, columns=['Id', 'SalePrice'])

In [None]:
submission.Id = submission.Id.astype(np.int)

In [None]:
submission.to_csv('../data/svr_output_submission.csv', index=False)

### What was your score? 

- Compare and share scores

In [None]:
rfr = RandomForestRegressor(max_depth=5, random_state=0)

### Exercise

You fit and predict and evaluate this model! :)

In [None]:
%load ../solutions/rfr.py



### Discussion

Do you think this will score as well? Why or why not?

# Exercise

Prepare your submission and submit! :)

In [None]:
%load ../solutions/rfr_submit.py



### What was your score? 

- Compare and share scores

## Exercise

Choose one of the following models and try another submission:
    - ExtraTreesRegressor
    - SVR with a different kernel
    - Or any of the linear models [from the documentation](http://scikit-learn.org/stable/modules/linear_model.html)
    
OR you can take this time to engineer one or two more features and retry either one of the models we used before or another new model from the list above. 

### GridSearch for Parameter Tuning

In [None]:
rfr.get_params()

In [None]:
param_grid = { 
    'max_depth': [2, 5, 7, 10],
    'max_features': ['auto', 'sqrt', 'log2'],
    'n_estimators': [5, 10, 20]
}

In [None]:
grid_search = GridSearchCV(estimator=rfr, param_grid=param_grid)
grid_search.fit(X_train, y_train)
scores = grid_search.grid_scores_
scores

In [None]:
score = scores[0]
score.mean_validation_score

## Exercise: Grab the top 5 scores from all scores sorted by the best (highest) score

In [None]:
%load ../solutions/top_scores.py



### Exercise

Now implement your favorite model with Grid Search top scores and resubmit. Did you improve your score?

In [None]:
%load ../solutions/post_grid_search_rfr.py


### Bonus Exercise: XGBoost and GridSearch

In [None]:
grid_test = {
    'gamma': [0.0, 0.2, 0.4],
    'max_depth': [5, 10, 25],
    'n_estimators': [100, 500, 1000, 5000],
    'reg_alpha': [0.1, 0.5, 1.0],
    'reg_lambda': [0.1, 0.5, 1.0],
}


In [None]:
xgbr_base = XGBRegressor(learning_rate=0.05)

### Exercise

Use grid search to find the top scores and retrain with the top score.

In [None]:
%load ../solutions/grid_search_xgb.py


### Save your most performant model(s) for later evaluation

- Save as many as you want but try and make the names memorable. Sometimes I even use the dates in them so I can change and compare over time.
- If you are doing this on a large scale you might also want to think of a versioning system with documentation.

In [None]:
pickle.dump(svr, open('../data/models/svr.sav', 'wb'))
pickle.dump(rfr, open('../data/models/rfr.sav', 'wb'))
pickle.dump(new_rfr, open('../data/models/new_rfr.sav', 'wb'))