## Notebook 5 - Fit Regression Models on the Tweet Metadata
The purpose of this notebook is to train classification models on the non-text features of the tweets.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df_spca = pd.read_pickle('../data/3-post_eda.p')

In [3]:
df_spca.head()

Unnamed: 0,author_id,local_datetime,favorites,hashtags,id,mentions,permalink,retweets,text,urls,username,country,year,month,weekday,hour,has_mention,has_hashtag,has_url
1398,22213454,2012-01-30 10:18:54-08:00,0,,164049967055503360,,https://twitter.com/sfspca/status/164049967055...,1.098612,Adoptable of the Week: KC! If KC had his way t...,http://fb.me/16dUVX4Xp,sfspca,usa,2012,1,0,10,0,0,1
1399,22213454,2012-01-29 13:12:21-08:00,0,,163731230481846272,,https://twitter.com/sfspca/status/163731230481...,1.609438,Donate your old car to the SF SPCA! Hassle-fre...,http://fb.me/MpZqiDTz,sfspca,usa,2012,1,6,13,0,0,1
1400,22213454,2012-01-29 09:50:19-08:00,0,,163680385379733505,,https://twitter.com/sfspca/status/163680385379...,1.386294,Do you see cats in your neighborhood with no p...,http://fb.me/10Ge4JL5o,sfspca,usa,2012,1,6,9,0,0,1
1401,22213454,2012-01-28 07:01:22-08:00,4,,163275481335087104,,https://twitter.com/sfspca/status/163275481335...,1.098612,Photo: pig in a blanket! http:// tmblr.co/ZEVw...,http://tmblr.co/ZEVwxwFVZsZM,sfspca,usa,2012,1,5,7,0,0,1
1402,22213454,2012-01-27 17:10:56-08:00,0,,163066492647251968,,https://twitter.com/sfspca/status/163066492647...,1.386294,"Come meet these adorable, adoptable animals at...",http://fb.me/1e7m3Pz0s,sfspca,usa,2012,1,4,17,0,0,1


### Train models with the following predictors:
1. Year tweet was created
1. Month tweet was created
1. Hour tweet was created
1. Day of week tweet was created
1. Tweet has a hashtag
1. Tweet has a mention
1. Tweet has a url

In [4]:
# pull the desired features and split the data into train and test sets
from sklearn.model_selection import train_test_split
cols = ['year', 'month', 'hour', 'weekday', 'has_hashtag', 'has_mention', 'has_url']
X_train, X_test, y_train, y_test = train_test_split(df_spca[cols], df_spca.retweets, 
                                                      test_size=.3, random_state=55)

### To-do: Change these to classifiers

In [5]:
from sklearn.linear_model import LinearRegression, Lasso, BayesianRidge, SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

In [6]:
def fit_and_score(data, regr):
    print('Regressor: {}'.format(regr))
    
    regr.fit(X_train, data[2])
    print('Train score: {}'.format(regr.score(X_train, data[2])))
    print('Test score: {}'.format(regr.score(X_test, data[3])))

In [7]:
data = (X_train, X_test, y_train, y_test)

In [8]:
lasso = Lasso()
sgd = SGDRegressor()
knr = KNeighborsRegressor(n_jobs=-1)
bayes = BayesianRidge()
models = [lasso, sgd, knr, bayes]

In [9]:
for model in models:
    fit_and_score(data, model)
    print()

Regressor: Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
Train score: 0.0
Test score: -3.583835111786726e-06

Regressor: SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)
Train score: -1.751284707370474e+29
Test score: -1.7495617275622054e+29

Regressor: KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
          weights='uniform')
Train score: 0.5026726315772778
Test score: 0.27120937059065

Regressor: BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=