Here my objective is to develop a model that has an ability to predict how many retweets/likes a tweet from Trump will generate. The broad plan is to extract features from the tweets using an NLP approach and see if these features have any predective power. We will begin with a very simple approach called the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to build a benchmark model. See NLP_Trump_tweets_Topic_modelling notebook for a more thorough treatment of data collection, problem approach etc. 

In [1]:
#regular imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_colwidth', -1) #to show full tweets in the cells.

In [2]:
#load tweets data
tweets_df = pd.read_csv('trump_tweets_6000.csv', parse_dates = True)
tweets_df.head()

Unnamed: 0,id,created_at,text,retweets,favorites
0,1057254051254013953,2018-10-30 12:53:03,"“If the Fed backs off and starts talking a little more Dovish, I think we’re going to be right back to our 2,800 to 2,900 target range that we’ve had for the S&amp;P 500.” Scott Wren, Wells Fargo.",3854.0,13636.0
1,1057249169507803137,2018-10-30 12:33:39,"The Stock Market is up massively since the Election, but is now taking a little pause - people want to see what happens with the Midterms. If you want your Stocks to go down, I strongly suggest voting Democrat. They like the Venezuela financial model, High Taxes &amp; Open Borders!",7569.0,25503.0
2,1057247021919297536,2018-10-30 12:25:07,"Congressman Kevin Brady of Texas is so popular in his District, and far beyond, that he doesn’t need any help - but I am giving it to him anyway. He is a great guy and the absolute “King” of Cutting Taxes. Highly respected by all, he loves his State &amp; Country. Strong Endorsement!",4214.0,16375.0
3,1057243826899877889,2018-10-30 12:12:25,"Congressman Andy Barr of Kentucky, who just had a great debate with his Nancy Pelosi run opponent, has been a winner for his State. Strong on Crime, the Border, Tax Cuts, Military, Vets and 2nd Amendment, we need Andy in D.C. He has my Strong Endorsement!",4532.0,17532.0
4,1057110242541080577,2018-10-30 03:21:36,".@Erik_Paulsen, @Jason2CD, \r\n@JimHagedornMN and @PeteStauber love our Country and the Great State of Minnesota. They are winners and always get the job done. We need them all in Congress for #MAGA. Border, Military, Vets, 2nd A. Go Vote Minnesota. They have my Strong Endorsement!",6598.0,24168.0


We will now vectorize every tweet using TF-IDF, use that as the feature matric and set number of retweets as the target. We will also try to quickly test out a Random Forest model and a couple of Linear Regression models out of the box from sklearn to see if they have any predictive power. 

In [3]:
#convert text to td-idf feature space. Retweets is target array. Test Random forest regressor as the predective model. 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import math
text_list = list(tweets_df['text'])
vectorizer = TfidfVectorizer()
vectorizer.fit(text_list)
X = vectorizer.transform(text_list)
y = tweets_df['retweets']

  from numpy.core.umath_tests import inner1d


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestRegressor(n_estimators = 100, oob_score = True, random_state = 0)
model.fit(X_train, y_train)

y_on_train = model.predict(X_train)
rmse_train = math.sqrt(metrics.mean_squared_error(y_on_train, y_train))
#predict for unseen test data
y_pred = model.predict(X_test)
rmse_test = math.sqrt(metrics.mean_squared_error(y_pred, y_test))
print('root mean squared error on train data %.2f' % rmse_train)
print('root mean squared error on test data %.2f' % rmse_test)

root mean squared error on train data 4557.41
root mean squared error on test data 9737.47


In [5]:
y_compare = pd.DataFrame(y_test)
y_pred_series = pd.Series(y_pred)
y_compare['retweets_predictions'] = y_pred_series.values
y_compare

Unnamed: 0,retweets,retweets_predictions
2650,10759.0,14981.760000
4020,23105.0,21075.040000
4930,12279.0,13020.610000
4920,31355.0,19151.410000
1700,23685.0,21426.210000
5328,11805.0,12064.180000
3468,18598.0,20271.240000
5034,13549.0,18202.080000
4316,21349.0,18971.490000
5384,16197.0,19709.130000


In [6]:
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
import time

In [7]:
model = SGDRegressor(loss='squared_loss', penalty='l2', random_state=42, max_iter=5)
params = {'penalty':['none','l2','l1'],
          'alpha':[1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 0.1]}
gs = GridSearchCV(estimator=model,
                  param_grid=params,
                  scoring='neg_mean_squared_error',
                  n_jobs=1,
                  cv=5,
                  verbose=3)
start = time.time()
gs.fit(X_train, y_train)
end = time.time()
print('Time to train model: %0.2fs' % (end -start))

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV] alpha=0.0001, penalty=none ......................................
[CV]  alpha=0.0001, penalty=none, score=-141245847.55650964, total=   0.2s
[CV] alpha=0.0001, penalty=none ......................................
[CV]  alpha=0.0001, penalty=none, score=-421051212.68259895, total=   0.0s
[CV] alpha=0.0001, penalty=none ......................................
[CV]  alpha=0.0001, penalty=none, score=-185460291.0896577, total=   0.0s
[CV] alpha=0.0001, penalty=none ......................................
[CV]  alpha=0.0001, penalty=none, score=-151648207.89519772, total=   0.0s
[CV] alpha=0.0001, penalty=none ......................................
[CV]  alpha=0.0001, penalty=none, score=-279372616.41973513, total=   0.0s
[CV] alpha=0.0001, penalty=l2 ........................................
[CV]  alpha=0.0001, penalty=l2, score=-141342575.21940422, total=   0.0s
[CV] alpha=0.0001, penalty=l2 ...................................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s



[CV] alpha=0.0002, penalty=l2 ........................................
[CV]  alpha=0.0002, penalty=l2, score=-151853241.11341107, total=   0.0s
[CV] alpha=0.0002, penalty=l2 ........................................
[CV]  alpha=0.0002, penalty=l2, score=-279579186.092057, total=   0.0s
[CV] alpha=0.0002, penalty=l1 ........................................
[CV]  alpha=0.0002, penalty=l1, score=-141246006.98261854, total=   0.0s
[CV] alpha=0.0002, penalty=l1 ........................................
[CV]  alpha=0.0002, penalty=l1, score=-421051417.9991225, total=   0.0s
[CV] alpha=0.0002, penalty=l1 ........................................
[CV]  alpha=0.0002, penalty=l1, score=-185460475.21983835, total=   0.0s
[CV] alpha=0.0002, penalty=l1 ........................................
[CV]  alpha=0.0002, penalty=l1, score=-151648373.3012162, total=   0.0s
[CV] alpha=0.0002, penalty=l1 ........................................
[CV]  alpha=0.0002, penalty=l1, score=-279372783.0677394, total=   0

[CV] alpha=0.005, penalty=l2 .........................................
[CV]  alpha=0.005, penalty=l2, score=-156725698.51500595, total=   0.0s
[CV] alpha=0.005, penalty=l2 .........................................
[CV]  alpha=0.005, penalty=l2, score=-284488679.4196815, total=   0.0s
[CV] alpha=0.005, penalty=l1 .........................................
[CV]  alpha=0.005, penalty=l1, score=-141249833.28012133, total=   0.0s
[CV] alpha=0.005, penalty=l1 .........................................
[CV]  alpha=0.005, penalty=l1, score=-421056345.6396067, total=   0.0s
[CV] alpha=0.005, penalty=l1 .........................................
[CV]  alpha=0.005, penalty=l1, score=-185464894.44519714, total=   0.0s
[CV] alpha=0.005, penalty=l1 .........................................
[CV]  alpha=0.005, penalty=l1, score=-151652343.09698883, total=   0.0s
[CV] alpha=0.005, penalty=l1 .........................................
[CV]  alpha=0.005, penalty=l1, score=-279376782.67916733, total=   0.0s
[

[CV] . alpha=0.1, penalty=l2, score=-227394145.45652053, total=   0.0s
[CV] alpha=0.1, penalty=l2 ...........................................
[CV] .. alpha=0.1, penalty=l2, score=-355738958.3908597, total=   0.0s
[CV] alpha=0.1, penalty=l1 ...........................................
[CV] .. alpha=0.1, penalty=l1, score=-141325557.8056674, total=   0.0s
[CV] alpha=0.1, penalty=l1 ...........................................
[CV] .. alpha=0.1, penalty=l1, score=-421153922.2576773, total=   0.0s
[CV] alpha=0.1, penalty=l1 ...........................................
[CV] . alpha=0.1, penalty=l1, score=-185552348.88538486, total=   0.0s
[CV] alpha=0.1, penalty=l1 ...........................................
[CV] . alpha=0.1, penalty=l1, score=-151730954.25400716, total=   0.0s
[CV] alpha=0.1, penalty=l1 ...........................................
[CV] .. alpha=0.1, penalty=l1, score=-279455781.4778735, total=   0.0s
Time to train model: 1.75s


[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:    1.6s finished


In [8]:
model = gs.best_estimator_

In [9]:
y_on_train = model.predict(X_train)
rmse_train = math.sqrt(metrics.mean_squared_error(y_on_train, y_train))
#predict for unseen test data
y_pred = model.predict(X_test)
rmse_test = math.sqrt(metrics.mean_squared_error(y_pred, y_test))

In [10]:
print('root mean squared error on train data %.2f' % rmse_train)
print('root mean squared error on test data %.2f' % rmse_test)

root mean squared error on train data 14660.63
root mean squared error on test data 12501.21


In [13]:
model = LinearRegression()
model.fit(X_train, y_train)
y_on_train = model.predict(X_train)
rmse_train = math.sqrt(metrics.mean_squared_error(y_on_train, y_train))
#predict for unseen test data
y_pred = model.predict(X_test)
rmse_test = math.sqrt(metrics.mean_squared_error(y_pred, y_test))

In [14]:
print('root mean squared error on train data %.2f' % rmse_train)
print('root mean squared error on test data %.2f' % rmse_test)

root mean squared error on train data 323.83
root mean squared error on test data 34237.21


In [15]:
max(tweets_df['retweets']) - min(tweets_df['retweets'])

329149.0

In [16]:
tweets_df['retweets'].describe()

count    5403.000000  
mean     18671.330557 
std      12300.082247 
min      1091.000000  
25%      11910.000000 
50%      16523.000000 
75%      22598.500000 
max      330240.000000
Name: retweets, dtype: float64

In [17]:
y_compare = pd.DataFrame(y_test)
y_pred_series = pd.Series(y_pred)
y_compare['retweets_predictions'] = y_pred_series.values
y_compare

Unnamed: 0,retweets,retweets_predictions
2650,10759.0,41283.990406
4020,23105.0,19871.727712
4930,12279.0,69855.849492
4920,31355.0,8611.572848
1700,23685.0,32636.855684
5328,11805.0,19931.284235
3468,18598.0,13966.698996
5034,13549.0,33186.851439
4316,21349.0,6466.134911
5384,16197.0,38722.566837


While Radom Forest regressor did surprisingly well, we can definitely extract much better features from these tweets. See the Topic Modelling notebook which tries to assign a topic vector for each tweet and use those as features