## Using RandomForestRegressor on the Russian "Troll" Ads dataset.
This gave some unique insights in what caused people to click on these Russian "troll" ads.

In [86]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

### Gathering our data and preparing it for modeling.
Here I read in the data set, turned some columns into dummy variables so that the model would pick it up more effectively and created a two sets of data, one with only the ad clicks and the other with the predictor variables.

In [92]:
russiandf = pd.read_csv('russian.afteribm.csv')
russiandf.head(1)
features = pd.get_dummies(russiandf, columns = ['month', 'year'])

# Get rid of column we want to predict
features = features.drop('Ad Clicks ', axis = 1)

features_list = list(features.columns)

# This is the variable we want to predict
labels = np.array(russiandf['Ad Clicks '])

Check to make sure there are no `NA` values and setting them equal to zero. This works for this data set because the columns in the data set with `NAs` are all dummy columns. This is something that will vary on a case-by-case basis.

In [93]:
features = np.array(features)

features[np.isnan(features)] = 0
print('Number of NAs', np.isnan(features).sum())

Number of NAs 0


Here we'll split the data into testing and training on a 70/30 split. 

In [94]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.30,
                                                                           random_state = 42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (1740, 778)
Training Labels Shape: (1740,)
Testing Features Shape: (747, 778)
Testing Labels Shape: (747,)


### Creating a Random Forest model
Now that the data is split 70/30 into training/test we can create our model. The `n_estimators` is an important variable to watch out for, this is essentially the iterations the model does and with this large of a dataset, things can get slow fast. However, more estimators may improve our results.

In [95]:
# Instantiate model 
rf = RandomForestRegressor(n_estimators= 100, random_state=13)

# Train the model on training data
rf.fit(train_features, train_labels);

Looking at our predictions, our mean absolute error is pretty high, suggesting we need to optimize this model. However, the importance of each variable can give us some insights on what causes people to click. We do this below.

In [96]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', np.mean(errors))

Mean Absolute Error: 1152.1322076241474


If we want, we can check how a different number of estimators effect our model. I've demonstrated this below. As we can see, the mean absolute error appears to flatten out after 100. To save time, I'll keep this number of estimators.

In [97]:
estimator_lst =[1, 10, 50, 100, 150]
for element in estimator_lst:
    rftest = RandomForestRegressor(n_estimators= element, random_state=13)
    rftest.fit(train_features, train_labels)
    errors = abs(rftest.predict(test_features) - test_labels)

    print(element, ': Mean Absolute Error:', np.mean(errors))

1 : Mean Absolute Error: 1797.838464970995
10 : Mean Absolute Error: 1233.21270638108
50 : Mean Absolute Error: 1160.8072744948047
100 : Mean Absolute Error: 1152.1322076241474
150 : Mean Absolute Error: 1152.9307778627738


Next we'll see what features are most important for our first model. Remember, `rf` is our original model with 100 estimators. As expected, `Ad Spend` has the highest effect on clicks. This makes sense because the more costly ads most likely reached more people. The `Age:.16-65+` is an unfortunate marker, as it includes nearly the entire population. This shows us that perhaps we should remove this variable and look for more informative age markers. From this list, we see that explicit ads and ads trigger emotional responses from people garnered the most clicks, which befit's their name of "troll ads".

In [98]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 4)) for feature, importance in zip(features_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: Ad Spend             Importance: 0.482
Variable: Age:.16-65+          Importance: 0.0427
Variable: Mexico.LaRaza        Importance: 0.0312
Variable: Interests:.MartinLutherKingIII Importance: 0.0244
Variable: word_count           Importance: 0.017
Variable: facet_gregariousness Importance: 0.0155
Variable: Mexico.BeingChicano.Mexicanamericanculture Importance: 0.0142
Variable: facet_cooperation    Importance: 0.0139
Variable: need_structure       Importance: 0.0129
Variable: OBSCENE              Importance: 0.012
Variable: mm_yyyy_2_2017       Importance: 0.011
Variable: SEXUALLY_EXPLICIT    Importance: 0.0105
Variable: Mexico.Latinhiphop.Chicano Importance: 0.0102
Variable: month_2              Importance: 0.01
Variable: facet_dutifulness    Importance: 0.0097
Variable: big5_openness        Importance: 0.0088
Variable: value_openness_to_change Importance: 0.0087
Variable: ATTACK_ON_AUTHOR     Importance: 0.0081
Variable: facet_imagination    Importance: 0.0075
Variable: valu