# Experiment Design - Analyzing and Predicting News Popularity in an Instant Messaging Service


---

Within this notebook the single task learning model, **STL-All** (SVM classifier trained on all the data from all channels) is implemented based on the Paper Analyzing and Predicting News Popularity in an Instant Messaging Service. It is checked whether or not the implementation and the results reported in the paper are reproducible and what information might be missing or wrong.

## Get Data & Feature Extraction

In [1]:
# Run only if needed, e.g. when the Telegram News Folder is not yet downloaded!
! git clone https://github.com/IceCream71/TelegramNews.git


Cloning into 'TelegramNews'...
Updating files:  36% (7/19)
Updating files:  42% (8/19)
Updating files:  47% (9/19)
Updating files:  52% (10/19)
Updating files:  57% (11/19)
Updating files:  63% (12/19)
Updating files:  68% (13/19)
Updating files:  73% (14/19)
Updating files:  78% (15/19)
Updating files:  84% (16/19)
Updating files:  89% (17/19)
Updating files:  94% (18/19)
Updating files: 100% (19/19)
Updating files: 100% (19/19), done.


In [None]:
#with open ('/content/TelegramNews/mongo/telegram/cnnbrk.json', 'rb') as f:
#  data = pd.read_json(f)

In [None]:
#df = pd.DataFrame(data)
#print(df.columns)

missing features (which can't be calculated due to missing information): 


* subscribers (shows the popularity of channel)
* channel age (date of first post? in days)
* frequent n-grams (frequent n-grams that news item contains)
* frequent hashtags (frequent hashtags that news item contains)
* frequent mentions (frequent mentions that news item contains)

what does frequent mean?

In [2]:
import pandas as pd
import bson
import seaborn as sns

def create_df_for_bson(path, channel): 
  with open (path, 'rb') as f:
    data = bson.decode_all(f.read())

  df = pd.DataFrame(data)
  df = handle_NAS(df)
  df = extract_features(df)
  df = cut_date_range(df)
  df = drop_unused_features(df)
  df = add_channel(df, channel)
  return df

def handle_NAS(df):
  df = df.dropna(subset=['views'],axis=0)
  return df

def extract_features(df):
  df = create_datetime_features(df)
  df = add_channel_features(df)
  df = create_entity_features(df)
  return df

def create_datetime_features(df):
  df['age'] =  df.date
  df['date'] = pd.to_datetime(df.date, unit='s')
  df['year'] = df.date.dt.year
  df['month'] = df.date.dt.month
  df['day'] = df.date.dt.day
  df['weekday'] = df.date.dt.dayofweek
  df['hour'] = df.date.dt.hour
  return df

def cut_date_range(df):
  # March 8, 2017 to October 8, 2017.
  start_date = "2017-03-08"
  end_date  = "2017-10-08"
  mask = (df.date > start_date) & (df.date <= end_date)
  return df.loc[mask]

def add_channel_features(df):
  df['minViews'] = df.views.min()
  df['maxViews'] = df.views.max()
  df['meanViews'] = df.views.mean()
  df['stdViews'] = df.views.std()
  df['avgPostsHour'] = df.groupby('hour').size().mean()
  df['avgPostsDay'] = df.groupby(['year', 'month', 'day']).size().mean()
  return df

def create_entity_features(df):
  # Convert the entity field from bson into the actual features.
  # For each row, check if media exists, if yes set media type.
  df['hasMedia'] = df.media.notnull()
  df['mediaType'] = df.apply(lambda x: x.media['_'] if x.hasMedia else '', axis=1)
  # Check if URL exists in entities, if exists, is a link
  df['hasLink'] = df.apply(lambda x: True if type(x.entities) != float and [d for d in x.entities if d['_'] ==  'messageEntityTextUrl'] else False, axis=1)
  # Collect the mentions from the entities feature
  df['mentions'] = df.apply(lambda x: len([d for d in x.entities if d['_'] == 'messageEntityMention']) if type(x.entities) != float else 0, axis=1)
  # Same as before for the hashtags
  df['hashtags'] = df.apply(lambda x: len([d for d in x.entities if d['_'] == 'messageEntityHashtag']) if type(x.entities) != float else 0, axis=1)
  return df

# keep needed features, more flexible as different datasets contain different features
def drop_unused_features(df):
  return df.drop(df.columns.difference(['age', 'date', 'year', 'month', 'day', 'weekday', 
                          'hour', 'minViews', 'maxViews', 'meanViews', 'stdViews',
                          'hasMedia', 'mediaType', 'hasLink', 'mentions', 'hashtags', 'views']),axis=1)

def add_channel(df, channel):
    df['channel']=channel
    return df


 
reutersWorld = create_df_for_bson('TelegramNews/mongo/telegram/ReutersWorld.bson', 'reutersWorld')
cnnBrk = create_df_for_bson('TelegramNews/mongo/telegram/CNNBrk.bson', 'cnnBrk')
#theGuardian = create_df_for_bson('/content/TelegramNews/mongo/telegram/TheGuardian.bson', 'theGuardian')
bbcBreaking = create_df_for_bson('TelegramNews/mongo/telegram/bbcbreaking.bson', 'bbc')
bbcPersian = create_df_for_bson('TelegramNews/mongo/telegram/bbcpersian.bson', 'bbc')
pressTV = create_df_for_bson('TelegramNews/mongo/telegram/presstv.bson', 'pressTV')
washingtonPost = create_df_for_bson('TelegramNews/mongo/telegram/washingtonpost.bson', 'washingtonPost')

In the paper it is not clearly stated which datasets are included in training SVM all. In the repository we found more datasets than the ones listed in the table comparing the different models. Hence we simply assumed that for training SVM all, all available datasets (except the Guardian as it doesn't contain data) are used and only the ones listed in the mentioned table are used for prediction.

Also they predicted BBC, however two datasets called bbcBreaking and bbcPersian do exist. It is not clear if BBC simply combines these two, or if only one of them is chosen as BBC.

In [3]:
dfs = [reutersWorld, cnnBrk, bbcBreaking, bbcPersian, pressTV, washingtonPost]
all = pd.concat(dfs)

We assumed that the train and test split is done first according to *Therefore, we first found the thresholds that satisfy these percentages for training and test sets and assigned a binary label to each post*. Also, if the split is done after assigning the popularity, in e.g. the cnn test set no popular posts can be found and the prediction would have an accuracy of 1.0, which is not the case in the paper.

In [4]:
def create_train_test_split(df):
  # the first six months (March - September) of the data is selected for training and the last month (October)
  # is selected for testing.
   # March 8, 2017 to September 7, 2017.
  train_start = "2017-03-08"
  train_end = "2017-09-07"
  mask = (df.date > train_start) & (df.date <= train_end)

  df = df.drop(labels=['date'],axis=1)
  df.mediaType = df.mediaType.astype('category').cat.codes

  training = df.loc[mask]
  test = df.loc[-mask]

  return training, test

allTrain, allTest = create_train_test_split(all)

print(f"Size of Training set: {len(allTrain.index)}")
print(f"Size of Test set: {len(allTest.index)}")
print(f"Percentage of train split: {round(100 * len(allTrain.index) / len(all.index),2)}%")

Size of Training set: 64916
Size of Test set: 11056
Percentage of train split: 85.45%


It is not clearly stated in the paper how the threshold which is used to state whether a post is popular or not is defined. In this case it is assumed, that the more views a post has, the more popular it is

In [5]:
# popular = 1
# not popular = 0

def assign_popularity (df, percent):
  df.sort_values(by=['views'], inplace=True, ascending=False)
  
  quantile = (100-percent)/100
  threshold = df.loc[:,'views'].quantile(quantile)
  is_popular = df.loc[:,'views'] > threshold

  df.loc[is_popular,'Popular']=1
  df.loc[~is_popular, 'Popular']=0

  return df

allTrain_5 = assign_popularity(allTrain, 5)
allTest_5 = assign_popularity(allTest, 5)

allTrain_25 = assign_popularity(allTrain, 25)
allTest_25 = assign_popularity(allTest, 25)

In [67]:
print(allTrain_5)

           views         age  year  month  day  weekday  hour  minViews  \
2716   1908500.0  1501109574  2017      7   26        2    22   78130.0   
4817   1482198.0  1495247786  2017      5   20        5     2   78130.0   
5007   1475636.0  1494845222  2017      5   15        0    10   78130.0   
4550   1364851.0  1495645444  2017      5   24        2    17   78130.0   
5729   1233545.0  1492715698  2017      4   20        3    19   78130.0   
...          ...         ...   ...    ...  ...      ...   ...       ...   
52145        6.0  1489419870  2017      3   13        0    15       5.0   
52167        6.0  1489418859  2017      3   13        0    15       5.0   
52168        5.0  1489418858  2017      3   13        0    15       5.0   
52169        5.0  1489418856  2017      3   13        0    15       5.0   
52170        5.0  1489418855  2017      3   13        0    15       5.0   

        maxViews      meanViews       stdViews  hasMedia  mediaType  hasLink  \
2716   2904081.0  4

In [44]:
# Apply Min-Max scaling fitted on the training data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
#training
scaled_training_5 = pd.DataFrame(scaler.fit_transform(allTrain_5.drop('channel', axis=1)), columns=allTrain_5.drop('channel', axis=1).columns)
scaled_training_25 = pd.DataFrame(scaler.fit_transform(allTrain_25.drop('channel', axis=1)), columns=allTrain_25.drop('channel', axis=1).columns)
#test
scaled_test_5 = pd.DataFrame(scaler.transform(allTest_5.drop('channel', axis=1)), columns=allTest_5.drop('channel', axis=1).columns)
scaled_test_25 = pd.DataFrame(scaler.transform(allTest_25.drop('channel', axis=1)), columns=allTest_25.drop('channel', axis=1).columns)


def scale_partial(df, channel):
  is_channel = df.loc[:, 'channel']==channel
  channel_df = df.loc[is_channel]
  return pd.DataFrame(scaler.transform(channel_df.drop('channel', axis=1)), 
                      columns=channel_df.drop('channel', axis=1).columns)

cnn_scaled_test_5 = scale_partial(allTest_5, 'cnnBrk')
reuters_scaled_test_5 = scale_partial(allTest_5, 'reutersWorld')
press_scaled_test_5 = scale_partial(allTest_5, 'pressTV')
bbc_scaled_test_5 = scale_partial(allTest_5, 'bbc')

cnn_scaled_test_25 = scale_partial(allTest_25, 'cnnBrk')
reuters_scaled_test_25 = scale_partial(allTest_25, 'reutersWorld')
press_scaled_test_25 = scale_partial(allTest_25, 'pressTV')
bbc_scaled_test_25 = scale_partial(allTest_25, 'bbc')

print(scaled_training_5)

              views       age  year     month       day   weekday      hour  \
0      1.000000e+00  0.770360   0.0  0.666667  0.833333  0.333333  0.956522   
1      7.766292e-01  0.399538   0.0  0.333333  0.633333  0.833333  0.086957   
2      7.731909e-01  0.374071   0.0  0.333333  0.466667  0.000000  0.434783   
3      7.151426e-01  0.424694   0.0  0.333333  0.766667  0.333333  0.739130   
4      6.463418e-01  0.239356   0.0  0.166667  0.633333  0.500000  0.826087   
...             ...       ...   ...       ...       ...       ...       ...   
64911  5.239731e-07  0.030859   0.0  0.000000  0.400000  0.000000  0.652174   
64912  5.239731e-07  0.030795   0.0  0.000000  0.400000  0.000000  0.652174   
64913  0.000000e+00  0.030795   0.0  0.000000  0.400000  0.000000  0.652174   
64914  0.000000e+00  0.030795   0.0  0.000000  0.400000  0.000000  0.652174   
64915  0.000000e+00  0.030794   0.0  0.000000  0.400000  0.000000  0.652174   

       minViews  maxViews  meanViews  stdViews  has

## SVM

Aim is to predict top 5% and top 25% popular news for each agency. Therefore train svm classifier on all channels and predict popular news for only one **channel**

In [None]:
from sklearn import svm
import numpy as np
from sklearn.model_selection import GridSearchCV

In [7]:
def calculate_weights(train_df):
  n = len(train_df.index)
  labels = train_df.Popular
  popular = len(labels[labels == 1])
  not_popular = n - popular

  denominator = sum((1 / (popular if labels[k] == 1 else not_popular)) for k in range(1,n))
  bigLambda = map(lambda i: (1 / (popular if labels[i] == 1 else not_popular)) * denominator, range(0, n))
  return pd.Series(bigLambda)

weights_5 = calculate_weights(scaled_training_5)
weights_25 = calculate_weights(scaled_training_25)

### Top 5%
Interestingly using GridSearch including the weights as stated in the paper, the results are way more different than simply applying svc (using default settings). If adding the weights to the simple svc implementation, the results turn out to be exactly the same as when using the gridSearch implementation.

In [None]:
y5 = scaled_training_5['Popular']
x5 = scaled_training_5.drop('Popular', axis=1)

param_grid = {'C' : np.logspace(-3, 2, 6)}
weight_array = weights_5.unique()
weight_dict = {0 : weight_array[0], 1 : weight_array[1]}

In [8]:
clf = svm.SVC()
clf.fit(x5, y5)

SVC()

In [9]:
svc_classifier_grid = GridSearchCV(svm.SVC(), param_grid=param_grid)
clf_grid = svc_classifier_grid.fit(x5,y5)

In [10]:
svc_classifier_grid = GridSearchCV(svm.SVC(class_weight=weight_dict), param_grid=param_grid, n_jobs=8)
clf_grid_weighted = svc_classifier_grid.fit(x5,y5)

In [11]:
svc_classifier_grid = GridSearchCV(svm.LinearSVC(), cv=5, param_grid=param_grid, n_jobs=8)
clf_grid_linear = svc_classifier_grid.fit(x5,y5)



In [12]:
svc_classifier_grid = GridSearchCV(svm.LinearSVC(class_weight=weight_dict), param_grid=param_grid, cv=5, n_jobs=8)
clf_grid_weighted_linear = svc_classifier_grid.fit(x5,y5)

#### Prediction + Evaluation of Top5% models

In [18]:
from sklearn.metrics import accuracy_score, balanced_accuracy_score, f1_score, precision_score, recall_score, confusion_matrix

def model_metrics(true, predicted):
  acc = accuracy_score(true, predicted)
  ba = balanced_accuracy_score(true, predicted)
  prec = precision_score(true, predicted)
  recall = recall_score(true, predicted)
  f1 = f1_score(true, predicted)

  print('Accuracy: ', acc)
  print('Balanced Accuracy: ', ba)
  print('Precision: ', prec)
  print('Recall: ', recall)
  print('F1: ', f1)

  return confusion_matrix(true, predicted)

Prediction for CNN

In [20]:
y5_cnn_test = cnn_scaled_test_5['Popular']
x5_cnn_test = cnn_scaled_test_5.drop('Popular', axis=1)

cnn_default = clf.predict(x5_cnn_test)
print('clf - default settings')
print(model_metrics(y5_cnn_test, cnn_default))

clf - default settings
Accuracy:  0.9384684147794994
Balanced Accuracy:  0.5492895169344248
Precision:  0.0075
Recall:  0.15789473684210525
F1:  0.01431980906921241
[[6296  397]
 [  16    3]]


In [21]:
cnn_grid = clf_grid.predict(x5_cnn_test)
print('grid - default settings')
print(model_metrics(y5_cnn_test, cnn_grid))

grid - default settings
Accuracy:  0.9408522050059595
Balanced Accuracy:  0.55048479558376
Precision:  0.0078125
Recall:  0.15789473684210525
F1:  0.01488833746898263
[[6312  381]
 [  16    3]]


In [22]:
cnn_grid_weighted = clf_grid_weighted.predict(x5_cnn_test)
print('grid - weighted')
print(model_metrics(y5_cnn_test, cnn_grid_weighted))

grid - weighted
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
cnn_grid_linear = clf_grid_linear.predict(x5_cnn_test)
print('grid - linear')
print(model_metrics(y5_cnn_test, cnn_grid_linear))

grid - linear
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
cnn_grid_linear_weighted = clf_grid_weighted_linear.predict(x5_cnn_test)
print('grid - linear + weighted')
print(model_metrics(y5_cnn_test, cnn_grid_linear_weighted))

grid - linear + weighted
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


Prediction for BBC

In [25]:
y5_bbc_test = bbc_scaled_test_5['Popular']
x5_bbc_test = bbc_scaled_test_5.drop('Popular', axis=1)

bbc_default = clf.predict(x5_bbc_test)
print('clf - default settings')
print(model_metrics(y5_bbc_test, bbc_default))

clf - default settings
Accuracy:  1.0
Balanced Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0
[[1148]]


In [26]:
bbc_grid = clf_grid.predict(x5_bbc_test)
print('grid - default settings')
print(model_metrics(y5_bbc_test, bbc_grid))

grid - default settings
Accuracy:  1.0
Balanced Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0
[[1148]]


In [27]:
bbc_grid_weighted = clf_grid_weighted.predict(x5_bbc_test)
print('grid - weighted')
print(model_metrics(y5_bbc_test, bbc_grid_weighted))

grid - weighted
Accuracy:  0.7970383275261324
Balanced Accuracy:  0.7970383275261324
Precision:  1.0
Recall:  0.7970383275261324
F1:  0.8870576829859428
[[  0   0]
 [233 915]]




In [29]:
bbc_grid_linear_weighted = clf_grid_weighted_linear.predict(x5_bbc_test)
print('grid - linear + weighted')
print(model_metrics(y5_bbc_test, bbc_grid_linear_weighted))

grid - linear + weighted
Accuracy:  0.7970383275261324
Balanced Accuracy:  0.7970383275261324
Precision:  1.0
Recall:  0.7970383275261324
F1:  0.8870576829859428
[[  0   0]
 [233 915]]




Prediction for Reuters

In [30]:
y5_reuters_test = reuters_scaled_test_5['Popular']
x5_reuters_test = reuters_scaled_test_5.drop('Popular', axis=1)

reuters_default = clf.predict(x5_reuters_test)
print('clf - default settings')
print(model_metrics(y5_reuters_test, reuters_default))

clf - default settings
Accuracy:  0.14085267134376686
Balanced Accuracy:  0.5021888680425266
Precision:  0.13759479956663057
Recall:  1.0
F1:  0.2419047619047619
[[   7 1592]
 [   0  254]]


In [31]:
reuters_grid = clf_grid.predict(x5_reuters_test)
print('grid - default settings')
print(model_metrics(y5_reuters_test, reuters_grid))

grid - default settings
Accuracy:  0.14085267134376686
Balanced Accuracy:  0.5021888680425266
Precision:  0.13759479956663057
Recall:  1.0
F1:  0.2419047619047619
[[   7 1592]
 [   0  254]]


In [32]:
reuters_grid_weighted = clf_grid_weighted.predict(x5_reuters_test)
print('grid - weighted')
print(model_metrics(y5_reuters_test, reuters_grid_weighted))

grid - weighted
Accuracy:  0.8629249865083648
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[1599    0]
 [ 254    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [33]:
reuters_grid_linear_weighted = clf_grid_weighted_linear.predict(x5_reuters_test)
print('grid - linear + weighted')
print(model_metrics(y5_reuters_test, reuters_grid_linear_weighted))

grid - linear + weighted
Accuracy:  0.8629249865083648
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[1599    0]
 [ 254    0]]


  _warn_prf(average, modifier, msg_start, len(result))


Prediction for PressTV

In [34]:
y5_press_test = press_scaled_test_5['Popular']
x5_press_test = press_scaled_test_5.drop('Popular', axis=1)

press_default = clf.predict(x5_press_test)
print('clf - default settings')
print(model_metrics(y5_press_test, press_default))

clf - default settings
Accuracy:  0.14098360655737704
Balanced Accuracy:  0.14098360655737704
Precision:  1.0
Recall:  0.14098360655737704
F1:  0.24712643678160917
[[   0    0]
 [1048  172]]




In [35]:
press_grid = clf_grid.predict(x5_press_test)
print('grid - default settings')
print(model_metrics(y5_press_test, press_grid))

grid - default settings
Accuracy:  1.0
Balanced Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0
[[1220]]


In [36]:
press_grid_weighted = clf_grid_weighted.predict(x5_press_test)
print('grid - weighted')
print(model_metrics(y5_press_test, press_grid_weighted))

grid - weighted
Accuracy:  0.0
Balanced Accuracy:  0.0
Precision:  0.0
Recall:  0.0
F1:  0.0
[[   0    0]
 [1220    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [37]:
press_grid_linear_weighted = clf_grid_weighted_linear.predict(x5_press_test)
print('grid - linear + weighted')
print(model_metrics(y5_press_test, press_grid_linear_weighted))

grid - linear + weighted
Accuracy:  0.0
Balanced Accuracy:  0.0
Precision:  0.0
Recall:  0.0
F1:  0.0
[[   0    0]
 [1220    0]]


  _warn_prf(average, modifier, msg_start, len(result))


<hr>

## Top 25%

In [38]:
y25 = scaled_training_25['Popular']
x25 = scaled_training_25.drop('Popular', axis=1)

param_grid = {'C' : np.logspace(-3, 2, 6)}
weight_array_25 = weights_25.unique()
weight_dict_25 = {0 : weight_array_25[0], 1 : weight_array_25[1]}

In [39]:
# default SVM
clf_25 = svm.SVC()
clf_25.fit(x25, y25)

SVC()

In [40]:
# default kernel (rbf), no weights
svc_classifier_grid_25 = GridSearchCV(svm.SVC(), param_grid=param_grid)
clf_grid_25 = svc_classifier_grid_25.fit(x25,y25)

In [41]:
# default kernel (rbf), weights
svc_classifier_grid_25 = GridSearchCV(svm.SVC(class_weight=weight_dict_25), param_grid=param_grid, n_jobs=8)
clf_grid_weighted_25 = svc_classifier_grid_25.fit(x25,y25)

In [42]:
# linear kernel, no weights
svc_classifier_grid_25 = GridSearchCV(svm.LinearSVC(), cv=5, param_grid=param_grid, n_jobs=8)
clf_grid_linear_25 = svc_classifier_grid_25.fit(x25,y25)



In [43]:
# linear kernel, weights
svc_classifier_grid_25 = GridSearchCV(svm.LinearSVC(class_weight=weight_dict_25), param_grid=param_grid, cv=5, n_jobs=8)
clf_grid_weighted_linear_25 = svc_classifier_grid_25.fit(x25,y25)

#### Prediction + Evaluation for Top 25% models

Prediction for CNN

In [62]:
y25_cnn_test = cnn_scaled_test_25['Popular']
x25_cnn_test = cnn_scaled_test_25.drop('Popular', axis=1)

cnn_default_25 = clf_25.predict(x25_cnn_test)
print('clf - default settings')
print(model_metrics(y25_cnn_test, cnn_default_25))

clf - default settings
Accuracy:  0.9384684147794994
Balanced Accuracy:  0.5492895169344248
Precision:  0.0075
Recall:  0.15789473684210525
F1:  0.01431980906921241
[[6296  397]
 [  16    3]]


In [46]:
cnn_grid_25 = clf_grid_25.predict(x25_cnn_test)
print('grid - default settings')
print(model_metrics(y25_cnn_test, cnn_grid_25))

grid - default settings
Accuracy:  0.9408522050059595
Balanced Accuracy:  0.55048479558376
Precision:  0.0078125
Recall:  0.15789473684210525
F1:  0.01488833746898263
[[6312  381]
 [  16    3]]


In [47]:
cnn_grid_weighted_25 = clf_grid_weighted_25.predict(x25_cnn_test)
print('grid - weighted')
print(model_metrics(y25_cnn_test, cnn_grid_weighted_25))

grid - weighted
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [48]:
cnn_grid_linear_25 = clf_grid_linear_25.predict(x25_cnn_test)
print('grid - linear')
print(model_metrics(y25_cnn_test, cnn_grid_linear_25))

grid - linear
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [63]:
cnn_grid_linear_weighted_25 = clf_grid_weighted_linear_25.predict(x25_cnn_test)
print('grid - linear + weighted')
print(model_metrics(y25_cnn_test, cnn_grid_linear_weighted_25))

grid - linear + weighted
Accuracy:  0.9971692491060786
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[6693    0]
 [  19    0]]


  _warn_prf(average, modifier, msg_start, len(result))


Prediction for BBC

In [50]:
y25_bbc_test = bbc_scaled_test_25['Popular']
x25_bbc_test = bbc_scaled_test_25.drop('Popular', axis=1)

bbc_default_25 = clf_25.predict(x25_bbc_test)
print('clf - default settings')
print(model_metrics(y25_bbc_test, bbc_default_25))

clf - default settings
Accuracy:  1.0
Balanced Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0
[[1148]]


In [51]:
bbc_grid_25 = clf_grid_25.predict(x25_bbc_test)
print('grid - default settings')
print(model_metrics(y25_bbc_test, bbc_grid_25))

grid - default settings
Accuracy:  1.0
Balanced Accuracy:  1.0
Precision:  1.0
Recall:  1.0
F1:  1.0
[[1148]]


In [52]:
bbc_grid_weighted_25 = clf_grid_weighted_25.predict(x25_bbc_test)
print('grid - weighted')
print(model_metrics(y25_bbc_test, bbc_grid_weighted_25))

grid - weighted
Accuracy:  0.7970383275261324
Balanced Accuracy:  0.7970383275261324
Precision:  1.0
Recall:  0.7970383275261324
F1:  0.8870576829859428
[[  0   0]
 [233 915]]




In [53]:
bbc_grid_linear_weighted_25 = clf_grid_weighted_linear_25.predict(x25_bbc_test)
print('grid - linear + weighted')
print(model_metrics(y25_bbc_test, bbc_grid_linear_weighted_25))

grid - linear + weighted
Accuracy:  0.7970383275261324
Balanced Accuracy:  0.7970383275261324
Precision:  1.0
Recall:  0.7970383275261324
F1:  0.8870576829859428
[[  0   0]
 [233 915]]




Prediction for Reuters

In [54]:
y25_reuters_test = reuters_scaled_test_25['Popular']
x25_reuters_test = reuters_scaled_test_25.drop('Popular', axis=1)

reuters_default_25 = clf_25.predict(x25_reuters_test)
print('clf - default settings')
print(model_metrics(y25_reuters_test, reuters_default_25))

clf - default settings
Accuracy:  0.14085267134376686
Balanced Accuracy:  0.5021888680425266
Precision:  0.13759479956663057
Recall:  1.0
F1:  0.2419047619047619
[[   7 1592]
 [   0  254]]


In [55]:
reuters_grid_25 = clf_grid_25.predict(x25_reuters_test)
print('grid - default settings')
print(model_metrics(y25_reuters_test, reuters_grid_25))

grid - default settings
Accuracy:  0.14085267134376686
Balanced Accuracy:  0.5021888680425266
Precision:  0.13759479956663057
Recall:  1.0
F1:  0.2419047619047619
[[   7 1592]
 [   0  254]]


In [56]:
reuters_grid_weighted_25 = clf_grid_weighted_25.predict(x25_reuters_test)
print('grid - weighted')
print(model_metrics(y25_reuters_test, reuters_grid_weighted_25))

grid - weighted
Accuracy:  0.8629249865083648
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[1599    0]
 [ 254    0]]


  _warn_prf(average, modifier, msg_start, len(result))


In [64]:
reuters_grid_linear_weighted_25 = clf_grid_weighted_linear_25.predict(x25_reuters_test)
print('grid - linear + weighted')
print(model_metrics(y25_reuters_test, reuters_grid_linear_weighted_25))

grid - linear + weighted
Accuracy:  0.8629249865083648
Balanced Accuracy:  0.5
Precision:  0.0
Recall:  0.0
F1:  0.0
[[1599    0]
 [ 254    0]]


  _warn_prf(average, modifier, msg_start, len(result))


Prediction for PressTV

In [65]:
y25_press_test = press_scaled_test_25['Popular']
x25_press_test = press_scaled_test_25.drop('Popular', axis=1)

press_default_25 = clf_25.predict(x25_press_test)
print('clf - default settings')
print(model_metrics(y25_press_test, press_default_25))

clf - default settings
Accuracy:  0.14098360655737704
Balanced Accuracy:  0.14098360655737704
Precision:  1.0
Recall:  0.14098360655737704
F1:  0.24712643678160917
[[   0    0]
 [1048  172]]




In [None]:
press_grid_25 = clf_grid_25.predict(x25_press_test)
print('grid - default settings')
print(model_metrics(y25_press_test, press_grid_25))

In [None]:
press_grid_weighted_25 = clf_grid_weighted_25.predict(x25_press_test)
print('grid - weighted')
print(model_metrics(y25_press_test, press_grid_weighted_25))

In [66]:
press_grid_linear_weighted_25 = clf_grid_weighted_linear_25.predict(x25_press_test)
print('grid - linear + weighted')
print(model_metrics(y25_press_test, press_grid_linear_weighted_25))

grid - linear + weighted
Accuracy:  0.0
Balanced Accuracy:  0.0
Precision:  0.0
Recall:  0.0
F1:  0.0
[[   0    0]
 [1220    0]]


  _warn_prf(average, modifier, msg_start, len(result))
