# Newspaper classification

## Setup


Relevant libraries are imported:

In [1]:
from project_lib import Project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from tensorflow import keras
from tensorflow.keras import backend as K

Data is loaded:

In [2]:
# The code was removed by Watson Studio for sharing.

In [3]:
project = Project(project_id = project_id, project_access_token = project_token)

In [4]:
df_file = project.get_file("Newspaper_Data_Processed.csv")
df_file.seek(0)
df = pd.read_csv(df_file)

df.head()

Unnamed: 0,Title,Newspaper,DateTime,title_nlp_0,title_nlp_1,title_nlp_2,title_nlp_3,title_nlp_4,title_nlp_5,title_nlp_6,...,title_nlp_91,title_nlp_92,title_nlp_93,title_nlp_94,title_nlp_95,title_polarity,title_subjectivity,title_length,title_word_count,title_avg_word_length
0,"Taliban versprechen ""Amnestie"" - und werben fü...",SZ,2021-08-17 10:00:51.109716,1.056343,-0.331942,0.377576,-0.755563,1.121652,1.051834,-0.692204,...,-0.145574,0.0037,-0.349982,-0.764239,-0.042902,1.0,0.0,73,11,5.727273
1,Frau tot aus dem Wasser geborgen,SZ,2021-08-17 10:00:51.109716,0.482346,-1.360563,2.204261,-0.980908,0.938673,2.182759,0.753072,...,-1.245529,-0.426957,-0.18753,0.610376,-0.090306,-1.0,0.0,32,6,4.5
2,Wenn ein einzelner Corona-Fall zum Lockdown führt,SZ,2021-08-17 10:00:51.109716,0.719709,0.718557,1.878793,0.517712,0.685538,1.683592,0.760514,...,1.376056,0.862674,-1.260844,-0.054087,-0.305862,0.0,0.0,49,7,6.142857
3,Mit diesen Argumenten überzeugen Sie Impfskept...,SZ,2021-08-17 10:00:51.109716,1.173115,0.131595,-1.171353,-0.516909,-0.515251,1.939519,1.22832,...,-1.644382,-1.046088,-1.202309,0.199285,1.258943,1.0,0.0,50,6,7.5
4,Vier Silben nähren die Titelhoffnungen,SZ,2021-08-17 10:00:51.109716,1.485402,-0.082825,0.894071,0.251246,-0.263389,0.542428,1.021385,...,-0.847832,-0.782896,-1.793984,2.312975,-0.406768,0.0,0.0,38,5,6.8


The shape of the data is: 

In [5]:
df.shape

(313, 104)

For each newspaper, the following number of title entries is available:

In [6]:
df[['Newspaper', 'Title']].groupby('Newspaper').count()

Unnamed: 0_level_0,Title
Newspaper,Unnamed: 1_level_1
SZ,118
Welt,195


Classification-relevant variables are defined:

In [7]:
number_newspapers = df['Newspaper'].nunique()

feature_columns = [col for col in df.columns if 'title_nlp_' in col]
feature_columns.extend(['title_polarity', 'title_subjectivity'])
feature_columns.extend(['title_length', 'title_word_count', 'title_avg_word_length'])

## Training & test split

Data is split into training and test sets using 80% for training and 20% for testing:

In [8]:
n_samples, n_features = df.shape[0], len(feature_columns)
rng = np.random.RandomState(0)

X = np.array(df[feature_columns])
y = np.array(df['Newspaper'])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, test_size = 0.2)

## Feature scaling

Features are scaled for classification using the *MinMaxScaler()* implemented in sklearn:

In [9]:
min_max_scaler = preprocessing.MinMaxScaler()

X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)

## Classification

Classification is run using three different algorithms:

1. Linear Support Vector Machine (SVM) is used for computationally simple proof-of-concept prediction.
2. XGBoost is ussed as a state-of-the-art algorithm for supervised machine learning.
3. A neural network is used as state-of-the-art deep learning model for comparison with XGBoost and SVM algorithms.

Both XGBoost and the neural network were selected as they can model non-linear associations. This is likely to be important for complex text-based features that include combinations of different words and sentiment.

As **evaluation metric**, the primary evaluation metric of interest is the F1-score as it is more robust against imbalanced class data (as is the case here). However, the standard model accuracy is also provided as comparison.

### Support vector classifier

In [10]:
svc_parameters = {'kernel':('linear', 'rbf'), 'C':[0.1,1,8,16,32]}
svc = svm.SVC()
clf_svm = GridSearchCV(svc, svc_parameters, cv=5)
clf_svm.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [0.1, 1, 8, 16, 32], 'kernel': ('linear', 'rbf')})

In [12]:
y_pred = clf_svm.predict(X_test)

print("The accuracy for the test set is: " + str(accuracy_score(y_test, y_pred)))
print("The F1 score for the test set is: " + str(f1_score(y_test, y_pred, average = 'weighted')))

The accuracy for the test set is: 0.9523809523809523
The F1 score for the test set is: 0.9522578739430639


### XGBoost

In [13]:
xgboost_parameters = {
    "learning_rate": [0.01, 0.1, 0.2],
    "min_samples_split": np.linspace(0.1, 0.5, 3),
    "min_samples_leaf": np.linspace(0.1, 0.5, 3),
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"],
    "subsample":[0.5, 0.8, 1.0],
    "n_estimators":[10]
    }

xgboost = GradientBoostingClassifier(random_state=0)
clf_xgboost = GridSearchCV(xgboost, xgboost_parameters, cv=5)
clf_xgboost.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=0),
             param_grid={'learning_rate': [0.01, 0.1, 0.2],
                         'max_depth': [3, 5, 8],
                         'max_features': ['log2', 'sqrt'],
                         'min_samples_leaf': array([0.1, 0.3, 0.5]),
                         'min_samples_split': array([0.1, 0.3, 0.5]),
                         'n_estimators': [10], 'subsample': [0.5, 0.8, 1.0]})

In [14]:
y_pred = clf_xgboost.predict(X_test)

print("The accuracy for the test set is: " + str(accuracy_score(y_test, y_pred)))
print("The F1 score for the test set is: " + str(f1_score(y_test, y_pred, average = 'weighted')))

The accuracy for the test set is: 0.7777777777777778
The F1 score for the test set is: 0.7724358974358974


### Neural network

One-hot encoded vectors are created for the newspaper outcome variable to align with *keras* format:

In [15]:
y_train_cat = pd.get_dummies(y_train)
y_test_cat = pd.get_dummies(y_test)

To assess the F1-score for the Keras model, this metric needs to be defined manually as it is not implemented in *keras*. The functions are therefore defined as indicated on https://neptune.ai/blog/implementing-the-macro-f1-score-in-keras:

In [16]:
# F1 measures definition according to: F1 = 2 * (precision * recall) / (precision + recall)
def custom_f1(y_true, y_pred):    
    def recall_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        
        recall = TP / (Positives+K.epsilon())    
        return recall 
    
    
    def precision_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Pred_Positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    
        precision = TP / (Pred_Positives+K.epsilon())
        return precision 
    
    precision, recall = precision_m(y_true, y_pred), recall_m(y_true, y_pred)
    
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

The neural network is built using three dense layers with relu activation function and a variable amount of 32-64 nodes. A dropout layer with 0.5 dropout is defined to enhance model generalisability:

In [17]:
model = keras.Sequential([
    keras.layers.Dense(32, input_shape=(X_train.shape[1],), activation='relu'),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(2, activation='sigmoid')])

model.compile(optimizer='adam', 
              loss=keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=[custom_f1,'accuracy'])

model.fit(X_train, y_train_cat, batch_size=4, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f7b22f52d60>

The model is evaluated on the test set:

In [18]:
model.evaluate(X_test, y_test_cat)



[0.2620910704135895, 0.8568547964096069, 0.8571428656578064]

## Conclusion

Classification of newspaper title data shows that it is possible to classify SZ and Welt newspapers from title features alone with relatively high accuracy. For prediction using data across multiple days, best prediction is achieved using SVC (F1 score=*0.95*), followed by somewhat worse prediction using neural network classification (F1 score=*0.86*) and worst performance for the XGBoost classifier (F1 score=*0.77*).