This notebook processes the scraped data, converts text columns to numeric vectors (Part 1) and trains classification models using the processed text data as predictors.

Locally on Richard's computer, the notebook takes approximately 1.5 hrs to execute completely - Part 1 needs ~45 mins, Part 2 and 3 ~20 min each.

Besides usual numpy, pandas etc., needed modules are spacy and pytorch. The German NLP model needs to be downloaded separately.

# Part 1: Text Processing

Here we import scraped data, splits text column to words, extracts and normalizes tokens. Next, tokens are converted to embeddings.

In [1]:
### Setup

# pip install spacy
# python -m spacy download de_core_news_sm  - for small model (13 MB)
# python -m spacy download de_core_news_md  - for medium model (42 MB)
# source - https://spacy.io/models/de


import spacy
from spacy.lang.de.examples import sentences 


# Example from documentation
nlp = spacy.load("de_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen
Die DET nk
ganze ADJ nk
Stadt NOUN sb
ist AUX ROOT
ein DET nk
Startup NOUN pd
: PUNCT punct
Shenzhen NOUN sb
ist AUX cj
das DET nk
Silicon PROPN pnc
Valley PROPN sb
für ADP mnr
Hardware-Firmen NOUN nk


In [2]:
## Data loading

import numpy as np
import pandas as pd

standard_data = pd.read_csv('./data/4yrs_derstandard_frontpage_data.csv')
standard_data

Unnamed: 0,title,subtitle,link,datetime,kicker,n_posts,storylabels
0,Real Madrid stolpert mit Aluminiumpech im Tite...,Die Königlichen können Bilbao daheim nicht bes...,https://www.derstandard.at/story/2000112599363...,2019-12-22T23:44,Primera Division,30,
1,Bolivien weist venezolanische Diplomaten aus,InterimspräsidentinJeanine Áñez wirft denBotsc...,https://www.derstandard.at/story/2000112598924...,2019-12-22T22:50,Übergangsregierung,16,
2,Erdoğan warnt vor neuer Flüchtlingswelle aus S...,"Türkischer Präsident: ""80.000 Menschen Richtun...",https://www.derstandard.at/story/2000112598130...,2019-12-22T21:43,Bürgerkrieg,104,
3,Massenkarambolage mit 63 Fahrzeugen in Virginia,Autos stießen auf vereister Brücke zusammen,https://www.derstandard.at/story/2000112597972...,2019-12-22T21:29,Weihnachtsverkehr,35,
4,"Salzburg schlägt Caps, Meister KAC mit vierter...",Die Bullen sind damit der Gewinner der Runde: ...,https://www.derstandard.at/story/2000112595206...,2019-12-22T20:54,Eishockey,4,
...,...,...,...,...,...,...,...
182102,Wer braucht die Kirche?,Dass sich die Kirche nach soschwerwiegendenVer...,https://www.derstandard.at/story/3000000200743...,2023-12-22T06:00,Dominik Straub,1,Kommentar
182103,Sonderregelung verlängert: Mehr als 1.000 Ärzt...,"Der""Pandemieparagraf""im Ärztegesetz hat mehr a...",https://www.derstandard.at/story/3000000200621...,2023-12-22T06:00,Pandemieparagraf,148,
182104,"Stadtforscher: ""Architektur ist Teil unserer W...",Jetzt anhören: In Zukunft müssen Städte wieder...,https://www.derstandard.at/story/3000000200499...,2023-12-22T06:00,Edition Zukunft,54,Podcast
182105,David Alaba zum zehnten Mal zu Österreichs Fuß...,Zehn von zwölf Trainern wählten den derzeit ve...,https://www.derstandard.at/story/3000000200745...,2023-12-22T05:46,Fußball,34,


## Tokenizing + lemmatizing, embeddings

First, title and subtitile columns are split into individual words. Second, words are replaced by their standard forms (großem - groß, rettete - retten). Finally, the standard forms are converted to numeric vectors using toc2vec embeddings.

In [3]:
# Initializing two columns
standard_data['title_tokens'] = standard_data['title']
standard_data['subtitle_tokens'] = standard_data['subtitle']

standard_data['title_vectors'] = standard_data['title_tokens']
standard_data['subtitle_vectors'] = standard_data['subtitle_tokens']

# Load model
nlp = spacy.load("de_core_news_sm")
nlp.get_pipe('lemmatizer')
nlp.get_pipe("tok2vec")

<spacy.pipeline.tok2vec.Tok2Vec at 0x17e1a8cbcb0>

In [4]:
# loops through rows

for index, row in standard_data.iterrows():
    
    # TITLE

    text = nlp(row['title_tokens'].replace("-", " ").replace(",", " ").replace(": ", " "))

    token_list = ' '.join([token.lemma_ for token in text]) # list of standardized tokens joined to a string

    vector_list = [token.vector for token in nlp(token_list)] # list of numeric vectors

    # store in the dataframe

    standard_data.at[index, 'title_tokens'] = token_list 
    standard_data.at[index, 'title_vectors'] = vector_list 

    # SAME FOR SUBTITLE

    # try except because subtitle is sometimes empty, then it would output error

    try:

        text = nlp(row['subtitle_tokens'])
        token_list = ' '.join([token.lemma_ for token in text]) # list of standardized tokens joined to a string
        vector_list = [token.vector for token in nlp(token_list)] # list of numeric vectors

    except:
        token_list = '' # if empty, token list empty
        vector_list = []

    standard_data.at[index, 'subtitle_tokens'] = token_list # store in the dataframe
    standard_data.at[index, 'subtitle_vectors'] = vector_list



In [5]:
standard_data.head(4)

Unnamed: 0,title,subtitle,link,datetime,kicker,n_posts,storylabels,title_tokens,subtitle_tokens,title_vectors,subtitle_vectors
0,Real Madrid stolpert mit Aluminiumpech im Tite...,Die Königlichen können Bilbao daheim nicht bes...,https://www.derstandard.at/story/2000112599363...,2019-12-22T23:44,Primera Division,30,,Real Madrid stolpern mit Aluminiumpech in Tite...,der Königlich können Bilbao daheim nicht besiegen,"[[1.4714637, 2.1941755, 0.37696052, 0.46038526...","[[2.48862, 0.1332562, -1.0822431, -2.6994205, ..."
1,Bolivien weist venezolanische Diplomaten aus,InterimspräsidentinJeanine Áñez wirft denBotsc...,https://www.derstandard.at/story/2000112598924...,2019-12-22T22:50,Übergangsregierung,16,,Bolivien weisen venezolanisch Diplomat aus,InterimspräsidentinJeanine Áñez werfen denBots...,"[[-1.3009748, -0.9428248, -3.3170214, 2.703031...","[[1.1021554, 1.5151364, -0.9405582, -2.410718,..."
2,Erdoğan warnt vor neuer Flüchtlingswelle aus S...,"Türkischer Präsident: ""80.000 Menschen Richtun...",https://www.derstandard.at/story/2000112598130...,2019-12-22T21:43,Bürgerkrieg,104,,Erdoğan warnen vor neu Flüchtlingswelle aus Sy...,Türkischer Präsident -- -- 80.000 Mensch Richt...,"[[-1.0778335, 0.5461743, 2.5889783, -1.2373338...","[[-0.37814975, -0.962201, 0.65647066, -5.07107..."
3,Massenkarambolage mit 63 Fahrzeugen in Virginia,Autos stießen auf vereister Brücke zusammen,https://www.derstandard.at/story/2000112597972...,2019-12-22T21:29,Weihnachtsverkehr,35,,Massenkarambolage mit 63 Fahrzeug in Virginia,Auto stoßen auf vereist Brücke zusammen,"[[0.4362324, 2.8947713, -4.595852, -0.09335708...","[[-4.584771, -1.4883125, 0.2176263, 4.631511, ..."


In [6]:
# Example of list of numeric vectors representing a subtitle

standard_data.iloc[1,10]

[array([ 1.1021554 ,  1.5151364 , -0.9405582 , -2.410718  , -1.1881657 ,
        -3.615175  ,  1.0234319 ,  4.1538424 , -1.5936106 , -2.785506  ,
         0.6842209 ,  0.07389921,  0.5360052 ,  0.31706142, -0.9196383 ,
        -0.82573223, -3.5580513 , -1.8286765 ,  0.10499629, -2.6490862 ,
         0.7699168 ,  0.9488483 ,  3.0474105 , -3.499149  , -1.3851355 ,
        -1.3532616 , -2.1080291 ,  1.4907815 ,  1.6701344 ,  0.32809997,
         3.3263462 ,  1.2118888 , -0.66293633, -0.6889895 , -0.9815758 ,
         7.8648634 , -2.7158377 ,  1.1486918 ,  1.389757  ,  2.5390139 ,
        -2.2970562 ,  1.9853625 , -0.53672844,  1.7880895 ,  1.0695057 ,
        -1.396502  , -0.44181645,  8.145238  , -6.583647  ,  2.6661644 ,
        -3.5900345 ,  5.988452  ,  0.7246865 ,  0.7114573 , -0.97419953,
        -0.59691894, -1.2567514 ,  1.227717  , -0.20339483,  1.2106575 ,
        -3.6484938 , -0.32804242, -0.22830245, -2.9978542 , -1.7434182 ,
         1.2077783 ,  1.3858008 , -2.3194008 ,  2.6

## Defining predictors and dependent variable

In [7]:
### sum vectors for words  to get a vector for a sentence

import numpy as np

standard_data_copy = standard_data.copy()

for index,row in standard_data_copy.iterrows():

    l = standard_data_copy.loc[index, 'title_vectors']
    standard_data_copy.at[index, 'title_vectors'] = np.sum(l, axis=0)

    l = standard_data_copy.loc[index, 'subtitle_vectors']
    standard_data_copy.at[index, 'subtitle_vectors'] = np.sum(l, axis=0)

In [8]:
# reshaping

X = np.vstack(standard_data_copy['title_vectors'].to_numpy())
X[0:2]

array([[  9.119195  ,   0.4420882 ,  -6.585746  ,  -0.36528832,
          5.3427124 ,   8.7853775 ,  -5.9323215 ,  15.100472  ,
          8.413376  ,   0.498165  ,  15.515087  , -10.902954  ,
        -14.050985  ,   6.649028  , -11.754989  ,   6.057748  ,
          8.1888    ,  -2.3931763 ,  -3.9362216 ,  -2.4156115 ,
          1.1384144 ,  16.5233    ,  -3.1429567 ,  12.609358  ,
        -10.458169  , -14.882816  ,  -7.4182734 ,   2.4116597 ,
         -5.4921722 ,   7.045065  , -10.576113  ,  -2.9996653 ,
         -2.5590808 ,  15.348734  ,  17.18261   ,   6.2615204 ,
          1.5978868 ,   4.3599935 ,   6.792941  ,   0.34123358,
         -4.33022   ,   1.9403619 ,  -4.0108805 ,  -3.0241973 ,
          2.9737525 ,   5.95477   ,  -1.0211641 ,  -9.582537  ,
         16.537086  ,  -1.5229907 , -17.975307  ,  14.044141  ,
         17.179005  , -15.048843  ,  -3.5073388 ,  -9.0553465 ,
          7.744524  ,  -1.2955302 ,  -1.1938403 ,  -0.9916693 ,
         -5.2311893 , -15.524109  ,  -1.

In [9]:
# split y to above and under 50 posts (regression -> classification)

y = (standard_data_copy['n_posts'].to_numpy() > 50).astype(int)
y = np.column_stack(standard_data_copy['n_posts'].to_numpy() > 50).astype(int).T

print(y.shape)
print(X.shape)

(182107, 1)
(182107, 96)


In [10]:
# train-test split

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)


In [11]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [12]:
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

In [13]:
# two hidden layers, each with 12 nodes

class BinaryClassifier(nn.Module):
    def __init__(self, n_inputs = 96):
        super().__init__()
        self.hidden1 = nn.Linear(n_inputs, 12)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(12, 12)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x
    
loss_fn = nn.L1Loss(size_average=None, reduce=None, reduction='mean')

In [14]:
# training the model

import random

def train_binary_model(X_train, Y_train, n_inputs = 96, n_epochs = 5000):

    model = BinaryClassifier(n_inputs)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    random.seed(42069)
    batch_size = int(len(X_train) / 50)
    print(batch_size)
    for epoch in range(n_epochs):
        for i in range(0, len(X_train), batch_size):
            Xbatch = X_train[i:i+batch_size]
            y_pred = model(Xbatch)
            ybatch = Y_train[i:i+batch_size]
            loss = loss_fn(y_pred, ybatch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        if epoch % 100 == 0:
            baseline = sum(ybatch) / len(ybatch)
            accuracy = sum(ybatch == y_pred.round())/len(ybatch)
            print(f'Finished epoch {epoch}, latest loss {loss}, accuracy {accuracy} vs baseline {baseline}')

    return model

In [15]:
# testing the model

def test_binary_model(model, X_test):
    Y_hat = model(X_test).round()
    print("Accuracy: ")
    print((sum(Y_test == Y_hat)/len(Y_test)).item())
    print("Baseline accuracy: ")
    print((sum(Y_test)/len(Y_test)).item())


In [16]:
model = train_binary_model(X_train, Y_train, n_epochs=1000)

2913
Finished epoch 0, latest loss 0.28019288182258606, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 100, latest loss 0.11489197611808777, accuracy tensor([0.8857]) vs baseline tensor([0.6000])
Finished epoch 200, latest loss 0.08579690009355545, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 300, latest loss 0.08585256338119507, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 400, latest loss 0.08591507375240326, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 500, latest loss 0.08572710305452347, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 600, latest loss 0.08575011789798737, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 700, latest loss 0.08571437746286392, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 800, latest loss 0.08571577072143555, accuracy tensor([0.9143]) vs baseline tensor([0.6000])
Finished epoch 900, lates

In [17]:
test_binary_model(model, X_test)

Accuracy: 
0.6365109086036682
Baseline accuracy: 
0.5655098557472229


## Can we get better accuracy using subtitles?

Here, we try using subtitle instead of article title as the predictor

In [18]:
for i in range(len(standard_data_copy)):

    if standard_data_copy.loc[i, 'subtitle_tokens'] == '':
        standard_data_copy.at[i, 'subtitle_vectors'] = np.zeros(96)


In [19]:
X = np.vstack(standard_data_copy['subtitle_vectors'].to_numpy())
X

X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

In [20]:
model = train_binary_model(X_train, Y_train, n_epochs = 1000)

2913
Finished epoch 0, latest loss 0.368115097284317, accuracy tensor([0.6571]) vs baseline tensor([0.6000])
Finished epoch 100, latest loss 0.20157773792743683, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 200, latest loss 0.2004449963569641, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 300, latest loss 0.20048309862613678, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 400, latest loss 0.20007279515266418, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 500, latest loss 0.20005889236927032, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 600, latest loss 0.20005641877651215, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 700, latest loss 0.2000395506620407, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 800, latest loss 0.20004263520240784, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 900, latest lo

In [21]:
test_binary_model(model, X_test)

Accuracy: 
0.6548514366149902
Baseline accuracy: 
0.5655098557472229


We can see, using article's subtitle yields worse imporvement in accuracy. Next, we try using both

In [22]:
X = np.concatenate((np.vstack(standard_data_copy['title_vectors'].to_numpy()), np.vstack(standard_data_copy['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

(182107, 192)


In [23]:
model = train_binary_model(X_train, Y_train, n_inputs = 192, n_epochs = 1000)

2913
Finished epoch 0, latest loss 0.29562562704086304, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 100, latest loss 0.20029419660568237, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 200, latest loss 0.20000003278255463, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 300, latest loss 0.2000010460615158, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 400, latest loss 0.2000003308057785, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 500, latest loss 0.20000004768371582, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 600, latest loss 0.20000001788139343, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 700, latest loss 0.20000068843364716, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 800, latest loss 0.20000000298023224, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 900, latest 

In [24]:
test_binary_model(model, X_test)

Accuracy: 
0.6726703643798828
Baseline accuracy: 
0.5655098557472229


# Part 3: Multi-class classifier

We also want to try out a classifier for variables with more than two categories. The predictors are going to be the same, for dependent variables we will choose columns *kicker* and *storylabels*

## 3.1: Kickers

In [25]:
# 20 most common kickers

standard_data_copy['kicker'].value_counts()[0:20]

kicker
Fußball                 3162
Nachrichtenüberblick    3101
Netzpolitik             2674
Sudoku                  2414
Bundesliga              1775
Sport                   1684
USA                     1518
IT-Business             1464
Coronavirus             1464
Games                   1356
Tennis                  1252
Switchlist              1208
Krieg in der Ukraine    1203
Deutsche Bundesliga     1180
Etat-Überblick          1161
Hans Rauscher           1153
Wintersport             1134
TV-Tagebuch             1080
Eishockey               1058
Thema des Tages         1032
Name: count, dtype: int64

In [26]:
# filtering for the most common ones

kickers_20 = standard_data_copy['kicker'].value_counts().nlargest(20).index.tolist()

standard_kickers = standard_data_copy.copy()
standard_kickers = standard_kickers[standard_kickers['kicker'].isin(kickers_20)].reset_index()[['kicker', 'title_vectors', 'subtitle_vectors']]

standard_kickers.head(3)

Unnamed: 0,kicker,title_vectors,subtitle_vectors
0,Eishockey,"[16.587378, -5.880515, -10.134394, -2.7442758,...","[17.787502, -9.519611, 6.9725924, -22.601917, ..."
1,Deutsche Bundesliga,"[5.977871, -8.565644, -8.99353, 8.525143, 13.7...","[30.480652, -4.33595, -10.554278, -1.181915, 1..."
2,Sport,"[6.727174, -2.1885931, -6.3502436, -9.411136, ...","[4.372181, -6.303278, -5.9711304, 3.729673, 8...."


In [27]:
# dependent variable: bianry vector of length 20

y = np.zeros((len(standard_kickers), 20))
y.shape

for row in range(y.shape[0]):
    y_row = [1 if k == standard_kickers['kicker'][row] else 0 for k in kickers_20]
    y[row,] = y_row

y[0:2]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0.]])

In [28]:
X_title = np.vstack(standard_kickers['title_vectors'].to_numpy())
X_title[0:2]

X_subtitle = np.vstack(standard_kickers['subtitle_vectors'].to_numpy())
X_subtitle[0:2]

array([[ 17.78750229,  -9.51961136,   6.97259235, -22.60191727,
         23.74225616,   9.31873894,   8.5572834 ,  23.8451519 ,
         10.47311592,   0.22695971,  34.2782402 , -30.66071892,
        -22.11200523,  -2.05303741, -20.84233093,  24.89317513,
         20.07475853,  42.06415558, -25.77451897,  -8.23214912,
         23.6224556 ,  20.40205765,  24.19214058,   2.76441216,
        -18.69178772, -16.48721504,  19.79255867,  -3.66987586,
         40.11144257,  12.80308914, -32.58789825,  -8.74606419,
          4.88409948,   1.47553456,  16.15632629, -12.9812727 ,
          1.87224245,  -2.28196573,  20.60498428,   2.65613198,
        -31.96606445,  53.01169968,  10.38889885,  -8.28358364,
         42.46792603,   4.36738205, -20.69274902, -27.77854729,
        -12.88485432,  -8.18120575, -25.95684433,  11.9279747 ,
         -2.15959144,   5.14111853,  18.90647316, -12.25464344,
         20.79118919,  24.50359917, -17.3789978 , -15.80044937,
        -12.89952755, -20.38439178,  -5.

In [29]:
X_title_train, X_title_test, Y_train, Y_test = train_test_split(X_title, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

In [30]:
X_subtitle_train, X_subtitle_test, Y_train, Y_test = train_test_split(X_subtitle, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)


In [31]:
X_title_train = torch.tensor(X_title_train, dtype=torch.float32)
X_title_test = torch.tensor(X_title_test, dtype=torch.float32)

X_subtitle_train = torch.tensor(X_subtitle_train, dtype=torch.float32)
X_subtitle_test = torch.tensor(X_subtitle_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)

In [32]:
class ClassifierTwentyCategories(nn.Module):
    def __init__(self, n_inputs):
        super().__init__()
        self.hidden1 = nn.Linear(n_inputs, 48)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(48, 24)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(24, 20)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x

loss_fn = nn.CrossEntropyLoss()

In [33]:
def train_multiclass_model(X_train, Y_train, n_inputs = 96, n_epochs = 1000):

    random.seed(42069)

    model = ClassifierTwentyCategories(n_inputs)
    optimizer = optim.Adam(model.parameters(), lr=0.001)


    batch_size = int(len(X_train) / 19)
    print(batch_size)

    for epoch in range(n_epochs):
        for i in range(0, len(X_train), batch_size):
            Xbatch = X_train[i:i+batch_size]
            y_pred = model(Xbatch)
            ybatch = Y_train[i:i+batch_size]
            loss = loss_fn(y_pred, ybatch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        if epoch % 100 == 0:
            y_pred_category = [torch.argmax(r).item() for r in y_pred]
            ybatch_category = [torch.argmax(r).item() for r in ybatch]
            accuracy = sum([1 if x == y else 0 for x, y in zip(y_pred_category, ybatch_category)]) / len(ybatch_category)
            print(f'Finished epoch {epoch}, latest loss {loss}, accuracy {accuracy}')

    return model



In [34]:
def test_multiclass_model(model, X_test, Y_test, categories = kickers_20):
    
    p = [torch.argmax(r).item() for r in model(X_test)] 
    Y_hat = [categories[i] for i in p]

    p = [torch.argmax(r).item() for r in Y_test] 

    Y_test_values = [categories[i] for i in p]

    acc = sum([1 if x == y else 0 for x, y in zip(Y_hat, Y_test_values)]) / len(Y_test_values)
    print("Accuracy: " + str(acc))

In [35]:
model_titles = train_multiclass_model(X_title_train, Y_train, n_epochs = 1000)

1266
Finished epoch 0, latest loss 2.870898723602295, accuracy 0.23617693522906794


Finished epoch 100, latest loss 2.3970587253570557, accuracy 0.4794628751974723
Finished epoch 200, latest loss 2.3890132904052734, accuracy 0.466824644549763
Finished epoch 300, latest loss 2.3703079223632812, accuracy 0.4565560821484992
Finished epoch 400, latest loss 2.3584561347961426, accuracy 0.4565560821484992
Finished epoch 500, latest loss 2.347475528717041, accuracy 0.45734597156398105
Finished epoch 600, latest loss 2.3382813930511475, accuracy 0.4510268562401264
Finished epoch 700, latest loss 2.338529348373413, accuracy 0.4518167456556082
Finished epoch 800, latest loss 2.335376024246216, accuracy 0.44944707740916273
Finished epoch 900, latest loss 2.3318440914154053, accuracy 0.4565560821484992


In [36]:
test_multiclass_model(model_titles, X_title_test, Y_test)

Accuracy: 0.409652076318743


In [37]:
model_subtitles = train_multiclass_model(X_subtitle_train, Y_train, n_epochs = 1000)

1266
Finished epoch 0, latest loss 2.843013048171997, accuracy 0.06398104265402843
Finished epoch 100, latest loss 2.4316508769989014, accuracy 0.4146919431279621
Finished epoch 200, latest loss 2.4045395851135254, accuracy 0.4028436018957346
Finished epoch 300, latest loss 2.3934378623962402, accuracy 0.39257503949447076
Finished epoch 400, latest loss 2.39837384223938, accuracy 0.39968404423380727
Finished epoch 500, latest loss 2.4003560543060303, accuracy 0.38704581358609796
Finished epoch 600, latest loss 2.383061170578003, accuracy 0.3941548183254344
Finished epoch 700, latest loss 2.377591609954834, accuracy 0.3902053712480253
Finished epoch 800, latest loss 2.364919662475586, accuracy 0.4028436018957346
Finished epoch 900, latest loss 2.372812271118164, accuracy 0.3933649289099526


In [38]:
test_multiclass_model(model_subtitles, X_subtitle_test, Y_test)

Accuracy: 0.3559047262750967


In [39]:
X = np.concatenate((np.vstack(standard_kickers['title_vectors'].to_numpy()), np.vstack(standard_kickers['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)
print(y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)


(32073, 192)
(32073, 20)


In [40]:
combined_model = train_multiclass_model(X_train, Y_train, n_inputs=192, n_epochs=1000)

1266
Finished epoch 0, latest loss 2.7735064029693604, accuracy 0.2401263823064771
Finished epoch 100, latest loss 2.3024654388427734, accuracy 0.5387045813586098
Finished epoch 200, latest loss 2.3085975646972656, accuracy 0.5039494470774092
Finished epoch 300, latest loss 2.280428886413574, accuracy 0.5521327014218009
Finished epoch 400, latest loss 2.2779200077056885, accuracy 0.5284360189573459
Finished epoch 500, latest loss 2.256657361984253, accuracy 0.5387045813586098
Finished epoch 600, latest loss 2.2502005100250244, accuracy 0.5402843601895735
Finished epoch 700, latest loss 2.2486824989318848, accuracy 0.5521327014218009
Finished epoch 800, latest loss 2.2432501316070557, accuracy 0.5481832543443917
Finished epoch 900, latest loss 2.238225221633911, accuracy 0.556872037914692


In [41]:
test_multiclass_model(combined_model, X_test, Y_test)

Accuracy: 0.4732510288065844


## Part 3.2 : Storylabels

In [42]:
# filtering for the most common ones

storylabels_20 = standard_data_copy['storylabels'].value_counts().nlargest(20).index.tolist()

standard_storylabels = standard_data_copy.copy()
standard_storylabels = standard_storylabels[standard_storylabels['storylabels'].isin(storylabels_20)].reset_index()[['storylabels', 'title_vectors', 'subtitle_vectors']]

standard_storylabels.head(3)

Unnamed: 0,storylabels,title_vectors,subtitle_vectors
0,Kopf des Tages,"[20.26102, -2.1764362, 12.056327, -8.353496, 2...","[12.673783, -13.09248, 8.577187, -2.7328134, 9..."
1,Spiel,"[-3.334474, -4.604315, -2.6970067, -2.6541345,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,Spiel,"[-0.60618734, 1.1090474, 3.5244317, -7.726671,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [43]:
# dependent variable: bianry vector of length 20

y = np.zeros((len(standard_storylabels), 20))
print(y.shape)

for row in range(y.shape[0]):
    y_row = [1 if s == standard_storylabels['storylabels'][row] else 0 for s in storylabels_20]
    y[row,] = y_row

y[0:2]

(36407, 20)


array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

In [44]:
X_title = np.vstack(standard_storylabels['title_vectors'].to_numpy())
X_title[0:2]

X_subtitle = np.vstack(standard_storylabels['subtitle_vectors'].to_numpy())
X_subtitle[0:2]

array([[ 12.6737833 , -13.09247971,   8.57718658,  -2.73281336,
          9.55943489,   1.19442225,  13.35463333,  18.54079056,
         -1.24317193, -16.44563866,  26.11481094, -23.8042469 ,
        -19.78061295,  -1.20370197, -14.3854847 ,  21.38131714,
         11.98632336,  11.43702793,  -8.80769539,   5.8185854 ,
          9.41383076,  19.68308258,  12.65031338,   6.34458733,
        -14.87839794, -12.42210007,  14.90303516,   5.76547098,
          8.85460186,  13.4050703 , -14.50532913, -11.64571476,
          6.06785011,   7.40806866,  18.20200348,   0.66248214,
         -7.48345232,  -6.37721109,  30.55085945,  -2.74134064,
        -18.50937462,  46.69207764,  14.90998268,  -1.3300662 ,
         21.63993645,   9.69990635, -10.60776997, -12.7449646 ,
         -7.96562004,  -2.69942284, -28.52653694,  -3.66213107,
          3.77998972,   4.08382893,  10.05634022,  -5.16197395,
         14.41419888,   7.13106537, -17.49291039,  -9.09768009,
        -14.40098476,  -5.38680267, -15.

In [45]:
X_title_train, X_title_test, Y_train, Y_test = train_test_split(X_title, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

In [46]:
X_subtitle_train, X_subtitle_test, Y_train, Y_test = train_test_split(X_subtitle, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)


In [47]:
X_title_train = torch.tensor(X_title_train, dtype=torch.float32)
X_title_test = torch.tensor(X_title_test, dtype=torch.float32)

X_subtitle_train = torch.tensor(X_subtitle_train, dtype=torch.float32)
X_subtitle_test = torch.tensor(X_subtitle_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)

In [48]:
model_titles = train_multiclass_model(X_title_train, Y_train, n_epochs = 1000)

1437
Finished epoch 0, latest loss 2.7086448669433594, accuracy 0.0
Finished epoch 100, latest loss 2.1900155544281006, accuracy 0.0
Finished epoch 200, latest loss 2.189687728881836, accuracy 0.0
Finished epoch 300, latest loss 2.189687490463257, accuracy 0.0
Finished epoch 400, latest loss 2.18967342376709, accuracy 0.0
Finished epoch 500, latest loss 2.189673900604248, accuracy 0.0
Finished epoch 600, latest loss 2.189673900604248, accuracy 0.0
Finished epoch 700, latest loss 2.189673900604248, accuracy 0.0
Finished epoch 800, latest loss 2.189673900604248, accuracy 0.0
Finished epoch 900, latest loss 2.189673900604248, accuracy 0.0


In [49]:
test_multiclass_model(model_titles, X_title_test, Y_test)

Accuracy: 0.34673698088332233


In [50]:
model_subtitles = train_multiclass_model(X_subtitle_train, Y_train, n_epochs = 1000)

1437
Finished epoch 0, latest loss 2.792396068572998, accuracy 0.0
Finished epoch 100, latest loss 2.294781446456909, accuracy 0.0
Finished epoch 200, latest loss 2.2581024169921875, accuracy 0.0
Finished epoch 300, latest loss 2.258202075958252, accuracy 0.0
Finished epoch 400, latest loss 2.25809383392334, accuracy 0.0
Finished epoch 500, latest loss 2.258094072341919, accuracy 0.0
Finished epoch 600, latest loss 2.258089065551758, accuracy 0.0
Finished epoch 700, latest loss 2.2580909729003906, accuracy 0.0
Finished epoch 800, latest loss 2.2580976486206055, accuracy 0.0
Finished epoch 900, latest loss 2.258089065551758, accuracy 0.0


In [51]:
test_multiclass_model(model_subtitles, X_subtitle_test, Y_test)

Accuracy: 0.32179740716326083


In [52]:
X = np.concatenate((np.vstack(standard_storylabels['title_vectors'].to_numpy()), np.vstack(standard_storylabels['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)
print(y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)


(36407, 192)
(36407, 20)


In [53]:
combined_model = train_multiclass_model(X_train, Y_train, n_inputs=192, n_epochs=1000)

1437
Finished epoch 0, latest loss 2.785167694091797, accuracy 0.0
Finished epoch 100, latest loss 2.1517343521118164, accuracy 0.5
Finished epoch 200, latest loss 2.1516098976135254, accuracy 0.5
Finished epoch 300, latest loss 2.1516027450561523, accuracy 0.5
Finished epoch 400, latest loss 2.1516025066375732, accuracy 0.5
Finished epoch 500, latest loss 2.151602268218994, accuracy 0.5
Finished epoch 600, latest loss 2.116236448287964, accuracy 0.5
Finished epoch 700, latest loss 2.1162257194519043, accuracy 0.5
Finished epoch 800, latest loss 2.1162257194519043, accuracy 0.5
Finished epoch 900, latest loss 2.1162257194519043, accuracy 0.5


In [54]:
test_multiclass_model(combined_model, X_test, Y_test)

Accuracy: 0.43331136014062843


## Summary

We have processed the scraped data and trained classification models. The word2vec encoding of article title and subtitle is able to improve upon baseline accuracy on test data in both binary and multi-class classification tasks. However, there is still a large room for improvement. This could be achieved by "bruteforcing" the models - more layers, more nodes, more epochs etc., by using a more sofisticated network architecture such as RNN, or by using superior encodings - such as larger models included in the spacy module.

## References

https://machinelearningmastery.com/develop-your-first-neural-network-with-pytorch-step-by-step/

http://mitloehner.com/lehre/dsai1/

https://pytorch.org/docs/stable/nn.html

https://spacy.io/models/de
