This notebook processes the scraped data, converts text columns to numeric vectors (Part 1) and trains classification models using the processed text data as predictors.

Locally on Richard's computer, the notebook takes approximately 1.5 hrs to execute completely - Part 1 needs ~45 mins, Part 2 and 3 ~20 min each.

Besides usual numpy, pandas etc., needed modules are spacy and pytorch. The German NLP model needs to be downloaded separately.

# Part 1: Text Processing

Here we import scraped data, splits text column to words, extracts and normalizes tokens. Next, tokens are converted to embeddings.

In [1]:
### Setup

# pip install spacy
# python -m spacy download de_core_news_sm  - for small model (13 MB)
# python -m spacy download de_core_news_md  - for medium model (42 MB)
# source - https://spacy.io/models/de


import spacy
from spacy.lang.de.examples import sentences 


# Example from documentation
nlp = spacy.load("de_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen
Die DET nk
ganze ADJ nk
Stadt NOUN sb
ist AUX ROOT
ein DET nk
Startup NOUN pd
: PUNCT punct
Shenzhen NOUN sb
ist AUX cj
das DET nk
Silicon PROPN pnc
Valley PROPN sb
für ADP mnr
Hardware-Firmen NOUN nk


In [2]:
## Data loading

import numpy as np
import pandas as pd

standard_data = pd.read_csv('./data/4yrs_derstandard_frontpage_data.csv')
standard_data

Unnamed: 0,title,subtitle,link,datetime,kicker,n_posts,storylabels
0,Real Madrid stolpert mit Aluminiumpech im Tite...,Die Königlichen können Bilbao daheim nicht bes...,https://www.derstandard.at/story/2000112599363...,2019-12-22T23:44,Primera Division,30,
1,Bolivien weist venezolanische Diplomaten aus,InterimspräsidentinJeanine Áñez wirft denBotsc...,https://www.derstandard.at/story/2000112598924...,2019-12-22T22:50,Übergangsregierung,16,
2,Erdoğan warnt vor neuer Flüchtlingswelle aus S...,"Türkischer Präsident: ""80.000 Menschen Richtun...",https://www.derstandard.at/story/2000112598130...,2019-12-22T21:43,Bürgerkrieg,104,
3,Massenkarambolage mit 63 Fahrzeugen in Virginia,Autos stießen auf vereister Brücke zusammen,https://www.derstandard.at/story/2000112597972...,2019-12-22T21:29,Weihnachtsverkehr,35,
4,"Salzburg schlägt Caps, Meister KAC mit vierter...",Die Bullen sind damit der Gewinner der Runde: ...,https://www.derstandard.at/story/2000112595206...,2019-12-22T20:54,Eishockey,4,
...,...,...,...,...,...,...,...
182102,Wer braucht die Kirche?,Dass sich die Kirche nach soschwerwiegendenVer...,https://www.derstandard.at/story/3000000200743...,2023-12-22T06:00,Dominik Straub,1,Kommentar
182103,Sonderregelung verlängert: Mehr als 1.000 Ärzt...,"Der""Pandemieparagraf""im Ärztegesetz hat mehr a...",https://www.derstandard.at/story/3000000200621...,2023-12-22T06:00,Pandemieparagraf,148,
182104,"Stadtforscher: ""Architektur ist Teil unserer W...",Jetzt anhören: In Zukunft müssen Städte wieder...,https://www.derstandard.at/story/3000000200499...,2023-12-22T06:00,Edition Zukunft,54,Podcast
182105,David Alaba zum zehnten Mal zu Österreichs Fuß...,Zehn von zwölf Trainern wählten den derzeit ve...,https://www.derstandard.at/story/3000000200745...,2023-12-22T05:46,Fußball,34,


## Tokenizing + lemmatizing, embeddings

First, title and subtitile columns are split into individual words. Second, words are replaced by their standard forms (großem - groß, rettete - retten). Finally, the standard forms are converted to numeric vectors using toc2vec embeddings.

In [3]:
# Initializing two columns
standard_data['title_tokens'] = standard_data['title']
standard_data['subtitle_tokens'] = standard_data['subtitle']

standard_data['title_vectors'] = standard_data['title_tokens']
standard_data['subtitle_vectors'] = standard_data['subtitle_tokens']

# Load model
nlp = spacy.load("de_core_news_sm")
nlp.get_pipe('lemmatizer')
nlp.get_pipe("tok2vec")

<spacy.pipeline.tok2vec.Tok2Vec at 0x1ad98dcaed0>

In [4]:
# loops through rows

for index, row in standard_data.iterrows():
    
    # TITLE

    text = nlp(row['title_tokens'].replace("-", " ").replace(",", " ").replace(": ", " "))

    token_list = ' '.join([token.lemma_ for token in text]) # list of standardized tokens joined to a string

    vector_list = [token.vector for token in nlp(token_list)] # list of numeric vectors

    # store in the dataframe

    standard_data.at[index, 'title_tokens'] = token_list 
    standard_data.at[index, 'title_vectors'] = vector_list 

    # SAME FOR SUBTITLE

    # try except because subtitle is sometimes empty, then it would output error

    try:

        text = nlp(row['subtitle_tokens'])
        token_list = ' '.join([token.lemma_ for token in text]) # list of standardized tokens joined to a string
        vector_list = [token.vector for token in nlp(token_list)] # list of numeric vectors

    except:
        token_list = '' # if empty, token list empty
        vector_list = []

    standard_data.at[index, 'subtitle_tokens'] = token_list # store in the dataframe
    standard_data.at[index, 'subtitle_vectors'] = vector_list



In [5]:
standard_data.head(4)

Unnamed: 0,title,subtitle,link,datetime,kicker,n_posts,storylabels,title_tokens,subtitle_tokens,title_vectors,subtitle_vectors
0,Real Madrid stolpert mit Aluminiumpech im Tite...,Die Königlichen können Bilbao daheim nicht bes...,https://www.derstandard.at/story/2000112599363...,2019-12-22T23:44,Primera Division,30,,Real Madrid stolpern mit Aluminiumpech in Tite...,der Königlich können Bilbao daheim nicht besiegen,"[[1.4714637, 2.1941755, 0.37696052, 0.46038526...","[[2.48862, 0.1332562, -1.0822431, -2.6994205, ..."
1,Bolivien weist venezolanische Diplomaten aus,InterimspräsidentinJeanine Áñez wirft denBotsc...,https://www.derstandard.at/story/2000112598924...,2019-12-22T22:50,Übergangsregierung,16,,Bolivien weisen venezolanisch Diplomat aus,InterimspräsidentinJeanine Áñez werfen denBots...,"[[-1.3009748, -0.9428248, -3.3170214, 2.703031...","[[1.1021554, 1.5151364, -0.9405582, -2.410718,..."
2,Erdoğan warnt vor neuer Flüchtlingswelle aus S...,"Türkischer Präsident: ""80.000 Menschen Richtun...",https://www.derstandard.at/story/2000112598130...,2019-12-22T21:43,Bürgerkrieg,104,,Erdoğan warnen vor neu Flüchtlingswelle aus Sy...,Türkischer Präsident -- -- 80.000 Mensch Richt...,"[[-1.0778335, 0.5461743, 2.5889783, -1.2373338...","[[-0.37814975, -0.962201, 0.65647066, -5.07107..."
3,Massenkarambolage mit 63 Fahrzeugen in Virginia,Autos stießen auf vereister Brücke zusammen,https://www.derstandard.at/story/2000112597972...,2019-12-22T21:29,Weihnachtsverkehr,35,,Massenkarambolage mit 63 Fahrzeug in Virginia,Auto stoßen auf vereist Brücke zusammen,"[[0.4362324, 2.8947713, -4.595852, -0.09335708...","[[-4.584771, -1.4883125, 0.2176263, 4.631511, ..."


In [6]:
# Example of list of numeric vectors representing a subtitle

standard_data.iloc[1,10]

[array([ 1.1021554 ,  1.5151364 , -0.9405582 , -2.410718  , -1.1881657 ,
        -3.615175  ,  1.0234319 ,  4.1538424 , -1.5936106 , -2.785506  ,
         0.6842209 ,  0.07389921,  0.5360052 ,  0.31706142, -0.9196383 ,
        -0.82573223, -3.5580513 , -1.8286765 ,  0.10499629, -2.6490862 ,
         0.7699168 ,  0.9488483 ,  3.0474105 , -3.499149  , -1.3851355 ,
        -1.3532616 , -2.1080291 ,  1.4907815 ,  1.6701344 ,  0.32809997,
         3.3263462 ,  1.2118888 , -0.66293633, -0.6889895 , -0.9815758 ,
         7.8648634 , -2.7158377 ,  1.1486918 ,  1.389757  ,  2.5390139 ,
        -2.2970562 ,  1.9853625 , -0.53672844,  1.7880895 ,  1.0695057 ,
        -1.396502  , -0.44181645,  8.145238  , -6.583647  ,  2.6661644 ,
        -3.5900345 ,  5.988452  ,  0.7246865 ,  0.7114573 , -0.97419953,
        -0.59691894, -1.2567514 ,  1.227717  , -0.20339483,  1.2106575 ,
        -3.6484938 , -0.32804242, -0.22830245, -2.9978542 , -1.7434182 ,
         1.2077783 ,  1.3858008 , -2.3194008 ,  2.6

## Defining predictors and dependent variable

In [7]:
### sum vectors for words  to get a vector for a sentence

import numpy as np

standard_data_copy = standard_data.copy()

for index,row in standard_data_copy.iterrows():

    l = standard_data_copy.loc[index, 'title_vectors']
    standard_data_copy.at[index, 'title_vectors'] = np.sum(l, axis=0)

    l = standard_data_copy.loc[index, 'subtitle_vectors']
    standard_data_copy.at[index, 'subtitle_vectors'] = np.sum(l, axis=0)

In [8]:
# reshaping

X = np.vstack(standard_data_copy['title_vectors'].to_numpy())
X

array([[  9.119195  ,   0.4420882 ,  -6.585746  , ...,   5.249668  ,
         -5.503358  ,   2.4459732 ],
       [  4.1763725 ,   6.6053157 ,  -6.26212   , ...,   6.883208  ,
         -5.6511483 ,  -1.0476801 ],
       [ 11.195007  ,  -1.1859422 ,   1.1256366 , ...,  11.542025  ,
        -11.157122  ,   0.35567856],
       ...,
       [ -2.303168  ,  -3.1763613 ,  -4.835184  , ...,   4.255176  ,
         -2.1192274 ,  -0.76783633],
       [ -2.038211  , -18.766294  ,   7.7005734 , ...,  17.543652  ,
        -13.355871  ,   6.8655033 ],
       [ 12.515891  ,   3.4930995 ,   0.40148246, ...,   8.258898  ,
          1.8255993 ,  -0.6187186 ]], dtype=float32)

In [9]:
# split y to above and under 50 posts (regression -> classification)

y = (standard_data_copy['n_posts'].to_numpy() > 50).astype(int)
y = np.column_stack(standard_data_copy['n_posts'].to_numpy() > 50).astype(int).T

print(y.shape)
print(X.shape)

(182107, 1)
(182107, 96)


In [10]:
# train-test split

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)


In [11]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [12]:
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

In [13]:
# two hidden layers, each with 12 nodes

class BinaryClassifier(nn.Module):
    def __init__(self, n_inputs = 96):
        super().__init__()
        self.hidden1 = nn.Linear(n_inputs, 12)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(12, 12)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(12, 1)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x
    
loss_fn = nn.L1Loss(size_average=None, reduce=None, reduction='mean')

In [14]:
# training the model

import random

def train_binary_model(X_train, Y_train, n_inputs = 96, n_epochs = 5000):

    model = BinaryClassifier(n_inputs)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    random.seed(42069)
    batch_size = int(len(X_train) / 50)
    print(batch_size)
    for epoch in range(n_epochs):
        for i in range(0, len(X_train), batch_size):
            Xbatch = X_train[i:i+batch_size]
            y_pred = model(Xbatch)
            ybatch = Y_train[i:i+batch_size]
            loss = loss_fn(y_pred, ybatch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        baseline = sum(ybatch) / len(ybatch)
        accuracy = sum(ybatch == y_pred.round())/len(ybatch)
        print(f'Finished epoch {epoch}, latest loss {loss}, accuracy {accuracy} vs baseline {baseline}')

    return model

In [15]:
# testing the model

def test_binary_model(model, X_test):
    Y_hat = model(X_test).round()
    print("Accuracy: ")
    print((sum(Y_test == Y_hat)/len(Y_test)).item())
    print("Baseline accuracy: ")
    print((sum(Y_test)/len(Y_test)).item())


In [16]:
model = train_binary_model(X_train, Y_train, n_epochs=1000)

2913
Finished epoch 0, latest loss 0.40014711022377014, accuracy tensor([0.6000]) vs baseline tensor([0.6000])
Finished epoch 1, latest loss 0.254181444644928, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 2, latest loss 0.23384593427181244, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 3, latest loss 0.22444556653499603, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 4, latest loss 0.2103758305311203, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 5, latest loss 0.19853954017162323, accuracy tensor([0.8000]) vs baseline tensor([0.6000])
Finished epoch 6, latest loss 0.1818917691707611, accuracy tensor([0.8286]) vs baseline tensor([0.6000])
Finished epoch 7, latest loss 0.17299869656562805, accuracy tensor([0.8286]) vs baseline tensor([0.6000])
Finished epoch 8, latest loss 0.16994909942150116, accuracy tensor([0.8286]) vs baseline tensor([0.6000])
Finished epoch 9, latest loss 0.1586179137229

In [17]:
test_binary_model(model, X_test)

Accuracy: 


0.6355499625205994
Baseline accuracy: 
0.5655098557472229


## Can we get better accuracy using subtitles?

Here, we try using subtitle instead of article title as the predictor

In [36]:
for i in range(len(standard_data_copy)):

    if standard_data_copy.loc[i, 'subtitle_tokens'] == '':
        standard_data_copy.at[i, 'subtitle_vectors'] = np.zeros(96)


In [37]:
X = np.vstack(standard_data_copy['subtitle_vectors'].to_numpy())
X

X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

In [None]:
model = train_binary_model(X_train, Y_train, n_epochs = 1000)

2913
Finished epoch 0, latest loss 0.4164944291114807, accuracy tensor([0.6000]) vs baseline tensor([0.6000])
Finished epoch 1, latest loss 0.33713018894195557, accuracy tensor([0.6857]) vs baseline tensor([0.6000])
Finished epoch 2, latest loss 0.29099467396736145, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 3, latest loss 0.2782460153102875, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 4, latest loss 0.2681356966495514, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 5, latest loss 0.2599217891693115, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 6, latest loss 0.24900314211845398, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 7, latest loss 0.23437213897705078, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 8, latest loss 0.23247313499450684, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 9, latest loss 0.2321036756038

In [38]:
test_binary_model(model, X_test)

Accuracy: 
0.5940640568733215
Baseline accuracy: 
0.5655098557472229


We can see, using article's subtitle yields worse imporvement in accuracy. Next, we try using both

In [39]:
X = np.concatenate((np.vstack(standard_data_copy['title_vectors'].to_numpy()), np.vstack(standard_data_copy['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32).reshape(-1, 1)
Y_test = torch.tensor(Y_test, dtype=torch.float32).reshape(-1, 1)

(182107, 192)


In [40]:
model = train_binary_model(X_train, Y_train, n_inputs = 192, n_epochs = 1000)

2913
Finished epoch 0, latest loss 0.28403809666633606, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 1, latest loss 0.2833850085735321, accuracy tensor([0.7143]) vs baseline tensor([0.6000])
Finished epoch 2, latest loss 0.27167806029319763, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 3, latest loss 0.2629415690898895, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 4, latest loss 0.2586017847061157, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 5, latest loss 0.2550983428955078, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 6, latest loss 0.25036823749542236, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 7, latest loss 0.24702943861484528, accuracy tensor([0.7429]) vs baseline tensor([0.6000])
Finished epoch 8, latest loss 0.24246254563331604, accuracy tensor([0.7714]) vs baseline tensor([0.6000])
Finished epoch 9, latest loss 0.2369374632835

In [41]:
test_binary_model(model, X_test)

Accuracy: 


0.6731096506118774
Baseline accuracy: 
0.5655098557472229


# Part 3: Multi-class classifier

We also want to try out a classifier for variables with more than two categories. The predictors are going to be the same, for dependent variables we will choose columns *kicker* and *storylabels*

## 3.1: Kickers

In [42]:
# 20 most common kickers

standard_data_copy['kicker'].value_counts()[0:20]

kicker
Fußball                 3162
Nachrichtenüberblick    3101
Netzpolitik             2674
Sudoku                  2414
Bundesliga              1775
Sport                   1684
USA                     1518
IT-Business             1464
Coronavirus             1464
Games                   1356
Tennis                  1252
Switchlist              1208
Krieg in der Ukraine    1203
Deutsche Bundesliga     1180
Etat-Überblick          1161
Hans Rauscher           1153
Wintersport             1134
TV-Tagebuch             1080
Eishockey               1058
Thema des Tages         1032
Name: count, dtype: int64

In [43]:
# filtering for the most common ones

kickers_20 = standard_data_copy['kicker'].value_counts().nlargest(20).index.tolist()

standard_kickers = standard_data_copy.copy()
standard_kickers = standard_kickers[standard_kickers['kicker'].isin(kickers_20)].reset_index()[['kicker', 'title_vectors', 'subtitle_vectors']]

standard_kickers.head(3)

Unnamed: 0,kicker,title_vectors,subtitle_vectors
0,Eishockey,"[16.587378, -5.880515, -10.134394, -2.7442758,...","[17.787502, -9.519611, 6.9725924, -22.601917, ..."
1,Deutsche Bundesliga,"[5.977871, -8.565644, -8.99353, 8.525143, 13.7...","[30.480652, -4.33595, -10.554278, -1.181915, 1..."
2,Sport,"[6.727174, -2.1885931, -6.3502436, -9.411136, ...","[4.372181, -6.303278, -5.9711304, 3.729673, 8...."


In [44]:
# dependent variable: bianry vector of length 20

y = np.zeros((len(standard_kickers), 20))
y.shape

for row in range(y.shape[0]):
    y_row = [1 if k == standard_kickers['kicker'][row] else 0 for k in kickers_20]
    y[row,] = y_row

y

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [45]:
X_title = np.vstack(standard_kickers['title_vectors'].to_numpy())
X_title

X_subtitle = np.vstack(standard_kickers['subtitle_vectors'].to_numpy())
X_subtitle

array([[ 17.78750229,  -9.51961136,   6.97259235, ...,   7.5420599 ,
        -20.29470253,  20.51945877],
       [ 30.48065186,  -4.3359499 , -10.55427837, ...,   1.18488801,
        -13.77711582,  -2.79524326],
       [  4.37218094,  -6.30327797,  -5.97113037, ...,   8.06975746,
         -0.17591423, -11.02886868],
       ...,
       [ 24.64826775,  -3.8569994 ,   7.98147011, ...,  -1.30448556,
         -8.15883255,   1.83407009],
       [  7.34537983,   4.33027363,   5.02943897, ..., -17.7085247 ,
         26.16075706,   2.84535241],
       [ 37.31260681, -21.39584732,  14.22052574, ...,  -1.84737384,
        -15.84661865,  20.746521  ]])

In [46]:
X_title_train, X_title_test, Y_train, Y_test = train_test_split(X_title, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

In [47]:
X_subtitle_train, X_subtitle_test, Y_train, Y_test = train_test_split(X_subtitle, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)


In [48]:
X_title_train = torch.tensor(X_title_train, dtype=torch.float32)
X_title_test = torch.tensor(X_title_test, dtype=torch.float32)

X_subtitle_train = torch.tensor(X_subtitle_train, dtype=torch.float32)
X_subtitle_test = torch.tensor(X_subtitle_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)

In [49]:
class ClassifierTwentyCategories(nn.Module):
    def __init__(self, n_inputs):
        super().__init__()
        self.hidden1 = nn.Linear(n_inputs, 48)
        self.act1 = nn.ReLU()
        self.hidden2 = nn.Linear(48, 24)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(24, 20)
        self.act_output = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.hidden1(x))
        x = self.act2(self.hidden2(x))
        x = self.act_output(self.output(x))
        return x

loss_fn = nn.CrossEntropyLoss()

In [50]:
def train_multiclass_model(X_train, Y_train, n_inputs = 96, n_epochs = 1000):

    random.seed(42069)

    model = ClassifierTwentyCategories(n_inputs)
    optimizer = optim.Adam(model.parameters(), lr=0.001)


    batch_size = int(len(X_train) / 19)
    print(batch_size)

    for epoch in range(n_epochs):
        for i in range(0, len(X_train), batch_size):
            Xbatch = X_train[i:i+batch_size]
            y_pred = model(Xbatch)
            ybatch = Y_train[i:i+batch_size]
            loss = loss_fn(y_pred, ybatch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        y_pred_category = [torch.argmax(r).item() for r in y_pred]
        ybatch_category = [torch.argmax(r).item() for r in ybatch]
        accuracy = sum([1 if x == y else 0 for x, y in zip(y_pred_category, ybatch_category)]) / len(ybatch_category)
        print(f'Finished epoch {epoch}, latest loss {loss}, accuracy {accuracy}')

    return model



In [51]:
def test_multiclass_model(model, X_test, Y_test, categories = kickers_20):
    
    p = [torch.argmax(r).item() for r in model(X_test)] 
    Y_hat = [categories[i] for i in p]

    p = [torch.argmax(r).item() for r in Y_test] 

    Y_test_values = [categories[i] for i in p]

    acc = sum([1 if x == y else 0 for x, y in zip(Y_hat, Y_test_values)]) / len(Y_test_values)
    print("Accuracy: " + str(acc))

In [52]:
model_titles = train_multiclass_model(X_title_train, Y_train, n_epochs = 1000)

1266
Finished epoch 0, latest loss 2.8781800270080566, accuracy 0.2109004739336493


Finished epoch 1, latest loss 2.769871234893799, accuracy 0.2725118483412322
Finished epoch 2, latest loss 2.674945116043091, accuracy 0.28515007898894157
Finished epoch 3, latest loss 2.6313443183898926, accuracy 0.31990521327014215
Finished epoch 4, latest loss 2.6070873737335205, accuracy 0.325434439178515
Finished epoch 5, latest loss 2.5902791023254395, accuracy 0.334913112164297
Finished epoch 6, latest loss 2.578387498855591, accuracy 0.3412322274881517
Finished epoch 7, latest loss 2.5698373317718506, accuracy 0.3570300157977883
Finished epoch 8, latest loss 2.5616254806518555, accuracy 0.358609794628752
Finished epoch 9, latest loss 2.5545120239257812, accuracy 0.3696682464454976
Finished epoch 10, latest loss 2.546813488006592, accuracy 0.37993680884676145
Finished epoch 11, latest loss 2.539283514022827, accuracy 0.38783570300157977
Finished epoch 12, latest loss 2.531949520111084, accuracy 0.39257503949447076
Finished epoch 13, latest loss 2.524897813796997, accuracy 0.3981

In [53]:
test_multiclass_model(model_titles, X_title_test, Y_test)

Accuracy: 0.4081556303778526


In [54]:
model_subtitles = train_multiclass_model(X_subtitle_train, Y_train, n_epochs = 1000)

1266
Finished epoch 0, latest loss 2.842034101486206, accuracy 0.18325434439178515
Finished epoch 1, latest loss 2.743034839630127, accuracy 0.28278041074249605
Finished epoch 2, latest loss 2.693507671356201, accuracy 0.29936808846761453
Finished epoch 3, latest loss 2.672405958175659, accuracy 0.31200631911532384
Finished epoch 4, latest loss 2.6563010215759277, accuracy 0.31911532385466035
Finished epoch 5, latest loss 2.643259286880493, accuracy 0.3175355450236967
Finished epoch 6, latest loss 2.6313652992248535, accuracy 0.3183254344391785
Finished epoch 7, latest loss 2.6206493377685547, accuracy 0.3222748815165877
Finished epoch 8, latest loss 2.6109185218811035, accuracy 0.32148499210110587
Finished epoch 9, latest loss 2.601322889328003, accuracy 0.32148499210110587
Finished epoch 10, latest loss 2.592183828353882, accuracy 0.32622432859399686
Finished epoch 11, latest loss 2.5837666988372803, accuracy 0.3293838862559242
Finished epoch 12, latest loss 2.5757699012756348, accur

In [55]:
test_multiclass_model(model_subtitles, X_subtitle_test, Y_test)

Accuracy: 0.35952113729891505


In [56]:
X = np.concatenate((np.vstack(standard_kickers['title_vectors'].to_numpy()), np.vstack(standard_kickers['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)
print(y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)


(32073, 192)
(32073, 20)


In [57]:
combined_model = train_multiclass_model(X_train, Y_train, n_inputs=192, n_epochs=1000)

1266
Finished epoch 0, latest loss 2.764134645462036, accuracy 0.2677725118483412
Finished epoch 1, latest loss 2.655111074447632, accuracy 0.31121642969984203
Finished epoch 2, latest loss 2.5902445316314697, accuracy 0.36334913112164297
Finished epoch 3, latest loss 2.546638250350952, accuracy 0.42022116903633494
Finished epoch 4, latest loss 2.5240976810455322, accuracy 0.4352290679304897
Finished epoch 5, latest loss 2.509584903717041, accuracy 0.44944707740916273
Finished epoch 6, latest loss 2.497767210006714, accuracy 0.45813586097946285
Finished epoch 7, latest loss 2.4874181747436523, accuracy 0.4684044233807267
Finished epoch 8, latest loss 2.4776580333709717, accuracy 0.47472353870458134
Finished epoch 9, latest loss 2.468841791152954, accuracy 0.4802527646129542
Finished epoch 10, latest loss 2.4612698554992676, accuracy 0.48894154818325436
Finished epoch 11, latest loss 2.455242872238159, accuracy 0.4921011058451817
Finished epoch 12, latest loss 2.450096368789673, accurac

In [58]:
test_multiclass_model(combined_model, X_test, Y_test)

Accuracy: 0.4706322484100262


## Part 3.2 : Storylabels

In [59]:
# filtering for the most common ones

storylabels_20 = standard_data_copy['storylabels'].value_counts().nlargest(20).index.tolist()

standard_storylabels = standard_data_copy.copy()
standard_storylabels = standard_storylabels[standard_storylabels['storylabels'].isin(storylabels_20)].reset_index()[['storylabels', 'title_vectors', 'subtitle_vectors']]

standard_storylabels.head(3)

Unnamed: 0,storylabels,title_vectors,subtitle_vectors
0,Kopf des Tages,"[20.26102, -2.1764362, 12.056327, -8.353496, 2...","[12.673783, -13.09248, 8.577187, -2.7328134, 9..."
1,Spiel,"[-3.334474, -4.604315, -2.6970067, -2.6541345,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,Spiel,"[-0.60618734, 1.1090474, 3.5244317, -7.726671,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [60]:
# dependent variable: bianry vector of length 20

y = np.zeros((len(standard_storylabels), 20))
print(y.shape)

for row in range(y.shape[0]):
    y_row = [1 if s == standard_storylabels['storylabels'][row] else 0 for s in storylabels_20]
    y[row,] = y_row

y

(36407, 20)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [61]:
X_title = np.vstack(standard_storylabels['title_vectors'].to_numpy())
X_title

X_subtitle = np.vstack(standard_storylabels['subtitle_vectors'].to_numpy())
X_subtitle

array([[ 12.6737833 , -13.09247971,   8.57718658, ...,  14.81214142,
        -18.95225525,  -3.14274883],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [ 11.92391777,  -4.8767643 ,   7.59373331, ...,   3.75452614,
          8.02588081, -17.04307175],
       [ 14.82507515,  -0.38778591,  -6.76249266, ...,   0.80372906,
        -13.19830894, -12.51316452],
       [ 11.57326698,   3.72351551,  18.47734642, ...,  -3.31539655,
          5.45101213,  -0.74486721]])

In [62]:
X_title_train, X_title_test, Y_train, Y_test = train_test_split(X_title, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

In [63]:
X_subtitle_train, X_subtitle_test, Y_train, Y_test = train_test_split(X_subtitle, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)


In [64]:
X_title_train = torch.tensor(X_title_train, dtype=torch.float32)
X_title_test = torch.tensor(X_title_test, dtype=torch.float32)

X_subtitle_train = torch.tensor(X_subtitle_train, dtype=torch.float32)
X_subtitle_test = torch.tensor(X_subtitle_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)

In [65]:
model_titles = train_multiclass_model(X_title_train, Y_train, n_epochs = 1000)

1437
Finished epoch 0, latest loss 2.8629751205444336, accuracy 0.0
Finished epoch 1, latest loss 2.536057472229004, accuracy 0.0
Finished epoch 2, latest loss 2.445584297180176, accuracy 0.0
Finished epoch 3, latest loss 2.3903446197509766, accuracy 0.0
Finished epoch 4, latest loss 2.345547676086426, accuracy 0.5
Finished epoch 5, latest loss 2.322000503540039, accuracy 0.5
Finished epoch 6, latest loss 2.293304443359375, accuracy 0.5
Finished epoch 7, latest loss 2.2746829986572266, accuracy 0.5
Finished epoch 8, latest loss 2.2608227729797363, accuracy 0.5
Finished epoch 9, latest loss 2.2539162635803223, accuracy 0.5
Finished epoch 10, latest loss 2.2491071224212646, accuracy 0.5
Finished epoch 11, latest loss 2.245262384414673, accuracy 0.5
Finished epoch 12, latest loss 2.2420082092285156, accuracy 0.5
Finished epoch 13, latest loss 2.2398834228515625, accuracy 0.5
Finished epoch 14, latest loss 2.2372069358825684, accuracy 0.5
Finished epoch 15, latest loss 2.2354044914245605, 

In [66]:
test_multiclass_model(model_titles, X_title_test, Y_test)

Accuracy: 0.3412436827070973


In [67]:
model_subtitles = train_multiclass_model(X_subtitle_train, Y_train, n_epochs = 1000)

1437
Finished epoch 0, latest loss 2.7888503074645996, accuracy 0.0
Finished epoch 1, latest loss 2.623957872390747, accuracy 0.0
Finished epoch 2, latest loss 2.497129440307617, accuracy 0.0
Finished epoch 3, latest loss 2.405191421508789, accuracy 0.0
Finished epoch 4, latest loss 2.368117332458496, accuracy 0.0
Finished epoch 5, latest loss 2.3386454582214355, accuracy 0.0
Finished epoch 6, latest loss 2.3347182273864746, accuracy 0.0
Finished epoch 7, latest loss 2.331000804901123, accuracy 0.0
Finished epoch 8, latest loss 2.327460289001465, accuracy 0.0
Finished epoch 9, latest loss 2.3253769874572754, accuracy 0.0
Finished epoch 10, latest loss 2.3229448795318604, accuracy 0.0
Finished epoch 11, latest loss 2.3167831897735596, accuracy 0.0
Finished epoch 12, latest loss 2.3037097454071045, accuracy 0.0
Finished epoch 13, latest loss 2.2976508140563965, accuracy 0.0
Finished epoch 14, latest loss 2.2960104942321777, accuracy 0.0
Finished epoch 15, latest loss 2.2950377464294434, 

In [68]:
test_multiclass_model(model_subtitles, X_subtitle_test, Y_test)

Accuracy: 0.33278400351571086


In [69]:
X = np.concatenate((np.vstack(standard_storylabels['title_vectors'].to_numpy()), np.vstack(standard_storylabels['subtitle_vectors'].to_numpy())), axis=1)
print(X.shape)
print(y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42069)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

Y_train = torch.tensor(Y_train, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.float32)


(36407, 192)
(36407, 20)


In [70]:
combined_model = train_multiclass_model(X_train, Y_train, n_inputs=192, n_epochs=1000)

1437
Finished epoch 0, latest loss 2.9918313026428223, accuracy 0.0
Finished epoch 1, latest loss 2.4952359199523926, accuracy 0.0
Finished epoch 2, latest loss 2.341116428375244, accuracy 0.0
Finished epoch 3, latest loss 2.2644762992858887, accuracy 0.0
Finished epoch 4, latest loss 2.246694803237915, accuracy 0.0
Finished epoch 5, latest loss 2.2337918281555176, accuracy 0.0
Finished epoch 6, latest loss 2.23345947265625, accuracy 0.0
Finished epoch 7, latest loss 2.2331395149230957, accuracy 0.0
Finished epoch 8, latest loss 2.229830026626587, accuracy 0.0
Finished epoch 9, latest loss 2.229196548461914, accuracy 0.0
Finished epoch 10, latest loss 2.228454828262329, accuracy 0.0
Finished epoch 11, latest loss 2.228076457977295, accuracy 0.0
Finished epoch 12, latest loss 2.2278263568878174, accuracy 0.0
Finished epoch 13, latest loss 2.227297782897949, accuracy 0.0
Finished epoch 14, latest loss 2.226919174194336, accuracy 0.0
Finished epoch 15, latest loss 2.226667881011963, accur

In [71]:
test_multiclass_model(combined_model, X_test, Y_test)

Accuracy: 0.42342342342342343


## Summary

We have processed the scraped data and trained classification models. The word2vec encoding of article title and subtitle is able to improve upon baseline accuracy on test data in both binary and multi-class classification tasks. However, there is still a large room for improvement. This could be achieved by "bruteforcing" the models - more layers, more nodes, more epochs etc., by using a more sofisticated network architecture such as RNN, or by using superior encodings - such as larger models included in the spacy module.

## References

https://machinelearningmastery.com/develop-your-first-neural-network-with-pytorch-step-by-step/

http://mitloehner.com/lehre/dsai1/

https://pytorch.org/docs/stable/nn.html

https://spacy.io/models/de
