<a href="https://colab.research.google.com/github/nicolaiberk/nlpdl_project/blob/main/02_Predict_articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install transformers



In [5]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizerFast
import os
import torch
import pandas as pd
import numpy as np

In [21]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-german-cased", num_labels = 6)
model.load_state_dict(torch.load(os.path.join("drive", "MyDrive", "nlpdl", "01_PPR_model.bin", "pytorch_model.bin")))

Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias

<All keys matched successfully>

In [22]:
# load news articles
news = pd.read_csv(os.path.join("drive", "MyDrive", "nlpdl", "subset_news.csv"))
news = news.reset_index()
news = news.dropna()

In [23]:
texts = list(news["text"])
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-german-cased")

In [24]:
# tokenize texts
news_encodings = tokenizer(texts, truncation=True, padding=True)

In [25]:
# generate fake labels for the data class

labels = news['source']

# there are probably better ways to do this
ulabels = list(set(labels))
label_dict = {}

for i in range(len(ulabels)):
  label_dict[str(ulabels[i])] = i

labels = [label_dict[str(l)] for l in labels]

In [26]:
class NEWSDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

news_dataset = NEWSDataset(news_encodings, labels)

In [27]:
trainer = Trainer(
    model=model
)

In [28]:
eval_res = trainer.predict(news_dataset)

In [29]:
np.unique(eval_res.predictions.argmax(-1))

array([0, 1, 2, 3, 4, 5])

In [30]:
preds = eval_res.predictions.argmax(-1)
preds_parties = [""]*len(preds)
for k, v in zip(["Grüne", "Union", "AfD", "SPD", "Linke", "FDP"], range(6)):
  preds_parties = np.where(preds == v, k, preds_parties)

In [37]:
pd.crosstab(news.source, preds_parties, dropna=False).apply(lambda r: r/r.sum(), axis=1).round(2)

col_0,AfD,FDP,Grüne,Linke,SPD,Union
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
faz,0.02,0.87,0.01,0.01,0.08,0.02
spon,0.0,0.89,0.0,0.02,0.03,0.06
taz,0.01,0.4,0.01,0.5,0.08,0.01
welt,0.02,0.84,0.02,0.01,0.09,0.02


In [32]:
news['green'] = eval_res.predictions[:,0]
news['union'] = eval_res.predictions[:,1]
news['afd']   = eval_res.predictions[:,2]
news['spd']   = eval_res.predictions[:,3]
news['linke'] = eval_res.predictions[:,4]
news['fdp']   = eval_res.predictions[:,5]

In [33]:
news.to_csv(os.path.join("drive", "MyDrive", "nlpdl", "subset_news_pred.csv"))