<a href="https://colab.research.google.com/github/nicolaiberk/nlpdl_project/blob/main/BaselineBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the baseline model for the classification of party press releases, and the subsequent measurement of newspaper bias, based on [this huggingface tutorial](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb). Let's first install the necessary packages.

In [1]:
!pip install transformers
!pip install datasets



Import necessary packages

In [2]:
import pandas as pd
from datasets import load_dataset, load_metric

Load and prepare train and test set

In [3]:
df = pd.read_csv('drive/MyDrive/germanyPPRs.csv', engine="python")

In [4]:
df = df.dropna()
df = df.sample(1000)

In [5]:
df.date = pd.to_datetime([dt for dt in df.date], format='%Y-%m-%d')

In [6]:
trainset = df[df.date < pd.to_datetime('2018-01-01', format='%Y-%m-%d')]
testset = df[df.date >= pd.to_datetime('2018-01-01', format='%Y-%m-%d')]

Tokenize the training text

In [7]:
from transformers import AutoTokenizer
model_checkpoint = "distilbert-base-german-cased"
batch_size = 32

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [8]:
encoded_trainset = [tokenizer(x, padding = True, truncation=True) for x in trainset.rawtext]
encoded_testset = [tokenizer(x, padding = True, truncation=True) for x in testset.rawtext]

Fine tune the model

In [9]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = len(df.label.unique())
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias

In [10]:
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

In [11]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_trainset,
    eval_dataset=encoded_testset,
    tokenizer=tokenizer
)

In [12]:
trainer.train()

KeyError: ignored