<a href="https://colab.research.google.com/github/nicolaiberk/bild/blob/main/np_slant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Estimating migration slant in german newspapers using bert

This notebook tries to estimate the sentiment in german newspapers' migration coverage over time. It will draw a subset of articles for each year, cut them into sentences, filter those containing migration-related terms (identified earlier using dictioary expansion with word embeddings), and estimate the sentiment in these sentences using BERT transformer models.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b0/9e/5b80becd952d5f7250eaf8fc64b957077b12ccfe73e9c03d37146ab29712/transformers-4.6.0-py3-none-any.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 5.1MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 24.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 41.1MB/s 
Installing collect

In [2]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import transformers
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# define input data and migration dictionary
url_list = ["https://www.dropbox.com/s/fg2j14sckivbv9e/_bild_articles.csv?dl=1",
            "https://www.dropbox.com/s/gu74lpyys9g30vk/_faz_articles.csv?dl=1",
            "https://www.dropbox.com/s/qqvd9qbgd16q6ny/_spon_articles.csv?dl=1",
            "https://www.dropbox.com/s/53yeud52h3r1hc4/_sz_articles.csv?dl=1",
            "https://www.dropbox.com/s/2atgbzx4dzbq6nd/_taz_articles.csv?dl=1",
            "https://www.dropbox.com/s/c1gmzcriuh337vd/_weltonline_articles.csv?dl=1"]
paper_list = ["bild", "faz", "spon", "sz", "taz", "weltonline"]
mig_dict = mig_dict = pd.read_csv("https://www.dropbox.com/s/65n22q1l19xkkmu/german_glove.csv?dl=1", encoding="latin-1")["x"]

In [4]:
# setup BERT
model_name = "oliverguhr/german-sentiment-bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=436382967.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=254729.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=161.0, style=ProgressStyle(description_…




In [None]:
for url, paper in zip(url_list, paper_list):
  
  # load data
  dta = pd.read_csv(url)
  dta = dta[dta.text == dta.text] # gets rid of missings
  dta = dta.reset_index(drop=True)
  dta["mig_sent"] = np.nan
  dta["mig_sent_sd"] = np.nan
  dta["mig_sent_n"] = np.nan

  for row in range(dta.shape[0]):

    ## print progress
    print(f"\r Row {row} of {dta.shape[0]}", end="")

    ## split sentences from cleaned text
    sentences = sent_tokenize(dta.loc[row, "text"].replace('„|”|"', " ").replace('!', "! "))

    ## filter sentences about migration    
    rel_sents = []
    for sent in sentences:
      ### tokenize and lowercase tokens of the sentence
      tokenized_sent = [word.lower() for word in word_tokenize(sent)]
      ### if any item in the tokenized sentence is a keyword, append the original sentence
      if any(keyw in tokenized_sent for keyw in mig_dict):
          rel_sents.append(sent)

    if len(rel_sents)>0:
      ## estimate migration sentiment
      results = classifier(rel_sents)
      estimates = []
      for result in results:
        if result["label"] == "positive":
          estimates.append(result["score"])
        elif result["label"] == "neutral":
          estimates.append(0)
        elif result["label"] == "negative":
          estimates.append(result["score"]*-1)
    else:
      estimates = [np.nan]

    ## calculate mean, sd & n of sentences
    dta.loc[row, "mig_sent"] = np.mean(estimates)
    dta.loc[row, "mig_sent_sd"] = np.std(estimates)
    dta.loc[row, "mig_sent_n"] = len(estimates)

  # write to csv
  dta.to_csv("".join(["drive/MyDrive/Bild/", paper, "_estimates.csv"]))

## Limitations
This approach has the problem that the sentiment might not be directed at migration, but in fact talk about the horrible conditions of migration, or something else. Hence, a second approach might replace all migration-related terms with a single token, and track their correlation with specific terms or frames across time.