## 2. Labellen van recente verkiezingsprogramma

De data van het Manifesto-project loopt tot en met 2022. Om de Nederlandse verkiezingsprogramma's voor dit jaar de classificeren, maken we gebruik van automatisch classificatiemodellen. We splitsen de programma's, die opgehaald zijn met behulp van code uit [deze repository](https://github.com/vanatteveldt/2023-manifestos-nl/tree/main), op in zinnen, en voeden die zinnen aan twee classificatiemodellen: één die bepaald of een zin uberhaupt relevant is, en één die de juist de code aan de relevante zinnen haalt. Het eerste classificatiemodel trainden we zelf, en is [hier](https://huggingface.co/joris/manifesto-dutch-binary-relevance) publiekelijk beschikbaar; het tweede is [beschikbaar gesteld](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1) door onderzoekers van het Manifesto Project.

In [1]:
import spacy
import pandas as pd
from tqdm.auto import tqdm

nlp = spacy.load('nl_core_news_sm')
data = pd.read_csv('data/manifestos_nl_2023.csv')

In [2]:
texts = list(data.text)
processed_texts = [text for text in tqdm(nlp.pipe(texts, 
                                              n_process=8, # four threads
                                              disable=["ner"]),
                                          total=len(texts))]

In [3]:
sentences = [[sent.text for sent in text.sents] for text in processed_texts]
data['sentences'] = sentences

In [4]:
rows = []
for i, row in data.iterrows():
    for sentence in row.sentences:
        rows.append(dict(party=row['party'],
                         url=row['url'],
                         title=row['title'],
                         sentence=sentence))

In [22]:
sentence_data = pd.DataFrame(rows)

sentence_data.head(5)

Unnamed: 0,party,url,title,sentence
0,BBB,https://boerburgerbeweging.nl/wp-content/uploa...,BBB partijprogramma 2023,Van Vertrouwenscrisis naar Noaberstaat
1,BBB,https://boerburgerbeweging.nl/wp-content/uploa...,BBB partijprogramma 2023,Visie en Verkiezingsprogramma 2023-2027
2,BBB,https://boerburgerbeweging.nl/wp-content/uploa...,BBB partijprogramma 2023,DANKWOORD Graag richt ik een bijzonder woord v...
3,BBB,https://boerburgerbeweging.nl/wp-content/uploa...,BBB partijprogramma 2023,"Programmacommissie, werkgroepen, schrijfteam, ..."
4,BBB,https://boerburgerbeweging.nl/wp-content/uploa...,BBB partijprogramma 2023,We zijn jullie zonder uitzondering zeer veel d...


In [5]:
sentence_data['sentence_id'] = range(len(sentence_data))

## Predict sentences

In [6]:
import os
import torch
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

class CMPClassifier:
    def __init__(self):
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1", 
            trust_remote_code=True)
        self.tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
        self.device = "cuda" if torch.cuda.is_available() else "mps"
        self.model.to(self.device)

    def predict(self, text, context=None):
        inputs = self.tokenizer(
            text,
            context or text,
            return_tensors="pt",
            max_length=300, 
            padding="max_length",
            truncation=True,
        ).to(self.device)

        logits = self.model(**inputs).logits
        probabilities = torch.softmax(logits, dim=1).tolist()[0]

        for i, p in sorted(enumerate(probabilities), key=lambda item: -item[1]):
            yield self.model.config.id2label[i], p
        
class RelevanceClassifier:
    def __init__(self):
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "joris/manifesto-dutch-binary-relevance")
        self.tokenizer = AutoTokenizer.from_pretrained("joris/manifesto-dutch-binary-relevance")
        self.device = "cuda" if torch.cuda.is_available() else "mps"
        self.model.to(self.device)
        self.id2label = {0: 'Other', 1: '000'}
        
    def predict(self, text):
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=128, 
            padding="max_length",
            truncation=True,
        ).to(self.device)

        logits = self.model(**inputs).logits
        probabilities = torch.softmax(logits, dim=1).tolist()[0]
        for i, p in sorted(enumerate(probabilities), key=lambda item: -item[1]):
            yield self.id2label[i], p
        
relevance_clf = RelevanceClassifier()
cmp_clf = CMPClassifier()

In [26]:
import re

def remove_non_printable_chars(text):
    printable_chars = ''.join(map(chr, range(32, 127)))
    printable_chars += '\n\t\r'
    return re.sub(f'[^{printable_chars}]', '', text)
                  
id2labels = dict()
for party, party_df in sentence_data.groupby('party'):
    sentences_by_party = list(party_df.sentence.values)
    sentence_ids = list(party_df.sentence_id.values)
    sentences_by_party = [remove_non_printable_chars(sent) for sent in sentences_by_party]
    for i, sentence in enumerate(tqdm(sentences_by_party)):
        context = ' '.join(sentences_by_party[i-1 if i > 0 else 0:
                                     i+3])
        cmp, confidence = list(relevance_clf.predict(sentence))[0]
        if cmp != '000':
            cmp, confidence = list(cmp_clf.predict(sentence, context))[0]
        id2labels[sentence_ids[i]] = (cmp, confidence)

  0%|          | 0/384 [00:00<?, ?it/s]

  0%|          | 0/3218 [00:00<?, ?it/s]

  0%|          | 0/1271 [00:00<?, ?it/s]

  0%|          | 0/2162 [00:00<?, ?it/s]

  0%|          | 0/1518 [00:00<?, ?it/s]

  0%|          | 0/6140 [00:00<?, ?it/s]

  0%|          | 0/7581 [00:00<?, ?it/s]

  0%|          | 0/2213 [00:00<?, ?it/s]

  0%|          | 0/1769 [00:00<?, ?it/s]

  0%|          | 0/3405 [00:00<?, ?it/s]

  0%|          | 0/1993 [00:00<?, ?it/s]

  0%|          | 0/2340 [00:00<?, ?it/s]

  0%|          | 0/852 [00:00<?, ?it/s]

  0%|          | 0/3597 [00:00<?, ?it/s]

  0%|          | 0/875 [00:00<?, ?it/s]

  0%|          | 0/947 [00:00<?, ?it/s]

  0%|          | 0/3185 [00:00<?, ?it/s]

  0%|          | 0/3516 [00:00<?, ?it/s]

In [7]:
sentence_data['label'] = sentence_data.sentence_id.apply(lambda x: id2labels[x][0][:3])
sentence_data['country'] = 'Netherlands'
sentence_data['date'] = '2023-10-01'

sentence_data = sentence_data[['sentence', 'party', 'country', 'date', 'label']]
sentence_data.columns = ['text', 'party', 'country', 'date', 'code']

sentence_data.to_csv('data/manifestos_nl_2023_coded_sents.csv', index=False)