# Final Project

This is the notebook for the final individual project of the Big Data and Automated Content Analysis Course. The goal of the project is to use supervised machine learning to successfully predict whether a document is populist or not. This project is based on the work of Matthijs Rooduijn ... 

## Acquiring the Data

The original data are files containing the manifestos of various parties from five European countries. These manifestos have been analysed by manual coders who decided whether a paragraph is populist or not. The manuscripts have been sourced online (in .csv format), were included in the original research (in .doc) or were copied from .pdf files (in .txt).

While these documents already contain paragraphs, these are not always congruent with the paragraphs used in the coding of the original project. I therefore have three main tasks in this part of the project:

    - transform all files into the same text data format and remove pre-existing paragraphs
    - find the first two words of each paragraph (as noted in the coded results) in the manifesto docs and set new paragraphs accordingly 
    - splitting the texts into lists by line breaks and then feeding the lists and the results into one combined dataframe which will provide the features and labels for the later models

In [None]:
# let's begin by importing the necessary modules.
import pandas as pd
from glob import glob
import re
import os
import numpy as np

# Pre-Processing
from string import punctuation
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import spacy

# Supervised Machine Learning
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics, preprocessing
from sklearn.metrics import recall_score
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV 
from sklearn.pipeline import make_pipeline, Pipeline

# Word Embeddings
import gensim
from gensim.models import word2vec
from embeddingvectorizer import EmbeddingCountVectorizer, EmbeddingTfidfVectorizer
import embeddingvectorizer

# I also want to import the functions defined in the other notebook.
from functions import file_folder, save, clean, nl_lemmatise, classification_report

# For Plotting
import seaborn as sns
import matplotlib 
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# To adapt the code to different directories, simply change this object. The results and manifestos folder should be placed in this directory.
directory = "/home/hennes/Desktop/Final Project/"

In [None]:
# The following code is used to go through the folder with the manuscripts
# The text files are extracted depending on the file format and saved to a new file after removing all line breaks.

for mani in glob(directory+'Results and manifestos/Manifestos/*/*'):
    folder, filename = file_folder(mani)
    if mani.endswith('.txt'):
        with open(mani, mode='r') as m:
            text = m.read()  
    elif mani.endswith('.csv'):
        m = pd.read_csv(mani)
        if len(m.index) > 1:
            text = " ".join(m['text'].values[1:])
        else:
            text = m.iloc[0,0]
    save(text, filename, directory+f"Clean Manifestos/{folder}/")

With the following code I first place line breaks in the manifesto text files according to the paragraph structure that was used in the original project. In doing so, I had to make use of specific error feedback to correct the beginnings of the paragraphs in the results. In many cases there were inconsistencies between the manifestos and the results and some coders did not fill out the 'ParWords' column for all of their paragraphs. I was only able to rectify this by comparing the manifesto text files, the PDFs of the manifestos and the results.

Once all errors were dealt with, the manifesto strings are split at the line breaks and turned into lists. These are then combined with the results into a dataframe that uses the coding scheme of the original article to identify paragraphs as populist (a paragraph needed to contain both a reference to "the people" and anti-elitism).

In [None]:
for res in glob(directory+'Results and manifestos/Results/NL/*'):
    folder = re.search(r"([A-Z]){2}", os.path.dirname(res)).group(0)
    filename = re.sub(r"\.\w+","", os.path.basename(res))
    results = pd.read_excel(res).dropna(axis=0, how='all').dropna(axis=1, how='all') # need to delete a lot of empty cells
    with open(directory+f"Clean Manifestos/{folder}/{filename}.txt", mode='r') as f:
        text = re.sub("  ", " ", re.sub("\n","", f.read()))      # I am adding this because for some reason the text files kept adding an \n after they were saved.
        iterrow = results.iterrows()        # I have to skip the first line because I don't want a line break at beginning.
        next(iterrow)
        for index, row in iterrow:
            par = row["ParWords"].lower()
            try:
                begin = text.rindex("\n")+5  # I am adding a +
                try:
                    end = text.index(par, begin)
                    text = text[:end] + "\n" + text[end:]
                except:
                    print(f"Error in {filename}: Last line-break at {begin}. Words '{par}' in row {index} not found")
            except:
                try:
                    end = text.index(par)
                    text = text[:end] + "\n" + text[end:]
                except:
                    print(f"Error in {filename}: Words '{par}' in row {index} not found")
                    raise

    # Now I save the dataset in the appropriately named folder and file
    paralist = text.splitlines()
    try:
        data = results[["ParNum", "ParWords", "People", "AntEl"]]
        data["Populist"] = np.where((data["People"]>0) & (data["AntEl"]>0), 'Populist', 'Not Populist')    # Creating the populist label according to coding rule of original paper.
        data["text"] = paralist
    except Exception as e:
        print(f"Could not generate final table for {filename}")
        print(e) 
    save(data, filename, directory+f"Master/{folder}/")
    


The individual dataframes are now combined to create the training and test data.

In [None]:
frames = []
for d in glob(directory+'Master/NL/*'):
    folder, filename = file_folder(d)
    df = pd.read_csv(d)
    frames.append(df)
    df['Party/Year'] = filename    # creating extra columns with the party/year and the country
    df['Country'] = folder
dataset = pd.concat(frames)
dataset.to_csv(directory+'Master/NL/Combined Dataset.csv')

## Pre-Processing of Data

In [None]:
# These are the topics (labels) in the data.
dataset['Populist'].value_counts()

In spite of the insufficient number of paragraphs, the following code will serve as a demonstration for how supervised machine learning could be used to classify populist paragraphs.

In [None]:
# Now I will the new column and the label column into two lists.
features = dataset['text'].tolist()
labels = dataset['Populist'].tolist()

In [None]:
# I will begin by taking a look at some text examples.
print(f"{features[500]}\n\n{features[0]}\n\n{features[1000]}\n\n{features[3000]}\n\n{features[2000]}")

In [None]:
# This function cleans the features
clean_features = clean(features)

# Let's compare the clean features to the original ones.
print(f"{features[0]}\n\n{clean_features[0]}")

In [None]:
# Please note that for lemmatising to work, you have to download the dutch language pipeline.
# This can be done in the user console with:
# python -m spacy download nl_core_news_lg

# In the last step of pre-processing, I lemmatise the words.
docs = nl_lemmatise(clean_features)

# Let's compare the lemmatised features to the clean ones.
print(f"{clean_features[0]}\n\n{docs[0]}")

## Training, Testing and Validating the Model

In this part of the notebook I will train, test and validate two models to predict whether a paragraph is populist or not. The first model will be based on normal word tokenization, while the second one will employ word embeddings. Both models and their respective parameters will be selected using a gridsearch. The possible classifiers for both models are a naive bayes, a support vector, a logistic regression and a random forest classifier.

In [None]:
# Before starting with the classification, I split the labelled data into test and training data.
X_train, X_test, y_train, y_test = train_test_split(docs, labels,test_size=0.2, random_state=42)
# Let's see if the features and labels indeed match up. That seems to be the case
print(f"{X_train[1]}\n\n{y_train[1]}")

In [None]:
# I initialize the pipeline with this vectorizer and the classifier.
pipe = Pipeline(steps = [('vectorizer', CountVectorizer()), ('classifier', MultinomialNB())])

# Each grid specifies the parameters for one combination of vectorizer and classifier. 
grid = [{
    'vectorizer': [CountVectorizer()], # parameters for combination of Count Vectorizer and Naive Bayes
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [MultinomialNB()]
},
{
    'vectorizer': [CountVectorizer()], # parameters for combination of Count Vectorizer and Logistic Regression
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [LogisticRegression(solver='liblinear')],
    'classifier__C': [0.01, 1, 100] 
},
{
    'vectorizer': [CountVectorizer()], # parameters for combination of Count Vectorizer and Support Vector Machine
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [SVC()],
    'classifier__kernel': ['linear','rbf', 'poly'],
    'classifier__C': [0.001, 0.01, 0.1, 1.0, 10, 100]
},
{
    'vectorizer': [CountVectorizer()], # parameters for combination of Count Vectorizer and Random Forest
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [RandomForestRegressor()],
    'classifier__max_features': [1-20],
    'classifier__n_estimators': [10, 100, 1000]
},
{
    'vectorizer': [TfidfVectorizer()], # parameters for combination of Tdifd Vectorizer and Naive Bayes
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [MultinomialNB()]
},
{
    'vectorizer': [TfidfVectorizer()], # parameters for combination of Tdifd Vectorizer and Logistic Regression
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [LogisticRegression(solver='liblinear')],
    'classifier__C': [0.01, 1, 100] 
},
{
    'vectorizer': [TfidfVectorizer()], # parameters for combination of Tdifd Vectorizer and Support Vector Machine
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [SVC()],
    'classifier__kernel': ['linear','rbf', 'poly'],
    'classifier__C': [0.001, 0.01, 0.1, 1.0, 10, 100]
},
{
    'vectorizer': [TfidfVectorizer()], # parameters for combination of Tdifd Vectorizer and Random Forest
    'vectorizer__ngram_range' : [(1,1), (1,2)],
    'vectorizer__max_df': [0.5, 1.0],
    'vectorizer__min_df': [0, 5],
    'classifier': [RandomForestRegressor()],
    'classifier__max_features': [1-20],
    'classifier__n_estimators': [10, 100, 1000]
}]
                 
search = GridSearchCV(estimator = pipe,
                      param_grid = grid,
                      scoring = 'accuracy',
                      cv = 5,
                      n_jobs = -1,
                      verbose = 10)
search.fit(X_train, y_train)

reg_SML_params = search.best_params_
reg_SML_model = search.predict(X_test)

Unfortunately I am not able to perform the gridsearch using the word embeddings because my kernel dies each time I run the code. This has happened with all trained embeddings from https://github.com/clips/dutchembeddings, and also happens if I do not try to run a gridsearch but only specify a single model. I suspect that this is because I cannot allocate more than 5 GB of RAM to my virtual machine.

In [None]:
model = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format("/home/hennes/Downloads/160/sonar-160.txt", binary=False)

In [None]:
# Each grid specifies the parameters for one combination of vectorizer and classifier. 
pipe = Pipeline(steps = [('vectorizer', CountVectorizer()), ('classifier', MultinomialNB())])

grid = [{
    'vectorizer': [embeddingvectorizer.EmbeddingCountVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Naive Bayes
    'classifier': [MultinomialNB()]
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingCountVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Logistic Regression
    'classifier': [LogisticRegression(solver='liblinear')],
    'classifier__C': [0.01, 1, 100] 
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingCountVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Support Vector Machine
    'classifier': [SVC()],
    'classifier__kernel': ['linear','rbf', 'poly'],
    'classifier__C': [0.001, 0.01, 0.1, 1.0, 10, 100]
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingCountVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Random Forest
    'classifier': [RandomForestRegressor()],
    'classifier__max_features': [1-20],
    'classifier__n_estimators': [10, 100, 1000]
},    
{
    'vectorizer': [embeddingvectorizer.EmbeddingtdidfVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Naive Bayes
    'classifier': [MultinomialNB()]
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingTfidfVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Logistic Regression
    'classifier': [LogisticRegression(solver='liblinear')],
    'classifier__C': [0.01, 1, 100] 
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingTfidfVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Support Vector Machine
    'classifier': [SVC()],
    'classifier__kernel': ['linear','rbf', 'poly'],
    'classifier__C': [0.001, 0.01, 0.1, 1.0, 10, 100]
},
{
    'vectorizer': [embeddingvectorizer.EmbeddingTfidfVectorizer(model, operator='mean')], # parameters for combination of Count Vectorizer and Random Forest
    'classifier': [RandomForestRegressor()],
    'classifier__max_features': [1-20],
    'classifier__n_estimators': [10, 100, 1000]
}]
                 
search = GridSearchCV(estimator = pipe,
                      param_grid = grid,
                      scoring = 'accuracy',
                      cv = 5,
                      n_jobs = -1,
                      verbose = 10)
search.fit(X_train, y_train)

wv_SML_params = search.best_params_
wv_SML_model = search.predict(X_test)

The best models with and without word embeddings can now be compared. The best model can then be chosen for prediction purposes.

In [None]:
print(f'These hyperparameters {reg_SML_params} provide the best performance using regular supervised machine learning techniques:')
print(f"\n\n\n {classification_report(y_test, reg_SML_model)}")
print(f'\n\n These hyperparameters {wv_SML_params} provide the best performance using regular supervised machine learning techniques:')
print(f"\n\n {classification_report(y_test, wv_SML_model)}")

## Using the model to predict whether a paragraph is populist

The data from the original research project contained manifestos that were not coded. One of these will be used to show how the model from above can be used to predict whether a paragraph is populist.

In [None]:
# I use a dataframe containing only the paragraphs of the SP manifesto from 1998.
SP1998 = pd.read_csv("/home/hennes/Downloads/SP1998.csv")
features = dataset['text'].tolist()

# Next I clean its features and lemmatise them.
clean(features)
nl_lemmatise(features)

# Finally I predict their features.
populist = search.predict(docs).tolist()

# And I add them to the dataframe and export the data.
SP1998['Populist'] = pd.DataFrame({'Populist':populist})
SP1998.to_csv("/home/hennes/Downloads/SP1998 with labels.csv")

In [None]:
# It could now be of interest to see how many of the paragraphs were labelled as populist.
SP1998['Populist'].value_counts()

The model can also be applied to a corpus of manifestos from different parties/years/countries etc. The results from the prediction could then be used for visualisation purposes. A example of this is a comparison of the percentage of populist paragraphs in a given manifesto. I am here showing how such a visualisation could look with the data that was already coded in the original research project.

In [None]:
# Loading the Dataset
dataset = pd.read_csv(directory+"Master/NL/Combined Dataset.csv")

# Aggregating the data to show the percentage of populist paragraphs per manifesto
dataset['Proportion Populist Paragraphs'] = dataset.Populist == 'Populist'
perparty = dataset.groupby('Party/Year', as_index=False).agg({'Proportion Populist Paragraphs': 'mean'}).sort_values(['Proportion Populist Paragraphs'], ascending=False)

In [None]:
# This plot shows the Proportions per party, but in a dataset with multiple countries,
# the percentage of populist paragraphs could also be plotted over different countries, years etc.
fig = sns.barplot(x="Party/Year", y="Proportion Populist Paragraphs", data=perparty, ci=None)

# Some settings to make the plot prettier, larger and more easily readable
sns.set(style="whitegrid", font_scale=1.3)
plt.figure(figsize=(16,8))
fig.set_xticklabels(fig.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()