# Class 5: ML Lab - Solution

## Problem Description

Partisanship and polarization are important features of democratic politics. Partisanship influences how citizens perceive real‐world conditions. For example, citizens tend to view the state of the national economy more positively if their party holds office. Polarization, on other hand, refers to ideological differences. While often used interchangeably, partisanship and polarization are different concepts, but the former often contains valuable information about the latter. If politics are becoming partisan, it is likely that it is getting more polarized as well. 

In this lab session, we will investigate whether we can develop computational large-scale measures of both partisanship and polarization. For the former, we use supervised learning. For the latter, we use unsupervised learning.

## Setup

Before we start, we need to have the correct setup. We start by importing all modules we need in the exercise. we then change the working directory to the project folder you are working with in the class using `os.chdir()` where `os` refers to the inbuilt module.

In [2]:
import os
import re
import time
import pickle
import multiprocessing

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt


from tqdm import tqdm
from string import punctuation
from gensim.models import Doc2Vec
from spacy.lang.da.stop_words import STOP_WORDS as stopwords

from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold

In [3]:
# Change directory
wd = '/home/rask/Dropbox/teaching/css_fall2023'                              # write directory here
wd = 'C:/Users/au535365/Dropbox/teaching/css_fall2023'
os.chdir(wd)

In [4]:
# Confirm that the working directory is as intended 
os.getcwd()

'C:\\Users\\au535365\\Dropbox\\teaching\\css_fall2023'

## 1.0 Supervised Learning: Classification of Partisanship


In their 2018 article, Peterson and Spirling (2018) develop and validate a measure of partisanship based on word choices by individual legislators. The logic is simple: The easier it is for a classifier to recognize legislators by their words, the higher is the amount of partisanship. This, in turn, means that polarization is higher. Peterson and Spirling (2018) develop a time-varying measure. We refrain for this in a second and for now consider only static partisanship.


Link to article: https://www.cambridge.org/core/journals/political-analysis/article/classification-accuracy-as-a-substantive-quantity-of-interest-measuring-polarization-in-westminster-systems/45746D999CFCD1CB43E362392D7B2FB4

We work with speeches from Folketinget from 2000-2021 spanning two decades of parliamentary debates. The data can be found on the GitHub repo under `data/ft-speeches` where each term is saved in a separate `.csv` file, e.g. `20001.csv`. Download the files to your local computer and place the files somewhere in your project folder. Do that before you continue.

When you've downloaded the files, we can move on to the first exercise.

### Reading and Cleaning Data

The first section of the supervised learning exercise is reading and cleaning data. This the boring step of any empirical project, but is always time-consuming and important to master!

#### Exercise 1.0: Reading in Data I

Read in each dataset you downloaded from GitHub. To see all files in a folder, you can write `os.listdir()`, which returns a list of files. Save the output from `os.listdir()` in an object called `files`. 

If you've done it correct, the list should have a length of $28$. Validate the result.




#### Solution 1.0

In [5]:
# Get list of of files
files = os.listdir('data/ft-speeches')

In [6]:
# Compute length - validate that it is 28
len(files) == 28

True

#### Exercise 1.1: Reading in Data II

We now use our list `files` to iteratively read in each of the $28$ datasets. This can be done using a for-loop where we loop over each file in the list. 

*Hints*: Declare an empty dataframe before the loop called called `df`. Within the loop, read in each dataset and concatenate it with the pre-declared dataframe. This can be using the `pd.concat([DF0, DF1])` (`pd` is imported as the namespace for Pandas). Note that `DF0` and `DF1` are just random names. Replace with your own. When all data is loaded, remember to reset indices.

If you have done it correct, the length of the dataset will be $411886$. Validate it.

#### Solution 1.1

In [7]:
# Read in data
df = pd.DataFrame()
for file in tqdm(files):
    df_term = pd.read_csv('data/ft-speeches/' + file)
    df = pd.concat([df, df_term])
df.reset_index(drop=True, inplace=True)

100%|███████████████████████████████████████████████████████████████████████████| 28/28 [00:26<00:00,  1.05it/s]


In [8]:
# Compute length - validate that it is 411886
len(df) == 411886

True

#### Exercise 1.2: Data Cleaning I

Before you move on, try print the dataframe in various ways to see the content of the data. 

We start by removing speeches in terms where each bloc has $2000$ or less speeches.  

I provide you the code for that. You can try play around with it to see what's going on.

When you have executed the code, we want to keep only speeches given by legislators from Socialdemokratiet (S) or Venstre (V). 

*Hint*: Use `.loc` and a `.isin()` to filter speeches given by legislators from S or V.

In [None]:
count_df = df.groupby(['partycolor', 'period']).count().reset_index()[['partycolor', 'period', 'date']]
count_df = count_df.rename(columns={'date': 'count'})

df = pd.merge(df, count_df, on=['partycolor', 'period'], how='left')

df = df.loc[df['count'] >= 2000,]

#### Solution 1.2

In [None]:
# Keep only Socialdemokratiet and Venstre
df = df.loc[df['party'].isin(['S', 'V'])].reset_index(drop=True)

#### Exercise 1.3: Data Cleaning II

If you haven't done already, remember to reset indices after filtering. 

The last thing we do is to make a binary indicator showing whether a speech is given by a legislator from S or not. You can achieve this in multiple ways. Remember that the output should be 0-1 column and NOT a 0.0 and 1.0 column. Note that this makes it a binary classification task.

#### Solution 1.3

In [None]:
# Make binary indicator
df['sd'] = df['party'] == 'S'
df['y_binary'] = [int(x) for x in df['sd']]

### Static Partisanship

We now move on to the classification tasks. 

At this point in the course, we have not talked about how we numerically represent text. Therefore, I provide you with code to do that now using the `CountVectorizer` class from `sklearn` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). By next class, you will understand what we're doing. I assume that you have called your dataset `df`. If now, go back and change it or replace `df` with your naming of it.

In [None]:
# Execute this code - the df['text'] refer to the text column in the dataframe
vectorizer = CountVectorizer(decode_error='ignore', min_df = 50, max_df=0.10)
vectorizer.fit(df['text'])

#### Exercise 1.4: Intuition

Before we continue, describe in your own words: 

1) Which assumptions we necessarily must make to assume that partisanship is conveyed in word choices

2) What it implies for the clarity of a voter's vote choice on election day. 

3) Do need high accuracy for us to get a good model for this task?

#### Solution 1.4

#### Exercise 1.5: Test Set and Model Selection

To train a model to classify partisanship based on words, we must split our data into train and test sets to avoid bias in the model selection. This is the standard procedure for all machine learning tasks where we train a model from scratch (and also when fine-tuning a model from transfer learning. 

Why is the train-test necessary to obtain an unbiased estimate of the generalization error?

#### Solution 1.5

#### Exercise 1.6: Data Splitting

Use the `train_test_split()` function from the `sklearn` module (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). It is already imported in the Setup section. Set `test_size=0.2` and `shuffle=True`. What does it mean to have a test_size of $0.2$ and why is it necessary to shuffle the data?

Save it as `X_train`, `X_test`, `y_train`, and `y_test`. Inspect the content of the four sets. Is the length of `X_train` larger than `X_test`? Are `X_train` and `y_train` of equal length? What about `X_test` and `y_test`.

#### Solution 1.6

In [None]:
# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['y_binary'], test_size=0.2, shuffle=True)

In [None]:
# Test if length is larger for X_train than X_test
len(X_train) > len(X_test)

In [None]:
len(X_train) == len(y_train), len(X_test) == len(y_test) 

#### Exercise 1.7: Data Formatting

If you check the `type()` of these four sets, you will see that they are `pandas.core.series.Series`. This is not universal, but happens since our data is stored in dataframes. 

Typically, however, scikit-learn wants the input data in numpy format. This is not a general rule, but I always use arrays or eventually lists to make the workflow the same every time.

Convert the `y_train` and `y_test` to numpy arrays. This can be done very simply using the `np.array()` method.

It is more tricky for `X_train` and `X_test` since we need to use the `vectorizer` we defined earlier based on the `CountVectorizer` class. Without further ado, execute the code. We will come back to the meaning of it in class 6.

In [None]:
# Apply vectorizer on input features and convert to numpy arrays
X_train = vectorizer.transform(X_train).toarray()
X_test = vectorizer.transform(X_test).toarray()

#### Solution 1.7

In [None]:
# Convert labels to numpy arrays
y_train, y_test = np.array(y_train), np.array(y_test)

#### Exercise 1.8: Random Sampling

Sometimes we do not have the time to train a model on all our dataset. An effective work-around is to randomy sample a subset of your training and test set to reduce training time. This approach can even be extended to generate bootstrapping estimates (i.e. nonparametric uncertainty estimates) if we sample with replacement $N$ times. We do not implement the bootstrap here, but if you feel for it, go nuts!

Your task is to randomly select a subset of samples from the training and test sets, respectively. We can achieve this with the `np.random.choice` method from the numpy module. We have already imported numpy as np. Set a seed like this `np.random.seed(10)` such that our results can be replicated. 

When you have done so, you can randomly select indices like this `np.random.choice(a, size=None, replace=False)` where `a` is a $1$d-array with the elements to be sampled and where `size` is the sample size. For this, we use $20000$ for the training set and $4000$ for the test set, which preserves the $0.2$ split. Think carefully about what you give to the parameter `a`. We need to select the *same* indices in the both `X_train` and `y_train` and the same for `X_test` and `y_test`. Save the results in two objects called `train_indices` and `test_indices` respectively.

After randomly selecting $20000$ and $4000$ samples, use `train_indices` and `test_indices` to subset the training and test sets. Call the sets:
* `X_train_subset`
* `y_train_subset`
* `X_test_subset`
* `y_test_subset`

Validate that the size of the resulting sets are $20000$ for training and $4000$ for test.

#### Solution 1.8

In [None]:
np.random.seed(10)

N_train, N_test = 20000, 4000

In [None]:
train_ixs = np.random.choice(len(y_train), size=N_train, replace=False)
test_ixs = np.random.choice(len(y_test), size=N_test, replace=False)

y_train_subset, y_test_subset = y_train[train_ixs], y_test[test_ixs]
X_train_subset, X_test_subset = X_train[train_ixs], X_test[test_ixs]

#### Exercise 1.9: Logistic Regression

We are now ready to train our first model. We use the logistic regression, which is actually a classifier and not a regression model. Since our input features are fairly large, we do not have the time and computer power to fine-tune eventual hyperparameters here. 

Start by declaring the logistic regression model using the model from `sklearn` (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Go through the documentation to see which options you have. For this, exercise we simply rely on the default options, but set `tol=1e-1` ($1e-1 = 0.01$). Note that the model is imported as `from sklearn.linear_model import LogisticRegression`

Now we do:
1) Initiate the model to an object called `logistic_clf`
2) Fit the data to the model using the `.fit()` model where we give as input our training data. Note that this step might take 10-15 minutes. 
3) Predict on the test set using the `.predict()` method. Save the predictions to an object called `log_preds`
4) Print the classification report using the `classification_report()` function which we have imported as `from sklearn.metrics import classification_report`. As input, it takes the `log_preds` and the labels for the test set.
5) Describe the results. What is overall accuracy? What is the precision and recall for the two labels? Are their any differences?
6) Interpret the results substantially. What does it mean that we are able/not able to classify partisanship?

**NOTE**: Remember to fit the model using the subset sets!

#### Solution 1.9

In [None]:
# 1 Initiate logistic classifier
n_cores = multiprocessing.cpu_count()                                
logistic_clf = LogisticRegression(solver='sag', 
                                  n_jobs=n_cores, 
                                  max_iter=150,
                                  tol=1e-1)

In [None]:
# 2 Fit the model
start_time = time.time()
logistic_clf.fit(X_train_subset, y_train_subset)
end_time = time.time()
print(f"Logistic classifier fitted in {end_time - start_time} seconds ({round((end_time - start_time) / 60, 3)} minutes)")

In [None]:
# 3 Predict on test set
log_preds = logistic_clf.predict(X_test_subset)

In [None]:
# 4 Classification report
print(classification_report(log_preds, y_test_subset))

5 Describe the results ....

6 Interpret the results substantially ....

#### Exercise 1.10: Naive Bayes

We see that the logistic regression performs fairly well. Without much work, we are able to accurately classify speeches as given by legislators from S or V. However, we are not interested in the accuracy in itself. That is, even if we could not classify partisanship based on speeches, our results would still be interesting. This is a huge difference compared to how computer scientists' use machine learning.

We now check whether our results are robust to our choice of algorithm or whether other methods work better. To compare the results, we use the simple but powerful Naive Bayes (NB) method There are multiple variants of NB methods, but they are all a family of supervissed learning algorithms, which use the famous Bayes' theorem assuming conditional independence between every feature in our input data (given the label/class). This is obviously an unrealistic assumption. Words are not chosen randomly when we condition on whether the speech is from a S or V legislator. Still, NB algorithms are powerful baselines to compare results against.

Start by declaring the multinominal NB model using sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html). Go through the documentation to see which options you have. For this, exercise we simply rely on the default options. Note that the model is imported as `from sklearn.naive_bayes import MultinomialNB`



Now we do:
1) Initiate the model to an object called `naive_clf`
2) Fit the data to the model using the `.fit()` model where we give as input our training data. This takes a while, but should be faster than for logistic classifier. 
3) Predict on the test set using the `.predict()` method. Save the predictions to an object called `naive_preds`. To the get probabilities, you can use the `.predict_proba()` method. Do that as well and save it to `naive_probs`
4) Print the classification report using the `classification_report()` function which we have imported as `from sklearn.metrics import classification_report`. As input, it takes the `naive_preds` and the labels for the test set.
5) Describe the results. What is overall accuracy? What is the precision and recall for the two labels? Are their any differences? How they compare to the results from the logistic classifier?

#### Solution 1.10

In [None]:
# 1 Initiate model
naive_clf = MultinomialNB()

In [None]:
# 2 Fit model
start_time = time.time()
naive_clf.fit(X_train_subset, y_train_subset)
end_time = time.time()
print(f"Naive classifier fitted in {end_time - start_time} seconds ({round((end_time - start_time) / 60, 3)} minutes)")

In [None]:
# 3 Predictions and probabilities
naive_preds = naive_clf.predict(X_test_subset)
naive_probs = naive_clf.predict_proba(X_test_subset)

In [None]:
# 4 Classification report
print(classification_report(naive_preds, y_test_subset))

5 Describe the results ....

### Dynamic Partisanship

Peterson and Spirling (2018) do not study static partisanship, but how it varies over time, higher accuracy as evidence or more polarization and lower accuracy as the latter. 

They test their approach in the UK House of Commons comparing yearly classification accuracy of a labour vs. conservative classifier. We now test this approach in the context of the Danish parliament and the historical big left and right-wing parties: Socialdemokratiet (S) and Venstre (V). 

#### Exercise 1.11: Data Formatting

Since we want to investigate how classification accuracy varies over time, we need to divide our speeches into different periods. For this, we use the parliamentary sessions, which run from October in year $t$ to June in year $t+1$ --- unless when elections interrupt the term.

The first trick is to make sure we have the same amount of input features in each term. For this, I define a `CountVectorizer()` below using a fixed vocabulary. Download the `vocab_min200.pkl` from the GitHub repo and execute the code without further ado. Once again --- don't worry about the code, we'll get back to it next week.

Now, we want construct training and test sets again, but this time we do it for each parliamentary term. The column `period` denotes the term for a given speech. To prepare the data for this task, we want to transform the text based on the `CountVectorizer()`'s fixed vocabuluary for each tern.

Do the following:
1) Declare empty dictionaries called `X_dict` and `y_dict`
2) Loop over each parliamentary term (you can define a list beforehand and then loop over it. Remeber to sort the list!!!)
3) Within each iteration, subset the dataframe to the given term (remember to reset indices). Call the subset `df_term`
4) Make an object called `y` based on the `y_binary` column in `df_term` 
5) Check whether there is any variance in the object `y`, and whether there any rows in `df_term`. If not, use the `continue` argument
6) Apply the `vectorizer` object defined by the fixed vocab in the `CountVectorizer` object to the `text` column in `df_term`. Save the result to an object called `X`
7) Store `X` and `y` in the `X_dict` and `y_dict` respectively. As keywords, use the parliamentary term like this `X_dict[str(term)] = X` and the same for `y` (note that we need `str(term)` since `term` is an integer.

In [None]:
fixed_vocab = pickle.load(open('data/vocab_min200.pkl', 'rb'))
vectorizer = CountVectorizer(decode_error='ignore', vocabulary=fixed_vocab)

#### Solution 1.11

In [None]:
# Make list of parliamentary terms
parlterms = sorted(df['period'].unique().tolist())

In [None]:
X_dict, y_dict = {}, {}

for term in parlterms:
    
    df_term = df.loc[df['period'] == term].reset_index(drop=True)
    y = df_term['y_binary']
    
    if (np.mean(y)==0 or np.mean(y)==1 or len(df_term)==0):
        continue
    
    X = vectorizer.fit_transform(df_term['text'])
    
    X_dict[str(term)] = X
    y_dict[str(term)] = y

#### Exercise 1.12 Stratified k-fold Cross Validation

Now that we have our data processed and saved in `X_dict` and `y_dict`, we can move on to the classification. Since we train a classifier on speeches on each term, our dataset is fairly small. To tackle this, we average 
the accuracy over a stratified 10-fold cross-validation. The stratified version of cross-validation makes sure that the balance between the labels are equal in each fold. Start by defining the straified k fold based on sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html). Use the specifications:
* `n_splits=10`
* `shuffle=True`
* `random_state=10`

and assign the object to `skf`



#### Solution 1.12

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=10)

#### Exercise 1.13: Fitting and Prediction with Logistic Regression 

Our data for each term is stored in `X_dict` and `y_dict` and we have defined a $10$-fold stratified cross validation scheme. Train a logistic classifier for each parliamentary term and compute the accuracy for each of the $10$ folds. I have provided you with a function that you can use to fit and predict partisanship for each term.

You should loop over each term, access the data in the `X_dict` and `y_dict`, and then loop over each fold. This can be done using like this `for train_ix, test_ix in skf.split(X, y)` where `skf` is the object you made in the previous exercise.

I don't provide you any more information here. See if you can figure it out yourself. You should end up with an object, for instance a dictionary, with each term having 10 accuracy estimates, one for each fold. 

**Note:** This task is difficult and requires you to master Python. Check the solution if you need help or ask the student next to you.

In [None]:
def logistic_classifier(X_train, y_train, X_test, y_test):

    logistic_clf = LogisticRegression(solver='sag', n_jobs=-1 ,tol=1e-1, C=1.e4 / 10000)        
    logistic_clf.fit(X_train, y_train)
    
    return logistic_clf.score(X_test, y_test)

#### Solution 1.13

In [None]:
term_stats = {}
for term in tqdm(parlterms):
    X, y = X_dict[str(term)], y_dict[str(term)]
    cls_stats = {}
    foldid = 0
    for train_ix, test_ix in skf.split(X, y):
        X_train, X_test = X[train_ix], X[test_ix]
        y_train, y_test = y[train_ix], y[test_ix]
        cls_stats[f"k-{foldid}"] = logistic_classifier(X_train, y_train, X_test, y_test)
        foldid += 1
    term_stats[str(term)] = cls_stats

#### Exercise 1.14: Average Accuracy 

You should have a dictionary with keys being each term and then $10$ accuracy estimates for each fold. Compute the average accuray for each term. 

*Hint*: Use a dictionary comprehension approach using the `np.mean()` method on the values.

#### Solution 1.14

In [None]:
stats_dict = {str(key): np.mean(list(val.values())) for key, val in term_stats.items()}

#### Exercise 1.15 Combine Into a Dataframe

Now you should have a dictionary with terms as the keys and the average accuracy as the values. Combine this into a pandas dataframe called `stats_df`. You might need to transpose the data to get a proper representation. This can be done using `.transpose()` method.

When you have done this, do the following:
1) Rename the column called `0` to `mean`
2) Assign the row indices as a column of its own to a column called `term` (can be done using `.index`)
3) Reset indices

You should now have a dataframe with 23 rows and two columns, `mean` and `term`

#### Solution 1.15

In [None]:
stats_df = pd.DataFrame(stats_dict, index=[0]).transpose()
stats_df = stats_df.rename(columns={0: 'mean'})
stats_df['term'] = stats_df.index
stats_df.reset_index(drop=True, inplace=True)

#### Exercise 1.16: Inspect and Interpret

Inspect the results. Can we see any interesting variation over time? Does it match our intuitive understanding of Danish politics? Maybe, maybe not...

I have provided you with code to plot the results over time.

#### Solution 1.16

In [None]:
# Plot the mean accuracy over time
plt.figure(figsize=(10, 6))
plt.plot(stats_df['term'], stats_df['mean'], marker='o', linestyle='-', color='b')
plt.title('Accuracy of Classifier For Each Term', size=16)
plt.xlabel('Term', size=14)
plt.ylabel('Accuracy', size=14)
plt.xticks(rotation=45, size=10)
plt.ylim(0.6, 0.8)
plt.grid(True)
plt.show()

## 2.0 Unsupervised Learning: Principal Component Analysis (PCA)

We now want to explore how we can use PCA to identify positions of political parties in a two-dimensional latent space. 

It is a common thing in political science to think of parties in spatial terms. The classical left-right distinction is an example of this. This typically refers to the positions in an economic sense. Another dimension is values (i.e. værdipolitik), commonly referred to as the cultural dimension. We investigate this two-dimensional space in the context of Denmark.

To explore this, we are working with word embeddings trained on a corpus of parliamentary speeches in the Danish parliament from 2000-2021. We will get to word embeddings, what they are, what they tell us, and so on later in the course. 

In brief, a word embedding is an $M$ dimensional vector where $M$ is typically large $(128, 256, 300, 512, \dots)$, and sometimes higher. Each word has an embedding which encodes the semantics of the word. The basic idea is that words that are similar are close to each other and word that are dissimilar are distant from each other. The embeddings are computed using a neural network called `Doc2Vec` (https://radimrehurek.com/gensim/models/doc2vec.html), which enables to include "covariates" into the estimation such as party indicators or year indicators. This results in one embedding for each "covariate". We will exploit this in the exercise. 

Our task is to figure out whether `Doc2Vec` outputs partisan embeddings that can be used to accurately locate parties in a two-dimensional space using PCA. We will try two different approaches. One where we investigate whether the variation happens at the bloc-level (left or right) and one where we investigate whether it happens at the party-level. 

The two models we will work with can be downloaded from the GitHub repo:
* Bloc-term: d2v_bloc-term_size300_window20_epochs10_count50.pkl
* Party: d2v_party_size200_window20_epochs5_count50.pkl

The first uses bloc-term indicators (e.g. 'blue-20081') where the latter uses party indicators (e.g. 'RV-party' or 'DF-party'). 

Before the actual exercise, we will load in the first model and get a bit familiar with the model. We have already imported `Doc2Vec` from the `gensim` module (https://radimrehurek.com/gensim/models/doc2vec.html) in the top of the notebook. A trained model can be loaded by the `Doc2Vec.load()` method. I do this below and save it to an object called `d2v`

In [None]:
# Define model name
model_name = 'd2v_bloc-term_size300_window20_epochs10_count50.pkl'
d2v = Doc2Vec.load('models/' + model_name)

The `d2v` object inherits from `gensim.models.doc2vec.Doc2Vec`, which make it a module-specific data structure.

In [None]:
# Check the type of the data
type(d2v)

To see the vocabulary you can write `d2v.wv.vocab` which returns a dictionary with key-value pairs of words (keys) and embeddings (values). Try print the vocabulary before you move on.

In [None]:
d2v.wv.vocab

Each word embedding can be accessed by `d2v.wv[WORD]` where `WORD` for instance is 'indvandrer'. Try to access the embeddding of the word 'indvandrer'.

In [None]:
# Print the embedding for the word 'indvandrer'
d2v.wv['indvandrer']

The output is a numpy array with dimension `(300, )`. Verify it using the `.shape method`

In [None]:
# Check the shape
d2v.wv['indvandrer'].shape

All embeddings are of dimension `(300, )`. To get this, you can write `d2v.vector_size`. This is not a global solution, but varies from model to model.

In [None]:
# Get dimensionality of the embeddings and save it as M
M = d2v.vector_size

As mentioned, the `Doc2Vec` permits including "covariates", which is party-term indicators in this case. We can see the list of indicators used to fit the model using `d2v.docvecs.offset2doctag`

In [None]:
# Check indicators used to fit the model
d2v.docvecs.offset2doctag

This returns a list of indicators. To access the embedding associated with each indicator, we can write `d2v.docvecs[INDICATOR]` where `INDICATOR` is the name of the indicator e.g. 'blue-20081'.

In [None]:
d2v.docvecs['blue-20081']

### Bloc-Term Embeddings

We start by considering bloc-term embeddings using the model we have already loaded.

#### Exercise 2.0: Generate List with Indicators 

Generate two lists with bloc-term indicators using list comprehensions. 

Call the first list `leftwing` which contains all indicators that contains `red` in the list returned by `d2v_model.docvecs.offset2doctag`

Call the second list `rightwing` which contains all indicators that contains `blue` in the list returned by `d2v_model.docvecs.offset2doctag`

*Hints:* You can check if a string starts with 'r' or 'b' using the method `.startswith()`. There are other possible solutions.

#### Solution 2.0

In [None]:
leftwing = [d for d in d2v.docvecs.offset2doctag if d.startswith('r')]
rightwing = [d for d in d2v.docvecs.offset2doctag if d.startswith('b')]

#### Exercise 2.1: Combining Lists

Now that we have the two lists `leftwing` and `rightwing`, we want to append them together. Store the result in a list called `pt_list` which is short for **p**arty **t**erm

#### Solution 2.1

In [None]:
pt_list = leftwing + rightwing

#### Exercise 2.2: Defining an Array

We are interested in the total of $46$ bloc-term embeddings. Hence, we want to extract them and save them in a new matrix of dimension `(46, 300)` where $46$ refers to the number of bloc-term embeddings and $300$ to the dimensions of the embeddings. 

Generate an empty numpy array with dimension `(X, M)` where `X` is the number of party-term indicators and `M` is the size of the embeddings. Call the array `z`. Verify the shape of the array.

*Hints*: Use the `np.ones` or `np.zeros` methods to construct the array. 

#### Solution 2.2

In [None]:
# Generate the (X, M) array
z = np.zeros((len(pt_list), M))

In [None]:
# Verify the shape is as intended
z.shape

#### Exercise 2.3: Populating `z` 

Now that we've created the empty matrix `z`, we want to extract the bloc-term embeddings and assign it to each row in `z`. 

Recall that you can access the embeddings like this `d2v.docvecs['blue-20081']` 

*Hints:* Loop over each indicator in `pt_list`.



#### Solution 2.3

In [None]:
# Loop through the list pt_list and assign each party-term embedding to a row in z
for i in range(len(pt_list)):
    z[i,:] = d2v.docvecs[pt_list[i]]

#### Exercise 2.4: PCA

We want to specify a PCA with two components, `n_components = 2`. 

Note that the PCA is already imported as `from sklearn.decomposition import PCA`. Inititate the model as `PCA(n_components=2)` and assign to an object called `pca_model`.

Finally, you the `.fit_transform()` method from `pca_model` on the matrix `z`. Save the results to a new matrix called `Z`. 

Verify that your output `Z` is as intended. It should have a dimension of `(46, 2)`

#### Solution 2.4

In [None]:
# Apply PCA
pca_model = PCA(n_components=2)
Z = pca_model.fit_transform(z)

In [None]:
# Verify shape 
Z.shape

#### Exercise 2.5: Explained Variance

Recall that PCA finds the principal components by maximizing variance. PC1 is projected onto an axis that explains most of the variance in the data. PC2 is then projected onto another axis that explains most of the leftover variation when we already accounted for PC1. 

To see the amount of explained variance, you can write `pca_model.explained_variance_ratio_`. How much variance is explained by PC1 and PC2 respectively? How much variation is captured in total by PC1 and PC2?

#### Solution 2.5

In [None]:
explained_variance = pca_model.explained_variance_ratio_
print(explained_variance)

In [None]:
# Total amount of explained variance
sum(explained_variance)

#### Exercise 2.6: Plot PCA

To plot the PCs I have created a function that you can straightaway. Before using it, I convert `Z` to a dataframe and define a color scale to properly label the PCs. 

Try and place around with the code if you don't understand what's going on.

Interpret and describe the plot.

#### Solution 2.6

In [None]:
# Convert Z to a dataframe
Z_df = pd.DataFrame(Z)
Z_df.columns = ['PC1', 'PC2']
Z_df['party_label'] = pt_list

# Define color scale
color_scale = {'blue': '#3333FF', 'red': '#E91D0E'}

cols = [color_scale['red']]*len(leftwing) + [color_scale['blue']]*len(rightwing)

In [None]:
# Function to plot PCs
def plot_pca(dataframe, indicators, cmap, show=True):
    
    mpl.rcParams['axes.titlesize'] = 20
    mpl.rcParams['axes.labelsize'] = 20
    mpl.rcParams['font.size'] = 14

    plt.figure(figsize=(22,15))
    plt.scatter(dataframe.PC1, dataframe.PC2, color=cmap)
    texts=[]
    for label, x, y, c in zip(indicators, dataframe.PC1, dataframe.PC2, cmap):
        plt.annotate(
            label,
            xy=(x, y), xytext=(-20, 20),
            textcoords='offset points', ha='right', va='bottom',
            bbox=dict(boxstyle='round,pad=0.5', fc=c, alpha=0.3),
            arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

    plt.xlabel("PC1")
    plt.ylabel("PC2")

    if show:
        plt.show()

In [None]:
# Plot the results
plot_pca(dataframe=Z_df, indicators=pt_list, cmap=cols, show=True)

#### Exercise 2.7: Interpret PCA

We can also investigate which words that drives the placement of each bloc-term by looking at the positive and negative similarities for each pole (left-right, north-south). In have written a Python class for this purpose below. Define the class `PCA_INTERPRET` and describe the results.

#### Solution 2.7

In [None]:
class PCA_INTERPRET(object):
    
    def __init__(self, model, parties, dr, Z, labels, rev1=False, rev2=False, min_count=100, max_count = 1000000, max_features=10000):

        self.model = model
        self.parties = parties
        self.labels = labels
        self.P = len(self.parties)
        self.M = self.model.vector_size   
        self.voc = self.sorted_vocab(min_count, max_count, max_features)
        self.V = len(self.voc)   
        self.pca = dr
        self.max = Z.max(axis=0)
        self.min = Z.min(axis=0)
        self.sims = self.compute_sims()
        self.dim1 = rev1
        self.dim2 = rev2
        
    def sorted_vocab(self, min_count=100, max_count=10000, max_features=10000):
        wordlist=[]
        for word, vocab_obj in self.model.wv.vocab.items():
            wordlist.append((word, vocab_obj.count))
        wordlist = sorted(wordlist, key=lambda tup: tup[1], reverse=True)
        return [w for w,c in wordlist if c>min_count and c<max_count and w.count('_')<3][0:max_features]
    
    def compute_sims(self):

        Z = np.zeros((self.V, 2))
        for idx, w in enumerate(self.voc):
            Z[idx, :] = self.pca.transform(self.model.wv[w].reshape(1,-1))
        sims_right = euclidean_distances(Z, np.array([self.max[0],0]).reshape(1, -1))
        sims_left = euclidean_distances(Z, np.array([self.min[0],0]).reshape(1, -1))
        sims_up = euclidean_distances(Z, np.array([0,self.max[1]]).reshape(1, -1))
        sims_down = euclidean_distances(Z, np.array([0,self.min[1]]).reshape(1, -1))
        temp = pd.DataFrame({'word': self.voc, 'right': sims_right[:,0], 'left': sims_left[:,0], 'up': sims_up[:,0], 'down': sims_down[:,0]})
        return temp

    def top_words_list(self, topn=20):

        if self.dim1:
            ordering = ['left','right']
        else:
            ordering = ['right', 'left']
        temp = self.sims.sort_values(by=ordering[0])
        print(80*"-")
        print("Words Associated with Positive Values (Right) on First Component:")
        print(80*"-")
        self.top_positive_dim1 = temp.word.tolist()[0:topn]
        self.top_positive_dim1 = ', '.join([w.replace('_',' ') for w in self.top_positive_dim1])
        print(self.top_positive_dim1)
        temp = self.sims.sort_values(by=ordering[1])
        print(80*"-")
        print("Words Associated with Negative Values (Left) on First Component:")
        print(80*"-")
        self.top_negative_dim1 = temp.word.tolist()[0:topn]
        self.top_negative_dim1 = ', '.join([w.replace('_',' ') for w in self.top_negative_dim1])
        print(self.top_negative_dim1)

        if self.dim2:
            ordering = ['down','up']
        else:
            ordering = ['up', 'down']
        temp = self.sims.sort_values(by=ordering[0])
        print(80*"-")
        print("Words Associated with Positive Values (North) on Second Component:")
        print(80*"-")
        self.top_positive_dim2 = temp.word.tolist()[0:topn]
        self.top_positive_dim2 = ', '.join([w.replace('_',' ') for w in self.top_positive_dim2])
        print(self.top_positive_dim2)
        temp = self.sims.sort_values(by=ordering[1])
        print(80*"-")
        print("Words Associated with Negative Values (South) on Second Component:")
        print(80*"-")
        self.top_negative_dim2 = temp.word.tolist()[0:topn]
        self.top_negative_dim2 = ', '.join([w.replace('_',' ') for w in self.top_negative_dim2])
        print(self.top_negative_dim2)
        print(80*"-")

In [None]:
# Apply class
PCA_INTERPRET(d2v, pt_list, pca_model, Z, pt_list, rev1=False, rev2=False, min_count=100, max_count = 1000000, max_features = 50000).top_words_list(20)

### Party Embeddings

We now investigate whether we can locate positions of individual parties and not just at the bloc level. 

Load the second model *d2v_party_size200_window20_epochs5_count50.pkl*. 

Note that this model has embeddings of size $200$ and not $300$. 

In [None]:
model_name = 'd2v_party_size200_window20_epochs5_count50.pkl'
d2v = Doc2Vec.load('models/' + model_name)
M = d2v.vector_size

#### Exercise 2.8: Generate List with Indicators

As for the bloc-term embeddings, we need to generate lists with the indicators used to fit the model. 

Recall that the indicators can be accessed with `d2v.docvecs.offset2doctag`. This time, it is a little more complicated since we have individual parties rather than blocs since we need to do it for each party. Extract the indcator for each party and save it in an object with the name of the party (e.g. `RV`). 

Combine into two lists called `rightwing` and `leftwing` based on the bloc affiliation of each party. Finally, combine the lists into one called `pt_list`. 


#### Solution 2.8

In [None]:
V = [x for x in d2v.docvecs.offset2doctag if x.startswith('V-')]
KF = [x for x in d2v.docvecs.offset2doctag if x.startswith('KF-')]
DF = [x for x in d2v.docvecs.offset2doctag if x.startswith('DF-')]
NB = [x for x in d2v.docvecs.offset2doctag if x.startswith('NB-')]
LA = [x for x in d2v.docvecs.offset2doctag if x.startswith('LA-')]

RV = [x for x in d2v.docvecs.offset2doctag if x.startswith('RV-')]
S = [x for x in d2v.docvecs.offset2doctag if x.startswith('S-')]
SF = [x for x in d2v.docvecs.offset2doctag if x.startswith('SF-')]
ALT = [x for x in d2v.docvecs.offset2doctag if x.startswith('ALT-')]
EL = [x for x in d2v.docvecs.offset2doctag if x.startswith('EL-')]

rightwing = V + KF + DF + NB + LA
leftwing = RV + S + SF + ALT + EL
pt_list = leftwing + rightwing

#### Exercise 2.9: PCA with Party Embeddings

We need to do exactly the same steps as before. 

1) Define empty array
2) Populate array with party embeddings
3) Apply PCA
4) Explore how much variance is explained

Use the code you already have to do it using the new model. 

#### Solution 2.9

In [None]:
# Generate the (X, M) array
z = np.zeros((len(pt_list), M))

# Loop through the list pt_list and assign each party-term embedding to a row in z
for i in range(len(pt_list)):
    z[i,:] = d2v.docvecs[pt_list[i]]

# Apply PCA
pca_model = PCA(n_components=2)
Z = pca_model.fit_transform(z)

# Explained variance
explained_variance = pca_model.explained_variance_ratio_
print(explained_variance)

#### Exercise 2.10: Plot PCA

We are now ready to plot the PCs again. Use the function `plot_pca` again. 

Once again, I convert `Z` to a dataframe before using it and define a color scale to properly label the PCs. 

Interpret and describe the plot.

#### Solution 2.10

In [None]:
# Convert Z to a dataframe
Z_df = pd.DataFrame(Z)
Z_df.columns = ['PC1', 'PC2']
Z_df['party_label'] = pt_list

color_scale = {'V': 'royalblue', 'KF': 'forestgreen',
         'DF': 'gold', 'NB': 'darkslategrey', 'LA': 'mediumturquoise',
         'RV': 'darkviolet', 'S': 'red', 'SF': 'sienna', 'EL': 'sandybrown', 'ALT': 'lawngreen'}

cols_right = [color_scale['V']]*len(V) + [color_scale['KF']]*len(KF) + [color_scale['DF']]*len(DF) + [color_scale['NB']]*len(NB) + [color_scale['LA']]*len(LA)

cols_left = [color_scale['RV']]*len(RV) + [color_scale['S']]*len(S) + [color_scale['SF']]*len(SF) + [color_scale['EL']]*len(EL) + [color_scale['ALT']]*len(ALT)

cols = cols_right + cols_left

In [None]:
# Plot the results
plot_pca(dataframe=Z_df, indicators=pt_list, cmap=cols, show=True)

#### Exercise 2.11: 

Use the class `PCA_INTERPRET` to investigate which words that drives the placement of each party. 

Describe the results. Do they make sense?

#### Solution 2.11

In [None]:
# Apply class
PCA_INTERPRET(d2v, pt_list, pca_model, Z, pt_list, rev1=False, rev2=False, min_count=100, max_count = 1000000, max_features = 50000).top_words_list(20)