<h1> Module 1 </h1>
<h1> Automated Textual Analysis of Parliamentary Debates </h1>

<h3> McMaster Conference on Substantive Representation </h3>

<h3> Ludovic Rheault (University of Toronto) </h3>

During this module, we will "learn by doing".  The plan is to examine concrete examples of computer-assisted textual analysis, and learn the syntax of the Python programming language along the way.  Python is the most popular language in the world (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages), and also one of the easiest to learn.  

This is a gentle introduction to the methods.  For an overview of methods for textual analysis and their applications in political science, you may consult Grimmer and Stewart (2013).

<h2> 1. Toy example </h2>

Let us start with a simple example to illustrate the concepts.  Suppose we have four "speeches" - deliberately simple speeches.

In [None]:
speeches = ['indigenous peoples', 'indigenous affairs',
       'international trade', 'trade relations']

<h3> 1.1. Python List </h3>

The Python <b> list </b> is defined with squared brackets, with each elements separated with commas.  

In this case, the list contains four items, each containing textual data.  The data type for textual characters is called <b>string</b>.

In [None]:
type(speeches)

In [None]:
type(speeches[0])

In [None]:
speeches

In [None]:
speeches[0]

<h3> 1.2. The Term-Document Matrix </h3>

A useful tool for textual analysis is called a term-document matrix (or document-term matrix), which I will abbreviate with TDM.  

The objective is to convert the corpus into a numerical matrix representing the count of each word in the vocabulary, for each document. This process is sometimes called vectorization, and will facilitate the analysis of our data using matrix operations.

Each row of the matrix represents one document.  Each column represents how many times a word appears in that document.  

We will import a library that easily creates a TDM for us.  

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

We start by initiating an instance of the class CountVectorizer, which converts a list of texts into a term-document matrix.  

A <b>class</b> is a fundamental concept in programming.  It means a pre-programmed category of objects that have specific properties, called methods.  For instance, the CountVectorizer class has a method called fit_transform() that converts a corpus into a TDM. 

In [None]:
# The pound symbol indicates that this line is a comment, not executed when running the script. 
# Here, we passed an option to remove English stop words.  
# We create an instance of the CountVectorizer class, called "tdm".  It could be any other name.

tdm = CountVectorizer(stop_words='english')

Next, we use the methods "fit_transform" on the list of texts we created, and transform our speeches into a matrix.

In [None]:
X = tdm.fit_transform(speeches)

In [None]:
X.todense()

To view which column corresponds to which word, we can use the following command:

In [None]:
tdm.get_feature_names()

Just to be clear, here's the matrix we created, printed as a spreadsheet-like dataset.

In [None]:
import pandas as pd

pd.DataFrame(X.todense(), columns=tdm.get_feature_names())

In [None]:
speeches

<h3> 1.3. Fitting a Model </h3>

Many models for textual analysis can be fitted from the term-document matrix.  We will start by looking at a simple and elegant topic model, non-negative matrix factorization, to see how it works concretely.

We first import the class NMF.

In [None]:
from sklearn.decomposition import NMF

Again, we create our own instance of the class, by giving it a name of our choice, so that we can use its methods. 

As in many topic models, we need to choose the number of topics in advance.  

In our toy example, we know that the solution should be two topics.

In [None]:
nmf = NMF(n_components = 2, random_state = 0)

We can now decompose our term-document matrix.  The NMF model decomposes the original TDM into two smaller matrices: 

$ \mathbf{W}_{m \times k}\mathbf{H}_{k \times n} \approx \mathbf{X}_{m \times n} $

The optimization problem is:

$\min_{\mathbf{W}, \mathbf{H}} \sum_{i,j}(\mathbf{WH}_{ij} - \mathbf{X}_{ij})^2$ 

with the constraints that all elements of $W$ and $H$ be non-negative.

In [None]:
W = nmf.fit_transform(X)

In [None]:
H = nmf.components_

Let us look at the W and H matrices.

W is a clustering of the documents, by topic.

In [None]:
W

In [None]:
pd.DataFrame(W, speeches, columns=['Topic1', 'Topic2'])

The first two documents were assigned to one cluster.  The last two documents to the last cluster.  In other words, the model has learned to detect the two most relevant clusters (or topics) in our TDM.

The H matrix gives us the most relevant words for each cluster.  For each row, the largest values indicate anchor words, words that serve to define the clusters.

In [None]:
H

In [None]:
pd.DataFrame(H, ['Topic1', 'Topic2'], columns=tdm.get_feature_names())

The word "indigenous" best defines the first cluster.  The word "trade" best defines the second cluster.  (In this case, they are said to be "true anchor words".)

<h2> 2. A real-life example: The Canadian Hansard </h2>
    
Let us use the file hansard_feb_2017.csv.  It contains the speeches made in the month of February 2017 in the Canadian House of Commons, taken from the www.lipad.ca website.

The file was saved in the comma-separated values format.  The pandas library (we used it above) allows to load csv files easily into a data frame format.

In [None]:
import pandas as pd

df = pd.read_csv("hansard_feb_2017.csv")

Verifying the size of our dataset is a useful thing to do.

In [None]:
df.shape

Seeing what it looks like (printing the first few rows).

In [None]:
df.head()

Listing all the variables.

In [None]:
df.columns

Looking at the distribution for some of these variables.  

The main topics (i.e., the item in the daily order of business) and subtopics (the specific subtitle in the Hansard).

In [None]:
df.maintopic.value_counts()

In [None]:
df.subtopic.value_counts().head(20)

The party affiliation of the speakers.

In [None]:
df.speakerparty.value_counts()

<h3> 2.1. Creating a TDM </h3>

We can replicate the same steps as above, with our real-life corpus.  We will use the column "speechtext" containing the content of the speeches.

The TF-IDF (Term-Frequency, Inverse-Document-Frequency) matrix gives more weight to infrequent words, but the idea is similar to our term-document matrix introduced before.  Instead of counts, we are using weighted word counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tdm = TfidfVectorizer(stop_words='english', max_features=2000)
X = tdm.fit_transform(df.speechtext)

In [None]:
X.shape

<h3> 2.2. Fitting a Topic Model </h3>

We can now fit a topic model, as we did before.  We'll start with 10 topics to give an example, and look at a metric that can help to choose an optimal number of topics later on.  

In [None]:
from sklearn.decomposition import NMF

nmf = NMF(n_components = 10, random_state = 0)
W = nmf.fit_transform(X)

We'll use a function to avoid repeating ourselves.  The function simply says: for each topic, find the words in the H matrix with the highest values, and print them. 

In [None]:
def print_top_words(model, feature_names, top_n):
    H = model.components_
    for topic_id, topic in enumerate(H):
        message = "Topic #%d: " % topic_id
        message += " ".join([feature_names[i].replace(' ','_') for i in topic.argsort()[::-1][:top_n]])
        print(message)

print_top_words(nmf, tdm.get_feature_names(), 10)

Let us illustrate Latent Dirichlet Allocation, one of the most popular topic models out there.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

tdm = CountVectorizer(stop_words='english', max_features=2000)
X = tdm.fit_transform(df.speechtext)
lda = LatentDirichletAllocation(n_components = 10, random_state = 0, learning_method = 'online')
W = lda.fit_transform(X)

print_top_words(lda, tdm.get_feature_names(), 10)

We can improve the model by preprocessing the text.  For an illustration, I have already included a version of the text containing only nouns.  For topic modeling, some parts of speech are more relevant than other.

We can also consider n-grams (sequences of more than one word).  

These preprocessing steps are sensitive and must be weighed carefully, but understanding them may help to adapt an empirical method to a particular theoretical problem. 

In [None]:
tdm = TfidfVectorizer(ngram_range=(1,2), max_features=2000)
X = tdm.fit_transform(df.preprocessed_text)
nmf = NMF(n_components = 10, random_state = 0)
W = nmf.fit_transform(X)
print_top_words(nmf, tdm.get_feature_names(), 10)

A number of methods to evaluate topic coherence have been proposed in the literature.  For instance, we can implement Mimno et al. (2012)'s topic coherence score.  The closer to zero (the higher the value), the more coherent the topics. 

A sound analysis involves validating the semantic coherence of the produced topics, in particular by comparing the impact of changing the number of topics, which is the key arbitrary decision that a researcher needs to make.

In [None]:
import math
import numpy as np

def coherence_score(model, tdm, top_n):

    W = model.transform(tdm)
    H = model.components_
    topic_assignnment = np.argmax(W, axis=1)

    topic_coherence = []

    for topic_id, topic in enumerate(H):

        idx = topic_assignnment==topic_id
        temp = tdm[idx,:]      
        top_words = topic.argsort()[::-1][:top_n]
        coherence = 0.0

        for i in range(2, len(top_words)):
            for j in range(1, i - 1):
                
                word_i = np.array(temp[:,top_words[i]].todense().tolist())
                word_j = np.array(temp[:,top_words[j]].todense().tolist())
                               
                D12 = np.count_nonzero(word_i * word_j) + 1
                D2 = np.count_nonzero(word_j)
                
                coherence += math.log(D12/D2)
                
        topic_coherence.append(coherence)

    return topic_coherence

In [None]:
coherence = coherence_score(nmf, X, 10)

for topic_id, value in enumerate(coherence):
    print("Topic %d's Coherence: %0.3f" %(topic_id, value))

In [None]:
np.mean(coherence)

Suppose that we are satisfied with our model (we shouldn't be yet, but for the sake of illustration, let's assume we are).

We can append the predicted topics to the original data frame.  Next, we can examine the topics produced, using groupings of interest.

In [None]:
df['topic'] = np.argmax(W, axis=1)

df['carbon_tax'] = np.where(df.topic==1,1,0)

In [None]:
pd.crosstab(df.speakerparty, df.carbon_tax, normalize='index')

And of course, we can save our enriched dataset for future usage.

In [None]:
df.to_csv("hansard_feb_2017_v2.csv", index=False)

<h3> 2.4. Using Monroe et al. (2008)'s Fightin' Words algorithm </h3>

We can examine the specificity of word usage by party using a technique proposed by political scientists.  Consulting the paper would be required if using in a project, but essentially, the method will compute z-scores indicating which words are more specific to one group of texts compared to another. 

In [None]:
from fw import FWExtractor

liberal = df[df.speakerparty=='Liberal'].speechtext.tolist()
cpc = df[df.speakerparty=='Conservative'].speechtext.tolist()

tdm = CountVectorizer(stop_words='english', ngram_range=(1,2))
f = FWExtractor(cv = tdm)
results = f.transform(liberal, cpc)
FW = pd.DataFrame(results, columns=['word','freq_liberal','freq_cpc','zscore'])

In [None]:
FW.sort_values(by='zscore', ascending=False).head(20)

In [None]:
FW.sort_values(by='zscore', ascending=True).head(20)

In [None]:
from fw import FWExtractor

liberal = df[df.speakerparty=='Liberal'].preprocessed_text.tolist()
cpc = df[df.speakerparty=='Conservative'].preprocessed_text.tolist()

tdm = CountVectorizer(stop_words='english', ngram_range=(1,2))
f = FWExtractor(cv = tdm)
results = f.transform(liberal, cpc)
FW = pd.DataFrame(results, columns=['word','freq_liberal','freq_cpc','zscore'])

In [None]:
FW.sort_values(by='zscore', ascending=False).head(20)

In [None]:
FW.sort_values(by='zscore', ascending=True).head(20)

<h2> 3. A More Involved Example </h2>

Suppose we have many files, each representing one day of debates.  This is one of the formats available on the Lipad.ca website. We wish to put them all together.  Moreover, we would like to join additional information on MPs, for instance to study representation by gender.

Let us look at an example.  The folder 2016/4/ contains daily files with parliamentary speeches (for April 2016).  We can loop over the files and concatenate them into a dataset. 

Next, the file member_db.csv contains sociodemographic information about Canadian MPs.  We can merge it with the speech dataset to add this information.

In [None]:
import pandas as pd
import os

# Looping through all daily files in the root directory, and appending the file names:
rootdir = '2016/'
all_files = []
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        all_files.append(os.path.join(subdir, file))

# Concatenating all files at once:         
df = pd.concat((pd.read_csv(f) for f in all_files))

# Loading the member file:
mp_data = pd.read_csv('member_db.csv')

# Merging the information in the member file with the speeches, using the MP id key (pid):
df = df.merge(mp_data, on='pid', how='left')

In [None]:
df.head()

In [None]:
df.gender.value_counts()

<h3> 3.1. Sentiment Analysis </h3>

We can use the VADER library for Python to compute a sentiment score for each speech (Hutto and Gilbert 2014).  VADER performs valence shifting and accounts for amplifiers/dampeners.  Although conceived for social media, it usually offers high quality sentiment indicators, and can serve as a useful benchmark.

We need to download the model first, using the nltk library (a comprehensive library for textual analysis in Python).

In [None]:
import nltk

In [None]:
nltk.download()

Now we can import the class, initiate an instance, and compute a compound sentiment score (between -1 and 1) for each speech in our corpus.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

vader = SentimentIntensityAnalyzer()

In [None]:
# A basic example first:
example = "I am not happy."
vader.polarity_scores(example)

In [None]:
df['sentiment'] = df.speechtext.apply(lambda x: vader.polarity_scores(x)['compound'])

To illustrate, we can compare the sentiment scores by groupings of interest.

In [None]:
df.groupby('gender').sentiment.mean()

Or we can print out the speech with the highest sentiment score:

In [None]:
df.iloc[df.sentiment.values.argmax()].speechtext

<h2> References </h2>

Beelen, Kaspar, Timothy Alberdingk Thijm, Christopher Cochrane, Kees Halvemaan, Graeme Hirst, Michael Kimmins, Sander Lijbrink, Maarten Marx, Nona Naderi, Roman Polyanovsky, Ludovic Rheault, and Tanya Whyte. 2017. "Digitization of the Canadian Parliamentary Debates." Canadian Journal of Political Science 50(3): 849-864.

Grimmer, Justin, and Brandon M. Stewart. 2013. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political analysis 21(3): 267-297.

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. "Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16(4): 372-403.

Hutto, C.J., and Eric Gilbert. 2014. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth International Conference on Weblogs and Social Media (ICWSM-14).

Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. "Optimizing Semantic Coherence in Topic Models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 262–272.