# Class 7: Topics Modeling and Dictionary-based Analysis - Exercise

## 0 Setup 

In [71]:
# Import basic Python modules
import os
import platform

# Data mangement libraries
import numpy as np
import pandas as pd

# For progress bar
from tqdm import tqdm

# Gensim
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

# SpaCy
import spacy

# For regular expressions
import re

# For sentiment analysis
from sentida import Sentida

In [22]:
# # # # Working Directory # # # #

if platform.system() == 'Linux':
    wd = '/home/rask/'
else:
    wd = 'C:/Users/au535365/'

wd = os.path.join(wd, 'Dropbox/teaching/css_fall2023')
    
# Change directory
os.chdir(wd)

# Confirm that the working directory is as intended 
os.getcwd()

'/home/rask/Dropbox/teaching/css_fall2023'

#### Exercise 0.0: Reading in Data

We start by reading in data. We work the same data as in *class05* and *class06* but restrict ourselves to a single year. You can choose whatever year you like, but make sure to have at least $10,000$ in the dataframe when you have loaded the data. Call the dataframe `df`

Note also tha we can read the data directly from GitHub. See the notebook `class05-filereading.ipynb` for details.

#### Solution 0.0

#### Exercise 0.1: Removal of Short Texts

Text-based methods generally work better with longer documents. Hence, we remove short texts, which are likely to be uninformative anyway. 

Compute the number of characters in each speech and keep only speeches with $V$ or more characters. Argue for your choice of $V$.

#### Solution 0.1

## 1 Topic Modeling

The exercises in the next section regards topic modeling. Our task is to identify topics discussed in parliamentary speeches from the Danish parliament in the given term that you decided upon in *exercise 0.0*. This is **very** common task when you work with text data. Mastering it will make your life easier :-) 

#### Exercise 1.0: Load SpaCy Model

1) Load in the spacy model we used in *class06* and assign to an object called `spacy_pipeline_da`.
2) Define a list called `texts` with the `text` column from `df`. Remember to type cast it as a list.

#### Solution 1.0

#### Exercise 1.1: Text Cleaning I

Cleaning our text is a very important step in working with text data. 

  1. Remove the weird encoding '\xa0'. This encoding does not always appear, but sometimes it will. Remove it.
  2. Remove two or more consecutive whitespaces.
    
You can use the `re` module for these tasks wrapped in list comprehensions. We also did this in the tutorial.

A good tip is to assign the cleaned text to a new object, for instance called `texts_cleaned` or another name of your choice.

#### Solution 1.1

#### Exercise 1.2: Text Cleaning II

Each language has a set of stopwords, which we typically wants to remove. 

Each corpus also has a set of corpus-specific characteristics that can be viewed as stopwords. In parliamentary speeches, this includes words like *ordfører*, *lovforslag*, and so on. Furthermore, mentionings of legislators and parties are very common and often uninformative, at least for the perspective of topic modelling.  

Below, I provide you a bunch of lists with names of politicians, procedural words, and parties that we want to remove in `removal_words`. I also make the regular expression for you `removal_pattern`. Try to figure out what's going on if you like. Maybe ask ChatGPT.

Your task is now to:

1. Define a list of Danish stopwords using the loaded pipeline `spacy_pipeline_da` (we also did this in *class06* and in the *class07-tutorial*). Call the object `stopwords`.

2. Remove the words defined in the regular expression `removal_pattern` from your cleaned text from *exercise 1.3* 

3. Remove two or more consecutive whitespaces (reuse your code from *exercise 1.3*).




In [37]:
names_to_remove = [x[:-1] + '[a-z]+' for x in list(df.speaker.unique())]

procedural_to_remove = ['[Ll]ovforsla[a-z]+', 'ordfør[a-z]+', 'spørgsmå[a-z]+',
                        'forsla[a-z]+', 'L', 'B', '[Hh]r', '[Ff]ru', '[Aa]fstemnin[a-z]+',
                        '[Ff]orhandlin[a-z]+', '[Hh]r', '[Ff]ru']
                   
parties_to_remove = ['[Ll]iberal [Aa]llianc[a-z]+', 'LA', '[Dd]et [Kk]onservative [Ff]olkepar[a-z]+', 'KF',
                   '[Dd]e [Kk]onservati[a-z]+', 'Venst[a-z]+', '[Dd]ansk [Ff]olkepart[a-z]+', 
                   '[Nn]ye [Bb]orgerli[a-z]+', '[Dd]e [Rr]adikal[a-z]+', '[Ss]ocialdemokratie[a-z+]',
                   '[Ss]ocialdemokra[a-z]+', '[Ss]ocialistis[a-z]+ [Ff]olkepart[a-z]+', 'SF',
                   '[Aa]lternative[a-z]+', '[Ee]nhedslist[a-z]+', '[Rr]adika[a-z]+']        

removal_words = parties_to_remove + procedural_to_remove + names_to_remove
        
removal_pattern = r'\b(?:' + '|'.join(removal_words) + r')\b'

#### Solution 1.2

#### Exercise 1.3: Tokenization

We now want to tokenize our speeches. Use the pipeline `spacy_pipeline_da`, which you loaded in *exercise 1.2*.

You should iterate over our cleaned text from *exercise 1.4*. Assign the tokens to an object called `tokens_raw` 

In [39]:
# Tokenize 
tokens_raw = [[d for d in spacy_pipeline_da(doc)] for doc in tqdm(texts_cleaned, position=0, leave=True)]

100%|█████████████████████████████████████| 17937/17937 [07:25<00:00, 40.23it/s]


#### Solution 1.3

#### Exercise 1.4 Preprocessing I

Preprocessing can influence our results in both positive and negative ways. Apply a range of preprocessing steps of your choice on your `tokens_raw` object from *exercise 1.6*. You should remove stopwords, but what you do besides that depends on your intution and line of reasoning. Whatever you decide, you should return tokens using SpaCy's `.text` attribute or `.lower_` attribute if you want to lower your terms.

Assign your result to an object called `tokens_clean`

#### Solution 1.4: 

#### Exercise 1.5 Preprocessing II

As a final preprocessing step, we consider whether we should remove rare and frequent occuring terms. We have already removed many frequent terms when we removed stopwords and also the domain-specific stopwords such as the names of legislators, party names, and procedural words. 

1. Argue why removal of frequent and rare terms can benefit topic modeling
2. Compute the word frequency and word-document frequency based on your `tokens_clean`. I have provided you with code to do in the tutorial (**Note**: you can not use the code if you have not returned the tokens using the `.text` or `.lower_` attribute in *exercise 1.6*)
3. Remove frequent and rare terms according to thresholds of your choice (e.g. terms that occur in what corresponds to 10% of the total speeches or words that occur in at least 5 speeches across corpus)
4. Compute the number of tokens for the first speech before and after removal of rare/frequent words) 

Assign your final tokens to an object called `tokens_final`

#### Solution 1.5

#### Exercise 1.6

1. Construct the vocabulary using the Dictionary class from gensim on `tokens_final`
2. Construct the BoW from the vocabulary from step 1

See the tutorial for the class if you can't remember how to do so.

#### Solution 1.6

#### Exercise 1.7

Estimate a LDA model with a topic number $k$ of your choice. Argue for your choice of $k$.

#### Solution 1.7

#### Exercise 1.8: Interpretation of Results

1. Print the results from the topic using the `.print_topics()` method specifying the argument `num_topics` with your choice of $k$. If you do not specify $k$, it leaves out some of the topics.
2. Use the `TopicInspector` class from the tutorial to see the topic-word distributions and topic-document distributions.

In [112]:
class TopicInspector:
    """ A class for inspecting and analyzing Latent Dirichlet Allocation (LDA) models. 
        
        Attributes:
            lda_model (gensim.models.LdaModel): The LDA model to be inspected.
            vocab (dict): A vocabulary mapping from word IDs to words.
            corpus (list of list of tuples): The corpus of documents used to train the LDA model.
            num_topics (int): The number of topics in the LDA model.
            topn (int, optional): The number of top words in each topic to consider (default is 10).
        
        Methods:
            id2token(wid): Convert a word ID to its corresponding word in the vocabulary.
            get_topic_words(tid): Get the top words associated with a given topic.
            get_topic_word_prob(tid): Get the probabilities of the top words in a given topic.
            topic_word_df(): Create a DataFrame representing the top words for each topic.
            topic_doc_df(add_max_topic=True, add_max_score=True): Create a DataFrame representing the topic distribution for each document in the corpus.
    """
    
    def __init__(self, lda_model, vocab, corpus, topn=10):
        self.lda_model = lda_model
        self.vocab = vocab
        self.corpus = corpus
        self.num_topics = self.lda_model.num_topics
        self.topn = topn
    
    def id2token(self, wid):
        return self.vocab[wid]
    
    def get_topic_words(self, tid):
        topic_terms = self.lda_model.get_topic_terms(tid, topn=self.topn)
        wordids, score = zip(*topic_terms)
        return [self.id2token(x) for x in wordids]
    
    def get_topic_word_prob(self, tid):
        topic_terms = self.lda_model.get_topic_terms(tid, topn=self.topn)
        wordids, score = zip(*topic_terms)
        return score
    
    def topic_word_df(self):
        
        topic_df_list = []
        for k in range(0, self.num_topics - 1):
            words = self.get_topic_words(tid=k)
            topic_df_ = pd.DataFrame(words, columns=[f'topic{k}'])
            topic_df_list.append(topic_df_)
        
        topic_word_df = pd.concat(topic_df_list, axis=1)
        
        return topic_word_df
            
    
    def topic_doc_df(self, add_max_topic=True, add_max_score=True):
        
        topic_docs = self.lda_model.get_document_topics(self.corpus, minimum_probability=0)
        doc_dist_list = []
        for d in range(len(topic_docs)):
            doc_dist = [x[1] for x in topic_docs[d]]
            doc_dist_list.append(doc_dist)
        
        topic_doc_df = pd.DataFrame(doc_dist_list, columns=[f'topic{x}' for x in range(self.num_topics)])
        if add_max_topic:
            max_topics = topic_doc_df.idxmax(axis=1)
        else:
            max_topics = None
        
        if add_max_score:
            max_scores = topic_doc_df.max(axis=1)
        else:
            max_scores = None
        
        if add_max_topic:
            topic_doc_df['max_topic'] = max_topics
        
        if add_max_score: 
            topic_doc_df['max_score'] = max_scores

        return topic_doc_df
10

#### Solution 1.8

## 2.0 Sentiment

The exercises in the this section regards sentiment analysis. 

#### Exercise 2.0

We will work with dictionaries of positive and negative words compiled by the Lexicoder Sentiment Dictionary (LSD), which contains a list of positive and negative words. The LSD was originally compiled by Young and Soroka (2012) for English text with the purpose of studying newspaper content. 

The LSD dictionary was translated by Proksch et al. (2019) in their article *Multilingual sentiment analysis: A new approach to measuring conflict in legislative speeches* to study the government/opposition conflict-dynamic at the level of each bill in Western democracies.
 
Note that the dictionaries contain bi-grams and maybe also tri-grams. We only, however, rely on the one-grams, i.e. single words.

1. Load the two dictionaries from directly from the GitHub repo as pandas dataframes:
- *lsd_pos.csv*
- *lsd_neg.csv*
2. Keep only single words (i.e. exclude bi-grams, tri-grams, and so on) and words with at least three characters

3. Define a list called `docs` based on the dataframe `df`, which you loaded and filtered in *exercise 0.0* and *0.1*, respectively. 

4. Generate lazy tokens from the `docs` list using a simple `.split()` method

#### Solution 2.0

#### Exercise  2.1

In this exercise, we will compute sentiment scores for each text using three different approaches.

- Approach 1: Sentiment Difference: Differene of positive and negative words:
    \begin{align}
        \frac{(tokens_i\cap \text{positive words}) - (tokens_i\cap \text{negative words})}{N_{tokens_i}}
    \end{align}
- Approach 2: Sentiment Ratio: Ratio between positive and negative words:
    \begin{align}
        \log_{10}\frac{(tokens_i\cap \text{positive words}) + 0.5}{(tokens_i\cap \text{negative words}) + 0.5}
    \end{align}
- Approach 3: Sentiment Weights: Weighted average of positive and negative words


For *Approach 1* and *Approach 2*, we rely on the intersection of words, which is basically the same as the summation notation we saw on the slides. For *Approach 3*, we rely on the Python module `Sentida` (https://github.com/Guscode/Sentida), which computes a weighted average.

The intersection of two lists are:
    \begin{align}
        a &= [1,2,3] \\
        b &= [1,4,3] \\
        a\cap b &= [1,3]
    \end{align}

1. Define a function that computes the intersection of two lists. If you can't figure a solution out, check the solutions.
2. Compute the intersection between words in each speech and positive and negative words respectively. Assign the results to two lists: `positivity_words` and `negavitity_words`.
3. Compute the sentiment scores for the three different approaches and assign to objects called:
    - `sentiment_diff`
    - `sentiment_ratio`
    - `sentiment_weight`
4. Compute the highest and lowest sentiment score for each of three approaches and their resulting indices (*Hints*: np.max/min vs. np.argmax/argmin)
5. Compute the pairwise correlation between the three approaches and visualize them. I provide you with code for this step.
6. Interpret the results. Which approaches correlate the most? Does it make sense? How can we interpret the scores? Are they absolute or relative?

#### Solution 2.1

In [196]:
# Solution to step 6

# Import module to plot heatmap as image
import plotly.express as px

# Convert the sentiment scores to a dataframe 
sent_df = pd.DataFrame({'sent_diff': sentiment_diff,
                  'sent_ratio': sentiment_ratio,
                  'sent_weight': sentiment_weight})

# Compute the pairwise correlation using pearson's r
corr_df = sent_df.corr(method="pearson")

# Generate visualization
corr_heatmap = px.imshow(corr_df.to_numpy(), 
          x=list(corr_df.index),
          y=list(corr_df.index),
          labels=dict(color="Similarity Score"),
          color_continuous_scale='GnBu')

In [268]:
# Print heatmap
corr_heatmap