In [None]:
#Importing packages 
import pandas as pd
import numpy as np
import scipy 
from tqdm import tqdm  #This is for creating progress bars.
import json #To save files in json format
import os #To set working directory
from datetime import datetime #To check start and end time when running code
import pickle  #To store and open previously saved machine learning models 



#Packages for language processing
import re
from collections import defaultdict
import string
import nltk
from nltk.corpus import wordnet

#Packages for visualization
import seaborn as sns
import matplotlib.pyplot as plt


# Exercise: Scaling

Like in earlier exercise labs, we will be working with a dataset of tweets from USA Members of Congress. The version of the dataset used today includes the 3000 most recent tweets from each member, collected in 2021. We have created a subset of this dataset containing tweets from 2018 and later, to capture only fairly recent tweets as the dataset is very large. 

Preprocessing such a large dataset takes a while to run (approx 20 mins). To save you time and effort, we have uploaded a preprocessed version of the dataset: 'MOC2021_Tweets_2018subset_preprocessed.csv'. 

The following steps were applied: 

1. Subsetting dataset to include only tweets from 2018 and later
1. Removing duplicated tweets 
2. Removing unneeded columns (all except 'nominate_name','affiliation','role','nominate_score', and 'text')
3. Turning independents into Democrats or Republicans. In essence, finding the independents ("SANDERS, Bernard" and "KING, Angus Stanley, Jr.") and turning these to "Democrat". 
4. Replace "&" with "and"
5. Remove odd special characters ("┻","┃","━","┳","┓","┏","┛","┗")
6. There are some annoying cases in these data where a character that looks like a space is not a space. Here, we just replace that character with an actual space. Note that the first "space" is not actually a space. " " == " " is FALSE. Replace "\u202F", "\u2069", "\u200d", and "\u2066" with " ".
7. Removing "RT" and "via"
8. Removing mentions (@someone)
9. Removing numbers, removing punctuation (except hyphens), removing separators, removing urls, lowercasing, removing stopwords, and lemmatizing.

In [None]:
#Setting up data paths

notebook_path = os.path.realpath(os.curdir)

path_to_moc_data = os.path.join(notebook_path, "MOC2021_Tweets_2018subset_preprocessed.csv.bz2")

path_to_corp = os.path.join(notebook_path,"Wordfish models/Wordfish models/MOCscalingresults.sav")
path_to_wf_scaler = os.path.join(notebook_path,"Wordfish models/Wordfish models/MOCscaler.sav")


### 1: Formatting the dataset to a shape accepted by the scaler

Import the dataset with preprocessed text 'MOC2021_Tweets_2018subset_preprocessed.csv'.

The dataset includes the name of each member of Congress (nominate_name), their affiliation (Democrat or Republican), their institutional role in Congress (House or Senate), their ideological score based on how they vote in Congress (nominate_score) and the text of each of their most recent 3000 tweets, reaching back until the beginning of 2018, in original and preprocessed format. 

1. The goal of our wordfish scaling today is to give each politician an ideological score based on their tweets. To prepare the data, we therefore need to aggregate all text per politician. In essence, transform the dataframe so that each row has one politician (rather than one tweet) and each text field includes all tweets from this politician in one long string. 

Hint for 1.1: When preprocessing, some tweet text was removed (e.g. if they were only URLs). To aggregate text, you may need to replace NaN values with an empty string. Next, the pandas functions `groupby` and `agg` can help you. These steps are the same as last week. 


2. The wordfish scaler accepts data in the shape of a list of tuples containing the document name and the document text. Essentially that means that you should create a list in the format: [(politician1, preprocessed_text1), (politician2, preprocessed_text2), (politician3, preprocessed_text3)].


In [45]:
#When importing the data, use pd.read_csv('filename', compression = 'bz2')


### 2: Scaling with Wordfish

The wordfish scaler we are using is from the implementation found here: https://github.com/umanlp/SemScale. Download the folder with the code from the github link by: Code --> Download zip. Alternatively, use git 

As this is not a (super) professional implementation, there is almost no documentation of how to use the code. Therefore, we have copied the essential parts of their `wordfish.py` code below and ask you to fill in the blanks with data from the list you have just created. 

You can check what the object `corp` contains with the following commands: 
- `corp.occurences`: See the document-feature matrix 
- `corp.vocubulary`: See the full vocabulary across documents. Numbers indicate their index in the dfm. 
- `corp.results`: See the scaling results 

The wordfish scaler takes several hours to run on the full dataset. Select a **subset of perhaps 20 politicians** to check that you have a code that works. 

#### Importing code 

Download the code from the github repository and store it somewhere on your computer. The following imports will be drawing on code in that folder. 

In [None]:
#Setting working directory to be where the SemScale code is stored 
os.chdir('./SemScale/')

#Scaling packages 
#from helpers import io_helper
from wfcode import corpus
from wfcode import scaler
import argparse


#### Creating a corpus object

In [None]:
#Select a subset of your list of politicians 

subset = 

In [None]:
#Setting parameters (keeping the default parameters as used in the github code)

niter = 5000      #number of iterations
lr = 0.00001      #learning rate
stopwords = None  #we've already removed stopwords


In [None]:
#Creating a corpus object

corp = corpus.Corpus() #input your data in the parentheses

In [None]:
#Preprocessing data
corp.tokenize(stopwords = stopwords)

#Building the document-feature matrix
corp.build_occurrences()

#### Investigating the corpus object

Check the shape of your document-feature matrix. Check the vocabulary of the corpus.


In [None]:
#your code here

#### Training the wordfish scaler

In [None]:
wf_scaler = scaler.WordfishScaler(corp)

print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " WordFish scaling begun.")

wf_scaler.initialize()
wf_scaler.train(learning_rate = lr, num_iters = niter)

print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " WordFish scaling completed.")

#### Results 

View the results of the scaler. 

In [None]:
#your code here

### 3: Validating and inspecting the results: part 1

Running the wordfish scaler on the full dataset takes several hours. Therefore, we've prepared two models for you pre-trained on the full dataset. Load the models called 'MOCscalingresults.sav' (the corp object) and 'MOCscaler.sav' (the wf_scaler object) from Absalon with the code below. **Beware this is very RAM intensive** as corp is not stored as a sparse matrix. 

To make sure we have meaningful results, check the alpha, psi, and beta values. 

1. Beta values can be accessed with `wf_scaler.beta_words`. Find the 10 words that are most predictive of the low end of the ideology scale and the 10 words that are most predictive of the high end of the ideology scale. Based on these words, can you guess which ideology (Democrat vs Republican) is categorized with low values and which with high values?
2. Alpha values can be accessed with `wf_scaler.alpha_docs`. Check the document length of the documents with the highest and lowest alpha. Do the results make sense? 
3. Psi values can be accessed with `wf_scaler.psi_words`. Check the frequency of the words with the highest and lowest psi values. Do the results make sense? 

Hint: All the values are stored in numpy arrays. Numpy has functions for getting the original indices of sorted values, and the index placement of minimum and maximum values, `argsort()`, `argmin()`, and `argmax()`.

Also: In fact the corp document feature matrix has been initalized with a matrix of ones as a base. If you want exact results you should subtract 1 from every entry in the matrix.

In [None]:
#Loading the saved models 

corp = pickle.load(open(path_to_corp, 'rb'))
wf_scaler = pickle.load(open(path_to_wf_scaler, 'rb'))


#### 3.1 Beta values 

In [None]:
#your code here

#### 3.2 Alpha values

In [None]:
#your code here

#### 3.2 Psi values

In [None]:
#your code here

### 4: Validating and inspecting the results: part 2

1. Create a new column in your dataset and input the scaling results. 
2. To validate the results, run the overall correlation between the scaling results and the provided nominate scores. Visualize the correlation in a scatterplot. How well do the scaling results correlate with the nominate scores?
3. Optional: Separate House and Senate. Run correlations within each institution as above and visualize as a scatterplot. 
3. Separate Democrats and Republicans. Run correlations within each party as above and visualize as a scatterplot. Can the scaling results help us determine ideological positions within each party?
4. Find the ideological score as computed by the wordfish scaler of specific politicians that you know, e.g. Ted Cruz and Bernie Sanders. Do the results make sense? Why might the results not be as we would have expected?
5. Based on nominate and scaling scores, respectively, who is the most extremist Republican and the most extremist Democrat? Who is the most left-wing Republican and the most right-wing Democrat? 

Finally, as a reflection exercise, consider the results. What are the limitations of this analysis? Could this be used in a paper or would you need to implement other methods? How could you engineer features of the text to improve the results? 

#### 4.1 Merging scaling results with dataframe

In [None]:
#your code here

#### 4.2 Overall correlation between our scaling results and the provided nominate scores

In [None]:
#your code here

#### 4.3 Correlations within the Senate and House, respectively

In [None]:
#your code here

#### 4.4 Correlations within the two parties

In [None]:
#your code here

#### 4.5 Scores for specific politicians

In [None]:
#your code here

#### 4.6.1 Most extreme Republican and Democrat

In [None]:
#your code here

#### 4.6.2 Most left-wing Republican and most right-wing Democrat

In [None]:
#your code here

### Reflection
