# Lab 12: Analyzing Presidential Debates

The data from this project comes from
the 
[UC Santa Barbara Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0)

The data was scraped using a custom Python program. It contains most presidential and primary debates from 1960
to 2020.

The base dataframe has five columns:

0. name : The name of the person speaking. There may be some inaccuracies due to inconsisitent formats on the website.
1. text : The transcript of what the person said. 
2. debate_file : This corresponds to the individual transcript file (not included).
3. party : This denotes the type of date. 'P' is for presidential, 'VP' is vice-presidential, 'D' is Democrat primaries,
           and 'R' is Republican primaries.
4. date : This is the date that the debate occurred. 

## Imports

In [None]:
import numpy as np
import pandas as pd
from nltk.sentiment import vader

## Load data
Read in the data file and remove some descriptive rows


In [None]:
debates = pd.read_csv('data/all_utterances.csv', index_col=0)
debates.reset_index(inplace=True, drop=True)

# replace missing values with empty strings
debates = debates.fillna('')

## Basic data information
First, let's see who spoke the most. This includes moderators and candidates, and is not normalized by type of debate (primary or presidential.

In [None]:
debates.info()

## Data Preprocessing

### Fix dates and get debate year

In [None]:
debates['year'] = pd.DatetimeIndex(debates['date']).year

### Inspect the `name` column

In [None]:
# distinct speakers
print(debates['name'].unique())

In [None]:

# drop rows with names of speakers
exclude_list = ['PARTICIPANTS', 'MODERATORS']
debates = debates[~debates['name'].isin(exclude_list)]


## Speaker-level measures

In [None]:
debate_counts = debates.groupby(['name']).count().sort_values(by='date', ascending=False)
debate_counts['date'].head(15).plot.bar()

# Sentiment analysis
Now, let's do some examples of sentiment analysis using the VADER algorithm.
The VADER algorithm computes positive, negative, and neutral sentiment, and it produces
a compound score that gives the overal polarity for each turn at talk.

We will look at a few rows of data and see what we get.

Compound scores greater than 0 indicate are considered positive, less than 0 are negative.
The further the compound score is from 0, the more extreme the sentiment.

In [None]:
vader_analyzer = vader.SentimentIntensityAnalyzer()

example1 = debates.loc[100, 'text']
print(example1)
vader_analyzer.polarity_scores(example1)

In [None]:
example2 =  debates.loc[1000, 'text']
print(example2)
vader_analyzer.polarity_scores(example2)

In [None]:
example3 = debates.loc[10000, 'text']
print(example3)
vader_analyzer.polarity_scores(example3)

# Compute sentiment for all utterances
Now, we will apply sentiment analysis to all of the turns-at-talk in the data.

We will use the `.apply()` function in pandas. It runs the function on 
every row of the dataframe. 

In [None]:
vader_analyzer = vader.SentimentIntensityAnalyzer()
results = debates['text'].apply(vader_analyzer.polarity_scores)

Now, we will format the results and add them to our original dataframe.
We will use the `.concat()` function, which joins dataframes. `axis=1` 
joins them by columns rather than rows.

See 
[this article](https://stackoverflow.com/questions/29681906/python-pandas-dataframe-from-series-of-dict) for details.

In [None]:
results_df = pd.DataFrame(list(results))

# add the new columns
debates = pd.concat([debates, results_df.reindex(debates.index)], axis=1)

# Summarize sentiment data
Now, let's compute some descriptive information for sentiment scores.

First, check the distribution of sentiment scores.

In [None]:
debates['compound'].hist()

Next, let's see the mean compund sentiment score for the candidates that spoke the most.

In [None]:
speaker_summary = debates.groupby(['name']).agg(['mean', 'count'])

display(speaker_summary['compound'][['mean', 'count']].sort_values(by='count', ascending=False).head(15))

# Topic Modeling

## Preprocessing

In [None]:
from gensim.parsing.preprocessing import preprocess_string
from gensim import corpora

debates['clean_text'] = debates['text'].apply(preprocess_string)
print(debates.loc[10, ['text', 'clean_text']])

In [None]:
dictionary = corpora.Dictionary(debates['clean_text'])
print(dictionary)

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in debates['clean_text']]

## Fitting an LDA model

Next, we will fit the model. One important consideration with LDA is that you must 
choose the number of topics in advance. The total number of topics allowed is not 
restricted, but too few topics and they will be too general interpret, too many topics 
and there may be considerable overlap. Later, we will see how to measure fit for different
topic counts. This dataset is small, so model fitting is fast. Larger datasets could
take minutes, hours, or days.

For this first example, let's try with 25 topics.

In [None]:
from gensim import models

lda_25 = models.LdaModel(bow_corpus, num_topics=100, id2word=dictionary)

In [None]:
for topic in lda_25.show_topics(num_topics=20, ):
    print("Topic", topic[0], ":", topic[1])

In [None]:
for doc in bow_corpus[0:9]:
    print(lda_25.get_document_topics(doc))

## Evaluation
To evaluate topic model fit, we can use perplexity or coherence. These measures indicate improvement as they get 
closer to 0. 

In [None]:
print('Perplexity: ', lda_25.log_perplexity(bow_corpus))

In [None]:
# add topics to data

from gensim.matutils import corpus2csc
all_topics = lda_25.get_document_topics(bow_corpus, minimum_probability=0.0)
all_topics_csr = corpus2csc(all_topics)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_df = pd.DataFrame(all_topics_numpy)

# make topic names easier to read
topic_names = ['Topic ' + str(x) for x in all_topics_df.columns]
all_topics_df.columns = topic_names


debates = pd.concat([debates, all_topics_df], axis=1)


# Exercises
1. Which candidate had the greatest percent of extreme positive turns-at-talk (compound score > 0.5)?
2. Try building a model with 50 topics. Is it easier to interpret than one with 25 topics?
3. How would you construct a classifier to predict a winner of an election?
4. With text analysis, the variables that you can construct are only limited by your imagination.
   Try creating a dictionary (a list of related words) and count the number of times each candidate
   uses these words. 