# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: John Gallagher
- Userid: jjg5fg
- GitHub Repo URL: https://github.com/jjg5fg/Final_Project_ETA
- UVA Box URL: https://virginia.box.com/s/mm98m5lh0iadess09hiyr7s5m3vd3fk0

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

The source material is a collection of US Presidential speeches up until September 25th 2019. This data is from [Kaggle](https://www.kaggle.com/datasets/littleotter/united-states-presidential-speeches) and collected from the UVA Miller Center. The dataset contains a corpus of all speeches and then broken up by US presidential eras. The corpus of all speeches is going to be used for project.

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: https://www.kaggle.com/datasets/littleotter/united-states-presidential-speeches
- UVA Box URL: https://virginia.app.box.com/folder/256969523790
- Number of raw documents: roughly 1000 speeches
- Total size of raw documents (e.g. in MB): 22 MB
- File format(s), e.g. XML, plaintext, etc.: csv

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

The data is a CSV with the following columns, 'date', 'president', 'party', 'speech title','summary', 'transcript' and 'url'. The main column of interest is 'transcript' as that is the text of the speech. Speeches vary in length but follow similiar format of addressing a group, The other columns will be used for riffs at the end of the notebook and placed in a LIB table. 

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/no2s3onves8lwksx1t9zhwub6k317o3x
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Number of observations: 991 speeches
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): Date, President, Party
- Average length of each document in characters: 22523

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/lhftw0mk3cgf0iex8kaw5l08wmv2jdqm
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): Roughly 3.8 million obervations (3838307)
- OHCO Structure (as delimitted column names): speech, sentence, token
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`): `speech_id`, `sent_num`, `token_num`, `pos_tuple`, `pos`, `token_str`, `term_str`, `pos_group`

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/ojrj9ha6nbr084se2b7enog2h75p47cn
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma 
- Number of observations: 38355
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): `n`, `n_chars`, `p`,`i`, `max_pos`, `n_pos`, `cat_pos`, `stop`, `stem_porter`, `stem_snowball`, `stem_lancaster`, `dfidf`, `max_pos_group`
- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

| term_str     | Value       |
|--------------|-------------|
| nearly       | 525.960837  |
| away         | 525.960837  |
| used         | 525.960837  |
| though       | 525.960565  |
| terms        | 525.960565  |
| lead         | 525.957156  |
| party        | 525.956330  |
| seek         | 525.949533  |
| held         | 525.949533  |
| self         | 525.949533  |
| independence | 525.949533  |
| including    | 525.948120  |
| experience   | 525.948120  |
| feel         | 525.948120  |
| forward      | 525.948120  |
| going        | 525.948120  |
| fair         | 525.948120  |
| regard       | 525.948120  |
| sure         | 525.937979  |
| longer       | 525.937979  |


# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/sbh8qj940g3g4rmv4mdrmlq621rfbjnf
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Bag (expressed in terms of OHCO levels): speeches
- Number of observations: 941286
- Columns (as delimitted names, including `n`, `tfidf`): `speech_id`, `term_str`, `n`, `tdidf`

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL:https://virginia.box.com/s/2pee6qf7a1x4ytii02isldg0vd091ofs
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/sbh8qj940g3g4rmv4mdrmlq621rfbjnf
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Bag (expressed in terms of OHCO levels): speeches

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/tz82rm6t4kkomvtpu5hkttnyw2qjdou7
- UVA Box URL of DTM or BOW used to create: https://virginia.box.com/s/2pee6qf7a1x4ytii02isldg0vd091ofs
or https://virginia.box.com/s/sbh8qj940g3g4rmv4mdrmlq621rfbjnf
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Description of TFIDIF formula ($\LaTeX$ OK):

$TF_{\text{max}}(t, d) = \frac{f_{t,d}}{\max_{t'}(f_{t',d})}$

 where $f_{t,d}$ is the frequency of term $t$ in document $d$, and $\max_{t'}(f_{t',d})$ is the maximum frequency of any term in document $d$

$IDF_{\text{standard}}(t) = \log_2\left(\frac{N}{DF_t}\right)$

where $N$ is the total number of documents, and $DF_t$ is the number of documents that contain the term 
$t$.

$TFIDF_{t, d} = TF_{\text{max}}(t, d) \times IDF_{\text{standard}}(t)$


## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL:  https://virginia.box.com/s/1bkm283d6ven6u2m1tgiyuqucqafj48s
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/tz82rm6t4kkomvtpu5hkttnyw2qjdou7
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter:comma
- Number of features (i.e. significant words): 5000 words
- Principle of significant word selection: NN, VB, JJ and not NNP

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/ty31x7g52s8g137mw7d7zd0aegs4bjs3
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/1bkm283d6ven6u2m1tgiyuqucqafj48s
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Number of components: 10
- Library used to generate: Sklearn
- Top 5 positive terms for first component: ['treaty', 'subject', 'revenue', 'commerce', 'public']
- Top 5 negative terms for second component: ['proclamation', 'aforesaid', 'whereof', 'seal', 'thereof']

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/1utgwmtk2bsbo93v17pm6aubzqf3ukv3
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/39x643w2staybdotf12yhml44ghhfu0f
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![PCA of Components 0 and 1](pca_0_1.png)


![PCA of Components 0 and 1](loadings_0_1.png)

Briefly describe the nature of the polarity you see in the first component:

PCA: PC0 shows separation between the political parties, with republican having positive PC0 values showing distinct characteristics. Older parties, have more negative PC0 values indicating they are different compared to present day parties.  Republicans have wider range in box plot showing more variability compared to other parties. 

Loadings: The loadings show that most speeches have similar patterns in their usage of words, as indicated by the clustering around the origin.

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)
![PCA 1 and 2](pca_1_2.png)

![PCA Loadings 1 and 2](loadings_1_2.png)

Briefly describe the nature of the polarity you see in the second component:

PCA: The scatter plot shows a clear distribution of speeches along the PC1 axis by political party. Similar to above, Republicans have positive PC1 values and older parties cluster among negative values. However, the distinction is not as obvious as first one.

Loadings: Similar to first one, centered along the origin, however, there are more outliers, suggesting those speeches contain unique features that differentiate them from the rest.

## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/6ohfxeiw8u4o0cv07q4p75yoe5xag8ga
- UVA Box URL of count matrix used to create: https://virginia.box.com/s/i4oj9gmokqat2tipl6urze7bng0j3kdh
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Libary used to compute: sklearn
- A description of any filtering, e.g. POS (Nouns and Verbs only): Same as earlier, NN, VB, JJ and not NNP
- Number of components: 2
- Any other parameters used: 
ngram_range = (1, 2)
n_terms = 4000
n_topics = 40
max_iter = 20
n_top_terms = 9
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T06:  years life home efforts children danger  4478.998558 Post tragedy speech
  - T16: country peace security prosperity energy  4337.812354 State of the Union
  - T04 year way increase expenditures revenue 4225.641413 Finance Related
  - T10: citizens state territory claims convention 4126.661523 Convention Talk
  - T39: tax capital banks taxes rates 4010.247723 Tax Bills

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/w2n6g33m24tba7j74vop47h147bpw37x
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/bz52bzt6r6u1g7jfkus864tasorczwkh
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

![PHI PCA ](plot_lda_pca.png)

From the LIB table, the president with highest score for the topic was plotted. 

Famous presidents like Teddy Roosevelt and John F Kennedy are alone. A lot of the "founding fathers" James Madison, Thomas Jefferson were lumped together. 

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/2axozbyz3gnc6py59s4ogswxjq03himc
- UVA Box URL for source lexicon: https://virginia.box.com/s/j8jxbzq61apqnqr9qo3xim5krvfrutg5
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/9cop60xkz96rti1jztjbi455t61ecapa
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/cthmexq7d7kh9mthzw3qh44w05kkbjec
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Document bag expressed in terms of OHCO levels: speech_level 

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![decades](Sent_decade.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/0twrvum5a2srsmh5c85ai36ljfiy0mm3
- GitHub URL for notebook used to create: https://github.com/jjg5fg/Final_Project_ETA
- Delimitter: comma
- Document bag expressed in terms of OHCO levels: speech_id
- Number of features generated: vector size 246
- The library used to generate the embeddings: gensim.models import word2vec

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![TSNE_plot](tsne_plot.png)


![TSNE_cluster_plot](tsne_zoomed_in.png)


The cluster that got was big orange dot beloved, and the words around make logical sense, dedication, belief, founders. 

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

![RIFF_1](RIFF_1.png)

The first RIFF was plotting key US talking points per decade. These words include economy, education, freedom, health and war. War dominated over the entirety of the corpus, until 2000s, with spikes coming during predictable times, like Civil War, World Wars and Cold War. Freedom really peaked post world war as well and peaked during the cold war as well. Economy is really not talked about much until post World War and then comes up to be largest topic of the 2000s. Health spikes in the 1980s, coinciding with AIDS epidemic. 

## Riff 2 (5)

![RIFF2](RIFF_2.png)
This plot is hierarchial clustering of presidents based on average speech sentiment. The interesting pairing clustering include Donald Trump and Ronald Reagan being close together, which makes historical sense well as FDR and Lincoln being close together, having to rally the nation during turbulent times. With a few exception, the clustering ties very nicely with political party. 

## Riff 3 (5)

![RIFF_3](RIFF_3.png)
The final RIFF is t-SNE for speeches by topic. The visual shows that the topics are very closely tied with year in which you are president, with the darker colors being older presidents clustering together and the lighter colors being current presidents. This logically makes sense since presidential topics have changed overtime, and topics from one president to another usually do not change that much. 

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

The assignment was very interesting and rewarding, offering an excellent opportunity to apply what we learned in class and to a novel dataset. I focused on U.S. Presidential speeches, from George Washington to Donald Trump. The data was sourced from a CSV file on Kaggle, and I created tables to facilitate analysis.

The first interesting point discovered from the corpus was in the PCA analysis, showing that Republicans among the parties have the most variation among their speeches. This finding was somewhat surprising, especially as the idea has only recently gained attention with the emergence of the MAGA movement. The loadings graph also provided fascinating insights, showing that while most speeches contained a common set of words, they diversified significantly, which makes sense given that many speeches share similar characteristics.

Looking at sentiment, it was interesting to see trust have such a sharp decline over time and see how most of the sentiment followed the same ebbs and flows. These aligned with historical periods, with fear jumping during the depression and in 2000 post 9/11. Disgust and surprise did not show much overtime, which tracks as that sentiment shouldn’t appear much in speeches by leaders.

The three RIFFs brought in some interesting insights. Plotting key issues over time showed that they matched up with what was going on in the history of the country. The hierarchical clustering brought in some insights including presidents that had similar beliefs, including Reagan and Trump and presidents that faced similar challenges, like FDR and Lincoln. Finally, the plot of t-SNE topics showed that between each president, the topics did not change much but over the course of history they have changed a lot. 

Overall, the project uncovered many intriguing insights from the speeches, and most of the findings aligned with the historical interpretations of U.S. presidents.
