# A7: State of the Union Addresses, 2001--2024}

_(This assignment is inspired by [A3 in the Spring 2018 offering of CS 1110 at Cornell University](https://www.cs.cornell.edu/courses/cs1110/2018sp/assignments/assignment3/a3.pdf).)}_

Many social scientists are interested in how political discourse evolves over time. 
We will narrow down this very broad question by focusing on State of the Union addresses given by United States presidents from 2001 to 2024 and asking how prominent some "topics" are within them.
To do this, we're going to use the [BERTopic](https://maartengr.github.io/BERTopic/index.html) library, a powerful tool for algorithmically computing topics in a given corpus of text.




## Reading Input
We will read all the text files and split each of them into paragraphs.
BERTopic takes a giant list of "documents" as input, and assigns one topic to each document.
For this reason, since we know that SOTU addresses in fact cover many topics, it is sensible to divide each SOTU speech into multiple BERTopic "documents".
Note that for each paragraph we will also note which year's address it was taken from--we will need this later.

Note: transcripts taken from [The American Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union).

In [67]:
def read_file(p):
    with open(p, 'r') as f:
        return f.read()

whole_docs = []
whole_doc_years = []
paragraphs = []
paragraph_years = []
for i in range(1978, 2025):
    contents = read_file(f"sotu/{i}.txt")
    whole_docs.append(contents)
    whole_doc_years.append(i)
    for paragraph in contents.split("\n\n"):
        paragraphs.append(paragraph)
        paragraph_years.append(i)
paragraphs[:5]

['Mr. President, Mr. Speaker, Members of the 95th Congress, ladies and gentlemen:',
 "Two years ago today we had the first caucus in Iowa, and one year ago tomorrow, I walked from here to the White House to take up the duties of President of the United States. I didn't know it then when I walked, but I've been trying to save energy ever since. [Laughter]",
 'I return tonight to fulfill one of those duties of the Constitution: to give to the Congress—and to the Nation—information on the state of the Union.',
 'Militarily, politically, economically, and in spirit, the state of our Union is sound.',
 'We are a great country, a strong country, a vital and a dynamic country—and so we will remain.']

## Training the Model

Here, we will train the model. Note that the `nr_topics` parameter specifies the maxmimum number of topics that the model should generate.
For other parameters you can supply to it, see [the documentation](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__).
The output of the `fit_transform` function, which performs the training, is a 2-tuple containing the most likely topic and the assigned probability for each input document--i.e., paragraph.

In [68]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
# This lets us get rid of words like "the", "of", "Mr.", etc. and other words which are not contentful for us
stop_words = list({"Mr.", "Mr", "mr", "mr.", "just", "distinguished", "madam", "guest", "guests", "sergeant", "speaker", "president", "fellow", "vice", "congress", "Thank", "thank", "audience", "speaker", "Tonight", "tonight", "Union", "union", "people", "People", "America", "American", "United", "States", "inaudible", "boo", "spoke", "member", "members", "applauded", "applause", "applaud", "inaudibledont", "inaudiblethe"}.union(ENGLISH_STOP_WORDS))
vectorizer_model = CountVectorizer(stop_words=stop_words)
model = BERTopic(
    vectorizer_model=vectorizer_model, 
    nr_topics=20,
)
topics, probabilities = model.fit_transform(paragraphs)

print("Topic and probability of 102nd paragraph:", topics[101], probabilities[101])
print("102nd paragraph:", paragraphs[101])

len(topics), len(paragraphs)

Topic and probability of 102nd paragraph: 4 0.6833697728742905
102nd paragraph: The other moment was in Warsaw, capital of a nation twice devastated by war in this century. There, people have rebuilt the city which war's destruction took from them. But what was new only emphasized clearly what was lost.


(4206, 4206)

## Topic Information
To see what the topics are all about, we can use `.get_topic_info()`, which returns a Pandas DataFrame containing the topic information.

Note that BERTopic has a topic numbered -1 which contains "junk"--i.e., documents that didn't fit into any of the other topics.

In [69]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1271,-1_america_american_new_world,"[america, american, new, world, country, jobs,...","[Congress, give these hard-working, responsibl..."
1,0,733,0_tax_health_care_budget,"[tax, health, care, budget, year, federal, def...","[On the critical issue of health care, our goa..."
2,1,567,1_america_world_nation_americans,"[america, world, nation, americans, freedom, a...",[This is a moral issue. The lawless state of o...
3,2,302,2_jobs_trade_new_american,"[jobs, trade, new, american, america, economy,...",[Standing as we are on the edge of a new centu...
4,3,273,3_schools_education_school_college,"[schools, education, school, college, children...","[Eighth, we must make the 13th and 14th years ..."
5,4,220,4_nuclear_soviet_defense_weapons,"[nuclear, soviet, defense, weapons, security, ...","[Ten years ago, the United States and the Sovi..."
6,5,155,5_iraq_al_afghanistan_iraqi,"[iraq, al, afghanistan, iraqi, qaida, terroris...","[In Iraq, the terrorists and extremists are fi..."
7,6,131,6_laughter_citizens_house_lady,"[laughter, citizens, house, lady, honored, ame...","[Now, there's a new face at this place of hono..."
8,7,113,7_energy_oil_clean_gas,"[energy, oil, clean, gas, climate, natural, ne...","[Today, no area holds more promise than our in..."
9,8,107,8_crime_gun_police_drugs,"[crime, gun, police, drugs, drug, violent, cri...","[To prepare America for the 21st century, we m..."


## Topics over Time
We can use `.topics_over_time` to see how many paragraphs were assigned to certain topics in certain years.
Try playing around with the `topics=` argument below to compare trends.

In [70]:
topics_over_time = model.topics_over_time(paragraphs, paragraph_years)
model.visualize_topics_over_time(topics_over_time, topics=[3,4])

In [71]:
# Find the paragraphs associated with a certain topic and year
def get_topic_paragraphs_for_year(target_topic_number, target_year):
    results = []
    for topic_number, year, paragraph in zip(topics, paragraph_years, paragraphs):
        if topic_number == target_topic_number and year == target_year:
            results.append(paragraph)
    return results

get_topic_paragraphs_for_year(7, 2024)

['Modernizing our roads and bridges, ports and airports, public transit systems; removing poisonous lead pipes so every child can drink clean water without risk of brain damage; providing affordable—affordable—high-speed internet for every American, no matter where you live—urban, suburban, or rural communities in red States and blue States; record investments in Tribal communities.',
 "We're also making history by confronting the climate crisis, not denying it. I don't think any of you think there's no longer a climate crisis. At least, I hope you don't. [Laughter] I'm taking the most significant action ever on climate in the history of the world.",
 "I'm cutting our carbon emissions in half by 2030, creating tens of thousands of clean energy jobs, like the IBEW workers building and installing 500,000 electric vehicle charging stations; conserving 30 percent of America's lands and waters by 2030; and taking action on environmental justice—fenceline communities smothered by the legacy 

---

## Problem 1: Topics
For this problem and others, feel free to write additional code to support your answers. Refer to the [BERTopic](https://maartengr.github.io/BERTopic/index.html) documentation if you want to look for additional functionality.

1a. How coherent do the generated topics seem to you? For example, based on what you'd expect from the representative words, do the paragraphs labeled for that topic actually seem to line up with the topic as you understand it? Please answer in this cell with a paragraph or two. Feel free to support your write-up with additional code.

In [72]:
def display_all_topics(num_paragraphs=3): # Output three paragraphs per topic
    for topic_number in range(len(topic_info)):
        topic_details = get_topic_details(topic_number)
        
        # Get topic title
        topic_title_query = topic_info.loc[topic_info['Topic'] == topic_number, 'Name']
        topic_title = topic_title_query.values[0] if not topic_title_query.empty else "Unknown Topic"
        
        print(f"Topic {topic_number}: '{topic_title}'")
        
        # Check if topic_details[0] is iterable
        if isinstance(topic_details[0], list):
            top_words = [word for word, _ in topic_details[0]]
        else:
            top_words = ["No words found"]
        
        print("Top Words:", top_words)

        # Display example paragraphs
        paragraphs_for_topic = topic_details[1]
        print(f"Example Paragraphs for Topic {topic_number}:")
        
        for i, paragraph in enumerate(paragraphs_for_topic[:num_paragraphs]):
            print(f"Paragraph {i+1}:", paragraph)

        print("\n" + "="*50 + "\n")  # Separator for topics

# Call the function to display all topics
display_all_topics()

Topic 0: '0_tax_health_budget_care'
Top Words: ['tax', 'health', 'care', 'budget', 'year', 'federal', 'deficit', 'social', 'insurance', 'americans']
Example Paragraphs for Topic 0:
Paragraph 1: I will announce detailed proposals for improving our tax system later this week. We can make our tax laws fairer, we can make them simpler and easier to understand, and at the same time, we can-and we will—reduce the tax burden on American citizens by $25 billion.
Paragraph 2: The tax reforms and the tax reductions go together. Only with the long overdue reforms will the full tax cut be advisable.
Paragraph 3: Almost $17 billion in income tax cuts will go to individuals. Ninety-six percent of all American taxpayers will see their taxes go down. For a typical family of four, this means an annual saving of more than $250 a year, or a tax reduction of about 20 percent. A further $2 billion cut in excise taxes will give more relief and also contribute directly to lowering the rate of inflation.


To

**Response**  

The coherence between the topics and the example paragraphs is consistent for most entries. For instance, Topic 0, titled "0_tax_health_budget_care," involves information related to tax reforms, money, policies, etc. showing an accurate relationship between the title and the content.

In contrast, some topics exhibit a lack of clarity and depth. Topic 8, "8_usa_hr_wall_joe,", for example, is characterized by brief audience interactions and repetitive chants of "U.S.A.!" Also, it is unclear who 'joe' or hr-wall was and does not provide substantial information about the topic. Similarly, Topic 11, "11_bless_god_america_united," is limited to expressions of gratitude without exploring themes of unity or patriotism, showin a disconnection between the title and the content. Finally, Topic 13, "13_saddam_hussein_weapons_inspectors," addresses military and geopolitical concerns but could benefit from a more focused narrative on either international security or domestic gun violence topic(s) to enhance clarity.


---

1b. For **two** topics of your choosing, observe how they trend over time. Given what these topics are _supposed_ to be about, do they trend as you would expect (based on, for example, real-world factors such as the president's party or real-world events occurring around the time of the speech), or do they deviate from your expectations? For example, you might expect a topic related to "war" or "terrorism" to become very frequent in 2002.

Explain with a couple of paragraphs.

In [73]:
topics_over_time = model.topics_over_time(paragraphs, paragraph_years)
model.visualize_topics_over_time(topics_over_time, topics=[0,1])

**Response**  

Looking at the graph above, I chose two topics: one about taxes, healthcare, and budget, and another about America, the nation, and freedom. The first topic, "tax, health, budget, care," goes up and down over time but spikes in the late 1990s, early 2000s, and again around the 2020s. These rises make sense when we think about what was happening at those times. For example, in the 1990s, there were big discussions around the budget and healthcare, and in the 2000s, George W. Bush focused a lot on tax cuts. The big spike in the 2020s might be related to COVID-19 and all the talk about healthcare and the economy.

The second topic, "America, nation, world, freedom," stays somewhat steady but has noticeable jumps, especially after 2001. This is likely due to 9/11 and the War on Terror, where national security and freedom were imporant in the history of the USA. There’s also a rise in the 2010s, which could be linked to discussions about America's global role under Obama and Trump. Overall, these topics seem to describe fairly well what was happening in the real world during these times.

In [74]:
model.visualize_topics_over_time(topics_over_time)

---

1c. Finish the code snippet below to calculate the topic distribution by political party. After you've finished it, write a paragraph or two explaining any patterns you see in how the two parties pattern along topics.

In [75]:
from collections import Counter
D_YEARS = [1978, 1979, 1980, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2021, 2022, 2023, 2024]
R_YEARS = [y for y in range(1978, 2025) if y not in D_YEARS]
d_counter = Counter()
r_counter = Counter()

# Loop over the assigned topic of each paragraph and the year that paragraph is from
for topic, year in zip(topics, paragraph_years):
    if year in D_YEARS:
        d_counter[topic] += 1
    elif year in R_YEARS:
        r_counter[topic] += 1

for _, topic in model.get_topic_info().iterrows():
    print(topic.Name, topic.Representation)
    print("Democratic paragraphs:", d_counter[topic.Topic])
    print("Republican paragraphs:", r_counter[topic.Topic])
    print()


-1_america_american_new_world ['america', 'american', 'new', 'world', 'country', 'jobs', 'government', 'years', 'work', 'make']
Democratic paragraphs: 763
Republican paragraphs: 508

0_tax_health_care_budget ['tax', 'health', 'care', 'budget', 'year', 'federal', 'deficit', 'social', 'insurance', 'americans']
Democratic paragraphs: 446
Republican paragraphs: 287

1_america_world_nation_americans ['america', 'world', 'nation', 'americans', 'freedom', 'american', 'new', 'immigration', 'state', 'time']
Democratic paragraphs: 309
Republican paragraphs: 258

2_jobs_trade_new_american ['jobs', 'trade', 'new', 'american', 'america', 'economy', 'world', 'years', 'markets', 'china']
Democratic paragraphs: 188
Republican paragraphs: 114

3_schools_education_school_college ['schools', 'education', 'school', 'college', 'children', 'students', 'teachers', 'high', 'help', 'parents']
Democratic paragraphs: 190
Republican paragraphs: 83

4_nuclear_soviet_defense_weapons ['nuclear', 'soviet', 'defense',

**Response**  

The topic distribution analysis shows ideological division between the Democratic and Republican parties, illustrated in the priorities and policies. Democratic texts emphasize social issues, in relationship with education and healthcare, showing a commitment to public welfare and reform. This is further illustrated by topics related to the economy, where Democrats discuss jobs and economic conditions more frequently than Republicans.

Conversely, Republican texts tend to concentrate on national security and military themes, as evidenced by the balanced representation in topics related to nuclear issues and wars. The Republicans also show a focus on crime, especially mentioning gun-related topics, although with lower representation than Democrats. This reveals Republicans' emphasis on traditional values and law enforcement. While both parties share some concerns, their narratives diverge significantly, with Democrats prioritizing social equity and Republicans focusing on security and economic conservatism.

---

# Problem 2: Model Tuning
Recall from above ("Training the Model") that the BERTopic model can be tuned with [a variety of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__).
Based on what you wrote in your solutions to Problem 1, we will try to tune the model to get it to behave better.

2a. Try changing the value of `nr_topics` in the line with `model = ` above and then re-run code above to see how the topics change. How would you describe the changes you're seeing? Would you say that the "quality" of the topics you're getting has increased or decreased? Write a paragraph or two.

In [76]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
# This lets us get rid of words like "the", "of", "Mr.", etc. and other words which are not contentful for us
stop_words = list({"Mr.", "Mr", "mr", "mr.", "just", "distinguished", "madam", "guest", "guests", "sergeant", "speaker", "president", "fellow", "vice", "congress", "Thank", "thank", "audience", "speaker", "Tonight", "tonight", "Union", "union", "people", "People", "America", "American", "United", "States", "inaudible", "boo", "spoke", "member", "members", "applauded", "applause", "applaud", "inaudibledont", "inaudiblethe"}.union(ENGLISH_STOP_WORDS))
vectorizer_model = CountVectorizer(stop_words=stop_words)
model = BERTopic(
    vectorizer_model=vectorizer_model, 
    nr_topics=10, # Adjusting the value of topics to 10 only
)
topics, probabilities = model.fit_transform(paragraphs)

print("Topic and probability of 102nd paragraph:", topics[101], probabilities[101])
print("102nd paragraph:", paragraphs[101])

len(topics), len(paragraphs)

Topic and probability of 102nd paragraph: 1 0.8117121718744169
102nd paragraph: The other moment was in Warsaw, capital of a nation twice devastated by war in this century. There, people have rebuilt the city which war's destruction took from them. But what was new only emphasized clearly what was lost.


(4206, 4206)

In [77]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1323,-1_america_american_new_world,"[america, american, new, world, americans, wor...",[The President. ——the American Rescue Plan hel...
1,0,1850,0_new_year_tax_american,"[new, year, tax, american, years, america, mak...","[At the same time, we must ensure that older s..."
2,1,383,1_nuclear_iraq_weapons_world,"[nuclear, iraq, weapons, world, defense, force...","[About the defense budget, I raise a hope and ..."
3,2,292,2_veterans_laughter_americans_men,"[veterans, laughter, americans, men, know, hon...",[Our men and women in uniform are making sacri...
4,3,154,3_crime_gun_police_violence,"[crime, gun, police, violence, law, laws, viol...",[Our fourth great challenge is to take our str...
5,4,77,4_usa_hr_wall_point,"[usa, hr, wall, point, joe, fix, ai, build, bo...",[Harness—[applause]. Harness the promise of AI...
6,5,55,5_cancer_aids_vaccines_covid,"[cancer, aids, vaccines, covid, diseases, dise...",[American foreign policy is more than a matter...
7,6,50,6_bless_god_america_united,"[bless, god, america, united, states, good, ev...","[God bless you, and God bless America.\n, Than..."
8,7,11,7_court_justice_senate_judges,"[court, justice, senate, judges, courts, bench...",[Because courts must always deliver impartial ...
9,8,11,8_fees_junk_companies_card,"[fees, junk, companies, card, airlines, like, ...","[We—the idea that cable, internet, and cell ph..."


**Response**  

When I increased the `nr_topics` value, the model will attempt to find more distinct topics, but this leads to more fragmentation of the topics, where related ideas seem to be split across multiple topics. So, topics overlap more, and it affects accuracy.

When I decreased the `nr_topics` value, the model will try to group similar ideas into broader topics. This makes it easier to summarize general topics and vocabulary, but we lose some important distinctions between topics/content in the pragraphs. ​

---

2b. The parameter `top_n_words` can be used to specify the number of words that BERTopic will learn to associate with any given topic. The default value is `10`, so try setting this to something lower or higher. Describe what you see in a paragraph or two.

In [78]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
# This lets us get rid of words like "the", "of", "Mr.", etc. and other words which are not contentful for us
stop_words = list({"Mr.", "Mr", "mr", "mr.", "just", "distinguished", "madam", "guest", "guests", "sergeant", "speaker", "president", "fellow", "vice", "congress", "Thank", "thank", "audience", "speaker", "Tonight", "tonight", "Union", "union", "people", "People", "America", "American", "United", "States", "inaudible", "boo", "spoke", "member", "members", "applauded", "applause", "applaud", "inaudibledont", "inaudiblethe"}.union(ENGLISH_STOP_WORDS))
vectorizer_model = CountVectorizer(stop_words=stop_words)
model = BERTopic(
    vectorizer_model=vectorizer_model, 
    nr_topics=20,
    top_n_words=5  # Changing the number of words that BERTopic uses to associate with the topics
)
topics, probabilities = model.fit_transform(paragraphs)

print("Topic and probability of 102nd paragraph:", topics[101], probabilities[101])
print("102nd paragraph:", paragraphs[101])

len(topics), len(paragraphs)

Topic and probability of 102nd paragraph: -1 0.0
102nd paragraph: The other moment was in Warsaw, capital of a nation twice devastated by war in this century. There, people have rebuilt the city which war's destruction took from them. But what was new only emphasized clearly what was lost.


(4206, 4206)

In [79]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1378,-1_america_american_world_new,"[america, american, world, new, jobs]",[The American people must prosper in the globa...
1,0,470,0_tax_budget_federal_spending,"[tax, budget, federal, spending, deficit]",[And let's begin by discussing how to maintain...
2,1,359,1_children_schools_education_school,"[children, schools, education, school, child]","[Eighth, we must make the 13th and 14th years ..."
3,2,341,2_energy_jobs_new_economy,"[energy, jobs, new, economy, clean]","[And 4 years ago, other countries dominated th..."
4,3,320,3_america_world_freedom_nation,"[america, world, freedom, nation, state]",[I have come to review with you the progress o...
5,4,233,4_health_care_insurance_coverage,"[health, care, insurance, coverage, medicare]",[My budget puts a priority on access to health...
6,5,232,5_veterans_men_home_americans,"[veterans, men, home, americans, honor]",[Our men and women in uniform are making sacri...
7,6,184,6_terrorists_iraq_al_afghanistan,"[terrorists, iraq, al, afghanistan, qaida]","[In Iraq, the terrorists and extremists are fi..."
8,7,162,7_immigration_border_reform_laws,"[immigration, border, reform, laws, illegal]",[The other pressing challenge is immigration. ...
9,8,129,8_nuclear_soviet_iran_nato,"[nuclear, soviet, iran, nato, weapons]",[This year I'll ask the Senate to approve STAR...


**Response**  

Reducing the `top_n_words` value in the BERTopic model from 10 to 5 changes the topics significantly. With fewer words (e.g., 5), the topics become more concise, selecting the most important terms. For example, while the original setup included broader keywords like "jobs" and "government," the new configuration focuses on fewer phrases, like those related to "health" and "care" only. However, while this shift can make the topics clearer, we might lose some other words that can be synonyms or antonyms but related to the topic.  

Therefore, changing the value of the `top_n_words` affects how topics are labeled and represented. With more words, topics can be more descriptive, capturing more words that are associated to the topic. However, with only five words, the topics risk oversimplifying complex ideas. While this adjustment can lead to a clearer understanding of key themes, it may sacrifice some in depth information and classification. This change is somewhat similar to modifying the `nr_topics`, as in the exercise above. 