# Example Project: Comparing Trump's and Biden's Inaugural Speeches

We will use a mini-project as an extended practical example to demonstrate the concepts we are learning in the workshop. The project aims to analyze and compare the inaugural speeches of the current and last US presidents.

The speech transcripts were obtained from https://millercenter.org/the-presidency/presidential-speeches and copied in the text files `biden_inauguration_millercenter.txt` and `trump_inauguration_millercenter.txt` in the `data` folder.

## Straight-line programming

Even just with basic understanding of data types, operations, and methods, we can already extract useful information from data. Below, we will:
1. Open the text file with one of the speeches
2. Clean up the text and extract a list of all the words used in the speech
3. Estimate the length of the speach and number of unique words used

In [1]:
# Open the file and get the text into a string variable called txt
with open('data/trump_inauguration_millercenter.txt') as f:
    txt = f.read()
txt[:500] # Show the first 500 characters of the txt variable

'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\n\nWe, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.\n\nTogether, we will determine the course of America and the world for years to come.\n\nWe will face challenges. We will confront hardships. But we will get the job done.\n\nEvery four years, we gather on these st'

In [2]:
# Remove paragraphs and format consistently
txt = txt.strip().replace('\n', ' ').replace("’", "'")

# Get rid of possessives and expand contractions
txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
txt = txt.replace("can't", 'can not').replace("n't", ' not')

# Remove punctuation
txt = txt.replace('—', '').replace('–', '')
txt = txt.replace('.', '').replace(',', '').replace(':', '').replace(';', '').replace('…', '')
txt = txt.replace("”", '').replace("“", '')

# Convert to lower-case
txt = txt.lower()

# Break into words
wrds = txt.split()
print(sorted(wrds)[:100])

# Count the number of words in the speech
print(len(wrds))

# Count the number of unique words
print(len(set(wrds)))

['2017', '20th', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'about', 'about', 'accept', 'across', 'across', 'across', 'across', 'across', 'action', 'action', 'administration', 'affairs', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'against', 'aid', 'airports', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'allegiance', 'allegiance', 'alliances', 'allowing', 'almighty', 'along', 'always', 'always', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'americans', 'americans', 'americans', 'americans', 'an', 'an', 'an', 'and', 'and']
1436
536


## Control flow with conditionals and loops

Branching and iteration allow us to employ more complex logic in our data processing and analysis: e.g., repeat operations or set conditions to select data. Below, we will:
1. Count the number of times each unique word is mentioned in the speech
2. Exclude non-meaningful words such as articles and prepositions
3. Identify the most commonly used meaningful words to reveal the theme and tone of the speech

In [3]:
# Create dictionary with word:count
word_counts = {}

for i in wrds:
    if i not in word_counts:
        word_counts[i] = 1
    else:
        word_counts[i] += 1

# Print the words with counts in decreasing order of popularity
# Note this produces a list of tuples
sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)

sorted_word_counts[:10]

[('and', 74),
 ('the', 70),
 ('we', 49),
 ('of', 48),
 ('our', 48),
 ('will', 40),
 ('to', 37),
 ('is', 21),
 ('america', 18),
 ('a', 15)]

In [4]:
# We will create a dictionary of all words mentioned more than once without stop words
# Stop words are common words that are not meaningful in this context
stop_words = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']

common_words = []
for i in sorted_word_counts:
    if i[0] not in stop_words:
        if i[1] > 1:
            common_words.append(i)
        else:
            break
            
# Alternatively, we can use a list comprehension for the code block above
# common_words = [i for i in sorted_word_counts if i[0] not in stop_words and i[1] > 1]     
        
common_words[:10]

[('we', 49),
 ('our', 48),
 ('will', 40),
 ('america', 18),
 ('you', 12),
 ('all', 12),
 ('american', 12),
 ('their', 11),
 ('your', 11),
 ('people', 9)]

## Functions

Once we understand conditionals, loops, and functions, we can improve the code above and make it more efficient and modular. This will allow us to apply it to multiple data files, without the need to duplicate large chunks of code. Below, we will:
1. Create a function to extract words from text and another function to count words in a text
2. Apply the functions to each president's speech
3. Compare the length and repetitiveness of the speeches, the most common words and the unique words

In [5]:
import string  # See https://docs.python.org/3/library/string.html

# This will now be a global variable so we will follow the convention and 
# name it in all caps
STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']

def get_tokens(fname):
    """Read given text file and return a list with all words in lowercase
    in the order they appear in the text. Common contractions are expanded
    and hyphenated words are combined in one word.
    """
    with open(fname) as f:
        txt = f.read()
        
    # Remove paragraphs and format consistently
    txt = txt.strip().replace('\n', ' ').replace("’", "'")
    
    # Get rid of possessives and expand contractions
    txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
    txt = txt.replace("can't", 'can not').replace("n't", ' not')

    # Remove punctuation and convert to lower-case
    exclude = set(string.punctuation) | {"”", "“", "…", '–'}
    txt = ''.join(ch.lower() for ch in txt if ch not in exclude)

    # Break into words
    wrds = txt.split()
    
    return wrds


def get_word_counts(tokens):
    """Take tokens and return a dictionary where keys are words
    and values are counts of the number of time the word is repeated.
    """
    # Create dictionary with word:count
    word_counts = {}

    for i in tokens:
        if i not in STOP_WORDS:
            if i not in word_counts:
                word_counts[i] = 1
            else:
                word_counts[i] += 1

    # Get the words with counts in decreasing order of popularity
    # Note this produces a list of tuples
    sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)
    
    return sorted_word_counts


trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')
biden_tokens = get_tokens('data/biden_inauguration_millercenter.txt')

print(trump_tokens[:100])


['chief', 'justice', 'roberts', 'president', 'carter', 'president', 'clinton', 'president', 'bush', 'president', 'obama', 'fellow', 'americans', 'and', 'people', 'of', 'the', 'world', 'thank', 'you', 'we', 'the', 'citizens', 'of', 'america', 'are', 'now', 'joined', 'in', 'a', 'great', 'national', 'effort', 'to', 'rebuild', 'our', 'country', 'and', 'to', 'restore', 'its', 'promise', 'for', 'all', 'of', 'our', 'people', 'together', 'we', 'will', 'determine', 'the', 'course', 'of', 'america', 'and', 'the', 'world', 'for', 'years', 'to', 'come', 'we', 'will', 'face', 'challenges', 'we', 'will', 'confront', 'hardships', 'but', 'we', 'will', 'get', 'the', 'job', 'done', 'every', 'four', 'years', 'we', 'gather', 'on', 'these', 'steps', 'to', 'carry', 'out', 'the', 'orderly', 'and', 'peaceful', 'transfer', 'of', 'power', 'and', 'we', 'are', 'grateful', 'to']


In [6]:
# Biden's speech is longer
print(len(trump_tokens), len(biden_tokens))
print(len(set(trump_tokens)), len(set(biden_tokens)))

# Biden's speech is also more repetitive
print(len(trump_tokens)/len(set(trump_tokens)), len(biden_tokens)/len(set(biden_tokens)))

print() # Add an empty line to separate results

# The ten most common words for Trump and Biden
trump_wcounts = get_word_counts(trump_tokens)
biden_wcounts = get_word_counts(biden_tokens)

# Biden's speech is more self-centered
print(trump_wcounts[:20])

print() # Add an empty line to separate results

print(biden_wcounts[:20])


1436 2382
536 721
2.6791044776119404 3.30374479889043

[('we', 49), ('our', 48), ('will', 40), ('america', 18), ('you', 12), ('all', 12), ('american', 12), ('their', 11), ('your', 11), ('people', 9), ('country', 9), ('nation', 9), ('again', 9), ('one', 8), ('every', 7), ('world', 6), ('now', 6), ('great', 6), ('back', 6), ('never', 6)]

[('we', 91), ('our', 43), ('will', 33), ('i', 33), ('us', 27), ('my', 20), ('america', 20), ('can', 18), ('you', 17), ('all', 17), ('one', 15), ('nation', 14), ('democracy', 11), ('me', 11), ('must', 10), ('americans', 9), ('today', 9), ('people', 9), ('american', 9), ('story', 9)]


In [7]:
# Get repeated words and check the difference
trump_100 = set([i[0] for i in trump_wcounts])
biden_100 = set([i[0] for i in biden_wcounts])

# Unique words only for Trump
print([i for i in trump_wcounts if i[0] in (trump_100-biden_100) and i[1] > 2])

print()

# Unique words only for Biden
print([i for i in biden_wcounts if i[0] in (biden_100- trump_100) and i[1] > 2])


[('back', 6), ('protected', 5), ('dreams', 5), ('wealth', 4), ('everyone', 4), ('bring', 4), ('obama', 3), ('too', 3), ('capital', 3), ('government', 3), ('factories', 3), ('foreign', 3), ('countries', 3)]

[('democracy', 11), ('me', 11), ('story', 9), ('know', 8), ('history', 7), ('war', 7), ('days', 6), ('truth', 5), ('may', 5), ('cause', 4), ('centuries', 4), ('peace', 4), ('virus', 4), ('lost', 4), ('soul', 4), ('things', 4), ('once', 4), ('better', 4), ('need', 4), ('say', 4), ('vice', 3), ('hope', 3), ('resolve', 3), ('prevailed', 3), ('ago', 3), ('violence', 3), ('them', 3), ('constitution', 3), ('sacred', 3), ('year', 3), ('cry', 3), ('whole', 3), ('uniting', 3), ('join', 3), ('common', 3), ('faith', 3), ('show', 3), ('dignity', 3), ('respect', 3), ('meet', 3), ('believe', 3), ('yet', 3), ('gave', 3), ('honor', 3), ('lies', 3)]


## Classes

What we did above is known as procedural programming – we keep functions and data separate and pass the data to the functions. Alternatively, we can employ the approach of object-oriented programming – we can bundle up the data and functions into classes. In this case, the functions become methods and they belong only to this particular data type. We cannot call them independently, on other types of data, for example.

In [8]:
class Speech(object):
        
    def __init__(self, fname):
        """Creates a speech using the text in file fname."""
        
        with open(fname) as f:
            self.txt = f.read()
        self.tokens = None
        self.word_counts = None
        
        # Populate the empty attributes above by processing the text
        self.process_tokens()        
        self.process_word_counts()
    
    
    # The following two methods are called when you initialize a new object
        
    def process_tokens(self):
        """Extracts the tokens in the text and assigns them to 
        the attribute 'tokens'. 'tokens' is a list of strings.
        """
        
        # Remove paragraphs and format consistently
        txt = self.txt.strip().replace('\n', ' ').replace("’", "'")
        
        # Get rid of possessives and expand contractions
        txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
        txt = txt.replace("can't", 'can not').replace("n't", ' not')
    
        # Remove punctuation and convert to lower-case
        exclude = set(string.punctuation) | {"”", "“", "…", '–'}
        txt = ''.join(ch.lower() for ch in txt if ch not in exclude)

        # Break into words
        wrds = txt.split()

        self.tokens = wrds
        
        
    def process_word_counts(self):
        """Counts the number of times each word, excluding stop words,
        appears in the speech and assigns the counts to the attribute 'word_counts'.
        'word_counts' is a list of tuples in the form (token, count).
        """
        # Create dictionary with word:count
        word_counts = {}

        for i in self.tokens:
            if i not in STOP_WORDS:
                if i not in word_counts:
                    word_counts[i] = 1
                else:
                    word_counts[i] += 1

        # Get the words with counts in decreasing order of popularity
        # Note this produces a list of tuples
        sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)
        self.word_counts = sorted_word_counts
    
    
    # Use get and set methods to provide interface for interacting with the objects
        
    def get_text():
        return self.text
        
    def get_tokens(self):
        """Get the tokens in the speech as a list of strings."""
        # Avoid returning mutable objects as they could be modified in undesirable ways
        return self.tokens[:]
    
    def get_word_counts(self):
        """Get each unique word in the speech and the number of times it appears in the speech.
        Return a list of tuples in the form (token, count).
        """
        # Avoid returning mutable objects as they could be modified in undesirable ways
        return self.word_counts[:]
    
    # You can make your code even more interactive by providing extra methods for
    # common and useful operations
    
    def get_speech_length(self):
        """Get the number of tokens in the speech."""
        return len(self.tokens)
    
    def get_number_unique_tokens(self):
        """Gets the number of unique words used in the speech,
        including stop words.
        """
        return len(set(self.tokens))
    
    def __str__(self):
        """Returns the first 200 characters of the speech."""
        return self.txt[:200] + '...'

    
# Create an object of class Speech for Trump's inaugural speech
trump = Speech('data/trump_inauguration_millercenter.txt')
print(trump)
# Process the speech text and get the length of the speech
print(trump.get_speech_length())

print()

# Create anothe Speech object for Biden's inaugural speech
biden = Speech('data/biden_inauguration_millercenter.txt')
print(biden)
print(biden.get_speech_length())

Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.

We, the citizens of America, are now joined in a gre...
1436

Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, distinguished guests, and my fellow Americans.

This is America’s day.

This is de...
2382
