## Decoding Emotions in Leadership: A Sentiment Analysis of Historical and Modern Icons' Speeches Using NLTK

NLTK, or the Natural Language Toolkit, is a popular Python library used for working with human language data (natural language processing or NLP), providing tools for text processing and linguistic analysis. It's commonly used for tasks like tokenization, parsing, classification, and sentiment analysis in various language-related applications.

In [127]:
# python3 -m venv env 
# source env/bin/activate
# pip freeze > requirements.txt

# pip3 install -U nltk 

In [128]:
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natlap/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/natlap/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [129]:
# Normally, lemmatization (the process of reducing a word to its base or dictionary form, known as the lemma) or stemming as a weaker alternative is essential; 
# however, without lemmatization due to technical issues, I trust the simplicity of English grammar 
# won't greatly impact our analysis results.

# from nltk.stem import WordNetLemmatizer
# from nltk.corpus import wordnet
# nltk.download('wordnet', download_dir=nltk.data.path[0], force=True)

### Data Preprocessing

In [130]:
# reading files
with open('speeches/1940-Churchill.txt', 'r', encoding='Latin-1') as file:
    churchill = file.read()
with open('speeches/1941-Hitler.txt', 'r', encoding='Latin-1') as file:
    hitler = file.read()
with open('speeches/1963-Mandela.txt', 'r', encoding='Latin-1') as file:
    mandela = file.read()
with open('speeches/2007-Putin.txt', 'r', encoding='Latin-1') as file:
    putin = file.read()
with open('speeches/2008-Ahtisaari.txt', 'r', encoding='Latin-1') as file:
    ahtisaari = file.read()
with open('speeches/2017-Trump.txt', 'r', encoding='Latin-1') as file:
    trump = file.read()
with open('speeches/2022-Zelensky.txt', 'r', encoding='Latin-1') as file:
    zelensky = file.read()
ahtisaari

'Your Majesties, Your Royal Highnesses, Excellencies,\nDistinguished members of the Norwegian Nobel Committee, Dear Friends and Colleagues around the world,\n\nI feel both humility and gratitude at receiving this yearÕs Nobel Peace Prize. It is the greatest recognition anybody working in this field can be given.\n\nWhat I am feeling now can only be compared with the joy I have felt when seeing the changes that peace has brought to the lives of people. When people, who have endured wars and crises, begin to build their lives in an atmosphere of peace Ð When faith in the future returns.\n\nI too was a child affected by a war. I was only two years old when, as a result of an agreement on spheres of interest between HitlerÕs Germany and StalinÕs Soviet Union, war broke out, forcing my family to leave soon thereafter the town of Viipuri. Like several hundred thousand fellow Karelians, we became refugees in our own country as great power politics caused the borders of Finland to be redrawn a

In [131]:
# strip extra characters
replace_chars = ['.', ',', ':', ';', '"', "'", "‘", "’", "“", "”", "„", "‟", "‚", "‘", "’", "‹", "›", "«", "»", "ñ", "Ò", "Ó", 'Õ', 'Ð']

for c in replace_chars:
    churchill = churchill.replace(c,'')
    hitler = hitler.replace(c,'')
    mandela = mandela.replace(c,'')
    putin = putin.replace(c,'')
    ahtisaari = ahtisaari.replace(c,'')
    trump = trump.replace(c,'')
    zelensky = zelensky.replace(c,'')
print(mandela)

I Have a Dream

I am happy to join with you today in what will go down in history as
the greatest demonstration for freedom in the history of our nation

Five score years ago a great American in whose symbolic shadow
we stand today signed the Emancipation Proclamation This momentous
decree came as a great beacon light of hope to millions of Negro slaves
who had been seared in the flames of withering injustice It came as a
joyous daybreak to end the long night of their captivity

But 100 years later the Negro still is not free One hundred years
later the life of the Negro is still sadly crippled by the manacles of
segregation and the chains of discrimination One hundred years later the
Negro lives on a lonely island of poverty in the midst of a vast ocean of
material prosperity One hundred years later the Negro is still languished
in the corners of American society and finds himself an exile in his own
land And so weve come here today to dramatize a shameful condition

In a sense weve c

In [132]:
# collective tokenization
text = churchill + ' ' + hitler + ' ' + mandela + ' ' + putin + ' ' + ahtisaari + ' ' + trump + " " + zelensky
tokens = text.split()
tokens = [token.lower() for token in tokens if token.isalpha()]

In [133]:
# separate tokenization
churchill_tokens = [token.lower() for token in churchill.split() if token.isalpha()]
hitler_tokens = [token.lower() for token in hitler.split() if token.isalpha()]
mandela_tokens = [token.lower() for token in mandela.split() if token.isalpha()]
putin_tokens = [token.lower() for token in putin.split() if token.isalpha()]
ahtisaari_tokens = [token.lower() for token in ahtisaari.split() if token.isalpha()]
trump_tokens = [token.lower() for token in trump.split() if token.isalpha()]
zelensky_tokens = [token.lower() for token in zelensky.split() if token.isalpha()]

In [134]:
# picking stop words (so widely used words that they carry very little useful information)
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

For other language use own stop words set, for example

stop_words_fi = stopwords.words('finnish')

Output: ['olla', 'olen', 'olet', 'on', 'olemme', etc]

In [135]:
# customize collective tokens by dropping stop words 
ctokens = [word for word in tokens if word not in stop_words]

In [136]:
# customize collective tokens same way as above
churchill_ctokens = [word for word in churchill_tokens if word not in stop_words]
hitler_ctokens = [word for word in hitler_tokens if word not in stop_words]
mandela_ctokens = [word for word in mandela_tokens if word not in stop_words]
putin_ctokens = [word for word in putin_tokens if word not in stop_words]
ahtisaari_ctokens = [word for word in ahtisaari_tokens if word not in stop_words]
trump_ctokens = [word for word in trump_tokens if word not in stop_words]
zelensky_ctokens = [word for word in zelensky_tokens if word not in stop_words]

In [137]:
# lemmatize words
# lemmatizer = WordNetLemmatizer()
# for i in range(len(ctokens)):
#     ctokens[i] = lemmatizer.lemmatize(ctokens[i])

In [138]:
# collective frequency distribution
freq = nltk.FreqDist(ctokens)

# print all tokens:
# for key, val in freq.items():
    # print('%30s, %5d' % (str(key), val))

# print first 20 tokens:
# initialize a counter
counter = 0

# loop through items and print only the first 20
for key, val in freq.items():
    print('%30s, %5d' % (str(key), val))
    counter += 1  
    if counter == 20:  
        break  

                        almost,     7
                          year,    39
                        passed,     5
                         since,    12
                           war,    89
                         began,    10
                       natural,     2
                            us,    84
                         think,    18
                         pause,     1
                       journey,     1
                     milestone,     1
                        survey,     2
                          dark,     3
                          wide,     2
                         field,     5
                          also,    46
                        useful,     1
                       compare,     3
                         first,    25


In [139]:
# formatting the top 10 words and their counts
freq.most_common(10)

[('one', 117),
 ('world', 96),
 ('people', 90),
 ('war', 89),
 ('us', 84),
 ('german', 79),
 ('would', 68),
 ('nation', 62),
 ('great', 53),
 ('new', 50)]

In [140]:
# separate frequency distribution
churchill_freq = nltk.FreqDist(churchill_ctokens)
hitler_freq = nltk.FreqDist(hitler_ctokens)
mandela_freq = nltk.FreqDist(mandela_ctokens)
putin_freq = nltk.FreqDist(putin_ctokens)
ahtisaari_freq = nltk.FreqDist(ahtisaari_ctokens)
trump_freq = nltk.FreqDist(trump_ctokens)
zelensky_freq = nltk.FreqDist(zelensky_ctokens)

In [None]:
%pip install pandas
# or pip3 install pandas in terminal
import pandas as pd

In [142]:
# function to format the top 10 words and their counts separately
def format_top_words(freq_dist):
    return [f'{word} - {freq}' for word, freq in freq_dist.most_common(10)]

# creating a dictionary with formatted top 10 words for each person
top_words_formatted = {
    'Churchill': format_top_words(churchill_freq),
    'Hitler': format_top_words(hitler_freq),
    'Mandela': format_top_words(mandela_freq),
    'Putin': format_top_words(putin_freq),
    'Ahtisaari': format_top_words(ahtisaari_freq),
    'Trump': format_top_words(trump_freq),
    'Zelensky': format_top_words(zelensky_freq)
}

# convert the dictionary to a DataFrame
top_words = pd.DataFrame(top_words_formatted)

# display the DataFrame
display(top_words)

Unnamed: 0,Churchill,Hitler,Mandela,Putin,Ahtisaari,Trump,Zelensky
0,war - 37,german - 65,freedom - 20,one - 24,peace - 28,america - 19,thank - 15
1,us - 22,one - 51,negro - 13,countries - 23,conflict - 13,american - 11,ukraine - 15
2,air - 20,world - 48,let - 13,international - 22,work - 12,people - 10,ñ - 15
3,british - 17,people - 46,dream - 12,russia - 22,parties - 11,country - 9,us - 14
4,france - 17,nation - 40,one - 12,would - 19,also - 11,one - 8,world - 13
5,would - 17,germany - 40,day - 12,security - 16,people - 9,every - 7,victory - 11
6,upon - 16,war - 34,ring - 11,world - 16,conflicts - 9,world - 6,people - 11
7,one - 14,us - 34,nation - 10,weapons - 16,international - 9,great - 6,russia - 10
8,great - 13,time - 27,come - 10,nuclear - 16,process - 8,back - 6,battle - 10
9,europe - 13,great - 25,every - 10,also - 14,middle - 8,nation - 6,freedom - 9


### Sentimental Analysis

In [143]:
# downloading positive words file from https://gist.github.com/mkulakowski2/4289437
pos = open('positive-words.txt','r').read().split()
print(f'Positive words total amount: {len(pos)}')

Positive words total amount: 2006


In [144]:
# downloading negative words file from https://gist.github.com/mkulakowski2/4289441
neg = open('negative-words.txt','r').read().split()
print(f'Negative words total amount: {len(neg)}')

Negative words total amount: 4783


In [145]:
# collective categories
countpositive = countnegative = countneutral = counttotal = 0

# to print all words:
for token in ctokens:
    counttotal += 1
    cat = ''
    if token in pos:
        cat += 'POS'
        countpositive += 1
    elif token in neg:
        cat += 'NEG'
        countnegative += 1
    else:
        countneutral += 1
#     print('Word:', token, cat)

# to print 20 words for each category:
import random

# collect tokens by sentiment category
positive_tokens = [token for token in ctokens if token in pos]
negative_tokens = [token for token in ctokens if token in neg]
neutral_tokens = [token for token in ctokens if token not in pos and token not in neg]

# print 20 random tokens for each sentiment category
print("20 Positive tokens:")
for token in random.sample(positive_tokens, 20):
    print('Word:', token, 'POS')

print("\n20 Negative tokens:")
for token in random.sample(negative_tokens, 20):
    print('Word:', token, 'NEG')

print("\n20 Neutral tokens:")
for token in random.sample(neutral_tokens, 20):
    print('Word:', token, 'NEUTRAL')

20 Positive tokens:
Word: brilliant POS
Word: genius POS
Word: luxury POS
Word: well POS
Word: strong POS
Word: great POS
Word: comprehensive POS
Word: dynamic POS
Word: favorable POS
Word: victory POS
Word: reasonable POS
Word: loyalty POS
Word: unity POS
Word: peaceful POS
Word: unity POS
Word: free POS
Word: miracle POS
Word: strongest POS
Word: gain POS
Word: warmth POS

20 Negative tokens:
Word: mistaken NEG
Word: overwhelming NEG
Word: breaking NEG
Word: conflict NEG
Word: overrun NEG
Word: dead NEG
Word: injustice NEG
Word: attack NEG
Word: flawed NEG
Word: attack NEG
Word: deprived NEG
Word: critic NEG
Word: pretend NEG
Word: death NEG
Word: backward NEG
Word: degenerate NEG
Word: tanks NEG
Word: incapable NEG
Word: strictly NEG
Word: lose NEG

20 Neutral tokens:
Word: imagine NEUTRAL
Word: one NEUTRAL
Word: witnessed NEUTRAL
Word: individual NEUTRAL
Word: austrian NEUTRAL
Word: prove NEUTRAL
Word: dreams NEUTRAL
Word: final NEUTRAL
Word: allow NEUTRAL
Word: fate NEUTRAL
Word: 

In [146]:
# separate categories
# Churchill
c_countpositive = c_countnegative = c_countneutral = c_counttotal = 0

for c in churchill_ctokens:
    c_counttotal += 1
    cat = ''
    if c in pos:
        cat += 'POS'
        c_countpositive += 1
    elif c in neg:
        cat += 'NEG'
        c_countnegative += 1
    else:
        c_countneutral += 1

# Hitler
h_countpositive = h_countnegative = h_countneutral = h_counttotal = 0

for h in hitler_ctokens:
    h_counttotal += 1
    cat = ''
    if h in pos:
        cat += 'POS'
        h_countpositive += 1
    elif h in neg:
        cat += 'NEG'
        h_countnegative += 1
    else:
        h_countneutral += 1

# Mandela
m_countpositive = m_countnegative = m_countneutral = m_counttotal = 0

for m in mandela_ctokens:
    m_counttotal += 1
    cat = ''
    if m in pos:
        cat += 'POS'
        m_countpositive += 1
    elif m in neg:
        cat += 'NEG'
        m_countnegative += 1
    else:
        m_countneutral += 1

# Putin
p_countpositive = p_countnegative = p_countneutral = p_counttotal = 0

for p in putin_ctokens:
    p_counttotal += 1
    cat = ''
    if p in pos:
        cat += 'POS'
        p_countpositive += 1
    elif p in neg:
        cat += 'NEG'
        p_countnegative += 1
    else:
        p_countneutral += 1

# Ahtisaari
a_countpositive = a_countnegative = a_countneutral = a_counttotal = 0

for a in ahtisaari_ctokens:
    a_counttotal += 1
    cat = ''
    if a in pos:
        cat += 'POS'
        a_countpositive += 1
    elif a in neg:
        cat += 'NEG'
        a_countnegative += 1
    else:
        a_countneutral += 1

# Trump
t_countpositive = t_countnegative = t_countneutral = t_counttotal = 0

for t in trump_ctokens:
    t_counttotal += 1
    cat = ''
    if t in pos:
        cat += 'POS'
        t_countpositive += 1
    elif t in neg:
        cat += 'NEG'
        t_countnegative += 1
    else:
        t_countneutral += 1

# Zelensky
z_countpositive = z_countnegative = z_countneutral = z_counttotal = 0

for z in zelensky_ctokens:
    z_counttotal += 1
    cat = ''
    if z in pos:
        cat += 'POS'
        z_countpositive += 1
    elif z in neg:
        cat += 'NEG'
        z_countnegative += 1
    else:
        z_countneutral += 1

In [147]:
# display the final count
print('Collective statistics: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (countpositive/counttotal, countnegative/counttotal, countneutral/counttotal))
print('Churchill: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (c_countpositive/c_counttotal, c_countnegative/c_counttotal, c_countneutral/c_counttotal))
print('Hitler: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (h_countpositive/h_counttotal, h_countnegative/h_counttotal, h_countneutral/h_counttotal))
print('Mandela: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (m_countpositive/m_counttotal, m_countnegative/m_counttotal, m_countneutral/m_counttotal))
print('Putin: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (p_countpositive/p_counttotal, p_countnegative/p_counttotal, p_countneutral/p_counttotal))
print('Ahtisaari: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (a_countpositive/a_counttotal, a_countnegative/a_counttotal, a_countneutral/a_counttotal))
print('Trump: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (t_countpositive/t_counttotal, t_countnegative/t_counttotal, t_countneutral/t_counttotal))
print('Zelensky: Positive=%.2f Negative=%.2f Neutral=%.2f' % 
      (z_countpositive/z_counttotal, z_countnegative/z_counttotal, z_countneutral/z_counttotal))

Collective statistics: Positive=0.08 Negative=0.06 Neutral=0.86
Churchill: Positive=0.07 Negative=0.06 Neutral=0.87
Hitler: Positive=0.06 Negative=0.06 Neutral=0.88
Mandela: Positive=0.11 Negative=0.08 Neutral=0.81
Putin: Positive=0.06 Negative=0.06 Neutral=0.88
Ahtisaari: Positive=0.11 Negative=0.06 Neutral=0.82
Trump: Positive=0.12 Negative=0.04 Neutral=0.85
Zelensky: Positive=0.13 Negative=0.05 Neutral=0.82


## Conclusion:

The sentiment analysis of speeches from diverse historical figures including Winston Churchill, Adolf Hitler, Nelson Mandela, Vladimir Putin, Martti Ahtisaari, Donald Trump, and Volodymyr Zelensky reveals a nuanced spectrum of sentiments, with a predominant tilt towards neutral expressions across all speeches.

### Key Findings and Trends:

- Predominance of Neutral Sentiment: With neutrality accounting for 86% of the collective sentiment, the analysis underscores a general tendency towards factual or objective content within these speeches, highlighting a focus on conveying information or addressing issues directly.

- Highlight on Positivity: Among the leaders analyzed, Donald Trump and Volodymyr Zelensky stand out with the highest positive sentiments at 12% and 13%, respectively. This suggests a rhetorical strategy that perhaps leans more towards optimism, resilience, or national pride, distinguishing their speeches from the others.

- Mandela's Speech Analysis: Mandela’s speech exhibits a significant presence of positive sentiment (11%), alongside a notable proportion of negative sentiment (8%). This could be indicative of Mandela's rhetorical style, where he acknowledges serious challenges (reflected in negative sentiment) while also conveying hope and determination (positive sentiment).

- Ahtisaari's Positive Rhetoric: Martti Ahtisaari also exhibits a high level of positive sentiment (11%), aligning closely with Mandela's scores. This similarity suggests Ahtisaari's speeches, like Mandela's, might also balance acknowledgment of challenges with a forward-looking and hopeful perspective.

- Insights into Rhetorical Styles: The analysis further reveals a consistency in neutrality within speeches by Churchill, Hitler, and Putin (87-88% neutral), which could indicate a measured or formal approach, possibly a reflection of the era or contexts in which they were speaking.

### Contextual Importance in Sentiment Analysis:

This analysis highlights the critical role of context in interpreting sentiment analysis results. The presence of positive or negative words must be understood within the broader framework of the speaker's intent, historical circumstances, and the issues addressed in their speeches.

### Limitations and Insights:

- While the analysis provides quantitative insights into the emotional tones of the speeches, it also points to the limitations of sentiment analysis in capturing the full depth of complex political rhetoric.
- The findings encourage a deeper qualitative examination to fully appreciate the nuances and implications of the language used by these influential figures.
- The variances observed in sentiment scores across different leaders not only highlight their unique rhetorical strategies but also reflect the diverse historical and political contexts in which they operated. This analysis, therefore, provides a fascinating lens through which to examine the impact of language in leadership and politics.

-------

Named entity recognition (NER)

This process is an essential part of information extraction that allows you to automatically identify and categorize important information in text data.

Part-of-speech (POS) tagging is the process of labeling each word in a text (corpus) with its corresponding part of speech, such as noun, verb, adjective, etc. 

In [148]:
# token's labeling
tagged = nltk.pos_tag(ctokens)

# show first 20
display(tagged[0:19])

[('almost', 'RB'),
 ('year', 'NN'),
 ('passed', 'VBD'),
 ('since', 'IN'),
 ('war', 'NN'),
 ('began', 'VBD'),
 ('natural', 'JJ'),
 ('us', 'PRP'),
 ('think', 'VBP'),
 ('pause', 'IN'),
 ('journey', 'NN'),
 ('milestone', 'NN'),
 ('survey', 'NN'),
 ('dark', 'JJ'),
 ('wide', 'JJ'),
 ('field', 'NN'),
 ('also', 'RB'),
 ('useful', 'JJ'),
 ('compare', 'NN')]

In [149]:
# pip install svgling
import svgling

In [150]:
# named entity recognition (NER)
entities = nltk.chunk.ne_chunk(tagged)

# to print table:
# entities

In [151]:
# this code should create a window displaying a graphical representation of a parsed tree
# this requires installing tkinter, I won't do that here for now
# from nltk.corpus import treebank
# tree = treebank.parsed_sents('wsj_0001.mrg')[0]
# tree.draw()