# NLP Analysis of I Have A Dream
---
In this experiment we use sentiment analysis provided by TextBlob and topic modelling. The following libraries are used:

NLTK
TextBlob
Gensim
Sentiment analysis is a technique used to understand the sentiment of a body of text through either scoring of words or classification relative to a reference dataset. TextBlob uses a sentiment analysis algorithm provided by the Computational and Psycholinguistics Research Centre (CLiPS). The method is based on crowdsourcing scores of individual words from which sentiment is derived by averaging word scores in a string. The result is a numerical value between -1.0 to indicate negativity and +1.0 to indicate positivity.

Topic Modelling is a statistical method used to reveal latent themes in a text by grouping terms together in a single topic. For example, the terms, 'patient', 'doctor', 'disease', 'cancer', ad 'health' will represents topic 'healthcare'.

In this notebook, these techniques are applied to Hitler's manifesto, Mein Kampf, as a way to test each of these technologies.

## Testing the four 'I have a Dream' statements

Before the main experiment, the "Four I Have A Dream" statements are testing for sentiment.

In [3]:
%%time

from textblob import TextBlob
import os

filepath  = r"C:\Users\Steve\OneDrive - University of Southampton\CulturalViolence\KnowledgeBases\data"

dream_text = [
    'I have a dream that one day this nation will rise up and live out the true meaning of its creed: We hold these truths to be self-evident, that all men are created equal.',
    'I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.',
    'I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.',
    'I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.',
    'I have a dream that one day down in Alabama, with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification, that one day right down in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers.',
    'I have a dream that one day every valley shall be exhalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.'
]

for dream in dream_text:
    i = TextBlob(''.join(dream))
    print('Sentiment: ', i.sentiment.polarity)
    print(dream)
    print('-----')

Sentiment:  0.1621212121212121
I have a dream that one day this nation will rise up and live out the true meaning of its creed: We hold these truths to be self-evident, that all men are created equal.
-----
Sentiment:  0.06888888888888889
I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.
-----
Sentiment:  0.0
I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.
-----
Sentiment:  -0.025568181818181823
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.
-----
Sentiment:  -0.11215728715728718
I have a dream that one day down in Alabama, with its vicious racists, with its governor having his lips

## Testing Sentiment Against Noun Phrases

In this next text we test the following statement for sentiment against each of the noun phrases to which it refers and show how references to 'black' as a race will consistently score negatively to when the term 'white' is used.

I have a dream that one day down in Alabama, with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification, that one day right down in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers

In [2]:
%%time

from textblob import TextBlob

dream_text = ['little black boys',
              'black girls',
              'little white boys',
              'white girls',
              'black',
              'white',
              'little'
    ]

for dream in dream_text:
    i = TextBlob(''.join(dream))
    print('Sentiment: ', i.sentiment.polarity)
    print(dream)
    print('-----')

Sentiment:  -0.17708333333333331
little black boys
-----
Sentiment:  -0.16666666666666666
black girls
-----
Sentiment:  -0.09375
little white boys
-----
Sentiment:  0.0
white girls
-----
Sentiment:  -0.16666666666666666
black
-----
Sentiment:  0.0
white
-----
Sentiment:  -0.1875
little
-----
Wall time: 20.9 ms


## The Main Experiment

In [1]:
%%time

### analyse the whole document to extract the topics and sentiment
# https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
# https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

import os
from textblob import TextBlob
from collections import Counter
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
import collections
from gensim import corpora
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
Lda = gensim.models.ldamodel.LdaModel

### open file
filepath  = r"C:\Users\Steve\OneDrive - University of Southampton\CulturalViolence\KnowledgeBases\data"

with open(os.path.join(filepath, "IHaveADream.txt"), 'r') as text:
    speech_text = text.readlines()

# define common functions
def clean(doc): # text cleaning function
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) # remove stopwords
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude) # remove punctuation
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) #lemmatize words
    return normalized

def Most_Common(lst, quantity):
    data = Counter(lst)
    return data.most_common(quantity)

# speech pre-processing to a bag-of-words
i = TextBlob(''.join(speech_text))
doc_clean = [clean(doc).split() for doc in speech_text] # clean speech to a bag of words
dictionary = corpora.Dictionary(doc_clean) # Creating the term dictionary for the speech, where every unique term is assigned an index.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

print('Overall Text Topics: ', end='')
print(ldamodel.print_topics(num_topics=10, num_words=3))
print('-----')

print('Text Overall Sentiment: ', end='')
print(i.sentiment.polarity)
print('-----')

print('Text Common Nouns: ', end='')
print(Most_Common(i.noun_phrases, 15))
print('-----')

flat_list = []

for sublist in doc_clean:
    for item in sublist:
        flat_list.append(item)


print('Speech common words: ', end='')
print(Most_Common(flat_list, 5))
print('-----')

# generate speech summary data

counter = 0
most_pos_para = ''
most_neg_para = ''
pos_paras = []
neg_paras = []
most_pos_score = 0
most_neg_score = 0

for section in speech_text:
    paragraph = TextBlob(section)
    
    if paragraph.sentiment.polarity > most_pos_score and paragraph.sentiment.polarity < 1:
        most_pos_para = paragraph
        most_pos_score = paragraph.sentiment.polarity
    elif paragraph.sentiment.polarity < most_neg_score and paragraph.sentiment.polarity > -1:
        most_neg_para = paragraph
        most_neg_score = paragraph.sentiment.polarity
    elif paragraph.sentiment.polarity == 1:
        pos_paras.append(section)
    elif paragraph.sentiment.polarity == -1:
        neg_paras.append(section)
    
        
print('With a score of ', str(most_pos_score), ', the most positive paragraph less than 1 is:')
print(most_pos_para)
print('-----')
print('With a score of ', str(most_neg_score), ', the most negative paragraph greater than -1 is:')
print(most_neg_para)
print('-----')
print('The paragraphs with a sentiment score of 1 are:')
for i in pos_paras:
    print(i)
print()
print('The paragraphs with a sentiment score of -1 are:')
for i in neg_paras:
    print(i)
print('-----')

# generate paragraph summary data by iterating through each paragraph

for section in speech_text:
    counter += 1
    paragraph = TextBlob(section)
    
    doc_clean = [clean(section).split()] # clean section
    dictionary = corpora.Dictionary(doc_clean) # Creating the term dictionary of each paragraph, where every unique term is assigned an index. 
    
    if len(dictionary) > 0: ## skip over empty paragraphs
        doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] # Converting paragraph into Document Term Matrix using dictionary prepared above.
        ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
    
    print('Paragraph %d' %counter)
    print('    Paragraph topics: ', end='')
    print(ldamodel.print_topics(num_topics=5, num_words=2))
    print('    Paragraph sentiment: ', end='')
    print(paragraph.sentiment.polarity)
    print('    Paragraph nouns: ', end='')
    print(paragraph.noun_phrases)
    print('    Paragraph common words: ', end='')
    
    for w in doc_clean:
        print(Most_Common(w, 5))
        
    print()



Overall Text Topics: [(0, '0.023*"freedom" + 0.017*"must" + 0.013*"justice"'), (1, '0.019*"negro" + 0.019*"freedom" + 0.019*"back"'), (2, '0.023*"dream" + 0.015*"come" + 0.015*"today"')]
-----
Text Overall Sentiment: 0.14083845280536458
-----
Text Common Nouns: [('negro', 15), ('freedom ring', 11), ('america', 5), ('mississippi', 4), ('god', 3), ("'s children", 3), ('alabama', 3), ('georgia', 3), ('black men', 2), ('white men', 2), ('insufficient funds', 2), ('police brutality', 2), ('york', 2), ('free', 2), ('score years', 1)]
-----
Speech common words: [('freedom', 20), ('negro', 15), ('one', 13), ('let', 13), ('ring', 12)]
-----
With a score of  0.8 , the most positive paragraph less than 1 is:
But we refuse to believe that the bank of justice is bankrupt. We refuse to believe that there are insufficient funds in the great vaults of opportunity of this nation. So we have come to cash this check- a check that will give us upon demand the riches of freedom and the security of justice.