# NLP Analysis of Mein Kampf
---

In this experiment we use sentiment analysis provided by TextBlob and topic modelling. The following libraries are used:

- NLTK
- TextBlob
- Gensim

In the paper, these general purpose technologies are tested against state-of-the-art equivalents

- Watson API 

``https://www.ibm.com/demos/live/natural-language-understanding/self-service/home``
- Google API 

``https://cloud.google.com/natural-language/``

Sentiment analysis is a technique used to understand the sentiment of a body of text through either scoring of words or classification relative to a reference dataset. TextBlob uses a sentiment analysis algorithm provided by the Computational and Psycholinguistics Research Centre (CLiPS). The method is based on crowdsourcing scores of individual words from which sentiment is derived by averaging word scores in a string. The result is a numerical value between -1.0 to indicate negativity and +1.0 to indicate positivity.

Topic Modelling is a statistical method used to reveal latent themes in a text by grouping terms together in a single topic. For example, the terms, 'patient', 'doctor', 'disease', 'cancer', ad 'health' will represents topic 'healthcare'.

In this notebook, these techniques are applied to Hitler's manifesto, Mein Kampf, as a way to test each of these technologies.

The text we use is from the following website and is reproduced in the repo folder.

``http://www.hitler.org/writings/Mein_Kampf/``

## Discussion of Results

For overall sentiment, Mein Kampf scores consistently higher than I Have a Dream for TextBlob. From these results, there are two statements of particular interest from \emph{Mein Kampf}.

``A state which in this age of racial poisoning dedicates itself to the care of its best racial elements must some day become lord of the earth``

That this statement yields a positive score of +1, for which there is no higher sentiment, is a somewhat troubling result. Reading this statement in context, Hitler others immigrants by equating immigration to racial poisoning whereby "All great cultures of the past perished only because the originally creative race died out from blood poisoning". He is, therefore, arguing to preserve a state's "racial purity" by tending to its indigenous people he regards as a state's "best racial elements". In other words, he creates a self-other gradient between each by dehumanising immigrants as an in-pure outgroup while elevating the purity of those he presents as his indigenous people - the ingroup audience for whom he writes.

This statement should be seen, therefore, as the essential premise of racism by inciting violence or hate against immigrants, which in this case ultimately led to genocide.

``Like a swarm of hornets they tackled disturbers at our meetings, regardless of superiority of numbers, however great, indifferent to wounds and bloodshed, inspired with the great idea of blazing a trail for the sacred mission of our movement``

The second of these statements refers to "defensive squads" of men tasked with maintaining order at Nationalist Socialist meetings. Reading this statement in context, Hitler legitimises the violent way in which they maintained order by praising their violent actions, yet yields a score of +0.8. How these numerical values relates to the specific features of hate speech is unclear.

While the results for topic modelling seem to have meaning for each particular text, whether or not each constitute hatefulness is indeterminable. The Gensim results are inconclusive since there does not appear to be definitive terms to detect topics related to hate speech.

In [3]:
%%time

### analyse the whole document to extract the topics and sentiment
# https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
# https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

import os
from textblob import TextBlob
from collections import Counter
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
import collections
from gensim import corpora
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
Lda = gensim.models.ldamodel.LdaModel

### open file
filepath  = r"C:\Users\Steve\OneDrive - University of Southampton\CulturalViolence\KnowledgeBases\data"

with open(os.path.join(filepath, "MeinKampfv2.txt"), 'r') as text:
    speech_text = text.readlines()

# define common functions
def clean(doc): # text cleaning function
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) # remove stopwords
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude) # remove punctuation
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) #lemmatize words
    return normalized

def Most_Common(lst, quantity):
    data = Counter(lst)
    return data.most_common(quantity)

# speech pre-processing to a bag-of-words
i = TextBlob(''.join(speech_text))
doc_clean = [clean(doc).split() for doc in speech_text] # clean speech to a bag of words
dictionary = corpora.Dictionary(doc_clean) # Creating the term dictionary for the speech, where every unique term is assigned an index.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

print('Overall Text Topics: ', end='')
print(ldamodel.print_topics(num_topics=10, num_words=3))
print('-----')

print('Text Overall Sentiment: ', end='')
print(i.sentiment.polarity)
print('-----')

print('Text Common Nouns: ', end='')
print(Most_Common(i.noun_phrases, 15))
print('-----')

flat_list = []

for sublist in doc_clean:
    for item in sublist:
        flat_list.append(item)


print('Speech common words: ', end='')
print(Most_Common(flat_list, 5))
print('-----')

# generate speech summary data

counter = 0
most_pos_para = ''
most_neg_para = ''
pos_paras = []
neg_paras = []
most_pos_score = 0
most_neg_score = 0

for section in speech_text:
    paragraph = TextBlob(section)
    
    if paragraph.sentiment.polarity > most_pos_score and paragraph.sentiment.polarity < 1:
        most_pos_para = paragraph
        most_pos_score = paragraph.sentiment.polarity
    elif paragraph.sentiment.polarity < most_neg_score and paragraph.sentiment.polarity > -1:
        most_neg_para = paragraph
        most_neg_score = paragraph.sentiment.polarity
    elif paragraph.sentiment.polarity == 1:
        pos_paras.append(section)
    elif paragraph.sentiment.polarity == -1:
        neg_paras.append(section)
    
        
print('With a score of ', str(most_pos_score), ', the most positive paragraph less than 1 is:')
print(most_pos_para)
print('-----')
print('With a score of ', str(most_neg_score), ', the most negative paragraph greater than -1 is:')
print(most_neg_para)
print('-----')
print('The paragraphs with a sentiment score of 1 are:')
for i in pos_paras:
    print(i)
print()
print('The paragraphs with a sentiment score of -1 are:')
for i in neg_paras:
    print(i)
print('-----')

# generate paragraph summary data by iterating through each paragraph

for section in speech_text:
    counter += 1
    paragraph = TextBlob(section)
    
    doc_clean = [clean(section).split()] # clean section
    dictionary = corpora.Dictionary(doc_clean) # Creating the term dictionary of each paragraph, where every unique term is assigned an index. 
    
    if len(dictionary) > 0: ## skip over empty paragraphs
        doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] # Converting paragraph into Document Term Matrix using dictionary prepared above.
        ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
    
    print('Paragraph %d' %counter)
    print('    Paragraph topics: ', end='')
    print(ldamodel.print_topics(num_topics=5, num_words=2))
    print('    Paragraph sentiment: ', end='')
    print(paragraph.sentiment.polarity)
    print('    Paragraph nouns: ', end='')
    print(paragraph.noun_phrases)
    print('    Paragraph common words: ', end='')
    
    for w in doc_clean:
        print(Most_Common(w, 5))
        
    print()

Overall Text Topics: [(0, '0.009*"movement" + 0.009*"one"'), (1, '0.011*"german" + 0.008*"germany"'), (2, '0.016*"state" + 0.011*"must"')]
-----
Text Overall Sentiment: 0.0964078892045762
-----
Text Common Nouns: [('germany', 162), ('reich', 124), ('france', 91), ('revolution', 66), ('england', 66), ('jew', 55), ('marxist', 54), ('marxism', 52), ('german people', 48), ('europe', 46), ('russia', 44), ('movement', 43), ("people 's state", 38), ('munich', 36), ('germans', 36)]
-----
Speech common words: [('state', 577), ('people', 520), ('would', 517), ('must', 458), ('one', 430)]
-----
With a score of  0.8 , the most positive paragraph less than 1 is:
Like a swarm of hornets they tackled disturbers at our meetings, regardless of superiority of numbers, however great, indifferent to wounds and bloodshed, inspired with the great idea of blazing a trail for the sacred mission of our movement. 

-----
With a score of  -0.65 , the most negative paragraph greater than -1 is:
All in all, the co