### Exercise 2

Take whichever English text you want and calculate word frequency:


In [1]:
# Import relevant libraries:

import textract

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from textblob import TextBlob


In [2]:
# Extract text file content into a string object using the textract library:

text = str(textract.process("us_constitution.txt"))

# Tokenise each word (make them into separate, countable entities):

tokenized_word = word_tokenize(text)

# Add up words to obtain word count. Then examine the most common words:

fdist = FreqDist(tokenized_word)

fdist.most_common(10)

# Note that most of the most common words without filtering are irrelevant to obtain valuable information, such as 
# dots, commas or connector words. Hence, we must now filter them out

[('the', 678),
 (',', 611),
 ('of', 506),
 ('shall', 306),
 ('.', 270),
 ('and', 264),
 ('to', 190),
 ('be', 179),
 ('or', 160),
 ('in', 148)]

## Level 2

### Exercise 3

Remove the stopwords and do stemming to your data:


In [8]:
# Remove non-alphanumerical values:

tokenized_word= [word for word in tokenized_word if word.isalnum()]

In [9]:
# To remove stopwords, we first set the language to English:

stop_words = set(stopwords.words("english"))

# We create a for loop that adds to a new array the filtered text:

filtered_stop=[]

for w in tokenized_word:
    if w not in stop_words:
        filtered_stop.append(w)
    

In [10]:
# With another for loop we now get our word stems:
ps = PorterStemmer()

filtered_stem = []

for w in filtered_stop:
    filtered_stem.append(ps.stem(w))
    

In [11]:
# Frequency count of the filtered text:

fdist_stem = FreqDist(filtered_stem)

fdist_stem.most_common(20)

# After filtering, the frequency count appears much more valuable. States, President, Unity, Congress all seem like
# very relevant parts of a constitution.

# It can be noticed that currently relevant political concepts are not explicitly mentioned in the Constitution.
# Amongst them: democracy, rights, minority, parties, nation...

[('shall', 306),
 ('state', 218),
 ('presid', 113),
 ('unit', 86),
 ('the', 80),
 ('congress', 63),
 ('offic', 57),
 ('amend', 54),
 ('law', 52),
 ('senat', 51),
 ('person', 49),
 ('may', 44),
 ('hous', 42),
 ('power', 41),
 ('vote', 41),
 ('repres', 35),
 ('year', 34),
 ('ratifi', 34),
 ('constitut', 33),
 ('articl', 29)]

## Level 3

### Exercise 4

Perform Sentiment Analysis on your dataset:

In [None]:
# The resources provided indicated the use of a classification model that could be compared with a pre-existing 
# baseline, which is not the case of the US constitution. Instead, the sentiment analysis tools from NLTK and Text 
# Blob will be used:

In [6]:
# Through the Sentiment Intensity Analyser module of NLTK we can obtain the sentiment of the text overall

vader_score = SentimentIntensityAnalyzer().polarity_scores(text)

print(vader_score)

{'neg': 0.061, 'neu': 0.828, 'pos': 0.111, 'compound': 0.9999}


In [7]:
# The TextBlob library provides another insight into sentiment analysis:

blob_score = TextBlob(text)

blob_score.sentiment


Sentiment(polarity=0.060844770455881596, subjectivity=0.3702974104085216)

In [None]:
# When evaluating the results, we should expect a high degree of objectivity/neutrality from a legal, objective text 

# According to the NLTK library analysis, the largest part of the text is neutral in sentiment. 

# According to the TextBlob library, the text stands in the middle of the polarity scale (from -1 to 1, negative 
# being negative sentiments and positive being positive), indicating a very low degree of polarity. It also stands 
# low in the scale of subjectivity (from 0, most objective, to 1, most subjective), although not excessively so.