# Install NLTK and import modules

First of all, we import all the modules we need to achieve our goals.
Remember that the first time you'll use NLTK within another interpreter, you'll need to install some packages (the ones commented here below).

In [3]:
import nltk , nltk.data , collections, string , sys
from nltk.text import Text 
from nltk.probability import FreqDist , ConditionalFreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.collocations import *
# nltk.download('punkt')
# nltk.download('gutenberg')
# nltk.download('genesis')
# nltk.download('inaugural')
# nltk.download('popular')

We start opening our txt file with the **open** method. This time we'll use the built-in' method **read()** to access it's content'. 

We immediately clean the text so as to perform the first analysis: 
 * use **lower()** to transform words in lowercase;
 * one way to remove punctuation is the **join** method (with an empty string to join stringa) iterated over all the characters of the text *if* they are not included in the list *excluded* containing punctuation characters - i.e., it's a replacement of such characters with an empty string.
 * we tokenize words
 * we create a Text object, that will enable us to use shortcuts for prividing statistics

In [19]:
with open('military.txt', errors='ignore') as txtFile: # open .txt file
    text=txtFile.read().lower() # read the file and transform it in lowercase
    exclude = set(string.punctuation) # set of characters to be deleted
    no_punct = ''.join(ch for ch in text if ch not in exclude) # a fancy way to remove characters from the text
    textList = word_tokenize(no_punct) # tokenize words and create a list
    textList = Text(textList) # create a Text object
    print(textList)

<Text: the profound impression made upon a crowded congregation...>


# Define a function that prints statistics
We define a function that prints the following simple statistics:
 * **Total number of words**: since we tokenized the text, removed punctuation and gather everything in a list, the total number of words is the length of the list, i.e., **len()**
 * **Lexical diversity**: `numberOfUniqueWords / totalNumberOfWords`: the number of unique words is obtained by transforming our list of tokens in a set, that gives us only unique terms appearing in the text. we then use the arithmetic operator **/** to divide the two integers
 * **Occurrences of the term `military`**: we check in the list of tokens how many times an item appear by using the method **count(term)**
 * Percentage of the term `military` with respect to the total amount of words: `100 * occurrencesOfTerm / totalNumberOfWords`: we use the operator ***** and **/** to perform the operation
 * **Concordance of the term `military`**: since we created a Text object, we can use the nltk method **concordance(term)** to display the list of concordances. We specify three arguments: the term to be searched, the maximum number of words surrounding the term, and the number of lines we want to be displayed. We use the module **sys** to specify that the maximum number of lines corresponds to the maximum number of lines in the file. Be aware that this method already prints results by default, therefore it does not have to be included in the **print()** function
 * **Other words that appear in the same context of `military`**: we use the nltk method **similar()** to see how many words appear in the same position.

In [25]:
def analyseData(fileName, term):
    """ given a corpus of texts and a keyword return several analysis"""
    with open('military.txt', errors='ignore') as txtFile: # open .txt file
        text=txtFile.read().lower() # read the file and transform it in lowercase
        exclude = set(string.punctuation) # set of characters to be deleted
        no_punct = ''.join(ch for ch in text if ch not in exclude) # a fancy way to remove characters from the text
        textList = word_tokenize(no_punct) # tokenize words and create a list
        textList = Text(textList) # create a Text object
    # some statistics, e.g. tot.number of words, occurrences of a word, concordance, similar words
    print('## total number of words:', len(textList), '\n') 
    print('## lexical diversity:', len(set(textList)) / len(textList), '\n')
    print('## occurrences of the term "', term, '": ', textList.count(term), '\n')
    print('## percentage on the total amount of words:', 100 * textList.count(str(term)) / len(textList))
    print('\n## concordance of the term '+term+':') 
    textList.concordance(term, 75, sys.maxsize)
    print('\n## similar words in the same context of "'+term+'":') 
    textList.similar(term)

We call it specifying the txt file and the term to be analysed.

In [27]:
analyseData('military.txt', 'military')

## total number of words: 4024 

## lexical diversity: 0.35636182902584496 

## occurrences of the term " military ":  31 

## percentage on the total amount of words: 0.7703777335984096

## concordance of the term military:
Displaying 31 of 31 matches:
chic black velvet caps naval and military uniforms and academic robesmade 
a banda playing a passadoble and military exercises always apparently ende
we heard there also some russian military singers they were six private so
ectioneering we marched with the military band before us stopped before ol
 we have is that of the austrian military band which plays three evenings 
red one band during the daya big military bandthat would play all the marc
gnitary of the organisation with military music but the band was sitting a
 he would appear tallish faintly military not young and vanish again i oft
s was the sudden apparition of a military band i saw with astonishment tha
their musicstands like any other military band they played a march or t

Let's move on the other tasks:
 * **Frequency distribution of the 100 most common words**: before calculating the distribution of words we decide to remove stopwords from our text by using **stopwords.words('english')**. Then we use **FreqDist()** to compute the frequency distribution of all the words, and **most_common()** to prune only the more frequent ones. In order to print an ordered list we use **enumerate()** on the list of most common words. Notice how we print results: we use a variable % to identify the three parts that will compose our final string. 
   * `%3d` stands for a number of 3 digits max and represents the ordered list of items; 
   * `%2.5f` stands for a floating number including w digits before comma (the integers) and 5 after comma (the floating part); we print the frequency as a percentage and '%%' escapes the character '%'; 
   * `%s` stands for a string and represents our word
 * **Distribution of the 50 most common bigrams**: we rebuild our text joining the filtered items in the list *filtered_text*, we tokenize it and we look for bigrams (pair of consecutive words) by means of **nltk.bigrams**. Then we compute their frequency with the know **FreqDist()** class. We filter the 50 most common with the function **most_common(50)**
 * **Collocation of bigrams that appear more than three times**: we use **collocations.BigramAssocMeasures()** to find collocation of words and we filter the ones that appear more than 3 times. We score bigrams by their frequency (**raw_freq**)

In [30]:
# delete stopwords and punctuation and add in a list for further analysis
stopWords = set(stopwords.words('english'))
filtered_text = [w for w in textList if w not in stopWords] # removing stopwords from the text

# distribution of words in the cleaned corpus
print('## frequency distribution of the 100 most common words (without stopwords and punctuation):\n')
fd = FreqDist(filtered_text)
most_common_words = [word for (word, count) in fd.most_common(100)]
for rank, word in enumerate(most_common_words):
    print(("%3d %2.5f%% %s" % (rank + 1, fd.freq(word) * 100, word)))

#compute frequency distribution for all the bigrams in the text
cleaned_text = ' '.join(x for x in filtered_text) # rebuild the text without stopwords
textList = word_tokenize(cleaned_text)
bgs = nltk.bigrams(textList)
fdist = FreqDist(bgs)
print('\n## distribution of 50 most common bigrams:\n', fdist.most_common(50),'\n')

# predicting following words/ collocation
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(textList)
finder.apply_freq_filter(3) # filter bigrams that appear more than 3 times
scored = finder.score_ngrams(bigram_measures.raw_freq) # score bigrams by their frequency
print("\n## collocation of words that appear more than 3 times:\n", sorted(bigram for bigram, score in scored),'\n')

## frequency distribution of the 100 most common words (without stopwords and punctuation):

  1 1.51367% military
  2 1.26953% band
  3 0.92773% music
  4 0.53711% played
  5 0.53711% would
  6 0.53711% first
  7 0.48828% playing
  8 0.48828% went
  9 0.43945% long
 10 0.39062% two
 11 0.39062% great
 12 0.39062% one
 13 0.34180% see
 14 0.34180% drums
 15 0.34180% came
 16 0.34180% bands
 17 0.34180% heard
 18 0.29297% march
 19 0.29297% way
 20 0.29297% stood
 21 0.29297% chaliapin
 22 0.29297% drum
 23 0.29297% even
 24 0.29297% life
 25 0.24414% new
 26 0.24414% time
 27 0.24414% sang
 28 0.24414% day
 29 0.24414% tsar
 30 0.24414% hear
 31 0.24414% like
 32 0.24414% never
 33 0.24414% palace
 34 0.24414% solemn
 35 0.24414% performed
 36 0.24414% thought
 37 0.24414% occasion
 38 0.24414% boris
 39 0.24414% bass
 40 0.24414% singing
 41 0.24414% whole
 42 0.24414% cardinals
 43 0.24414% body
 44 0.24414% performance
 45 0.24414% going
 46 0.24414% procession
 47 0.19531% saw
 48 

**OPTIONAL** Score pair of words that are most likely to appear together (i.e. the most common), using likelihood_ratio (instead of raw_freq) as a measure for scoring bigrams.

The *likelihood ratio* is a way to compute usefulness of a bigram in the context of a text. Given the word pair of ‘word anotherWord’, the Log-Likelihood ratio is computed by looking at the number of occurences of that word pair in the corpus, the number of word pairs that begin with ‘word’ but end with something other than ‘anotherWord’, the number of word pairs that end with ‘anotherWord’ begin with something other than ‘word’ and the number of word pairs in the corpus that contain neither ‘word’ and 'anotherWord'.

We use the finder provided by NLTK, but we change the measure to *likelihood_ratio*, that is provided by the library as well.

At the beginning we imported the module *collections*, that provides the method **defaultdict**. This is a dict subclass that calls a factory function to supply missing values. When looking into a dictionary and ask for a value that does not exist, the interpreter return 'Error'. In a defaultdict you can explicit how to supplu such a missing value. We create an empty dictionary called *prefix_keys* and we supply values with a list.

We iterate over the list *scored2*, that includes tuples containing a bigram and the related likelihood score. E.g.

`[(('military', 'band'), 79.49166503212565), (('body', 'church'), 33.92480310549471), (('captain', 'waterhouse'), 31.72393346130431), (('save', 'queen'), 31.72393346130431), ...]`

We extract the first word of the bigram *key\[0\]*, such as 'military', 'body', 'captain', etc., and we transform it in a key in the new dictionary (prefix_keys\[key\[0\]\]). The value of such a key (that we said is wrapped in a list) is created with the method **append**: we append here a tuple - denoted by () - including the second word of the bigram (*\[key\[1\]\]*) and the related score (*scores*). Since we are iterating over all the tuples of *score2*, whenever a key is repeated (i.e. the first word of a bigram appears in several bigrams) it is not repreated in the new dictionary, while its value is updated and includes all the possible pairs of words wherein the first one is the first word. One line of such a dictionary looks like:

`'military': [('band', 79.49166503212565), ('music', 9.308310719639827), ('bands', 8.65962946055253), ('array', 8.413385833226672), ('bandthat', 8.413385833226672), ('busyness', 8.413385833226672), ('enthusiasm', 8.413385833226672), ('exercises', 8.413385833226672), ('headquarters', 8.413385833226672), ('hospital', 8.413385833226672), ('retiring', 8.413385833226672), ('reviews', 8.413385833226672), ('uniforms', 8.413385833226672), ('young', 5.670332474821485), ('men', 4.003354421277293), ('singers', 4.003354421277293)]`

The order of lines in this dictionary is not significant. We decide to sort it by the strongest association (i.e., the higher value of likelihood ratio). To do this we recall the list *most_common_words* we created a while ago, wherein the order of words is relevant: we iterate over its items (words) and we print:
 * the word, declared as a string, *str(w)*, beacuse we need to collate different types of data in the print function 
 * the list of tuples associated to that word when it is a key of our dictionary, *prefix_keys\[str(w)\]*; since such lists might be very long, we extract only the first three items using the slicing notation we know, *\[:3\]*



In [56]:
# rating bigrams that will likely occour
finder2 = BigramCollocationFinder.from_words(textList)
scored2 = finder2.score_ngrams(bigram_measures.likelihood_ratio) 
prefix_keys = collections.defaultdict(list)
for key, scores in scored2:
    prefix_keys[key[0]].append((key[1], scores)) # Group bigrams by first word in bigram  
print("## scored pair of words that are most likely to appear together:")
for w in most_common_words:
    print(str(w), prefix_keys[str(w)][:3])

## scored pair of words that are most likely to appear together:
['military', 'band', 'music', 'played', 'would', 'first', 'playing', 'went', 'long', 'two', 'great', 'one', 'see', 'drums', 'came', 'bands', 'heard', 'march', 'way', 'stood', 'chaliapin', 'drum', 'even', 'life', 'new', 'time', 'sang', 'day', 'tsar', 'hear', 'like', 'never', 'palace', 'solemn', 'performed', 'thought', 'occasion', 'boris', 'bass', 'singing', 'whole', 'cardinals', 'body', 'performance', 'going', 'procession', 'saw', 'always', 'side', 'men', 'last', 'play', 'work', 'much', 'george', 'used', 'singers', 'church', 'marching', 'quite', 'join', 'old', 'boys', 'tipperary', 'london', 'well', 'glazunov', 'soldiers', 'instruments', 'dressed', 'ten', 'week', 'grand', 'effect', 'composer', 'choir', 'taking', 'majesty', 'french', 'good', 'crowd', 'god', 'several', 'suddenly', 'st', 'everyone', 'statue', '1870', 'sun', 'charles', 'small', 'get', 'opera', 'us', 'five', 'orchestra', 'town', 'given', 'come', 'filled']
milita

# In the end
Our final function look like:

In [3]:
import nltk , nltk.data , collections, string , sys
from nltk.text import Text 
from nltk.probability import FreqDist , ConditionalFreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.collocations import *

def analyseData(fileName, term):
    """ given a corpus of texts and a keyword return several analysis"""
    with open(fileName, errors='ignore') as txtFile: # open .txt file
        text=txtFile.read().lower() # read the file and transform it in lowercase
        exclude = set(string.punctuation) # set of characters to be deleted
        no_punct = ''.join(ch for ch in text if ch not in exclude) # a fancy way to remove characters from the text
        textList = word_tokenize(no_punct) # tokenize words and create a list
        textList = Text(textList) # create a Text object
    # some statistics, e.g. tot.number of words, occurrences of a word, concordance, similar words
    print('## total number of words:', len(textList), '\n') 
    print('## lexical diversity:', len(set(textList)) / len(textList), '\n')
    print('## occurrences of the term "', term, '": ', textList.count(term), '\n')
    print('## percentage on the total amount of words:', 100 * textList.count(str(term)) / len(textList))
    print('\n## concordance of the term '+term+':') 
    textList.concordance(term, 75, sys.maxsize)
    print('\n## similar words in the same context of "'+term+'":') 
    textList.similar(term)
    # delete stopwords and punctuation and add in a list for further analysis
    stopWords = set(stopwords.words('english'))
    filtered_text = [w for w in textList if w not in stopWords] # removing stopwords from the text

    # distribution of words in the cleaned corpus
    print('## frequency distribution of the 100 most common words (without stopwords and punctuation):\n')
    fd = FreqDist(filtered_text)
    most_common_words = [word for (word, count) in fd.most_common(100)]
    for rank, word in enumerate(most_common_words):
        print(("%3d %2.5f%% %s" % (rank + 1, fd.freq(word) * 100, word)))

    #compute frequency distribution for all the bigrams in the text
    cleaned_text = ' '.join(x for x in filtered_text) # rebuild the text without stopwords
    textList = word_tokenize(cleaned_text)
    bgs = nltk.bigrams(textList)
    fdist = FreqDist(bgs)
    print('\n## distribution of 50 most common bigrams:\n', fdist.most_common(50),'\n')

    # predicting following words/ collocation
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(textList)
    finder.apply_freq_filter(3) # filter bigrams that appear more than 3 times
    scored = finder.score_ngrams(bigram_measures.raw_freq) # score bigrams by their frequency
    print("\n## collocation of words that appear more than 3 times:\n", sorted(bigram for bigram, score in scored),'\n')
    
    # rating bigrams that will likely occour
    finder2 = BigramCollocationFinder.from_words(textList)
    scored2 = finder2.score_ngrams(bigram_measures.likelihood_ratio) 
    prefix_keys = collections.defaultdict(list)
    for key, scores in scored2:
        prefix_keys[key[0]].append((key[1], scores)) # Group bigrams by first word in bigram  
    print("## scored pair of words that are most likely to appear together:")
    for w in most_common_words:
        print(str(w), prefix_keys[str(w)][:3])

In [4]:
analyseData('military.txt', 'military')

## total number of words: 4024 

## lexical diversity: 0.35636182902584496 

## occurrences of the term " military ":  31 

## percentage on the total amount of words: 0.7703777335984096

## concordance of the term military:
Displaying 31 of 31 matches:
chic black velvet caps naval and military uniforms and academic robesmade 
a banda playing a passadoble and military exercises always apparently ende
we heard there also some russian military singers they were six private so
ectioneering we marched with the military band before us stopped before ol
 we have is that of the austrian military band which plays three evenings 
red one band during the daya big military bandthat would play all the marc
gnitary of the organisation with military music but the band was sitting a
 he would appear tallish faintly military not young and vanish again i oft
s was the sudden apparition of a military band i saw with astonishment tha
their musicstands like any other military band they played a march or t

What can you say about a listening experience? And what about this specific type of listening experience?