# Text processing
In order to analyse quantitative and qualitative aspects of texts, we first need to build a corpus. It is a structured set of texts designed according to our purposes and implemented to accomplish some tasks. In the design phase we need to consider:
 * the size of the corpus
 * the balance of sources
 * the representativeness of corpus with regard of topics/features to analyse

We can look for general statistics (e.g. number of words, lexical diversity, occurrences of terms, frequency distribution) and other meaningful features (e.g. Part-Of-Speech tagging), and prediction models (e.g. which words are likely to appear together?).

## NLTK
[Natural Language Toolkit (NLTK)](http://www.nltk.org/) is a python library that simplifies several common tasks in linguistic analysis, thanks to several bespoke functions. 

## Creating a collection
We collect and save all our texts in a single .txt file. The function `read()` returns the entire contents of the file as a single string. Since we rely on a file that is online, we need to decode the file from bytes to strings by means of the function `decode()`, specifying the encoding standard (`utf-8`).

In [1]:
with open('military.txt', 'r', errors='ignore') as txtFile: # open .txt file
    text = txtFile.read()

## Cleaning texts
'Neat' statistics require 'neat' data, which means we need to remove all those characters (e.g. punctuation) and words (articles, prepositions, etc.) that would make less readable or significant our results. 
### lower()
The built-in function `lower()` returns a string in lowercase.

In [2]:
text.lower()

'the profound impression made upon a crowded congregation in st. paul\'s cathedral has already been mentioned. an eloquent sermon by the bishop of london (honorary chaplain to the academy) was admirably suited to the occasion and a largely-augmented choir did fullest justice to the well-chosen music, the selection and direction of which had been entrusted to dr. charles macpherson. the effect of his solemn te deum is unforgettable. our girls in white and scarletwearing specially-designed, chic black velvet caps, naval and military uniforms and academic robesmade a brave show of colour, and the unexpected burst of music by the band of the welsh guards after the "amen," as the procession moved slowly down the nave, provided a dignified close to a devotional function unique of its kind. \n \nat a fixed hour every day i heard from my room the distant sound of a banda playing a passadoble, and military exercises always apparently ended with that sort of music. all the little characteristics

### punctuation
The module `string` provide a string of characters (punctuation) that we can use to remove punctuation from our text. We define it as a set and we compare it to our text.

In [3]:
import string
exclude = set(string.punctuation)
exclude

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~'}

### stopwords
NLTK provides a list of lowercase stop words for several languages. We can create a set including all of them.

In [6]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
stopWords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

and compare it with the list of words in our texts (see `word_tokenize`), and then clean our texts.

## Tokenization
Tokenization classify parts of a string to be further processed. It's a necessary step to perform statistics on sentences or words.

### sent_tokenize()
`sent_tokenize()` knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence, thus it divides the text according to such separators. We'll use it to calculate frequency distribution of words.

In [8]:
from nltk.tokenize import sent_tokenize
text = "this's a sentence. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn."
sent_tokenize_list = sent_tokenize(text)
sent_tokenize_list

["this's a sentence.",
 'this is sent two.',
 'is this sent three?',
 'sent 4 is cool!',
 'Now it’s your turn.']

### word_tokenize()
Similarly to `sent_tokenize()`, `word_tokenize()` splits words in a text.

In [10]:
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')

['Hello', 'World', '.']

## Statistics
We can perform simple statistics to understand the usage of certain words in specific context.
### Text
The class `Text` is a wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (e.g. concordance, words in a similar context).

In [None]:
from nltk.text import Text 
textList = text.split() # we split our text, return a list
textList = Text(textList) # we transform the list in an object
textList

### concordance()
Given a word, we can see where it appears by using `concordance()`, which includes following and preceding words in results. In order

In [None]:
import nltk , sys
from nltk.text import Text 
term = 'sent'
textList.concordance(term, 75, sys.maxsize)

The accepted arguments are: the term to be searched, its length and the number of lines you want to be shown. In this case we specified `sys.maxsize` as the maximum number of lines.

### similar()
To explore which words appear in the same position as the term we are investigating, we use `similar()`.

In [18]:
textList.similar(term)

is


## Frequency distribution
A frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome. Frequency distributions are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment.

### FreqDist()
`FreqDist()` accepts an iterable object of tokens and returns a dictionary including words and their count 

In [4]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
sent = 'This is a simple example. Is it simple?'
fdist = FreqDist()
words = [word.lower() for word in word_tokenize(sent)]
fdist = FreqDist(words)
print(fdist.items())

dict_items([('simple', 2), ('a', 1), ('is', 2), ('?', 1), ('it', 1), ('example', 1), ('.', 1), ('this', 1)])


### freq()
Returns the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this `FreqDist`. Frequency is a number in the range [0, 1].

In [23]:
print(fdist.freq('is'))

0.2


## Bigrams
A bigram is a sequence of two adjacent elements from a string of tokens, such as letters, syllabs or words. We use bigrams to see which pairs of words appear in our text (so as to characterize a listening experience by its are most common bigrams).

In [24]:
tokens = word_tokenize(sent)
bgs = nltk.bigrams(tokens)
fdist = nltk.FreqDist(bgs)
print(fdist.most_common(10))

[(('it', 'simple'), 1), (('a', 'simple'), 1), (('is', 'a'), 1), (('simple', 'example'), 1), (('.', 'Is'), 1), (('simple', '?'), 1), (('This', 'is'), 1), (('Is', 'it'), 1), (('example', '.'), 1)]


### Collocation 
Collocation is expression of multiple words which commonly co-occur. We use `nltk.collocations.BigramAssocMeasures()` to find collocation of bigrams.
The `collocations` package provides collocation finders which by default consider all ngrams in a text as candidate collocations.

In [25]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(bigram for bigram, score in scored)

[('.', 'Is'),
 ('Is', 'it'),
 ('This', 'is'),
 ('a', 'simple'),
 ('example', '.'),
 ('is', 'a'),
 ('it', 'simple'),
 ('simple', '?'),
 ('simple', 'example')]

We can also apply filters, e.g. we want this method to return only bigrams that appear more than 2 times.

In [32]:
import nltk
from nltk.collocations import *
sent = 'This is another sentence, another sentence meant to test distribution of bisgrams'
tokens = word_tokenize(sent)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(bigram for bigram, score in scored)

[('another', 'sentence')]

[see documentation](http://www.nltk.org/howto/collocations.html)

# Exercise
Work on a bunch of short [texts](https://raw.githubusercontent.com/marilenadaquino/computational_thinking/master/3_lesson/military.txt) recording listening experiences of military bands, published by the [LED project](led.kmi.open.ac.uk). We want to understand how to recognize a listening experience in a text, and which elements characterize a listening experience of a military band. 

Define a function that prints the following statistics:
 * Total number of words
 * Lexical diversity: `numberOfUniqueWords / totalNumberOfWords`
 * Occurrences of the term `military`
 * Percentage of the term `military` with respect to the total amount of words: `100 * occurrencesOfTerm / totalNumberOfWords`
 * Concordance of the term `military`
 * Other words that appear in the same context of `military`
 * Frequency distribution of the 100 most common words
 * Distribution of the 50 most common bigrams
 * Collocation of bigrams that appear more than three times
 * **OPTIONAL** Score pair of words that are most likely to appear together (i.e. the most common), using `likelihood_ratio` (instead of `raw_freq`) as a measure for scoring bigrams
 * Write a short summary on how you would characterize a listening experience. What is missing in this analysis?

## References for the exercise
 * [Natural Language Toolkit](http://www.nltk.org/) python library for text analysis
 * [txt file](https://raw.githubusercontent.com/marilenadaquino/computational_thinking/master/3_lesson/military.txt) including texts on listening experiences
