# Words

In this notebook we will learn how to calculate word frequencies and lexical diversity, also known as type to token ratio (TTR), and we will experiment with frequency plots.

##### Hint: Check sections 1.3 and  2.1-2 in the Bird et al. (2009) book.

## 1) Text as word lists, counting tokens and types

Start by looking at the following lines of code and explaining what they do.

In [1]:
text1 = "my friend was bored at my concert . I insisted that they did not come ."
text2 = "he gazed and gazed and gazed . amazed amazed amazed"

In [2]:
text1

'my friend was bored at my concert . I insisted that they did not come .'

In [3]:
text2

'he gazed and gazed and gazed . amazed amazed amazed'

In [4]:
text1v = text1.split()
text1v

['my',
 'friend',
 'was',
 'bored',
 'at',
 'my',
 'concert',
 '.',
 'I',
 'insisted',
 'that',
 'they',
 'did',
 'not',
 'come',
 '.']

In [5]:
text2v = text2.split()
text2v

['he',
 'gazed',
 'and',
 'gazed',
 'and',
 'gazed',
 '.',
 'amazed',
 'amazed',
 'amazed']

Now it's your turn to calculate how many words are in each text, their vocabulary size and their lexical diversity.

In [20]:
#YOUR CODE HERE
words1 = len(text1v)
words2 = len(text2v)

type1 = len(set(text1v))
type2 = len(set(text2v))

print(words1, type1, float(type1/words1))
print(words2, type2, float(type2/words2))


16 14 0.875
10 5 0.5


## 2) Using NLTK text corpora

The command below loads the packages/libraries called `nltk` and `matplotlib`. You have to download them before running the code.

If you get an error that says the following:

`ModuleNotFoundError: No module named 'nltk'`

and/or

`ModuleNotFoundError: No module named 'matplotlib'`

that means that you need to install them. [This website](https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/) explains how to install packages in Anaconda. There are other ways of installing packages, but we believe this is the most user friendly.

In [8]:
import nltk

from matplotlib import pyplot as plt
%matplotlib inline

If you do not see any output, that means that NLTK is properly imported and we can start playing with it!

The following command downloads the book corpus from NLTK.

If all relevant corpora are downloaded, the next command will work without any errors.

In [9]:
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jesus/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /home/jesus/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /home/jesus/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jesus/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /home/jesus/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /home/jesus/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |

True

In [10]:
# This line imports the different books, each as separate text. 
# Each text has been tokenised, so we also show the first 10 words 
# of the first book in the collection.
from nltk.book import *
text1[0:10]

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.']

### Word statistics

Open the text “Sense and Sensibility by Jane Austen 1811” from the NLTK book collection, and calculate the following statistics:

- number of word tokens and types
- lexical diversity, or type-to-token ratio.

In [2]:
#YOUR CODE HERE
from nltk.book import text2

words1 = len(text2)
print(words1)
wordTypes = len(set(text2))
print(wordTypes)
print("{:.4f}".format(wordTypes/words1))

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
141576
6833
0.0483


In [3]:
text2.count

<bound method Text.count of <Text: Sense and Sensibility by Jane Austen 1811>>

## 3) Frequency distributions

Let us start by looking at the absolute freuency of some words in different books.
In order to do that, you will have to use the 'FreqDist' function available in nltk.

Build a frequency distribution of the words occurring in Sense and Sensibility.
Then extract how often the word 'God' occurs in the novel.

In [39]:
#YOUR CODE HERE
from nltk import FreqDist
fdist2 = FreqDist(text2)
fdist2["God"]

10

Let us now compare with another word. How often does 'God' occur in Monty Python and the Holy Grail?

In [40]:
#YOUR CODE HERE
from nltk.book import text6
fdist6 = FreqDist(text6)
fdist6["God"]

11

### Question:

Maybe absolute frequency in this case does not provide a good enough comparison. What about the relative frequency of the same word in the two texts? 

In [41]:
#Relative frequency of "God" in Sense and Sensibility
#YOUR CODE HERE
god2 = fdist2["God"]
print(god2, len(text2), round(god2/len(text2), 6))

10 141576 7.1e-05


In [42]:
#Relative frequency of "God" in Monty Python and the Holy Grail
#YOUR CODE HERE
god6 = fdist6["God"]
print(god6, len(text6), round(god6/len(text6), 6))

11 16967 0.000648


### Calculating and printing out absolute word frequency

Print out the 50 most frequent words of Sense and Sensibility together with their frequency.

In [None]:
#YOUR CODE HERE


### Plotting cumulative frequency

Create a plot of the cumulative frequency of the same words.

##### Hint:

If you add this line at the beginning of the cell below, the frequency distribution will look better.

`fig,ax = plt.subplots(figsize=(20,10))`

In [None]:
#YOUR CODE HERE


## 4) Conditional frequency

To work with conditional frequency, we will start by importing the Brown corpus from the NLTK corpus collection. We then print out the categories of the collection, which correspond to different genres.

In [30]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Just as a reminder, we create a frequency distribution for one of the genres using the FreqDist function and print out the absolute frequency of two target words.

In [31]:
news_text = brown.words(categories='news')
fdist_news = nltk.FreqDist(w.lower() for w in news_text)
targets = ['war','government']
for t in targets:
    print(t + ':', fdist_news[t], end=' ')

war: 27 government: 73 

Now your turn. Build conditional frequency distributions given two of the genres in the Brown corpus, and print out the conditional frequencies of a number of target words.

In [38]:
#YOUR CODE HERE
import nltk
from nltk.probability import ConditionalFreqDist
news_text = brown.words(categories='news')
government_text = brown.words(categories='government')
CondFredist = ConditionalFreqDist((target.lower(), target) for target in government_text)
targets = ["Sr"]

for target in targets:
    print(target + ":", CondFredist[target], end = " ")


Sr.: <FreqDist with 0 samples and 0 outcomes> 

In this last exercise, write code to plot the frequency of two target words in the various genres of the Brown corpus. You can use the code provided here as a guide. https://www.nltk.org/book/ch02.html#fig-inaugural2

In [None]:
# YOUR CODE HERE
