# Week 5 lab
In this lab, you will further develop your NLP and Python skills by utilising the knowledge you gained during the lecture and by following the workbook. Specifically:

1. You'll load corpora available on NLTK. 
2. You'll find the number of words that appear in a document and all the unique words that appear in a document.
3. You'll calculate the average length of words in a document. 
4. Finally, you'll answer a question based on a very simple data exploration. Every data science project starts with exploration of the data, i.e. understanding the type of data that is available to you which enhances the understanding of a new domain. 

## 1. NLTK Corpora

NLTK provides many corpora and covers many genres of text. Some of the corpora are listed below:

Gutenberg: out of copyright books
Brown: a general corpus of texts including novels, short stories and news articles
Inaugural: U.S. Presidential inaugural speeches
To see a complete list of available corpora you can run:

In [1]:
import os
import nltk
print(os.listdir(nltk.data.find("corpora")))

['unicode_samples', 'mte_teip5.zip', 'indian', 'stopwords', 'brown', 'swadesh', 'mac_morpho', 'conll2002.zip', 'indian.zip', 'abc', 'comparative_sentences', 'brown_tei.zip', 'cmudict.zip', 'conll2000.zip', 'universal_treebanks_v20.zip', 'words', 'pros_cons', 'udhr2', 'nonbreaking_prefixes.zip', 'lin_thesaurus', 'webtext', 'smultron.zip', 'names', 'sentiwordnet', 'dolch.zip', 'wordnet_ic.zip', 'brown.zip', 'alpino.zip', 'panlex_swadesh.zip', 'cmudict', 'sinica_treebank.zip', 'treebank.zip', 'ptb', 'inaugural', 'ppattach.zip', 'dependency_treebank.zip', 'opinion_lexicon.zip', 'cess_esp.zip', 'product_reviews_2', 'genesis.zip', 'reuters.zip', 'conll2007.zip', 'conll2002', 'comparative_sentences.zip', 'switchboard.zip', 'cess_cat.zip', 'udhr.zip', 'subjectivity.zip', 'pl196x.zip', 'ieer', 'problem_reports', 'timit.zip', 'floresta', 'paradigms.zip', 'gazetteers.zip', 'wordnet.zip', 'inaugural.zip', 'sinica_treebank', 'stopwords.zip', 'verbnet.zip', 'gutenberg', 'ieer.zip', 'ycoe.zip', 'shak

If you don't see a corpus named "inaugular" in the list, then you will need to download it by running: __ntlk.download()__. A window will pop up -> click on "corpora" -> select "inaugular" -> click Download. 

Each corpus contains a number of texts. We’ll work with the inaugural corpus, and explore what the corpus contains. Load the inaugural corpus by typing the following:


In [2]:
from nltk.corpus import inaugural

To list all of the documents in the inaugural corpus, run:

In [3]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

The corpus contains many of the inaugular talks of US presidency candidates. From this point on we’ll work with President Barack Obama’s inaugural speech from 2009 (2009-Obama.txt). The contents of each document in the corpus may be accessed via a number of corpus readers. The plaintext corpus reader provides methods to view the raw text (raw), a list of words (words) or a list of sentences (sents):

In [4]:
print(inaugural.raw('2009-Obama.txt'))

My fellow citizens:

I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.

So it has been. So it must be with this generation of Americans.

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a conse

In [5]:
print(inaugural.words('2009-Obama.txt'))

['My', 'fellow', 'citizens', ':', 'I', 'stand', 'here', ...]


In [6]:
print(inaugural.sents('2009-Obama.txt'))

[['My', 'fellow', 'citizens', ':'], ['I', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', ',', 'grateful', 'for', 'the', 'trust', 'you', 'have', 'bestowed', ',', 'mindful', 'of', 'the', 'sacrifices', 'borne', 'by', 'our', 'ancestors', '.'], ...]


Great work! You've learn how to download and load corpora from NLTL, how to print a document as raw text, its words and sentences. 

## 2. Counting words in a text

In this part of the lab, you'll write your own code. You will have to complete the functions below to answer two questions:
1. Find the total number of words (tokens) in Obama’s 2009 speech
2. Find the total number of distinct words (word types) in the same speech

In [7]:
def ex1(doc_name):
    # Use the plaintext corpus reader to access a pre-tokenised list of words
    # for the document specified in "doc_name"
    doc_words = inaugural.words(doc_name)

    # Find the total number of words in the speech
    total_words = 

    # Find the total number of DISTINCT words in the speech
    total_distinct_words = 

    # Return the word counts
    return (total_words, total_distinct_words)

SyntaxError: invalid syntax (<ipython-input-7-3bddf7712751>, line 7)

To test your solution:

In [None]:
speech_name = '2009-Obama.txt'
(tokens,types) = ex1(speech_name)
print("Total words in %s: %s"%(speech_name,tokens))
print("Total distinct words in %s: %s"%(speech_name,types))
#The correct solutions should be 2726 and 900 respectively. 

## 3. Word lengths
Next, you will create a function that takes as input a document and returns the average length of the words in the document. For instance, if the document was "I love icecream", the average length of its words is: (1+4+8)/3 = 4.33

In [None]:
def ex2(doc_name):
    doc_words = inaugural.words(doc_name)

    # Construct a list that contains the word lengths for each DISTINCT word in the document
    distinct_word_lengths = 

    # Find the average word type length
    avg_word_length = 

    # Return the average word type length of the document
    return avg_word_length

To test your solution:

In [None]:
speech_name = '2009-Obama.txt'
result2 = ex2(speech_name)
print("Average word type length for %s: %.3f"%(speech_name,result2))
#The correct solution is 6.065

### Frequency Distribution
A frequency distribution records the number of times each an entity/outcome occurs. For example, a frequency distribution could be used to record the number of times each word appears in a document:

In [None]:
# Obtain the words from Barack Obama’s 2009 speech
obama_words = inaugural.words('2009-Obama.txt')
# Construct a frequency distribution over the lowercased words in the document
fd_obama_words = nltk.FreqDist(w.lower() for w in obama_words)
# Find the top 50 most frequently used words in the speech
fd_obama_words.most_common(50)

You can easily plot the top 50 words (note __%matplotlib inline__ tells jupyter that it should embed plots in the output cell after you run the code. You only need to run it once per notebook, not in every cell with a plot.

In [None]:
%matplotlib inline
fd_obama_words.plot(50)

You can also find out how many times the specific words appear. E.g. how many times were the words peace and america  used in the speech? 

In [None]:
print('peace:', fd_obama_words['peace'])
print('america:', fd_obama_words['america'])

## 4. Answering interesting questions based on word frequences

Compare the top 50 most frequent words in Barack Obama’s 2009 speech with George Washington’s 1789 speech. What word frequencies can tell us about different speeches at different times in history?

In [None]:
def ex3(doc_name, x):
    doc_words = inaugural.words(doc_name)
    
    # Construct a frequency distribution over the lowercased words in the document
    fd_doc_words = 

    # Find the top x most frequently used words in the document
    top_words = 

    # Return the top x most frequently used words
    return top_words

Now test your solution:

In [None]:
### Now test your code
print("Top 50 words for Obama's 2009 speech:")
result3a = ex3('2009-Obama.txt', 50)
print(result3a)
print("Top 50 words for Washington's 1789 speech:")
result3b = ex3('1789-Washington.txt', 50)
print(result3b)

If you completed all the exercises correctly - __Well Done!__ Check your solutions against the ones provided on moodle. If you'd like to further develop your Python/NLTK skills you can load and look into Donald Trump's inaugular talk: https://www.whitehouse.gov/briefings-statements/the-inaugural-address/ 
Do you notice any patterns?