<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/02_Introduction_to_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Relevant readings

[NLTK Book, Chapter 1, Section 1.1, 1.2 and 1.3](https://www.nltk.org/book/ch01.html)

# Getting started with Natural Language Toolkit (NLTK)

One of the primary libraries we will use is NLTK. The authors of NLTK have made a free textbook available, which this course uses some material from. This notebook covers how to get NLTK up and running in a Colab notebook.

Sections 1.1 and 1.2 of Chapter 1 in the NLTK Book contain instructions for how to install Python and NLTK on your computer. If you are using Google Collaboratory, these points are not relevant. However, you are still encouraged to read these sections for a better understanding of how Python and NLTK would work outside of Google Colab.

*A few other notes*
- Google Colab does not have the `>>>` prompt mentioned in the NLTK book - this is replaced by the code cells in Colab.
- the `generate()` function will not work.
- there are likely many other little things that won't work based on various updates to Python/NLTK and/or the use of Google Colab, such as plotting and other functions which come later in the book. These also likely change over the years, so even things I think still work might not work later on. And other things which I need to provide workarounds for might no longer be problems - so bear with me!

If there is a problem, don't worry about it for now, the NLTK book was written before notebooks were as widespread as they are now.

## Accessing NLTK on Google Colab

- NLTK is already pre-installed in Google Colab. But NLTK requires a lot of additional resources which we need to download. In section 1.2 of Chapter 1, the NLTK book explains that to access these resources one should use `nltk.download()`. We will use the same function but will not see the graphical downloader shown in the book.

- For instance, one of the very first lessons in NLTK section 1.2 asks you to use `from nltk.book import *`, which means import everything from `nltk.book` (the `*` means everything). However you will get an error if you try this in Colab because the `book` data has not yet been downloaded.

- When you do not have the right resource to run a particular NLTK function, you will see an error which looks like this:
>> ![error](https://drive.google.com/uc?id=1nt76M0KbiLueTYHb72HSFnREC9_1DsM4)

- If you see this error, don't panic! It just means you are missing a specific resource. In this example, the part that I selected in yellow is what is missing - in this case it is the resource `stopwords`. All you need to do is ask Colab to download the resource using the `nltk.download()` function. Because the Colab notebook is running on a temporary server, you will need to repeat this each time you connect to a new session. Fortunately, it does not take very long to download the data.

You can specify which resources you need by passing each resource as a `string` inside a single `list`, a data container which is delimted by square brackets `[]`). We will talk about lists in more detail later, so you can simply run the code cell below to use this notebook.

In the example below, I include a wide range of specific resources which you will need for the first chapter of NLTK.

```
# define a list of resources and save to variable
nltk_resources = ['gutenberg', 'genesis', 'inaugural', 'nps_chat', 'webtext',
 'treebank', 'stopwords', 'punkt', 'brown', 'reuters', 'udhr', 'words', 'names', 'cmudict', 'swadesh', 'wordnet', 'state_union']

# Pass the list to nltk.download(), which will then download each resource
nltk.download(nltk_resources)
```

Once you have sorted out your ability to access NLTK resources, you are ready to go through the rest of the notebook lessons.

In [None]:
# import the main nltk module
import nltk

# create a list of resources we will need for this notebook
nltk_resources = ['gutenberg', 'genesis', 'inaugural', 'nps_chat', 'webtext', 'treebank', 'stopwords', 'punkt', 'brown', 'reuters', 'udhr', 'words', 'names', 'cmudict', 'swadesh', 'wordnet', 'state_union']

# download them (here I pass the variable name to the function, which contains all the strings of the resources I want.)
nltk.download(nltk_resources)

# Searching Text

Chapter 1 asks you to load in a series of books and corpora which are stored in NLTK as examples. Take a moment to look at the names of the files - some are single books and movie scripts, while others are different corpora. What is the difference between the two? A book is simply a stand-alone book, whereas a corpus is a large collection of text from similar texts/documents, which can be longer or shorter than a single book.

In [None]:
# import everything from ntlk.book()
# the * stands as a wildcard which means "anything"
from nltk.book import *

## Concordance Lines

NLTK starts off with concordances, which is a method that draws from corpus linguistics to analyse the meanings and/or functions of different words in texts. This method is fundamental for performing corpus analysis because a concordance will let you see a single word or patterns of words **in context** of the surrounding words.

The example in the NLTK book is to search for the word `monstrous` in *Moby Dick*.

Here are some other collocations I searched for based on my own guesses about what you might find in the differnt corpora. Feel free to play around and look for your own words.

In [None]:
# search for "color" in Holy Grail
text6.concordance('color')

In [None]:
# search for "like" in webchat
text5.concordance('like')

In [None]:
# search for 'looking' in the personals corpus
text8.concordance('looking')

# Comparing words

What you probably noticed is that words are used in specific ways in different corpora. This relatively simple anlaysis has thus already told us something about the way language works: context determines the way words are used and understood. Corpus linguistic anlaysis is thus a crucial way to gain a better understanding of word meanings and language use.


The `.similar()` and `.common_contexts()` functions from the NLTK texts allow you to find words that are used in the same contexts. This means words which occur before/after the same other words. For example, in the following two sentences the words "truck" and "apple" are similar because they both occur after the word "red":

- A big red truck
- A big red apple


Try testing the word "hello" using the `.similar()` method in different corpora. The output that you see represents words that are used in similar contexts are your input word.


In [None]:
# which words occur in the same contexts as 'green' in Monty Python?
text6.similar('green')

Let's verify this by looking at the concordance lines for those words...we can see that both "green" and "black" occur before the word "knight". While interesting, the `.similar()` function is admittedly difficult to use in a meaningful way - we will get to collocations and word probabilities later on to see better ways of capturing these relationships.

In [None]:
text6.concordance('green')

In [None]:
text6.concordance('black')

The `.common_contexts()` method allows you to compare two words in the same text. The book uses the example of testing `monstrous` and `very` in Moby Dick. The book asks you to pick two words of your own and compare them. If you're like me, you will see that many times you have no results because the words you searched for do not occur in the corpus or do not have any common contexts - this further demonstrates how we can predict the types of words based on our knowldge of the corpus.

Please note that the two words you are searching for are placed inside square brackets `[]`, and separated by a comma. This is a `list`, which we will learn about in an upcoming notebook.

In [None]:
# test commmon contexts with 'green' and 'black'
text6.common_contexts(['green', 'black'])

In [None]:
# do hello and goodbye have simlar contexts in the webchat corpus?
text5.common_contexts(['hello', 'goodbye'])

In [None]:
# what about 'hi' and 'bye' in the webchat corpus?
text5.common_contexts(['hi', 'bye'])

In [None]:
# how about "hi" and "bye" in the personals?
text8.common_contexts(['hi', 'bye'])

In [None]:
# these words in moby dick?
text1.common_contexts(['white', 'whale'])

If we wanted to make sure the words are actually in the corpus before running the function, we could use the `in` conditional statement and check the `.vocab()` method of the corpus (which is a list of the words in the corpus!). We will learn more about conditional statements later on.

In [None]:
# check if a word is are in the corpus text
'like' in text5.vocab()

In [None]:
'knight' in text6.vocab()

### **Your Turn**

1. Spend a few moments using the `.concordance()` function to search for different words in the texts. See if you can find any interesting examples and share with the class.  

2. After looking through some concordances, play with the `.similar()` and `.common_contexts()` functions to see if you can find words used in similar contexts.

3. What can this analysis tell us, if anything, about the nature of the different texts?



In [None]:
#1
text4.concordance("while")
len(text4)

In [None]:
#2 similar
text6.similar("tho")

In [None]:
#2 common contexts
text4.common_contexts(["the", "how"])

3.
