<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/13_creating_NLTK_corpora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating your own NLTK corpus

Regardless of how you get your data into Colab, you can use the NLTK library to make your own version of the NLTK corpora.

There are two ways to do this, one is to read in a bunch of texts as one single corpus. To do this, we use the `PlaintextCorpusReader` class from NLTK.

In order to use it, we need three things:

1. some files,
2. a filepath which leads to files, and
3. the names of the files.

Let's start with some data! The following cell will download several scripts from an American television show *Seinfeld*. The zip file will be downloaded into the notebook environment.  



In [None]:
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/seinfeld.zip'

Next we need to extract or unzip the contents of zip file to our notebook - the following cell does so using the `!unzip` command. There's `-d` flag unzips into a directory to make working with the data easier. So the command `!unzip` will be run on the `seinfeld.zip` folder and the results will be a new directory `-d` named `seinfeld`


In [None]:
!unzip "seinfeld.zip" -d "seinfeld"

In [None]:
#corpus_root = '/content/drive/MyDrive/seinfeld'

Next, we'll load in the corpus reader class from NLTK

In [None]:
# import the module to read in plain text
from nltk.corpus import PlaintextCorpusReader

As well as some other required NLTK resources

In [None]:
# import the NLTK library
import nltk

# download resources
nltk_resources = ['gutenberg', 'punkt_tab', 'brown', 'state_union']

nltk.download(nltk_resources)

Now, we need to create a new variable from the `PaintextCorpusReader` which will become our corpus!

The first variable is `root` which allows us to specify where the corpus lives. In this case, the corpus is within the `seinfeld` folder.

The second argument is `fileids` which asks for the list of files. The files in the seinfeld folder are:

```
THE BOYFRIEND PT 1_cleaned.txt
THE BOYFRIEND PT 2_cleaned.txt
THE CHINESE RESTAURANT_cleaned.txt
THE DEALERSHIP_cleaned.txt
THE DOODLE_cleaned.txt
THE ENGLISH PATIENT_cleaned.txt
THE FACE PAINTER_cleaned.txt
THE GOOD SAMARITAN_cleaned.txt
THE JUNIOR MINT_cleaned.txt
THE LITTLE KICKS_cleaned.txt
THE MARINE BIOLOGIST_cleaned.txt
THE PARKING GARAGE_cleaned.txt
THE PARKING SPACE_cleaned.txt
THE PEZ DISPENSER_cleaned.txt
```

Let's try it out on a single file to start.


In [None]:
# read in my text (i've passed the name in a list, so I could include more than one text if I need to later)
marine_biologist_corpus = PlaintextCorpusReader(root = 'seinfeld/', fileids = ['THE MARINE BIOLOGIST_cleaned.txt'])

Now that we've created a corpus (even if it is just one text), we can use the built-in NLTK corpus functions.

In [None]:
# The raw version should be just the string
marine_biologist_corpus.raw()[15041:15135]

In [None]:
# we can also get sentences
marine_biologist_corpus.sents()

If you remember from the first part of NLTK, they were using functions like `.concordance()` on the built-in data. We can do the same with our data, but we need to wrap the tokenized words in an `nltk` function called `Text()`.

In [None]:
# Create a special Text version of the corpus
from nltk.text import Text
mb_txt = Text(marine_biologist_corpus.words())

In [None]:
# now we can look for concordance lines
mb_txt.concordance('GEORGE')

In [None]:
mb_txt.concordance('whale')

### Loading in multiple texts to make a corpus

A corpus of a single text is not very interesting. Let's update our `PlaintextCorpusReader` to include all of the texts in our Seinfeld folder. But, it sure would be annoying having to type all of the filenames one-by-one. Fortunately, there are several ways around this.

One is to use the [`glob` library](https://docs.python.org/3/library/glob.html) to quickly all of the filenames in a directory. The `glob` function makes it easy to save all of the filenames from a directory into a variable.  

In [None]:
# import the function which is the same name as the module
from glob import glob

# the * indicates you want everything from the folder.
# we can use more intelligent ways to select only certain files, we'll see this later with regex
filenames = glob('/content/seinfeld/*')

filenames

Doing this gives us the entire filepath which doesn't really hurt us but also is kind of annoying. We could easily remove this using slicing. Because the part that we want to remove is always the same (i.e., the `/content/seinfeld/'` part), we could just slice that part off from each filename. All we need to know is where to start the slice

In [None]:
# starting at 18 gives us the episode name only.
filenames[1][18:]

In [None]:
# let's write a list comprehension which removes the start of each filename
filenames_short = [name[18:] for name in filenames]

# voila!
filenames_short

Now we can just pass `filenames_short` to the `PlaintextCorpusReader` function and make a larger corpus. I tested it and it will also work without cleaning the filepath we get from `glob`, but this is nice because we remove the clutter.

In [None]:
# make our seinfeld corpus
seinfeld_corpus = PlaintextCorpusReader(root = 'seinfeld/', fileids = filenames_short)

In [None]:
# we can use the fileids function to see the texts in here
seinfeld_corpus.fileids()

In [None]:
# what are the ten most common words in our corpus?
from nltk import FreqDist
FreqDist(seinfeld_corpus.words()).most_common(10)

In [None]:
# and I can search for concordances, neat!
Text(seinfeld_corpus.words()).concordance('apartment')

### loading data from your Google Drive folders

If you have data within your Google Drive you want to use, you just need to amend the above code to point at those folders. This means your corpus roots will be something like `content/drive/MyDrive/...`

You'll also need to mount the drive!

# Creating Your Own Categorized Corpus

The next type of corpus you can make is a categorised corpus, which will allow you to compare groups of files within your corpus.

In order to do so, we need some text files, and we also need a way to indicate what genre/category we would like those files to belong to. The NLTK authors do this by extracting information (i.e., metadata) from the filenames.

As an example, let's use some data from a [paper I published in 2015.](https://europeanjournalofhumour.org/index.php/ejhr/article/view/68)

In this paper, I analysed the linguistic properties of product reviews written for the American retail website Amazon.com. I was interested in two types of reviews: legitimate review and satirical/funny reviews.

The data lives here: [Amazon Data](https://github.com/scskalicky/LING-226-vuw/blob/main/other-data/amazon%20reviews.zip)

We can again use `!wget` and `!unzip` to load in a zip file and save to the notebook environment. Run the code cell below to download and unzip the data into the notebook:



In [None]:
# download the data
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/amazon%20reviews.zip'

Now unzip the data into the environment:


In [None]:
!unzip "amazon reviews.zip" -d "amazon reviews"

In the folder are 375 normal reviews and 375 satirical reviews.

The name of each file looks like this:
```
001-5-satire.txt
002-2-normal.txt
```

The first three numbers are the ID number, ranging from 1 - 375. The second number (between the two `-`) is the star rating of the review, from 1-5. The words `satire` or `normal` indicate whether the review was a normal review or a satirical funny review.

We can exploit this information to make categories in our corpus. Just as the authors of NLTK sliced the year from the filename to examine change over time, we can do the same thing with these filenames to get different categories.




In [None]:
# first we will load in the Corpus Reader and define the location of our texts
import nltk
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader

# set the corpus location to point to wherever it is you saved the data
# (you may need to mount your Drive to the notebook)
corpus_location = '/content/amazon reviews'

Now to use the filenames as categories, we will use a regular expression (regex) pattern to define a pattern to capture the `normal` or `satire` portions of the filesnames using this pattern:

```
.*(......).txt
```

This pattern captures whatever is in the brackets `()`, and says give me the last six characters before `.txt` of my pattern.

It corresponds to:

```
001-5-(satire).txt
002-2-(normal).txt
```

Try it out:

In [None]:
# create a categorised corpus
amz_corpus = CategorizedPlaintextCorpusReader(root = corpus_location, fileids = '.*', cat_pattern = '.*(......).txt')

# you can check the categories
amz_corpus.categories()

In [None]:
# and we still have our fileids
amz_corpus.fileids()

Now that we've made our corpus, we can create CFD tabulations and plots just like the NLTK book did for Brown corpus.

Let's compare different words between the satirical and regular reviews.



In [None]:
# Create a CFD of the amazon corpus
# I am using the same code as the one for Brown with two modifications:
# I have replaced "genre" with "review_type"
# I lowercase the words in the corpus
amz_cfd = nltk.ConditionalFreqDist(
    (review_type, word)
    for review_type in amz_corpus.categories()
    for word in [w.lower() for w in amz_corpus.words(categories = review_type)]
)

In [None]:
# let's ask for some specific words
pronouns = ['i', 'me', 'you', 'my', 'yours', 'them']

# then tabulate them
amz_cfd.tabulate(conditions = ['normal', 'satire'], samples = pronouns, cumulative = True)

The raw counts are interesting but not really helpful without being normalised somehow. Let's plot the data using the `percents = True` argument to convert the counts into percents of the entire corpus. These allow us to make more fair comparisons.

In [None]:
# we can also plot this.
amz_cfd.plot(conditions = ['normal', 'satire'], samples = pronouns, cumulative = False, percents = True)

What do you see in the plot? Does any one category have any more/less of a particular word?

We can try this out using any number of target words:

In [None]:
# what about some other words?
emotions = ['good', 'bad', 'happy', 'sad', 'love', 'sweet', 'hurt', 'ugly', 'nasty']
amz_cfd.tabulate(conditions = ['normal', 'satire'], samples = emotions, cumulative = True)

In [None]:
amz_cfd.plot(conditions = ['normal', 'satire'], samples = emotions, cumulative = False, percents = True)

We can also wrap individual files from our corpus in `Text` so that we can look for concordances

In [None]:
# Wrap the whole set of words to look at all concordances
nltk.text.Text(amz_corpus.words()).concordance('terrible')

In [None]:
# we can also look at concordances for just one category to compare them
# the word "banana" is strongly associated with the satire corpus
nltk.text.Text(amz_corpus.words(categories = 'satire')).concordance('banana')

In [None]:
# but only occurs once in the non-satire corpus.
nltk.text.Text(amz_corpus.words(categories = 'normal')).concordance('banana')

## A more complex method to create categories

> You don't really need to worry about this unless you think you need a more complex corpus / category structure.

You might not want to use part of the file names to create a categorized corpus reader in NLTK (in fact, doing so requires you to have a specific naming convention and potentially use more complicated regex syntax to parse the names). You might have a bunch of files, maybe even in different folders. Rather than trying to find a consistent way to put metadata into the filename, you can instead supply a file that provides the metadata for your corpus.



This file should contain a list of all the paths for your texts, followed by the category you want to assign to those texts.

For example the text file could look like this:

```
folder1/file1.txt categoryA
folder1/file2.txt categoryA
folder2/file3.txt categoryB

```

You can save this as a `.txt` file named something like `categories.txt`.

Then, you pass the location of this file to the `cat_file` argument of the `CategorizedPlaintextCorpusReader` class.

Here is an example.

First download a sample corpus.


In [None]:
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/corpus.zip'

In [None]:
# unzip the corpus into the content folder of the notebook environment
!unzip 'corpus.zip' -d '/content'

You will also need your file which maps the categories, ideally outside the folder that your corpus is in. I download this one to the main notebook environment.

In [None]:
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/categories.txt'

Inspect the contents of the categories file: You can see it lists the file name of each file, followed by a space and the category for that file.



In [None]:
print(open('categories.txt').read())

Now load in the categorized corpus class from NLTK


In [None]:
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader

Declare your corpus and give the corpus reader three pieces of information:

1. The root folder for your corpus. Since this is in Colab, the first root will be `content`. The folder we uploaded is called `corpus`, so our root is `/content/corpus/`

2. The names of the files, which will be all the `.txt` files in the folders

3. The name of the `.txt` file containing the mapping of file names to categories. The `..` is there to tell the function that the `categories.txt` file is in one directory above the corpus folder. Otherwise, the function will not find the file!

In [None]:
corpus = CategorizedPlaintextCorpusReader(root = '/content/corpus', fileids = '.*.txt', cat_file = '../categories.txt' )

We can inspect the categories of the corpus:

In [None]:
corpus.categories()

As well as the filenames:

In [None]:
corpus.fileids()

We can then use the `categories` argument to look at files in any one category:

In [None]:
for category in corpus.categories():
  for file in corpus.fileids(categories = category):
    print(corpus.raw(file), '\t', category)

This is probably the best method to use if you have a complex set of files and folders that you want to be able to place into different categories without relying on filenames.

# **Wrap Up**

Being able to create your own corpus and make a comparison across categories in your corpus within NLTK will provide you with a way to compare texts across categories.
