# Reading files, creating corpora

Most of what we do in NLP is reading in files, counting things in them, and extracting information about their linguistic structure. In this notebook, we'll work on how to read a few files from the data/ directory and counting various things. 

So far, you have read in one file at a time, for instance the NYT file in the data directory, with the following command:

`
    with open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8') as file:
        nyt_text = file.read()
`


Now we want to read several files (or all the files) in a directory and work with them. We need the `Path` library, which we need to import first. 

In [None]:
from pathlib import Path

We then create a new variable, `data_dir`, and assign it the place where our data is. You can make this a relative path or an absolute path. Remember that, to know what your absolute path is, you can always execute the command `pwd` in a cell block. 

For more information on relative and absolute paths, review the reading from the POS module on [file organization](https://automatetheboringstuff.com/chapter8/).

* Relative path: `Path('./data')`
* Absolute path (this will change for your system): `Path('C:/Maite/MOD/notebooks/Ling450/data')`

In [None]:
data_dir = Path('./data')

Now we just print the names (and paths) of each of the files in that directory. This is just so that you know what's inside. The `glob()` function here matches all the files with a specific pattern. In this case, `'*'` means "match anything". 

In [None]:
for filepath in Path(data_dir).glob('*'):
    print(filepath)

You may have other files there, for instance, a file called `.ipynb_checkpoints`. You don't really want to read in that file, so you can restrict the listing to files that end in ".txt" (i.e., that have the extension "txt").

In [None]:
for filepath in Path(data_dir).glob('*.txt'):
    print(filepath)

# Reading other files, using NLTK

In this section, we are going to read in files from another directory and use NLTK to tokenize and count things in them. NLTK is a light-weight (but still quite powerful) NLP tool. For many purposes, you can use either NLTK or spaCy and you should be familiar with both. You can learn more about it from the [NLTK website](https://www.nltk.org/) and the [NLTK book](https://www.nltk.org/book/). 

If you have not, install NLTK by running the `nltk_install.ipynb` notebook in our [Ling450 GitHub repository](https://github.com/maitetaboada/Ling450). Once you have it installed NLTK in your system, you don't need to run that notebook again. You do need, however, to import it every time (with `import nltk`, as you see in the cell below). It's good practice to put all your import statements at the beginning of your notebook, so I'm going to put everything I'll need here. 

After you import nltk, you'll also need a few more of the NLTK libraries, so there are statements for that too. `numpy` and `matplotlib` are libraries for numerical calculation and for plotting.  

In [None]:
import nltk
import numpy
import matplotlib
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import FreqDist

Now we will point to a path where we have some files. For this exercise, I have placed some files in the `small_corpus/` directory in the Ling450 repository on GitHub. Make sure you download that entire directory to the same place where you have this notebook (the `./ ` part tells python to look in the current directory). The easiest way to get files from the class repository is to use [GitHub Desktop](https://desktop.github.com/). 

By the way, the texts in the `small_corpus` directory are from the [SFU Review Corpus](https://www.sfu.ca/~mtaboada/SFU_Review_Corpus.html). They are reviews of very old movies!

In [None]:
corpus_root = "./small_corpus/"

In [None]:
reviews = PlaintextCorpusReader(corpus_root, '.*', encoding = "utf8")

In [None]:
# the variable 'reviews' is an object of type 'PlaintextCorpusReader'
reviews

In [None]:
# you can list the files in that variable
reviews.fileids()

Now we are going to operate on the list of files. First, we will create a variable, `fileids`, to store the list of files in the reviews. 

Then, the dictionary `tokenized_reviews` will store the tokens for each file. Note that we create an empty dictionary first, to populate it inside the for loop.

Finally, we go through each of the files in "fileids" and tokenize it, storing the information in the dictionary. In the next code block, you can see that the structure of `tokenized_reviews` is a dictionary with a structure where the key is the name of the file and the value a list of the tokens.

`key: value`

`bad_santa.txt: ['If', 'you', 'use', ...]`

You can list the tokens alone by using the statement `tokenized_reviews["filename"]`. 

In [None]:
# store the names of the files int he corpus into a variable, 'fileids'
fileids = reviews.fileids()

# create an empty dictionary for the reviews once tokenized
tokenized_reviews = {}

# the for loop goes through each text (raw), tokenizes it (word_tokenize), and saves
# the tokens for each file in the tokenized_reviews dictionary
for fileid in fileids:
    tokens = word_tokenize(reviews.raw(fileid))
    tokenized_reviews[fileid] = tokens

In [None]:
tokenized_reviews

In [None]:
tokenized_reviews["bad_santa.txt"]

Now that we have the tokens, we can calculate the types and lexical diversity (types / tokens). FreqDist is an NLTK function that gives you each type and the frequency it has it the text. You'll see that it's a dictionary, with the word/token as the key, and the count as the value.

In [None]:
len(tokenized_reviews["bad_santa.txt"])

In [None]:
len(set(tokenized_reviews["bad_santa.txt"]))

In [None]:
len(set(tokenized_reviews["bad_santa.txt"])) / len(tokenized_reviews["bad_santa.txt"])

In [None]:
freq_dist = FreqDist(tokenized_reviews["bad_santa.txt"])

In [None]:
freq_dist

## Your turn!

Modify the for loop above, so you calculate all these things and print them for each file. You'll need to complete some of the code. Part of this is just a re-write of what's above. 

In [None]:
# store the names of the files in the corpus into a variable, 'fileids'
fileids = reviews.fileids()

# create an empty dictionary for the reviews once tokenized
tokenized_reviews = {}

# the for loop goes through each text (raw), tokenizes it (word_tokenize), and saves
# the tokens for each file in the tokenized_reviews dictionary
for fileid in fileids:
    tokens = word_tokenize(reviews.raw(fileid))
    tokenized_reviews[fileid] = tokens
    token_count = len(tokenized_reviews[fileid])
    # complete this line
    type_count = 
    # complete this line
    lex_div = 
    # complete this line
    freq_dist = FreqDist(tokenized_reviews[fileid])
    
    print(
        "File: ", fileid, "\n"
        "Token count: ", token_count, "\n"
        # complete the following few lines to print type and lexical diversity
        
    )

## Getting data from elsewhere

There are lots of places to get data from. One easy place is [Project Gutenberg](https://www.gutenberg.org/), a free repository of files that are in the open domain. To access them, we will use another library, `requests`, which allows us to request information from places on the web.

To get some sample files, I searched Project Gutenberg with the following parameters and selected five books from the results:

* Search term: "canada"
* Language: "English"
* LoCC classification: "P Language and Literatures: American and Canadian Literature"
* Filetype: Plain text UTF-8

For other types of practice with Project Gutenberg data, you can also check out [Josef Fruehwald's data processing lesson](https://jofrhwld.github.io/teaching/courses/2022_lin517/lectures/data_processing/) and the [NLTK book, Chapter 3](https://www.nltk.org/book/ch03.html).

We first put a few titles and their URLs (the plain text format link) into a dictionary, `canadiana`. Then, we go through the URL for each of the books in the dictionary and decode the text as UTF-8. The simple code below only gets the tokens and prints the length. If you also print the beginning of the text, you will see that it needs a lot of clean-up. For instance, all Project Gutenberg texts have a license at the beginning, which you probably don't want to count towards the token count of the book. There are a few other things that need processing too.  

In [None]:
import urllib.request

In [None]:
canadiana = {"The Moccasin Maker": "https://www.gutenberg.org/cache/epub/6600/pg6600.txt",
             "As Others See Us": "https://www.gutenberg.org/cache/epub/67312/pg67312.txt",
             "An Algonquin Maiden": "https://www.gutenberg.org/cache/epub/8661/pg8661.txt", 
             "God's Green Country": "https://www.gutenberg.org/cache/epub/34700/pg34700.txt", 
             "The Blue Castle": "https://www.gutenberg.org/cache/epub/67979/pg67979.txt"}

In [None]:
canadiana

In [None]:
for key, url in canadiana.items():
    response = urllib.request.urlopen(url)
    raw = response.read().decode('utf8')
    tokens = word_tokenize(raw)
    
    print(f"Tokens for {key}: {len(tokens)}")