# Corpus preparation and processing

This notebook will show you how to use [NLTK](https://www.nltk.org/) to prepare your corpus for processing. 

You can run this notebook online using something like [Google Collab](https://colab.research.google.com/), but you can also download it and run it locally using [Jupyter Notebooks](https://jupyter.org/).

## Installing NLTK

The next 3 lines install NLTK, import it into this notebook, and then tell NLTK to download all the components. If running this notebook locally, you'll only have to do the first and the last lines once. When you have done it once, you can comment them out.

After you have installed NLTK and downloaded the components, the only thing you'll have to do in the future is import it on any notebook you want to use it in, by running the middle line (`import nltk`).

In [None]:
!pip install nltk

In [None]:
import nltk

In [None]:
nltk.download('all')

## Reading files

The processes below are just for a single file at a time. You can do this for an entire directory as well (see below). In the code below, you will have to change the text after `file` to the path where your file is (in my case, `C:/Maite/MOD/notebooks/Ling810/data/`) and to the name of the file (`CBC_446.txt`).

In [None]:
with open(file='C:/Maite/MOD/notebooks/Ling810/data/CBC_446.txt', mode='r', encoding='utf-8') as file:
    text = file.read()

The line below just shows you the contents of the variable `text`. You will see lines like this after each of the processing steps. They are just so that you can see what happened to the text. You don't have to run them every time.

In [None]:
text

## Tokenization
The file was read into a variable called `text`. We need to ask NLTK to tokenize it into words and punctuation. We first import the right function and then tokenize it.

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
tokens = word_tokenize(text)

In [None]:
tokens

## Corpus processing

There are several things you can do with the data, depending on your goals:

* Make all the words lowercase
* Remove all punctuation
* Remove stop words
* Lemmatize

You may do some or all of them, or you may do them in different orders. Here, I just show you how to do one at a time.

### Lowercase

This is useful if you want to count all instances of the same word, regardless of where they appear in the text ("The" and "the"). But beware that when you do this, you may count instances of "Mark" (the person's name) and "mark" (a grade) as instances of the same word. 

In [None]:
lower_tokens = [token.lower() for token in tokens]

In [None]:
lower_tokens

### Remove punctuation

In [None]:
only_words =[token for token in tokens if token.isalpha()]

In [None]:
only_words

Notice that I did the remove punctuation operation on `tokens`, which contains the text before lowercasing. If you want to remove punctuation from the lowercased text, then you change from `tokens` to `lower_tokens`. 

In [None]:
only_words_lower =[token for token in lower_tokens if token.isalpha()]

In [None]:
only_words_lower

### Remove stopwords

Stopwords are lists of words that you do not want to include in your corpus, perhaps because you don't want to count function words and other common words. We will be using NLTK's standard list of English stopwords, but you can define your own too. 

In [None]:
from nltk.corpus import stopwords
 
nltk.download('stopwords')

In [None]:
print(stopwords.words('english'))

In [None]:
stopwords = stopwords.words('english')

In [None]:
tokens_no_stopwords = [token for token in tokens if token not in stopwords]

In [None]:
tokens_no_stopwords

Again, I performed this operation on all the tokens (`tokens`). I could also do it for `only_words` or `only_words_lower`, which has all the punctuation removed and the text in lowercase. 

In [None]:
only_words_lower_no_stopwords = [token for token in only_words_lower if token not in stopwords]

In [None]:
only_words_lower_no_stopwords

### Lemmatize

There are different ways of lemmatizing. This uses the [WordNet](https://wordnet.princeton.edu/) dictionary to find the lemma for the word. This means not all words will be found. There are better lemmatizers, including the functions in [spaCy](https://spacy.io/), if you want to explore those. 

In [None]:
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()

In [None]:
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]

In [None]:
lemmatized_words

In [None]:
lemmatized_words_lower = [lemmatizer.lemmatize(token) for token in only_words_lower_no_stopwords]

In [None]:
lemmatized_words_lower

## Process an entire directory

The steps above are for one file at a time. If you have a directory of .txt files, you can process all of them and save the output to a new set of files, tokenized. Remember that, to do this, you need to import NLTK and the tokenization, if you haven't run the relevant lines of this notebook above. All of the import statements and global definitions are repeated here, just in case you only want to run this portion. You'll also need the `os` library.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
import os

Replace the path to the directory where your files are. This will save the output to a subdirectory there, but you can define the output directory to be whatever you want.

In [None]:
directory = 'C:/Maite/MOD/teaching/Ling810_Fall2024/data/CBC_Dec2023/CBC_Dec2023_sample20'
output = 'C:/Maite/MOD/teaching/Ling810_Fall2024/data/CBC_Dec2023/CBC_Dec2023_sample20/processed'

Define the function that is going to do all the work. This only works for txt files.

In [None]:
def tokenize_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):  
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
                # tokenize
                tokens = word_tokenize(text)
                # prepare the path to the output dir and relevant file
                output_file_path = os.path.join(output, f'tokenized_{filename}')
                
                with open(output_file_path, 'w', encoding='utf-8') as out_file:
                    out_file.write(' '.join(tokens))

Now call the function.

In [None]:
tokenize_files(directory)

That function only tokenizes. You can from there also lowercase, remove punctuation, etc. If you want to do the whole process (tokenize, remove punctuation, remove capitalization, lemmatize), you can do it all in one function.

In [None]:
def process_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):  
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
                # tokenize
                tokens = word_tokenize(text)
                # lowercase and remove punctuation
                cleaned = [token for token in tokens if token.lower() not in stopwords and token.isalpha()]
                # lemmatize
                lemmatized = [lemmatizer.lemmatize(token) for token in cleaned]
                # prepare output path 
                output_file_path = os.path.join(output, f'processed{filename}')
                
                with open(output_file_path, 'w', encoding='utf-8') as out_file:
                    out_file.write(' '.join(tokens))

Now call this other function.

In [None]:
process_files(directory)