# Before we begin...

This course introduces basic skills and methods of text mining and analysis, focusing most but not exclusively on analytic methods rooted in descriptive statistical analysis. As the name implies, Descriptive Statistical methods describe the basic features of the data being examined, providing simple but powerful summaries about the data.

## About text analysis

Some people interchangeably use *text mining*, *text analysis*, and *natural language processing* to refer to any large-scale textual analysis with computational tools. Within the humanities, terms such as *distant reading* (per Franco Moretti) or *macroanalysis* (per Matthew Jockers) are also substituted for any large-scale textual analysis. But these words all have subtle distinctions:

- **Text Mining** refers to process of computationally "reading" a text (or collection of texts) and extracting specific chunks of information.  It encapsulates process that convert unstructured text to structured data.
- **Text Analysis** refers to processes, such as basic descriptive statistical methods or more advanced machine learning methods, that analyze the information contained in texts.  Text Analysis is frequently performed on the structured data that results from Text Mining operations, but it can also be performed directly on unstructured texts.  Text Analysis processes typically return summary data, such as lists of all references to particular dates, persons, and topics or summary statistics regarding word or phrase usage and frequency.  
- **Natural Language Processing** (NLP) describes a particular subset of Text Mining and Text Analysis processes that utilize the grammatical and semantic structures associated with natural languages as part of the mining and analysis process.  Processes that do not account for the naturalness of language, such as word frequency analysis, cannot properly be consider Natural Language Processing.

## Data types and structures in Python

Python uses four primary data types and structures:

### Data types

- String
  - text
- Integer
  - whole numbers
- Float
  - decimals
- Boolean
  - binary values (e.g. True/False)

### Data structures

- Lists
  - ordered, mutable, represented with `[]`
- Dictionaries
  - unordered, key-value pairs, represented with `{}`
- Sets
  - unordered, unique items, represented with `{}`
- Tuples
  - ordered, immutable, represented with `()`

In this workshop, nearly everything we do is limited to strings and lists (save for some scant integers throughout).

# Text mining

## Python packages for text mining

Many Python libraries, packages, and modules help perform text mining. Here is a brief breakdown of some of the most common ones:

- **NLTK**: Supports common text processing techniques—e.g. tokenization, stemming/lemmatization, part-of-speech (PoS) tagging, etc. Also includes select datasets for training/evaluating NLP tools. Designed for more "traditional" NLP—e.g. sentiment analysis, named-entity recognition, etc. Supports multiple Latinate languages. [NLTK docs](https://www.nltk.org/)

- **spaCy**: Supports much faster text processing. Includes pre-trained language models (LMs) for similar text processing tasks. Designed for production use with small-scale LMs—e.g. translation, summarization, etc. But does not support some functions like sentiment analysis. Supports multiple Latin and non-Latinate languages. [spaCy docs](https://spacy.io/)

- **re**: Part of standard Python library. Supports regular expressions, which are useful as part of many text processing/cleaning tasks, e.g. string searching and matching, etc. Anything to do with patterns, this library helps. [re docs](https://docs.python.org/3/library/re.html)



In [None]:
# install packages in colab env
%pip install nltk -U
%pip install spacy -U

# import packages
import nltk
import spacy
import string
import re

## Load a text file

Both NLTK and spaCy allow users to perform some initial processing on text files as they're loaded. For now, however, in order to see the effects of text processing, we'll simply load a text file sans any load-time processing.

Since we are working in Google Drive, we need to first mount the drive to our environment:

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/gdrive/')

**Note on Python syntax:** In the code `from google.colab import drive`, we import the drive module from the colab package of the google Python library.  If we want to call specific items from imported modules, packages, or libraries, we must use the Python structure `large.small`, with `.` between each unit. So take this code: `from google.colab import drive`. Here, we tell Python to go to google library, find the colab pacakge, and import only its drive module/function.

Now that you've mounted your drive, you can open files from the course folder in your notebook:

In [None]:
src_txt_path = "/gdrive/MyDrive/python-nlp-humanities/text-data/melville.txt"

We've now stored the text file as a Python variable.

*What is a variable?*

A Python variable is a symbolic name used to reference a data value. Once we assign a variable, we can use that variable to to access and manipulate that value elsewhere.

There are some general rules when naming variables in Python:

- variable names must begin with a letter or underscore
- variable names are case-sensitive

From here, we can now open the file, read its contents, and save those contents as a new Python string variable.

In [None]:
# read file
src_txt = open(src_txt_path, "r")

# "extract" file contents as string variable
working_txt = src_txt.read()

# confirm file loaded
print('Characters in text:', len(working_txt))

# view first 500 characters of "extracted" text
print(working_txt[:500:1])

**Note on Python indices:** Python can iterate through data structures, such as strings and lists, using indices. *Index* refers to a given item's position within a sequence. Python indices begin at 0 (e.g. the first character in a string is at index 0). The code `[:500:1]` is called a split. Splits allow us to select a specific subsection of a sequence. Splits follow this structure: `[start:stop:step]`, defined as:
- `start`: start index (inclusive; default = 0)
- `stop`: end index (exclusive; default = -1)
- `step`: increment of indices (default = 1)

...Back to text processing.

The foundational but overlooked first step in any text analysis pipeline is forensics. You must be familiar with your data—its format, what noise may exist therein, the structure and patterns to which it may adhere, etc. This will aid you, not only in interpreting any later results you obtain, but also in cleaning the text to ensure you obtain valid results.

Look at the 500 characters and think of features in the text that could pose problems for later analysis.

There are several computational tools/methods for performing text analysis. These include analyzing word co-occurence, word/n-gram distributions. Since I am already familiar with the primary issues afflicting out .txt file, and for the sake of time, we'll jump to common text cleaning tasks. As we move forward, consider how you may apply these text cleaning objectives to your own text(s).

## Text cleaning

### Paratext

Our first task is to remove the Project Gutenbeg header and footer, what narratologist Gérard Genette calls *paratext*. Rather than opt for cumbersome rule-based approach (e.g. character counting, regular expressions, etc.), we can load the [gutenbergpy](https://github.com/raduangelescu/gutenbergpy) library, which contains features designed to strip PG files of their paratext.

**Write code:** To load the gutenbergpy library:
1. Use pip to install the gutenburgpy library
2. Import the `textget()` function from the gutenbergpy library

In [None]:
# install gutenbergpy


# import textget()


**Write code:** Write a print statement that uses a slice to view the first 500 characters (i.e. indices) from the `txt_body` string variable.

In [None]:
# strip PG header
txt_body = textget.strip_headers(working_txt.encode('utf-8')).decode('utf-8')

# strip PG footer
txt_body = re.sub(r'end of the project gutenberg', '', txt_body, flags=re.IGNORECASE)

# view first 500 characters


**Note on functions:** The re module `sub()` function replaces one text string with another. Nearly all functions take arguments. We discuss functions in more detail later, but for now know that arguments are the items you give to a function inside of its parentheses. The `sub()` function takes, at minimum, three arguments: `(string_in_text, replacement_string, input_text)`.


### Whitespace and cases

Cases and whitespace can also cause problems in some NLP methods. For example, some NLP pipelines separate text strings based on line divisions, but we may not care about paragraph divisions.

**Write a code:** Write two regular expressions for both `clean_txt` variables:
1. The first should replace all line breaks with a single whitespace character.
    - The regular expression character for line breaks is `\n`.
2. The second should replace occurences of two or more whitespace characters with a single whitespace character.
    - Consult the [regular expression documentation](https://docs.python.org/3/library/re.html) to find how to represent whitespace. 

In [None]:
# lowercase text
lower_txt = txt_body.lower()

# turn new lines into single space

# turn subsequent whitespace into single space

# remove any leading/trailing whitespace
clean_txt = clean_txt.strip()

# view first 500 characters
print(txt_body[:500:1])

**Note on Python method:** Certain objects and data types in Python have their own functions call *methods*. These are function's that can only be called for a given object or data type. Here, we use the `lower()` method and the `strip()` method, which are methods for string objects. Here is the [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) on string object methods. Keep in mind that each data type and structure has its own methods.

### Tokenization

Before we move onto more robust cleaning, we'll first tokenize the text. Tokenization turns unstructured text data in structured text data comprised of units called *tokens*. A token is simply a text-based unit. It can have almost any pre-defined size, ranging from a single character, words, paragraphs, and so forth.

**Write code:** Write a print statement that uses a slice to view the first 100 indices from the `tokens` variable list.


In [None]:
# import NLTK word tokenizer
from nltk.tokenize import word_tokenize

# tokenize on words
tokens = word_tokenize(clean_txt)

# view first 100 tokens


Look through the list of tokens. What do you notice? Are there any tokens or bits of text that may create noise? We can remove these tokens through a combination of additional preprocessing steps.

For now, we'll start with numbers. Execute this code:


In [None]:
# remove punctuation from each word
filtered_tokens = [word for word in tokens if word not in string.punctuation]

# remove non-alphabetic tokens
filtered_tokens_alpha = [word for word in filtered_tokens if word.isalpha()]

**Note on Python syntax:** This code may seem a lot. But let's break it down to better understand Python syntactic structure. Let's look specifically at this code:
```
filtered_tokens = [word for word in tokens if word not in string.punctuation]
```
This sets up a list variable `filtered_tokens`. `word` is simply another variable that refers to the items in this and other lists. So we create a list variable named "filtered_tokens" whose items are the same items in the "tokens" list if those items are not in the string module's "punctuation" list.

### Stopwords

Stopwords are common, semantically less significant words removed from a text set. This includes articles (*the* and *a*), *to be* conjugations (*is* and *are*), conjunctions (*and*), and some repositions (*of*). These are compiled into a stop list, i.e. list of stopwords. A stop list typically consists of the most common words in a given language, which are believed to add little value to a text's overall meaning.

Rather than create a wholly original stop list, we can load the NLTK English language "stopwords" list.

**Write code:** Write a single line of code to download the NLTK "stopwords" list using the `download()` function from the NLTK library. Per the NLTK [downloader documentation](https://www.nltk.org/api/nltk.downloader.html), you will need to pass "stopwords" as an argument to the `download()` function.

In [None]:
# import NLTK stopword module
from nltk.corpus import stopwords

# download NLTK stopwords list


# load NLTK stopwords to variable
stop_words = stopwords.words('english')

# view stopwords
print(stop_words)



**Write code:** Review the [Python documentation](https://docs.python.org/3/tutorial/datastructures.html) for list object methods. What method allows you to add multiple items to a pre-existing list object? Use the appropriate method to add your own list of additional stopwords to NLTK's pre-defined stoplist. Keep in mind that, because `stop_words` is a list object, you need to compile and add these additional stopwords as a Python list object.

In [None]:
# add corpus-specific stopwords to list


# remove stopwords
filtered_tokens_final = [word for word in filtered_tokens_alpha if not word in stop_words]

# view first 50 filtered tokens
print(filtered_tokens_final[:50])

### Stemming and Lemmatization

How does one treat variants of a single word/concept (dancer, dances, dancing, etc.) as equivalent? Stemming and lemmatization are different approaches to this problem. They both reduce inflected word forms to a common root word (or, *lemma*). But here are key differences:

Stemming is a fairly heuristic approach to essentially removes suffixes (e.g. *-ing*, *-s*, etc.) to produce a base form for each word.

Lemmatization utilizes part-of-speech (POS) tagging to reduce inflected forms. POS assigns each word/token a tag that represents its grammatical function. It then uses this tag to identify a word's base "dictionary form."

There are multiple stemming algorithms, with various strengths and weaknesses. One notable example is the Porter stemmer, which is fairly straightforward to import and execute.

Stemming can produce notable erroneous examples however. So let's move ahead with lemmatization.

**Write code:** Write import and download statements to import the necessary modules and functions from NLTK:
1. import "WordNetLemmatizer" from the NLTK "stem" package
2. import "wordnet" from the NLTK "corpus" package
3. download the "averaged_perceptron_tagger" from NLTK


Now we can pass our text through the lemmatizer.

Fortunately, we do not need to create a lemmatization and POS algorithm. We can assign POS tags to words using the Penn treebank and pass them through a pre-created WordNet lemmatizer. This reduces words to base forms according to their assigned part of speech. In this way, lemmatization accounts for some of the errors that arise from stemmers.

In [None]:
# POS tagger function
def pos_tagger(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


# define lemmatizer
lemmatizer = WordNetLemmatizer()

# POS tagging
pos_tags = nltk.pos_tag(filtered_tokens_final)

# lemmatize tokens with POS tags
lemma_tokens = [lemmatizer.lemmatize(token, wordnet_pos_tags(pos_tag)) for token, pos_tag in pos_tags]

# view first 500 lemmas
print(lemma_tokens[:500])


## Functions

In Python, functions are re-callable blocks of code designed to execute a specific task. Functions are useful for organizing larger workflows into discrete, step-wise tasks.

We create a function using the `def` keyword, followed by the function's name and a pair of parentheses `()`. Inside the parentheses, we can include any variables we want to pass to the function, such as for processing data—these variables are called *arguments*. All code inside the function is indented. We close the function with a `return` statement—this specifies the function's output.

**Write code:** Create a function for text cleaning. The function should execute the six-item action list described in the function's doc string. You may use the code we have reviewed to create the function.

In [None]:
def pg_txt_cleaner(text: str):
    """
    Cleans contents of an input .txt file sourced from Project Gutenberg by:
        1. remove PG header + footer
        2. lowercase text
        3. regularize whitespace
        4. tokenize text
        5. remove punctuation and stopwords
        6. lemmatize tokens using WordNet

    Arguments:
        text (str) -- input .txt file
    Returns:
        lemma_tokens (list) -- list of filtered and lemmatized tokens
    """

    working_txt = text.read()

    ## write text preprocessing code here

    return lemma_tokens




**Note on function arguments:** Our function takes a single variable called "text". Notice the "str" keyword following the variable. This is called a *type annotation*. Type annotations allow you to restrict a function's input to a specific data type in order to prevent potential downstream errors. To create a type annotation, simply write the variable, followed by a colon ( **:** ) and the data type keyword (i.e. str, int, float, bool).

## Conditionals in Python

Per its name, conditionals execute blocks of code only if certain conditions are met. This serves as a way to implement what is called a *control structure*, and allows us to automate decisions that direct a program's order of execution. There are three components to Python conditionals:


- `if`: checks a condition. If the condition is met (i.e. True), the subsequent portion of code executes.
- `elif`: checks an additional following the initial if-condition. elif statements are used only if you intend to check more than one condition.
- `else`: specifies what code to execute if none of the preceding conditions (if or elif) are met.


**Write docstrings:** Write documentation for the following function that uses conditionals. Write a doc string for the function as a whole. It should include a description (1-2 sentences) of the function's action/purpose, lists its arguments, and note what the function returns. Wherever you encounter an empty line, write an in-line comment describing the following conditional statement.

In [None]:
def pos_tagger(treebank_tag):
    """
    [Description]

    Parameters:

    Returns:
    """

    if treebank_tag.startswith('J'):
        return wordnet.ADJ

    elif treebank_tag.startswith('V'):
        return wordnet.VERB

    elif treebank_tag.startswith('N'):
        return wordnet.NOUN

    elif treebank_tag.startswith('R'):
        return wordnet.ADV

    else:
        return wordnet.NOUN

## Loops in Python

Loops similarly use conditions, albeit in a different way. Loops direct the program to repeat some action/code so long as a certain condition is held. There are *while* loops and *for* loops:

- `for`: iterates over a data structure and executes the following code for each item in that structure
- `while`: execute the following code so long as a specified condition is met


**Write docstrings:** This function uses a for loop and a conditional. Write documentation for the following function that uses a loop. Write a doc string for the function as a whole. It should include a description (1-2 sentences) of the function's action/purpose, lists its arguments, and note what the function returns. Wherever you encounter an empty line, write an in-line comment describing what the following line/block of code does.

In [None]:
import os

def process_txt_files(dir: str):
    """

    """

    texts = []

    for file_name in os.listdir(dir):

        if file_name.endswith('.txt'):
            with open(os.path.join(dir, file_name), 'r', encoding='utf-8') as file:

                clean_tokens = pg_txt_cleaner(file)

                texts.append(clean_tokens)
    return texts

**Write code**: Our function uses an if conditional without an else statement. This is not necessarily a problem—the code executes fine. But it is good practice to include an else statement whenever you use if conditionals in order to handle exceptions. Write an else statement for the end of this function. It should tell the user a file has not been processed because it is not a .txt file.

We can now move on to some classic NLP text analysis.