# Text analysis

Text mining is the process of transforming unstructured text into structured data. This is essential, as the text must be in a machine-readable format. Now we can use various NLP tools to analyze that structured text.

We cover two fundamental techniques of text analysis: topic modeling and word embeddings.

## Python packages for text analysis

Many Python libraries, packages, and modules help perform text analysis. Here is a brief breakdown of some of the most common ones:

- **gensim:** Supports unsupervised NLP tasks, like topic modeling and document similarity analysis. Unlike spaCy or NLTK, which focus on syntactic NLP, Gensim is geared toward analysis via embedding-based NLP methods. Useful for tasks like topic modeling, document retrieval, text similarity, etc. [Gensim docs](https://radimrehurek.com/gensim/)

- **pyLDAvis:** Supports interactive topic model visualization. This is useful for interpreting and evaluating model topics. Its visualization of topics is esepcially useful for understanding the distribution and relationships between topics. [pyLDAvis docs](https://pyldavis.readthedocs.io/en/latest/)

- **word2vec:** transforms words into embedding vector points as a means of representing semantic relationships. It was developed by Google and is useful for analyzing potential semantic relationships between words. Word embeddings is fundamental to for understanding language models. We use gensim's word2vec module: [word2vec docs](https://radimrehurek.com/gensim/models/word2vec.html)


### Install and import packages

**Write code:** We will be using these packages for text mining and analysis in this session:

- nltk
- spacy
- gensim
- word2vec
- pyLDAvis
- string
- re
- string
- os

Write statements to install then import these packages and modules into our notebook environment. Keep in mind that string, re, and os come pre-loaded with Python.

In [None]:
# install packages in colab env


# import packages


# load small English LM via spacy
!python3 -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
import en_core_web_sm

## Load a corpus

Before proceeding to text analysis, we need to load multiple text files from a directory as our corpus. To do this, we must first import the Python os module. This module allows users to interact with their operating system—e.g. navigate and manipulate file structures, run shell commands, etc.

Here we import the os module and mount the Google Drive to our environment:

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/gdrive/')

Now we can load our directory of text files. We are here working with a complete collection of lyrics from all Taylor Swift's studio albums (up to 2023). We assign the subdirectory path to the variable `corpus_dir` and use the `listdir()` function from the os module to list all of the filenames in the tswift subdirectory.

**Write code:**
1. Define the `file_lst` variable as the list of all text files in our Taylor Swift subdirectory. To do this, call the `listdir()` function from the os module passing the variable `corpus_dir` as the function's argument. Remember that, in Python, we call specific items from modules/packages following the `large.small` syntactic structure.
2. Write a print command to view the list by calling the `file_lst` variable.

In [None]:
# define variable as subdir path
corpus_dir = "/gdrive/MyDrive/python-txt-humanities/text-data/tswift/"

# define variable as list of corpus text files


# view the list


If we want to view the contents of one of these files at a time, we can use the following command.

In [None]:
with open('/gdrive/MyDrive/python-txt-humanities/text-data/tswift/Fifteen.txt', 'r', encoding='utf-8') as file:
    print(file.read())

Before we can do any text analysis, we need to do some simple, quick cleaning of our corpus. Look through some of the other text files. Print their contents and identify any patterns of noise you think you see that is common across the files.

To print different files, you likely don't want to copy-paste or type out the whole file directory path. Fortunately, our directory path is saves as the `corpus_dir` variable. We can call this in our `open()` function using what is called an f-string. To write an f-string, type *f* before the string quotes, then inside the string call your variable with curly brackets `{}`. Here is an example:

In [None]:
with open(f'{corpus_dir}Fifteen.txt', 'r', encoding='utf-8') as file:
    print(file.read())

Replace `Fifteen.txt` with different text filenames from the corpus list.

What patterns do you see that we can clean using regular expressions?

## Clean multiple text files

There are two features we can fix with regular expressions:

1. First line of metadata found in each text file
2. Song arrangement metadata marked with square brackets `[]`
3. *Embed* statement preceded by a numerical string at end of file
4. The string *See Taylor Swift LiveGet tickets as low as $60*

**Write code:**
1. Write a regular expression to replace multiple consecutive white space with one whitespace. You can use the re module's `sub()` function.
2. Write a line of code that lowercases the whole text. Use Python's[string-type methods](https://docs.python.org/3/library/stdtypes.html).
3. Add a print statement at the end that uses a list slice to print the first five texts from the `swift_discography` list.


In [None]:
# create empty list to store lyrics
swift_discography = []

for song in file_lst:
    file_path = corpus_dir + song
    with open(file_path, "r", encoding='utf-8', errors='ignore') as file:
        # read file, skipping first line
        lines = file.readlines()[1:]
        # join separate lines into one line
        lyrics = ' '.join(lines)

        # replace consecutive whitespace with a single white space
        

        # lowercase text
        

        # remove text bounded by square brackets (inclusive)
        lyrics = re.sub(r'\[.*?\]', '', lyrics)

        # remove "embed" when it is preceded by digits or special characters (inclusive)
        lyrics = re.sub(r'[\d?!]+embed', '', lyrics)

        # remove concert advert
        lyrics = re.sub(r'(?<=\s|.)see taylor swift liveget tickets as low as \$60(?=\s|.)', '', lyrics)

        # append lyrics to end of list
        swift_discography.append(lyrics)

# view first five items in list


**Note on Python code:** See the regular expression for removing the concert advert. You may note the advert text `see taylor swift liveget tickets as low as \$60` is preceded by to parenthetical sequences. These are called look-behind and look-ahead assertions:

- `?<=\s|.` This is a positive look-behind assertion, as signified by the characters `?<=`. The expression will look for any occurence of whitespace (`\s`) or (`|`) any character (`.`) that precedes the advert text string.
- `?=\s|.` This is a positive look-ahead assertion, as signified by the characters `?=`. The expression will look for any occurence of whitespace (`\s`) or (`|`) any character (`.`) that precedes the advert text string.

Together, these will only find those instances of the advert that are preceded and followed by a whitespace or character.

In [None]:
# view first ten items on separate lines
for lyrics in swift_discography[:10]:
  print(lyrics)

There is more we could do to clean the text. For our purposes, however, this is sufficient.

We now need to put our text into a structured format for text analysis. Specifically, we can tokenize and lemmatize our text, as well as remove stopwords. Install these packages:

In [None]:
# tokenization
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nltk.download("punkt_tab")

# stopword removal
from nltk.corpus import stopwords
nltk.download('stopwords')

# lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

**Write Python code:** Stopwords are words that are semantically less significant, either because they are filler or they are too common. But our corpus likely contains domain-specific stopwords not inlcuded in the NLTK stopword list. What are some words unique to our corpus that we may want to exclude from the texts? Add to the stoplist using the list object's `extend` method.

In [None]:
# POS tagger for lemmatization
def pos_tagger(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def process_txt(text):
    tokens = word_tokenize(text)
    # Remove non-alphabetic tokens
    filtered_tokens_alpha = [word for word in tokens if word.isalpha()]
    # Load NLTK stopword list and add original stopwords
    stop_words = stopwords.words('english')
    # Extend Stopwords (Taylor's Version)
    
    # Remove stopwords
    filtered_tokens_final = [w for w in filtered_tokens_alpha if not w in stop_words]
    # Define lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Conduct POS tagging
    pos_tags = nltk.pos_tag(filtered_tokens_final)
    # Lemmatize word-tokens via assigned POS tags
    lemma_tokens = [lemmatizer.lemmatize(token, pos_tagger(pos_tag)) for token, pos_tag in pos_tags]
    return lemma_tokens

# Process each text in swift_discography
processed_swift = [process_txt(text) for text in swift_discography]

# Print the first 10 processed texts
for doc in processed_swift[:10]:
    print(doc)

Our corpus is stored in a list of lists of lemmas. Each item in the list is a list representing one text, whose items in turn are lemmatized tokens from that text. It follows this structure:
- corpus
  - text_1
    - token_1
    - token_2
    - token_n
  - text_2
    - token_1
    - token_2
    - token_n
  - text_n
  ...

We can use this structured format to generate topic models and word vector models.

But first, some background...

## Bag of words

Most, if not all, NLP techniques utilize a feature extraction technique called Bag of Words (BoW). BoW models are an unstructured assortment of every word in a corpus/document defined solely by frequency. In other words, BoW models ignore grammar, syntax, word order, and co-occurence. i.e. ignoring word order and co-occurence.

To construct a BoW from which we can make a topic model, we must first import certain packages and modules from the gensim library:


In [None]:
# import packages and modules for topic modeling
import gensim.corpora as corpora
from gensim import models
from gensim.models import CoherenceModel
from pyLDAvis import gensim_models

Now we can create a "dictionary" of the corpus. This is not a Python dictionary. Rather, it is an account of every word that appears across our corpus. We can then use this dictionary to construct the BoW.

**Write Python code:** Write two items of code:
1. If you run the code, the print statement returns a **TypeError**. A TypeError occurs in Python when one attempts to apply operations/functions to any inappropriate data type. Why do you think this line of code returns a TypeError? Rewrite the print statement using an f-string so that the string and list print together (separated by a line break). Line breaks are represented as `\n`.
2. Our code generates a BoW as a list of lists of word-frequency pairs. Each item of our main list is a list (representing a text) that is itself comprised of pairs of words and their frequencies. Write a two-line `for` statement that prints each list/document from our main list as a separate line so we can easily eyeball our BoW.

In [None]:
# establish dictionary with gensim
corpus_dict = corpora.Dictionary()

# create BoW model from tokenized data
corpus_bow = [corpus_dict.doc2bow(token, allow_update=True) for token in processed_swift]

# view BoW, token IDs paired with the frequencies they appear in the document
# modify this line
print('(Token ID, Frequency):' + bow[0])

# replace token ID with token string
bow_freq = [[(corpus_dict[token_id], freq) for token_id, freq in pair] for pair in corpus_bow]

# print each text's tokens+frequencies on individual lines


Each line in the output is a separate document from our corpus; each line contains token-number pairs representing the tokens that appear in that document and the frequency with which they appear in that document.

As we look over the lists, we may see semantically less-significant words reappearing throughout. We can add these to our custom stoplist, re-process the corpus, and re-create the BoW. After all, text processing and analysis is an interative process.

**Write Python code:** There is another approach to remove semantically insignificant words. Gensim's dictionary module comes with a `filter_extremes()` function, described in the the [gensim docs](https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html). This function allows us to filter out words that appear under or above a certain threshold. We can call the function as a method of our dictionary variable (much like the `.lower()` method we use to lowercase text) before we create the BoW model. Read the documentation for `filter_extremes()` and call it as a method of the `corpus_dict` variable, filtering out tokens above and/or below a certain frequency threshold.

## Topic models

We can now create our topic model. Gensim's `LdaModel()` function provides a readily accessible method for topic model generation. The funciton allows us to customize the model parameters to better tune it to our data and purposes. To do this, we pass different arguments to the function. Gensim's [documentation](https://radimrehurek.com/gensim/models/ldamodel.html) provides information on all of the function arguments, but here is a brief overview of the select arguments we use:

- `random_state`: known as 'seed' in many other packages and libraries. This ensures reproducibility. LDA is probabilistic, not deterministic. Even were we to train models with the same parameters on the same corpus, the resultant models would not be identical. But models trained with the same parameters on the same data will be the same *if* they are trained with the same random state.

- `chunksize`: the number of texts the model function processes at one time

- `num_topics`: number of topics into which the model distributes words and documents

- `passes`: number of passes through the whole corpus

Execute the models. This process might take several minutes.

In [None]:
# import gensim models module
from gensim import models

# train LDA model
lda_model = gensim.models.LdaModel(corpus=corpus_bow,
                     id2word=corpus_dict,
                     random_state=43,
                     chunksize=5,
                     num_topics=7,
                     passes=10,
                     per_word_topics=True)

# print topics
for topic in lda_model.print_topics(num_words=10):
    print(topic)

### Evaluate topic models

Topic models can be evaluated qualitatively and quantitatively. For the former, an individual uses domain knowledge to "eyeball" top key terms for interpretability. Since topic models are only useful in so far as they can be interpreted, this sort of "eyeballing" is essential.

This eyeballing can be done by simply glacning over the train output. There are other tools that can aid evaluation however. One is the pyLDAvis chart. This library visualizes every topic's scope and similarity in relation to one another by mapping them onto a two-dimensional space.


In [None]:
# create an LDA Visualization Object
lda_visual = pyLDAvis.gensim_models.prepare(lda_model, corpus_bow, corpus_dict)

#display the visualization
pyLDAvis.display(lda_visual)


Topic models can be evaluated qualitatively and quantitatively. Users can use domain knowledge to "eyeball" top key terms for interpretability. Common quantitiave metrics include log-likelihood and cohesion score, which measure topic probability and cohesion.

Topic coherence is one common method for evaluating LDA topics. Topic coherence attempts to measure topic interpretability (i.e. "coherence"). Coherence scores care any value from 0 to 1, with 1 being perfect coherence and 0 being no coherence.

Coherence sorts each topic's key terms from highest to lowest term weights. It then selects the first n terms in each topic and measures those terms similarity within their topic. How does the algorithm measure similarity? Cv is one widely adopted method that can be readily implemented via gensim's `CoherenceModel()` function.


In [None]:
# define coherence model
coherence_model = CoherenceModel(model=lda_model, texts=processed_swift, dictionary=corpus_dict, coherence='c_v')

# obtain coherence scores
coherence_score = coherence_model.get_coherence()

# view coherence scores
print(coherence_score)


### Tune topic model

We can tune our topic model using model hyperparameters. Specifically, we can use the hyperparameters **alpha** and **eta**, which respectively influence document-topic and word-topic density.

- **alpha:** this hyperparameter controls the document-topic density. A higher value smooths document weights over topics, meaning a document is more likely to contain a mixture of many topics. A lower value means that a document is more likely to contain only a few dominant topics.

- **eta:** this hyperparameter controls word-topic density. A higher value smooths word distribution across topics, meaning that topics are likely to include a wider range of words from the corpus. A lower value means that topics are likely to be composed of only a few, heavily-weighted words.


In [None]:
# train LDA model
lda_model = gensim.models.LdaModel(corpus=corpus_bow,
                     id2word=corpus_dict,
                     random_state=43,
                     chunksize=5,
                     num_topics=7,
                     passes=10,
                     per_word_topics=True,
                     alpha=10,
                     eta=10)

# print topics
for topic in lda_model.print_topics(num_words=10):
    print(topic)

## Word embeddings

Word embedding models represents words as numerical coordinate called *vectors*. Its aim is to represent words in such a way that their semantic relationships are preserved. The word-vectors are mapped across an multi-dimensional space in relation to one another. For this reason, the word vectors are also called *word embeddings*. Word embeddings works from the assumption that words with similar meanings appear in similar contexts. So each word is defined entirely by its relationship to every other word in the corpus. As a result, if one word establishes a new relationship, the "definitions" of all the other word embeddings also change.

We use a package called word2vec.

In [None]:
from gensim.models import word2vec

# instantiate an instance of the gensim Word2Vec class
embeddings_model = gensim.models.Word2Vec(processed_swift, window=5, min_count=3)

# set the model vocabulary from the gensim model and define a notification period
embeddings_model.build_vocab(processed_swift, progress_per=10000)

# now train the neural net
embeddings_model.train(processed_swift, total_examples=embeddings_model.corpus_count, epochs=30, report_delay=1)

Once we have our neural network trained, we can start asking questions about word semantics.   Play around with other words in which you might be interested.  But remember that this neural network was trained on a very small sample of texts, so don't put too much stock in what you find.

In [None]:
# find the words most similar to a word of interest
embeddings_model.wv.most_similar(positive=["man"])

# find the words most similar to a word of interest
embeddings_model.wv.most_similar(negative=["man"])

# test the similarit of identified words
embeddings_model.wv.similarity("man", "woman")

Word embeddings are foundational to language models. For example, LLMs use word embeddings to generate text, retrieve texts, and fill gaps/masks. Albeit they use contextual word embeddings as opposed to word-type embeddings found in word2vec. Nevertheless, being able to understand and query word embeddings helps in understanding other NLP tools.