<a href="https://colab.research.google.com/github/mrhallonline/NLP-Workshop/blob/main/Module_3_Basic_analysis_and_Analysis_Workshop_Natural_Language_Toolkit_(NLTK)_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 3.0 Basic analysis  (20 minutes)
In this module we will mess around with some of the basic features of NLTK. Looking at what types of information we can get from the features we have extracted.


  * Basic Information
    * Listing/Counting/Sorting/Ranking
  * Working with NLTK Text Object Variable
    * Concordance
    * Similar
    * n-grams/collocations
    * Data Visualizations


# 3.1 Reconnecting to Google Drive
Since we have openened a new Colab notebook for this module, we will need to remount Google Drive and extract the features again by recreating the variables from module 2. We will need to do something like this for each new module.

*Click the two code cells below to rerun the process

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Re-creating variables from module 2
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# load data from existing text file
filename = '/content/drive/MyDrive/raw_uncertaintyText.txt'
uncertaintyText = open(filename, 'rt', encoding='utf-8', errors='replace')

raw_uncertaintyText = uncertaintyText.read()
uncertaintyText.close()

# Word Tokenization
#uncertainty_wordTokens = nltk.word_tokenize(raw_uncertaintyText)

# Regular expression tokenizing Gaps =False
pattern = r'\s+'
uncertainty_wordTokens = nltk.regexp_tokenize(raw_uncertaintyText, pattern, gaps=True)

# This line converts our raw text into sentence tokens
uncertainty_sentTokens = nltk.sent_tokenize(raw_uncertaintyText)

# Creating a Text object from the tokens
uncertainty_wordTextObjects = nltk.Text(uncertainty_wordTokens)

print("raw_uncertaintyText is a: ",type(raw_uncertaintyText))
print("uncertainty_wordTokens is a: ",type(uncertainty_wordTokens))
print("uncertainty_sentTokens is a: ",type(uncertainty_sentTokens))
print("uncertainty_wordTextObjects is a: ",type(uncertainty_wordTextObjects))

# 3.2 Interacting with our words:
uncertainty_wordTokens

The following text cells will exhibited different ways to look at the word tokens we have created. Just typing in the variable name and executing the code cell will show us a list of all the word tokens in the order that they appear.

Current Variable List
```
raw_uncertaintyText
uncertainty_wordTokens
uncertainty_sentTokens
uncertainty_wordTextObjects
```
Replace the current variable with one of the other ones to see how they differ structurally.

In [None]:
#Executing the variable will show the contents
uncertainty_wordTokens

In [None]:
# Placing the variable in a print function shows the contents as well as gives you a view that doesn't run down the page and allows you to do sorting and slicing of the contained data
print(uncertainty_wordTokens)


# The sorted set, lists and alphabetizes all words that appear at least once throughout the document.
# The numbers at the end currently tell it to show the first 25 tokens. These numbers can be changed to look at different slices of data
print(sorted(set(uncertainty_wordTokens[0:25])))

Execute code cell to see the total number of characters/tokens

In [None]:
# How many total word tokens
len(uncertainty_wordTokens)

In [None]:
#Top 25 most common words with their counts
fd = nltk.FreqDist(uncertainty_wordTokens)
print(fd.most_common(25))


In [None]:
#Top 25 most common words easier to read from page
fd.tabulate(25)

In [None]:
# Cumulative plot of top 25 words
fd.plot(25, cumulative=True)

In [None]:
# Create a frequency distribution

from matplotlib import pyplot as plt

#fd = nltk.FreqDist(tokens_nltk_text)

# Get the 10 most common words and their counts
common = fd.most_common(10)

# Unzip the words and counts into two separate lists
words, counts = zip(*common)

# Create a bar graph
plt.bar(words, counts)
plt.show()

# 3.3 Working with NLTK Text Objects
1. concordance
2. similar words
3. dispersion plots


One of the variables that we created earlier was neither a string of words or a list of words, but was as labeled as a text object. What do we do with that bucket? We can work directly with some of NLTK's builtin functions. Such as concordance, similar, and dispersion plots.

#* concordance()
The concordance function will search your entire data corpus for the existance of a specific word. It shows the word in question along with the surrounding words, providing context to help understand how the word is used in various contexts throughout the text. This can be particularly helpful for studying word usage patterns, exploring word meanings, or analyzing language usage in different contexts.You can see that the whole function itself is simply a merging of our variable name 'uncertainty_wordTextObjects' + the NLTK function 'concordance()'

*Try seeing if this will work with any of the other variables that we created so far.


In [None]:
# This line will search for the occurences of one of our uncertainty words.
# After running try using different uncertainty words
# Feel free to change the number of lines that are shown as well
uncertainty_wordTextObjects.concordance('maybe', lines = 15)

#* similar()

This NLTK function is used to find words that appear in a similar context as the given word in a text corpus. It looks at the context in which a word appears—specifically, the words that appear immediately before and after the target word—and finds other words that appear in the same or similar contexts.

This can be useful for semantic analysis, linguistic exploration, and various natural language processing tasks. It allows us to see words that are used in similar ways and therefore might be considered similar in meaning within the text.



In [None]:
uncertainty_wordTextObjects.similar("maybe")

#* dispersion_plot()

This function used to create a lexical dispersion plot, which visualizes the distribution of words in a text over the entirety of the data corpus. This plot helps you see where certain words occur throughout the text and whether there are any patterns or clusters of word occurrences.
The dispersion plot displays a series of dashes or markers along a horizontal axis that represents the length of the text. Each dash or marker corresponds to the position of a specific word within the text. By visualizing the distribution of words in this way, you can quickly identify trends, repetition, and patterns of word usage.

In [None]:
# This will show a dispersion plot of the words of interest.
# This is a great useful function to use in NLTK, but it does not seem to work correctly when utilized in Google Colab
# You can see this if you compare the plot output below with our concordance search for "maybe"

uncertainty_wordTextObjects.dispersion_plot(["maybe","help", "math"])

# 3.4 n-grams and collocations

N-grams are contiguous sequences of words (or other linguistic units) that appear in a data corpus. They are able to show patterns and relationships or word usage in a text. For example bigrams are patterns of two words that appear together and trigrams are a pattern of three words appearing together.

1. bigrams



In [None]:
# Listing bigrams from data
uncertainty_bigrams = list(nltk.bigrams(uncertainty_wordTextObjects))
print(uncertainty_bigrams)

2. trigrams

In [None]:
# Listing trigrams from data
uncertainty_trigrams = list(nltk.trigrams(uncertainty_wordTextObjects))
print(uncertainty_trigrams)

# Collocations
Collocations are pairs or groups of words that tend to appear frequently together in a text, often with a specific meaning or connotation ('fast food', 'red wine', 'United States of America').  They are not just random word combinations but rather indicative of language patterns. Identifying collocations can be useful for understanding the associations between words in a text.

NLTK provides the collocations module, which includes methods to identify and extract collocations from a text. One common approach is to use the BigramAssocMeasures class along with the BigramCollocationFinder class to find significant word pairs (bigrams) based on various measures such as frequency among others.

In [None]:
# display frequency of highest 25 bigrams
finder = nltk.collocations.BigramCollocationFinder.from_words(uncertainty_wordTextObjects)
finder.ngram_fd.tabulate(25)

In [None]:
# display frequency of highest 25 trigrams
finder = nltk.collocations.TrigramCollocationFinder.from_words(uncertainty_wordTextObjects)
finder.ngram_fd.tabulate(25)

# Any number of Ns

If you have need to search for higher values of 'n', the following code cell when executed can search for any value of 'n' that you would like. Just change the number in line 4.

In [None]:

from nltk.util import ngrams

n_value = 4  # Change this for different n values
fourgrams = ngrams(uncertainty_wordTextObjects, n_value)

# Tabulate the top n-grams
fdist = nltk.FreqDist(fourgrams)
fdist.tabulate(25)  # Top 10 fourgrams


# Finding words that appear 'close' to each other.
The provided code snippet demonstrates how to search for instances of two specific words (e.g., "I" and "think") appearing within a certain distance from each other in a given text corpus. The code utilizes regular expressions and the NLTK library to tokenize the text, compile the regular expression pattern, and then search for matching sentences containing the specified word pair. It also includes a step to download the NLTK punkt tokenizer data if not already available, ensuring successful sentence tokenization.

* You should hopefully notice here that NLTK is able to use both word and sentence tokens interactively with each other.

In [None]:
import nltk
# Download the NLTK punkt tokenizer data
nltk.download('punkt')
#from nltk.sent_tokenize import sent_tokenize
from nltk.text import Text
import re


In [None]:
# Tokenize the text into words
corpus_words = nltk.word_tokenize(raw_uncertaintyText)

# Create a Text object from the tokenized words
corpus_text = Text(corpus_words)

# Define the two words to search for
word1 = "I"
word2 = "think"

# Define the maximum distance between the two words
max_distance = 10  # Adjust this value as needed

# Create a pattern for searching
pattern = r"\b" + word1 + r"\W+(?:\w+\W+){" + f"0,{max_distance}" + r"}" + word2 + r"\b"

# Extract plain text from the corpus_text object
plain_text = ' '.join(corpus_text.tokens)

# Compile the regular expression
search_regex = re.compile(pattern, re.IGNORECASE)

# Find sentences where the pattern appears
sentences = sent_tokenize(plain_text)

# Search for sentences where the two words appear within the specified distance
matching_sentences = []
for sentence in sentences:
    if search_regex.search(sentence):
        matching_sentences.append(sentence)

# Print the matching sentences
for sentence in matching_sentences:
    print(sentence)


# 3.5 Activity if time permitting
The following interactive tool will ask for a word plus a number for n and will report back with a plot of the most frequent ngrams containing the word entered.

See if this tool or any of the earlier code snippet can be useful for any of the following analysis

* Common themes or topics discussed.
* Questions posed by students indicating uncertainty (using the words we discussed earlier like "maybe", "think" etc.).

Discuss the findings with your table partner.

In [None]:
import nltk
from nltk import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
import matplotlib.pyplot as plt

# Sample Text Corpus (replace this with your own)
text_corpus = raw_uncertaintyText

# Download required datasets
nltk.download('punkt')

# Function to get n-grams
def get_ngrams(text, n):
    words = word_tokenize(text.lower())
    generated_ngrams = ngrams(words, n)
    return generated_ngrams

# Function to interactively explore n-grams
def interactive_ngram_explorer():
    uncertainty_word = input("Enter the word of uncertainty you're interested in (e.g., maybe, think, etc.): ").lower()
    n_value = int(input("Enter the n-gram length (e.g., 2 for bigrams, 3 for trigrams, etc.): "))

    n_grams = list(get_ngrams(text_corpus, n_value))
    filtered_ngrams = [gram for gram in n_grams if uncertainty_word in gram]
    ngram_frequencies = Counter(filtered_ngrams)

    print(f"\nHere are the most common {n_value}-grams containing the word '{uncertainty_word}':")
    for gram, freq in ngram_frequencies.items():
        print(f"{gram}: {freq}")

    # Optional: plot the results
    items = ngram_frequencies.items()
    labels = [str(i[0]) for i in items]
    values = [i[1] for i in items]

    plt.figure(figsize=(10,5))
    plt.barh(labels, values)
    plt.xlabel('Frequency')
    plt.title(f"Most common {n_value}-grams containing the word '{uncertainty_word}'")
    plt.show()

# Run the interactive explorer
interactive_ngram_explorer()


# End of Module 3

A huge takeaway to consider at this point is that we still have quite a bit of noise contained in our data and that shows up in our analysis

for example:
* Our second and third highest frequency collocation containing three words are the following ('I', "don't", 'know')   ('I', "don't", 'know.')
** in this case "know" and "know." are identified as being two different words which is no doubt causing underlying issues with other "words" in our data corpus

Module 5 will allow you to spend time interacting more deeply with text preprocessing allowing you to much better extract the specific features that you need from a transcript corpus. Seeing how the signal interacts with the noise is helpful in planning and better understanding the importance of text preprocessing and feature extraction from language.