<a href="https://colab.research.google.com/github/mrhallonline/NLP-Workshop/blob/main/Module_3_Basic_analysis_and_Analysis_Workshop_Natural_Language_Toolkit_(NLTK)_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 3.0 Basic analysis  (20 minutes)
In this module we will mess around with some of the basic features of NLTK. Looking at what types of information we can get from the features we have extracted.


  * Basic Information
    * Listing/Counting/Sorting/Ranking
  * Working with NLTK Text Object Variable
    * Concordance
    * Similar
    * n-grams/collocations
    * Data Visualizations


# 3.1 Reconnecting to the Text Corpus
Since we have openened a new Colab notebook for this module, we will need to load the text file in Google Drive and extract the features again by recreating the variables from module 2. We will need to do something like this for each new module.

*Click the code cells below to rerun the process for this module

In [None]:
import nltk
import requests
import pandas as pd
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# This is the full shared Drive link, the file ID starts at "1i" and ends at "8S"
# https://docs.google.com/spreadsheets/d/1iJ4SG-QXfY4zw5K9B7Ununv3rb3iBj8S/edit?usp=drive_link&ouid=106477043869312333876&rtpof=true&sd=true

# the file ID from the shareable link is pasted below in orange.
file_id = "1iJ4SG-QXfY4zw5K9B7Ununv3rb3iBj8S"

# construct the download URL, you would not need to change anything here.
download_url = f"https://docs.google.com/uc?export=download&id={file_id}"

# send a GET request to the download URL and save the response content
response = requests.get(download_url)

# The next line names the file after download. If you change it here, you will also need to change in the subsequent fields.
# If you click on the folder icon in Colab you should see a file now appear called "uncertaintyText.xlsx"
# These names can be changed to suit you own data
with open("uncertaintyText.xlsx", "wb") as f:
    f.write(response.content)


# Specify the path to the Excel file this where it was placed in 2.4 so that is the file and path you want to open
excel_file_path = '/content/uncertaintyText.xlsx'

# Specify the column name you want to pull the data corpus from
column_name = 'transcript'

# Read the Excel file and extract the specified column
data = pd.read_excel(excel_file_path, engine='openpyxl')
text_column = data[column_name]


# Convert each item in the column to a string and then join them together to be saved as a text file containing all data in the transcript column.
raw_uncertaintyText = ' '.join(map(str, text_column))


# Save the string to a text file in your Google Drive
with open('/content/raw_uncertaintyText.txt', 'w') as file:
  file.write(raw_uncertaintyText)
# load data from existing text file
## this following path can be changed to access other text files that you may like to work with.
filename = '/content/raw_uncertaintyText.txt'
uncertaintyText = open(filename, 'rt', encoding='utf-8', errors='replace')

# Extracting Different Features for analysis

# The raw unchanged data from the text file now exists as a variable called "raw_uncertaintyText"
raw_uncertaintyText = uncertaintyText.read()
uncertaintyText.close()

# 1. This line converts our raw text into word tokens
# Regular expression tokenizing Gaps =False
pattern = r'\s+'
uncertainty_wordTokens = nltk.regexp_tokenize(raw_uncertaintyText, pattern, gaps=True)

# 2. This line converts our raw text into sentence tokens
uncertainty_sentTokens = nltk.sent_tokenize(raw_uncertaintyText)

# 3. This line converts our word tokens into NLTK text objects from the tokens
uncertainty_wordTextObjects = nltk.Text(uncertainty_wordTokens)

print(uncertainty_wordTokens)

In [None]:
print("raw_uncertaintyText is a: ",type(raw_uncertaintyText))
print("uncertainty_wordTokens is a: ",type(uncertainty_wordTokens))
print("uncertainty_sentTokens is a: ",type(uncertainty_sentTokens))
print("uncertainty_wordTextObjects is a: ",type(uncertainty_wordTextObjects))

# 3.2 Interacting with our words tokens: uncertainty_wordTokens

The following text cells will show different ways to look at the word tokens we have created. Just typing in the variable name and executing the code cell will show us a list of all the word tokens in the order that they appear.

Current Variable List
```
raw_uncertaintyText
uncertainty_wordTokens
uncertainty_sentTokens
uncertainty_wordTextObjects
```
If time permits, feel free to replace the current variable with one of the other ones to see how they differ structurally.

In [None]:
#Executing the variable will show a list of the contents. Tokens are seperated by commas.
uncertainty_wordTokens

In [None]:
# Placing the variable in a print function shows the contents as well as gives you a view that doesn't run down the page and allows you to do sorting and slicing of the contained data
print(uncertainty_wordTokens)


# The sorted set, lists and alphabetizes all words that appear at least once throughout the document.
# The numbers at the end currently tell it to show the first 25 tokens. These numbers can be changed to look at different slices of data
print(sorted(set(uncertainty_wordTokens[0:25])))

# Execute code cell to see the total number of characters/tokens

In [None]:
# How many total word tokens
len(uncertainty_wordTokens)

# Looking at the frequency distribution of our words

In [None]:
#Top 25 most common words with their counts
fd = nltk.FreqDist(uncertainty_wordTokens)
print(fd.most_common(25))

In [None]:
#Top 25 most common words easier to read from page
fd.tabulate(25)

In [None]:
# Cumulative plot of top 25 words
fd.plot(25, cumulative=True)

In [None]:
# Create a frequency distribution

from matplotlib import pyplot as plt

#fd = nltk.FreqDist(tokens_nltk_text)

# Get the 10 most common words and their counts
common = fd.most_common(10)

# Unzip the words and counts into two separate lists
words, counts = zip(*common)

# Create a bar graph
plt.bar(words, counts)
plt.show()

# 3.3 Working with NLTK Text Objects
1. concordance
2. similar words
3. dispersion plots


One of the variables that we created earlier was neither a string of words or a list of words, but was as labeled as a text object. What do we do with that bucket? We can work directly with some of NLTK's builtin functions. Such as concordance, similar, and dispersion plots.

#* concordance()
The concordance function will search your entire data corpus for the existance of a specific word. It shows the word in question along with the surrounding words, providing context to help understand how the word is used in various contexts throughout the text. This can be particularly helpful for studying word usage patterns, exploring word meanings, or analyzing language usage in different contexts.You can see that the whole function itself is simply a merging of our variable name 'uncertainty_wordTextObjects' + the NLTK function 'concordance()'

*Try changing the word found in line 4


In [None]:
# This line will search for the occurences of one of our uncertainty words.
# After running try using different uncertainty words
# Feel free to change the number of lines that are shown as well
uncertainty_wordTextObjects.concordance('think', lines = 25)

Question:
With your table partners come up with 3-4 other search terms that might give use some insight into what may be occuring during student moments of uncertainty. Look at the concordance for those words and talk about any insight gained.

#* similar()

This NLTK function is used to find words that appear in a similar context as the given word in a text corpus. It looks at the context in which a word appears—specifically, the words that appear immediately before and after the target word—and finds other words that appear in the same or similar contexts.

This can be useful for semantic analysis, linguistic exploration, and various natural language processing tasks. It allows us to see words that are used in similar ways and therefore might be considered similar in meaning within the text.



In [None]:
uncertainty_wordTextObjects.similar("maybe")

#* dispersion_plot()

This function used to create a lexical dispersion plot, which visualizes the distribution of words in a text over the entirety of the data corpus. This plot helps you see where certain words occur throughout the text and whether there are any patterns or clusters of word occurrences.
The dispersion plot displays a series of dashes or markers along a horizontal axis that represents the length of the text. Each dash or marker corresponds to the position of a specific word within the text. By visualizing the distribution of words in this way, you can quickly identify trends, repetition, and patterns of word usage.

In [None]:
# This will show a dispersion plot of the words of interest.
# This is a great useful function to use in NLTK, but it does not seem to display correctly when utilized in Google Colab
# You can see this if you compare the plot output below with our concordance search for "maybe"
# It actually seems to list the words in reverse order to what is on the plot

uncertainty_wordTextObjects.dispersion_plot(["maybe","help", "math", "think"])

Question:

What are some inferences we can make using a lexical dispersion plot like this one?

# 3.4 n-grams and collocations

N-grams are contiguous sequences of words (or other linguistic units) that appear in a data corpus. They are able to show patterns and relationships or word usage in a text. For example bigrams are patterns of two words that appear together and trigrams are a pattern of three words appearing together.

1. bigrams



In [None]:
# Listing bigrams from data
uncertainty_bigrams = list(nltk.bigrams(uncertainty_wordTextObjects))
print(uncertainty_bigrams)

2. trigrams

In [None]:
# Listing trigrams from data
uncertainty_trigrams = list(nltk.trigrams(uncertainty_wordTextObjects))
print(uncertainty_trigrams)

# Collocations
Collocations are pairs or groups of words that tend to appear frequently together in a text, often with a specific meaning or connotation ('fast food', 'red wine', 'United States of America').  They are not just random word combinations but rather indicative of language patterns. Identifying collocations can be useful for understanding the associations between words in a text.

NLTK provides the collocations module, which includes methods to identify and extract collocations from a text. One common approach is to use the BigramAssocMeasures class along with the BigramCollocationFinder class to find significant word pairs (bigrams) based on various measures such as frequency among others.

In [None]:
# display frequency of highest 50 bigrams
# See if you can identify any bigrams that might signal uncertainty
finder = nltk.collocations.BigramCollocationFinder.from_words(uncertainty_wordTextObjects)
finder.ngram_fd.tabulate(50)

In [None]:
# display frequency of highest 25 trigrams
finder = nltk.collocations.TrigramCollocationFinder.from_words(uncertainty_wordTextObjects)
finder.ngram_fd.tabulate(25)

# Any number of Ns

If you have need to search for higher values of 'n', the following code cell when executed can search for any value of 'n' that you would like. Just change the number in line 4.

In [None]:

from nltk.util import ngrams

n_value = 5  # Change this for different n values
fourgrams = ngrams(uncertainty_wordTextObjects, n_value)

# Tabulate the top n-grams
fdist = nltk.FreqDist(fourgrams)
fdist.tabulate(25)  # Top 10 fourgrams


# Change the value for n and identify any other n-grams that might signal uncertainty

# Finding words that appear 'close' to each other.
The provided code snippet demonstrates how to search for instances of two specific words (e.g., "I" and "think") appearing within a certain distance from each other in a given text corpus. The code utilizes regular expressions and the NLTK library to tokenize the text, compile the regular expression pattern, and then search for matching sentences containing the specified word pair. It also includes a step to download the NLTK punkt tokenizer data if not already available, ensuring successful sentence tokenization.

* You should hopefully notice here that NLTK is able to use both word and sentence tokens interactively with each other.

In [None]:
import nltk
# Download the NLTK punkt tokenizer data
nltk.download('punkt')
#from nltk.sent_tokenize import sent_tokenize
from nltk.text import Text
import re


You can change the words in line 9 and 10 to investigate different word combinations. Line 13 allows you to chage the maximum distance between the words.

In [None]:

# Tokenize the text into words
corpus_words = nltk.word_tokenize(raw_uncertaintyText)

# Create a Text object from the tokenized words
corpus_text = Text(corpus_words)

# Define the two words to search for
word1 = "I"
word2 = "think"

# Define the maximum distance between the two words
max_distance = 10  # Adjust this value as needed

# Create a pattern for searching
pattern = r"\b" + word1 + r"\W+(?:\w+\W+){" + f"0,{max_distance}" + r"}" + word2 + r"\b"

# Extract plain text from the corpus_text object
plain_text = ' '.join(corpus_text.tokens)

# Compile the regular expression
search_regex = re.compile(pattern, re.IGNORECASE)

# Find sentences where the pattern appears
sentences = nltk.sent_tokenize(plain_text)

# Search for sentences where the two words appear within the specified distance
matching_sentences = []
for sentence in sentences:
    if search_regex.search(sentence):
        matching_sentences.append(sentence)

# Print the matching sentences
for sentence in matching_sentences:
    print(sentence)


# End of Module 3
Question:
Work with your table partners and come up with 3-4 other useful word combinations that could give us insight into our uncertainty data corpus. For example "did" and "you", do we find examples of students asking questions about the process used by other students?

[Module 4](https://github.com/mrhallonline/NLP-Workshop/blob/main/Module_4_Running_Basic_Sentiment_Analysis_Workshop_Natural_Language_Toolkit_(NLTK)_V3.ipynb)

A huge takeaway to consider at this point is that we still have quite a bit of noise contained in our data and that shows up in our analysis

for example:

Our second and third highest frequency collocation containing three words are the following ('I', "don't", 'know') ('I', "don't", 'know.')
** in this case "know" and "know." are identified as being two different words which is no doubt causing underlying issues with other "words" in our data corpus

Module 5 will allow you to spend time interacting more deeply with text preprocessing allowing you to much better extract the specific features that you need from a transcript corpus. Seeing how the signal interacts with the noise is helpful in planning and better understanding the importance of text preprocessing and feature extraction from language.