<a href="https://colab.research.google.com/github/mrhallonline/NLP-Workshop/blob/main/Module_5_Basics_of_Text_Preprocessing_Workshop_Natural_Language_Toolkit_(NLTK)_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5.0 Text Wrangling and Basic Preprocessing (20 Minutes)
* 5.1 Reconnecting to Google Drive and basic processing
* 5.2 Tokenizing Text Using Different Regular Expressions
* 5.3 All-in-One Text Normalization Using NLTK
* 5.4 Viewing the List of English Stopwords using NLTK
* 5.5 List of Ways to Extract Features Using Regular Expression as the tokenizer
* 5.6 Conclusion
  * Issues to keep in mind when normalizing your data corpus
  * Ethical considerations/Further usage
  * Wrap-up

#5.1 Reconnecting to Google Drive and basic processing
More or less same as before just to get things started up again.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
from nltk.tokenize import word_tokenize

# load data from existing text file we created in module 1
filename = '/content/drive/MyDrive/raw_uncertaintyText.txt'
uncertaintyText = open(filename, 'rt', encoding='utf-8', errors='replace')

raw_uncertaintyText = uncertaintyText.read()
uncertaintyText.close()

# Word Tokenization
#uncertainty_wordTokens = nltk.word_tokenize(raw_uncertaintyText)

# Regular expression word tokenizing
pattern = r'\s+'
uncertainty_wordTokens = nltk.regexp_tokenize(raw_uncertaintyText, pattern, gaps=True)

# This line converts our raw text into sentence tokens
uncertainty_sentTokens = nltk.sent_tokenize(raw_uncertaintyText)

# Creating a Text object from the tokens
uncertainty_wordTextObjects = nltk.Text(uncertainty_wordTokens)

print("raw_uncertaintyText is a: ",type(raw_uncertaintyText))
print("uncertainty_wordTokens is a: ",type(uncertainty_wordTokens))
print("uncertainty_sentTokens is a: ",type(uncertainty_sentTokens))
print("uncertainty_wordTextObjects is a: ",type(uncertainty_wordTextObjects))

# 5.2 Tokenizing Text Using Different Regular Expressions
This code snippet demonstrates different ways of tokenizing text using regular expressions via the NLTK library. The text is tokenized based on word boundaries, whitespace, and capitalized words using different configurations of the nltk.regexp_tokenize() function.**bold text**

* text_regexTokensWF: Tokenizes the text based on word boundaries, with gaps=False. This will create tokens that match the given pattern, essentially capturing the words themselves.

* text_regexTokensST: Tokenizes the text based on white spaces, with gaps=True. This captures the gaps between tokens, essentially tokenizing by spaces.

* text_regexTokensSF: Tokenizes the text based on white spaces, with gaps=False. This captures tokens that are separated by spaces.

* text_regexTokensCAPS: Tokenizes the text based on capitalized words. This will create tokens that start with a capital letter followed by one or more word characters.

The lengths and first 200 characters of each tokenized list are printed out to offer a quick view of the tokenization results.

In [None]:
import nltk
nltk.download('punkt')

#Regex Gaps=True
pattern2 = r'\w+'
text_regexTokensWF = nltk.regexp_tokenize(raw_uncertaintyText, pattern2, gaps=False)

# Regular expression tokenizing Gaps =False
pattern3 = r'\s+'
text_regexTokensST = nltk.regexp_tokenize(raw_uncertaintyText, pattern3, gaps=True)


#Regex Gaps=True
pattern4 = r'\s+'
text_regexTokensSF = nltk.regexp_tokenize(raw_uncertaintyText, pattern4, gaps=False)

#Words with capital letters
pattern5 = r'[A-Z]\w+'
text_regexTokensCAPS = nltk.regexp_tokenize(raw_uncertaintyText, pattern5)

#Contractions (can't, won't)
pattern10 = r'\b\w+\'\w+\b'
text_regexTokens10 = nltk.regexp_tokenize(raw_uncertaintyText, pattern10)

print("text_regexTokensWF Length =", len(text_regexTokensWF)," Characters", text_regexTokensWF[0:200])
print("text_regexTokensST Length =", len(text_regexTokensST)," Characters", text_regexTokensST[0:200])
print("text_regexTokensSF Length = ", len(text_regexTokensSF)," Characters", text_regexTokensSF[0:200])
print("text_regexTokensCAPS Length = ", len(text_regexTokensCAPS)," Characters", text_regexTokensCAPS[0:200])
print("text_regexTokens10 Length = ", len(text_regexTokens10)," Characters", text_regexTokens10[0:200])

# 5.3 All-in-One Text Normalization Using NLTK

This code snippet performs multiple steps to normalize a text corpus using NLTK in Python.

* Convert to Lowercase: The tokens previously obtained (referred to as text_regexTokensST in the code) are converted to lowercase using list comprehension.

* Remove Punctuation: Punctuation marks are removed from each token. This is done by translating each string using a translation table that removes punctuation marks.

* Retain Alphabetic Words: Words that are not purely alphabetic are filtered out to keep only words containing alphabetic characters.

* Remove Stopwords: Finally, common stopwords like 'and', 'the', etc., are removed from the list of tokens. The NLTK library provides a list of such stopwords for the English language.

The resulting list, cleaned_uncertainty_wordTokens, contains tokens that are lowercase, clear of punctuation, alphabetic, and are not stopwords. The first 200 tokens of this list are printed for you to compare to the earlier datasets.

In [None]:
#Combined All in one normalizer
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# convert to lower case
lower_tokens = [w.lower() for w in text_regexTokensST]

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in lower_tokens]


# remove remaining tokens that are not alphabetic
alpha_words = [word for word in stripped if word.isalpha()]


# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
cleaned_uncertainty_wordTokens = [w for w in alpha_words if not w in stop_words]
print(cleaned_uncertainty_wordTokens[0:200])


# 5.4 Viewing the List of English Stopwords using NLTK
This snippet will print the list of English stopwords that NLTK uses. The stopwords are common words such as 'and', 'the', 'is', etc., that are often filtered out in text processing to extract more 'meaningful' language for analysis. Keep in mind that you can change the words that appear in the list if needed.

In [None]:
from nltk.corpus import stopwords

# Download the stopwords package if you haven't already
import nltk
nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Print the list of stopwords
print(sorted(list(stop_words)))


# 5.5 List of Ways to Extract Features Using Regular Expression as the tokenizer
This code uses the NLTK to tokenize text (raw_uncertaintyText) using a variety of regular expression patterns. These patterns include filtering for
1. word tokens
2. non-word tokens
3. words with capital letters
4. sentences
5. email addresses
6. digits
7. URLs
8. abbreviations
9. words with hyphens
10. contractions
11. hashtags
12. parenthesized expressions
13. phone numbers

Each pattern is applied to the text, and the resulting tokens are stored in separate variables.

The code is a demonstration of multiple regular expression patterns that can be used to tokenize text for different needs. This code is useful for initial data preparation steps where you may need to extract specific types of tokens from raw text.

In [29]:
# Word Tokens
pattern1 = r'\w+'
text_regexTokens1 = nltk.regexp_tokenize(raw_uncertaintyText, pattern1)

#Non-Word Tokens
pattern2 = r'\W+'
text_regexTokens2 = nltk.regexp_tokenize(raw_uncertaintyText, pattern2)

#Words with capital letters
pattern3 = r'[A-Z]\w+'
text_regexTokens3 = nltk.regexp_tokenize(raw_uncertaintyText, pattern3)

#Sentences
pattern4 = r'[A-Z][^.!?]*[.!?]'
text_regexTokens4 = nltk.regexp_tokenize(raw_uncertaintyText, pattern4)

#Email Addresses
pattern5 = r'\S+@\S+'
text_regexTokens = nltk.regexp_tokenize(raw_uncertaintyText, pattern5)

#Digits
pattern6 = r'\d+'
text_regexTokens6 = nltk.regexp_tokenize(raw_uncertaintyText, pattern6)

#URLs
pattern7 = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text_regexTokens7 = nltk.regexp_tokenize(raw_uncertaintyText, pattern7)

#Custom Patterns - Abbreviations with following period
pattern8 = r'\b[A-Za-z]\.(\s)?'
text_regexTokens8 = nltk.regexp_tokenize(raw_uncertaintyText, pattern8)


#Words with hyphens non-violent
pattern9 = r'\w+(?:-\w+)*'
text_regexTokens9 = nltk.regexp_tokenize(raw_uncertaintyText, pattern9)


#Contractions (can't, won't)
pattern10 = r'\b\w+\'\w+\b'
text_regexTokens10 = nltk.regexp_tokenize(raw_uncertaintyText, pattern10)

#Hashtags
pattern11 = r'#\w+'
text_regexTokens11 = nltk.regexp_tokenize(raw_uncertaintyText, pattern11)


#Custom Pattern- Parenthesized expressions
pattern12 = r'\(.*?\)'
text_regexTokens12 = nltk.regexp_tokenize(raw_uncertaintyText, pattern12)

#Custom Pattern - Phone number
pattern13 = r'\d{3}-\d{3}-\d{4}'
text_regexTokens13 = nltk.regexp_tokenize(raw_uncertaintyText, pattern13)

# Conclusion
## Issues to keep in mind when running the all-in-one text normalization process:

* Context Loss: Removing stopwords and punctuation can sometimes result in the loss of context or meaning, which could be important for certain types of analysis like sentiment analysis or named entity recognition.

* Non-English Text: This code is tailored for English text. If your dataset contains text in other languages, you would need to adjust the list of stopwords and possibly other aspects of the normalization process.

* Special Tokens: Some tokens might be specialized terms or jargon that should not be lower-cased or stripped of punctuation. This would need special handling.

* Homonyms: Converting all words to lowercase might make some words that are spelled the same but have different meanings indistinguishable. For example, "US" (United States) and "us" (pronoun) would both become "us."

* Information Loss: By removing all words that are not purely alphabetic, you might lose some potentially important tokens like numerals or alphanumeric codes which might be significant in a particular context.

## Potential Pitfalls & Ethical Considerations

* Challenges of NLP and potential misinterpretations.
* Context is important in qualitative research NLP can only aid, not replace, human analysis.
* Important to take these analyses in the broader context of classroom dynamics, pedagogical approaches, and individual student backgrounds. Machine analysis can offer insights but can't replace the nuanced understanding of an educator
* Ethical considerations, especially when analyzing students' data..
* Importance of consent when analyzing student discussions and potential implications of the findings.
* The key is to use NLTK and Python as tools to augment qualitative analysis, not replace it. They can help identify patterns or areas of interest, but the rich, nuanced understanding and interpretation will come from the qualitative researcher’s expertise.