<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/1_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [188]:
# Imports
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import ne_chunk
from transformers import AutoTokenizer

# Ensure nltk resources are available
try:
    nltk.download('punkt')
    nltk.download('punkt_tab')  # for colab
    nltk.download('wordnet')
    nltk.download('stopwords')
    nltk.download('averaged_perceptron_tagger_eng')  # Corrected resource name
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
except Exception as e:
    print(f"Error downloading NLTK resources: {e}")

# Initialize reusable components
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\hjeong\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_

### 1. Basic Text Cleaning
Remove unnecessary characters and make text uniform for processing.
- Regular expression: https://docs.python.org/3/library/re.html
- NLTK: https://www.nltk.org/book/

In [189]:
# Sample text
text = "Hello! This is an NLP tutorial by Prof. Hong Jeong in AIDT class at IUT. Let's learn, step-by-step."

# Lowercasing
text = text.lower()
print("Lowercased Text:", text)

# Removing punctuation
text = re.sub(r'[^\w\s]', '', text)
print("Cleaned Text:", text)

Lowercased Text: hello! this is an nlp tutorial by prof. hong jeong in aidt class at iut. let's learn, step-by-step.
Cleaned Text: hello this is an nlp tutorial by prof hong jeong in aidt class at iut lets learn stepbystep


### 2. Tokenization
Split text into individual words or subword units for further analysis.

- Word Tokenization splits text into individual words.
- Subword Tokenization (BPE - Byte Pair Encoding) captures out-of-vocabulary words and morphological components.

In [190]:
# Word Tokenization
try:
    tokens = word_tokenize(text)
    print("Word Tokens:", tokens)
except LookupError as e:
    print("Error loading punkt resource. Ensure the directory structure is correct.", e)

# Subword Tokenization (BPE)
subword_tokens = tokenizer.tokenize(text)
print("Subword Tokens:", subword_tokens)

Word Tokens: ['hello', 'this', 'is', 'an', 'nlp', 'tutorial', 'by', 'prof', 'hong', 'jeong', 'in', 'aidt', 'class', 'at', 'iut', 'lets', 'learn', 'stepbystep']
Subword Tokens: ['hello', 'this', 'is', 'an', 'nl', '##p', 'tutor', '##ial', 'by', 'prof', 'hong', 'je', '##ong', 'in', 'aid', '##t', 'class', 'at', 'i', '##ut', 'lets', 'learn', 'step', '##by', '##ste', '##p']


### 3. Stemming and Lemmatization
Reduce words to their base forms to normalize similar words (e.g., 'running' to 'run').
- Stemming cuts down words to their base or root form, which may not always be a valid word.
- Lemmatization reduces words to their dictionary form (lemma) and is more accurate but requires knowledge of the word’s part of speech.


In [191]:
# Applying stemming
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Applying lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

Stemmed Tokens: ['hello', 'thi', 'is', 'an', 'nlp', 'tutori', 'by', 'prof', 'hong', 'jeong', 'in', 'aidt', 'class', 'at', 'iut', 'let', 'learn', 'stepbystep']
Lemmatized Tokens: ['hello', 'this', 'is', 'an', 'nlp', 'tutorial', 'by', 'prof', 'hong', 'jeong', 'in', 'aidt', 'class', 'at', 'iut', 'let', 'learn', 'stepbystep']


### 4. Stop-word Removal
Remove common words (e.g., "the", "is", "in") that do not add significant meaning.

Removing stop-words helps reduce noise in the data, especially when focusing on meaningful words.

In [192]:
# Remove stop-words
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Filtered Tokens (Stop-words removed):", filtered_tokens)

Filtered Tokens (Stop-words removed): ['hello', 'nlp', 'tutorial', 'prof', 'hong', 'jeong', 'aidt', 'class', 'iut', 'lets', 'learn', 'stepbystep']


### 5. Part-of-Speech (POS) Tagging
Assign each word a part of speech (noun, verb, etc.), useful for grammatical structure understanding.

In [193]:
# Part-of-Speech Tagging
try:
    pos_tags = nltk.pos_tag(filtered_tokens)
    print("POS Tags:", pos_tags)
except LookupError as e:
    print("Error loading averaged_perceptron_tagger resource. Ensure the directory structure is correct.", e)

POS Tags: [('hello', 'NN'), ('nlp', 'CC'), ('tutorial', 'JJ'), ('prof', 'NN'), ('hong', 'NN'), ('jeong', 'NN'), ('aidt', 'JJ'), ('class', 'NN'), ('iut', 'NN'), ('lets', 'VBZ'), ('learn', 'JJ'), ('stepbystep', 'NN')]


### 6. Named Entity Recognition (NER)

In [194]:
# Named Entity Recognition
try:
    # Ensure pos_tags is defined
    if 'pos_tags' not in locals():
        pos_tags = nltk.pos_tag(filtered_tokens)  # Perform POS tagging if not already done
    ner_tags = ne_chunk(pos_tags)
    print("Named Entities:", ner_tags)
except LookupError as e:
    print("Error loading NER resources. Ensure the directory structure is correct.", e)

Named Entities: (S
  hello/NN
  nlp/CC
  tutorial/JJ
  prof/NN
  hong/NN
  jeong/NN
  aidt/JJ
  class/NN
  iut/NN
  lets/VBZ
  learn/JJ
  stepbystep/NN)


### 7. Combining Steps into a Preprocessing Function

In [195]:
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Remove stop-words
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming (optional)
    tokens = [stemmer.stem(word) for word in tokens]
    
    return tokens

# Example usage
preprocessed_text = preprocess_text("Hello! This is an NLP tutorial. Let's learn, step-by-step.")
print("Preprocessed Text:", preprocessed_text)

Preprocessed Text: ['hello', 'nlp', 'tutori', 'let', 'learn', 'stepbystep']
