<img src="https://www.rp.edu.sg/images/default-source/default-album/rp-logo.png" width="200" alt="Republic Polytechnic"/>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/koayst-rplesson/SDGAI_LLMforGenAIApp_Labs/blob/main/L01/L01.ipynb)

# Setup and Installation

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and create a new development environment called `c3669c`.
* Pip/Conda install the libraries stated below when necessary.
---

# Lesson 01

In [None]:
%%capture --no-stderr
%pip install --quiet -U nltk

In [1]:
# print module version(s)

import nltk
print (nltk.__version__)

3.9.1


## Tokenization

This step takes a piece of text and converts it into a list of tokens. If the input is a sentence, then separating the words (including punctuations) would be an example of tokenization. Depending on the model, different granularities can be chosen. At the lowest level, each character could be a token.

Reference: https://www.nltk.org/api/nltk.tokenize.html


In [2]:
import nltk

# download and install the resource 
# https://www.nltk.org/api/nltk.tokenize.punkt.html
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### Sentence Tokenization

In [3]:
from nltk import sent_tokenize

In [4]:
text_tok = sent_tokenize("Hi Mr. John! I am going to buy some vegetables from the supermarket. Should I pick up some broccoli?")

print(text_tok)

['Hi Mr. John!', 'I am going to buy some vegetables from the supermarket.', 'Should I pick up some broccoli?']


### Word Tokenization

In [5]:
from nltk import word_tokenize

In [6]:
words_tok = word_tokenize("I am learning natural language processing.")

# print the tokens
# notice punctuations like full stop is a token
print(words_tok)

['I', 'am', 'learning', 'natural', 'language', 'processing', '.']


## Remove Punctuation and Lower Case

In [7]:
import string

text = """Generative AI refers to a branch of artificial intelligence that creates new content, 
such as text, images, music, or even code, by learning patterns from existing data. 
It uses models like GPT or GANs to generate realistic outputs, often mimicking human creativity. 
This technology is transforming fields like content creation, design, and personalized experiences 
by automating tasks that traditionally required human intervention."""

tokens = word_tokenize(text)

tokens_no_punctuation = []
for tok in tokens:
    if not(tok in string.punctuation):
        tokens_no_punctuation.append(tok.lower())

print(tokens_no_punctuation)

['generative', 'ai', 'refers', 'to', 'a', 'branch', 'of', 'artificial', 'intelligence', 'that', 'creates', 'new', 'content', 'such', 'as', 'text', 'images', 'music', 'or', 'even', 'code', 'by', 'learning', 'patterns', 'from', 'existing', 'data', 'it', 'uses', 'models', 'like', 'gpt', 'or', 'gans', 'to', 'generate', 'realistic', 'outputs', 'often', 'mimicking', 'human', 'creativity', 'this', 'technology', 'is', 'transforming', 'fields', 'like', 'content', 'creation', 'design', 'and', 'personalized', 'experiences', 'by', 'automating', 'tasks', 'that', 'traditionally', 'required', 'human', 'intervention']


## Stop Word

Stop words are common words that are just used to support the sentence construction. Stop words are removed from our analysis as they do not impact the meaning of sentence they are present in. Examples of stop words include `a`, `am` and `the`.

In [8]:
from nltk import word_tokenize
from nltk.corpus import stopwords

# download and install the resource 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
# to print the list of stopwords in English language
stop_words = stopwords.words('english')

print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Stop Word Removal

In [10]:
sentence = "I am learning Python. It is one of the most popular programming language."
sentence_words = word_tokenize(sentence)

print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'language', '.']


In [11]:
# the code below uses python list comprehension to construct the sentence_no_stop_word list
sentence_no_stop_word = ' '.join([word for word in sentence_words if word not in stop_words])

print(sentence_no_stop_word)

I learning Python . It one popular programming language .


In [12]:
# the code below is the same as above code except it is not using 'python list comprehension' technique
sentence_no_stop_word = []

for word in sentence_words:
    if word in stop_words:
        pass
    else:
        sentence_no_stop_word.append(word)

print(' '.join(sentence_no_stop_word))

I learning Python . It one popular programming language .


## Stemming

In English, words get tranformed into various forms when being in a sentence. For example, the word `product` get tranformed into `production`. It is necessary to convert these words into their base forms as they carry the same meaning.

|            | Stemming Process |         |
|------------|:----------------:|---------|
| Production |        =>        | product |
| Products   |        =>        | product |

In [13]:
stemmer = nltk.stem.PorterStemmer()

In [14]:
print(stemmer.stem("Production"))
print(stemmer.stem("Products"))
print(stemmer.stem("coming"))
print(stemmer.stem("firing"))

print(stemmer.stem("battling"))

product
product
come
fire
battl


## Lemmatization

The stemming process, sometimes, leads to unusual result.  For example, the word 'battling' is transformed to 'battl" which cannot be found in a dictionary. To overcome this issue, lemmatization is another technique to use. The base form of the word can be found in a dictionary. This additional check (is it a word in the dictionary?) slows down the process.

In [15]:
nltk.download("wordnet")
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [16]:
lemmatizer = WordNetLemmatizer()

In [17]:
print(lemmatizer.lemmatize("Production"))
print(lemmatizer.lemmatize("Products"))
print(lemmatizer.lemmatize("coming"))
print(lemmatizer.lemmatize("firing"))

print(lemmatizer.lemmatize("battling"))

Production
Products
coming
firing
battling


## Chunking

In [18]:
import nltk

# MWE stands for Multi-Word Expression.
# Certain groups of multi words are treated as one entity during tokenization.
# For example: "United States of America", "not only", "but also"

from nltk.tokenize import MWETokenizer
from nltk.tokenize import word_tokenize

In [19]:
compound_words = [("artificial","intelligence"),("data","science")]
mwe_tokenizer = MWETokenizer(compound_words, separator='_')

In [20]:
text_1 = 'We need to invest in data science and artificial intelligence capabilities'

In [21]:
tokens_words_mwe = mwe_tokenizer.tokenize(word_tokenize(text_1))

print(tokens_words_mwe)

['We', 'need', 'to', 'invest', 'in', 'data_science', 'and', 'artificial_intelligence', 'capabilities']


In [22]:
text_2 = """Mr Heng Swee Keat will deliver a Ministerial Statement on additional   
support measures for COVID-19 pandemic."""

compound_words = [("Ministerial", "Statement"), ("Heng","Swee","Keat")]
mwe_tokenizer = MWETokenizer(compound_words, separator='_')
tokens_words_mwe = mwe_tokenizer.tokenize(word_tokenize(text_2))

print(tokens_words_mwe)

['Mr', 'Heng_Swee_Keat', 'will', 'deliver', 'a', 'Ministerial_Statement', 'on', 'additional', 'support', 'measures', 'for', 'COVID-19', 'pandemic', '.']


## POS (Part-of-Speech) Tagging

POS refers to parts of speech. POS tagging refers to the process of tagging words within sentences into their respective parts of speech and then labelling them.

|     | **NLTK** (not exhaustive)                                           |
|-----|---------------------------------------------------------------------|
| DT  | Determiner                                                          |
| JJ  | Adjuctive                                                           |
| NN  | Noun, common, singular or mass                                      |
| PRP | Personal pronoun (e.g. he, she I, you)                              |
| VBG | Verb, gerund or present participate (e.g. running, eating, walking) |
| VBP | Verb, present tense (e.g. run, eat, walk)                           |


In [23]:
from nltk import word_tokenize

# download and install the resource 
# https://www.nltk.org/_modules/nltk/tag/perceptron.html
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [24]:
# tokenize the sentence and print the tokens
words = word_tokenize("I am learning natural language processing.")
print(words)

['I', 'am', 'learning', 'natural', 'language', 'processing', '.']


In [25]:
# print the POS tags
nltk.pos_tag(words)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('learning', 'VBG'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('.', '.')]

# Exercise

Given the text below, what are the preprocessing techniques you could apply?

"Generative AI is revolutionizing many industries by automating creative tasks such as writing, designing, and even composing music. These models, like GPT and DALL·E, learn from vast amounts of data to produce original content that mimics human output. However, the ethical implications of this technology, including concerns about bias and intellectual property, are raising important debates. As generative AI continues to evolve, it is crucial for researchers and developers to address these challenges to ensure responsible and fair use."