<img src="https://www.rp.edu.sg/images/default-source/default-album/rp-logo.png" width="200" alt="Republic POlytechnic"/>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/koayst-rplesson/SDGAI_LLMforGenAIApp_Labs/blob/main/L01/L01.ipynb) 

# Setup and Installation

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and create a new development environment called `c3669c`.
* Pip install the libraries stated below when necessary.

# Lesson 01

In [1]:
%%capture --no-stderr
%pip install --quiet -U nltk

## Tokenization

This step takes a piece of text and converts it into a list of tokens. If the input is a sentence, then separating the words (including punctuations) would be an example of tokenization. Depending on the model, different granularities can be chosen. At the lowest level, each character could be a token.  

In [2]:
import nltk
from nltk import word_tokenize

# download and install the resource 
# https://www.nltk.org/api/nltk.tokenize.punkt.html
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
words = word_tokenize("I am learning natural language processing.")

# print the tokens
# notice punctuations like full stop is a token
print(words)

['I', 'am', 'learning', 'natural', 'language', 'processing', '.']


## PoS (Part-of-Speech) Tagging

PoS refers to parts of speech. PoS tagging refers to the process of tagging words within sentences into their respective parts of speech and then labelling them.

|     | **NLTK** (not exhaustive)                                           |
|-----|---------------------------------------------------------------------|
| DT  | Determiner                                                          |
| JJ  | Adjuctive                                                           |
| NN  | Noun, common, singular or mass                                      |
| PRP | Personal pronoun (e.g. he, she I, you)                              |
| VBG | Verb, gerund or present participate (e.g. running, eating, walking) |
| VBP | Verb, present tense (e.g. run, eat, walk)                           |


In [4]:
import nltk
from nltk import word_tokenize

# download and install the resource 
# https://www.nltk.org/_modules/nltk/tag/perceptron.html
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [5]:
# tokenize the sentence and print the tokens
words = word_tokenize("I am learning natural language processing.")
print(words)

['I', 'am', 'learning', 'natural', 'language', 'processing', '.']


In [6]:
# print the PoS tags
nltk.pos_tag(words)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('learning', 'VBG'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('.', '.')]

## Stop Word

Stop words are common words that are just used to support the sentence construction. Stop words are removed from our analysis as they do not impact the meaning of sentence they are present in. Examples of stop words include `a`, `am` and `the`.

In [7]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# download and install the resource 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# to print the list of stopwords in English language
stop_words = stopwords.words('English')

print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Stop Word Removal

In [9]:
sentence = "I am learning Python. It is one of the most popular programming language."
sentence_words = word_tokenize(sentence)

print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'language', '.']


In [10]:
# the code below uses python list comprehension to construct the sentence_no_stop_word list
sentence_no_stop_word = ' '.join([word for word in sentence_words if word not in stop_words])

print(sentence_no_stop_word)

I learning Python . It one popular programming language .


In [11]:
# the code below is the same as above code except it is not using 'python list comprehension' technique
sentence_no_stop_word = []

for word in sentence_words:
    if word in stop_words:
        pass
    else:
        sentence_no_stop_word.append(word)

print(' '.join(sentence_no_stop_word))

I learning Python . It one popular programming language .


## Stemming

In English, words get tranformed into various forms when being in a sentence. For example, the word `product` get tranformed into `production`. It is necessary to convert these words into their base forms as they carry the same meaning.

|            | Stemming Process |         |
|------------|:----------------:|---------|
| Production |        =>        | product |
| Products   |        =>        | product |

In [12]:
import nltk
stemmer = nltk.stem.PorterStemmer()

In [13]:
print(stemmer.stem("Production"))
print(stemmer.stem("Products"))
print(stemmer.stem("coming"))
print(stemmer.stem("firing"))

print(stemmer.stem("battling"))

product
product
come
fire
battl


## Lemmatization

The stemming process, sometimes, leads to unusual result.  For example, the word 'battling' is transformed to 'battl" which cannot be found in a dictionary. To overcome this issue, lemmatization is another technique to use. The base form of the word can be found in a dictionary. This additional check (is it a word in the dictionary?) slows down the process.

In [14]:
import nltk

nltk.download("wordnet")
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\koay_seng_tian\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
lemmatizer = WordNetLemmatizer()

In [16]:
print(lemmatizer.lemmatize("Production"))
print(lemmatizer.lemmatize("Products"))
print(lemmatizer.lemmatize("coming"))
print(lemmatizer.lemmatize("firing"))

print(lemmatizer.lemmatize("battling"))

Production
Products
coming
firing
battling
