<a href="https://colab.research.google.com/github/hasnadaffau/nlp-lab/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Week 1: Introduction to NLP – Conceptual Foundations and Tokenization**

- Objective: Understand NLP concepts and perform basic tokenization (CMPK01a).
- Dataset: NLTK’s Gutenberg corpus (English text for simplicity).
- Data Collection Method: Direct access to a prebuilt corpus via NLTK.

In [None]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import gutenberg

# Load
text = gutenberg.raw("austen-emma.txt")[:1000]

# Tokenize
tokens = word_tokenize(text)

print("First 20 tokens:", tokens[:20])
print("Total tokens:", len(tokens))
print("Unique tokens:", len(set(tokens)))


First 20 tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich']
Total tokens: 198
Unique tokens: 114


**Case Study on Google Colab: Text Tokenization in Indonesian using the IndoNLU Corpus**

In [None]:
!pip install Sastrawi
!pip install datasets nltk PySastrawi



In [None]:
import nltk
import pandas as pd
from datasets import load_dataset
from nltk.tokenize import word_tokenize
from collections import Counter
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={
        "train": "https://raw.githubusercontent.com/IndoNLP/indonlu/master/dataset/smsa_doc-sentiment-prosa/train_preprocess.tsv",
        "validation": "https://raw.githubusercontent.com/IndoNLP/indonlu/master/dataset/smsa_doc-sentiment-prosa/valid_preprocess.tsv",
        "test": "https://raw.githubusercontent.com/IndoNLP/indonlu/master/dataset/smsa_doc-sentiment-prosa/test_preprocess.tsv",
    },
    sep="\t",
    column_names=["id", "text", "label"]
)


In [None]:
train_ds = dataset["train"]

df = train_ds.to_pandas()

print(df.head())

                                                  id      text  label
0  warung ini dimiliki oleh pengusaha pabrik tahu...  positive    NaN
1  mohon ulama lurus dan k212 mmbri hujjah partai...   neutral    NaN
2  lokasi strategis di jalan sumatera bandung . t...  positive    NaN
3  betapa bahagia nya diri ini saat unboxing pake...  positive    NaN
4  duh . jadi mahasiswa jangan sombong dong . kas...  negative    NaN


In [None]:
sample_texts = df['text'][:3].tolist()
labels = df['label'][:3].tolist()

stemmer = StemmerFactory().create_stemmer()

for i, (text, label) in enumerate(zip(sample_texts, labels), 1):
    print(f"\nExample {i} (label: {label}):")
    print("Original Text:", text)
    tokens = word_tokenize(text)
    print("Tokens:", tokens)
    print("Count:", len(tokens))
    print("Unique tokens:", len(set(tokens)))
    stemmed = stemmer.stem(text)
    stemmed_tokens = word_tokenize(stemmed)
    print("Stemmed tokens:", stemmed_tokens)
    print("Stemmed count:", len(stemmed_tokens))
    with open(f"tokens_sample_{i}.txt", "w", encoding="utf-8") as f:
        f.write("\n".join(tokens))
    print(f"Saved to tokens_sample_{i}.txt")

# Frequency analysis
all_tokens = [tok for txt in sample_texts for tok in word_tokenize(txt.lower())]
freq = Counter(all_tokens).most_common(10)
print("\nTop 10 words:", freq)



Example 1 (label: nan):
Original Text: positive
Tokens: ['positive']
Count: 1
Unique tokens: 1
Stemmed tokens: ['positive']
Stemmed count: 1
Saved to tokens_sample_1.txt

Example 2 (label: nan):
Original Text: neutral
Tokens: ['neutral']
Count: 1
Unique tokens: 1
Stemmed tokens: ['neutral']
Stemmed count: 1
Saved to tokens_sample_2.txt

Example 3 (label: nan):
Original Text: positive
Tokens: ['positive']
Count: 1
Unique tokens: 1
Stemmed tokens: ['positive']
Stemmed count: 1
Saved to tokens_sample_3.txt

Top 10 words: [('positive', 2), ('neutral', 1)]
