# Praktikum Minggu 13: Text Analytics & Natural Language Processing
## *Week 13 Lab: Text Analytics & NLP*

**Mata Kuliah / Course:** Big Data Analytics  
**Topik / Topic:** NLP Pipeline, TF-IDF, Word2Vec, Sentiment Analysis, Topic Modeling, NER  

---
### Deskripsi
Pada praktikum ini kita akan mengimplementasikan pipeline NLP lengkap mulai dari preprocessing
teks hingga model NLP lanjutan termasuk:
- Text preprocessing (tokenization, stopword removal, stemming, lemmatization)
- TF-IDF dengan Scikit-learn
- Word2Vec dengan Gensim
- Analisis Sentimen (VADER + ML)
- Topic Modeling dengan LDA
- Named Entity Recognition (NER)

In [None]:
!pip install nltk gensim scikit-learn --quiet

import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('maxent_ne_chunker_tab', quiet=True)
nltk.download('words', quiet=True)
nltk.download('vader_lexicon', quiet=True)

import re
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

from gensim.models import Word2Vec
from sklearn.decomposition import PCA

print('All libraries loaded successfully!')

## 1. Preprocessing Teks

Pipeline preprocessing standar NLP: tokenization → lowercase → remove punctuation
→ stopword removal → stemming & lemmatization.

In [None]:
sample_texts = [
    "The quick brown foxes are JUMPING over the lazy dogs! They've been running all day.",
    "Machine learning algorithms are transforming how companies analyze Big Data in 2024.",
    "Natural Language Processing enables computers to understand, interpret, and generate human language.",
    "Apache Spark's DataFrame API makes it easy to process large-scale datasets efficiently."
]

stop_words = set(stopwords.words('english'))
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Step 1: Lowercase
    text = text.lower()
    # Step 2: Remove punctuation and special characters
    text = re.sub(r"[^a-zA-Z\s]", '', text)
    # Step 3: Tokenize
    tokens = word_tokenize(text)
    # Step 4: Remove stopwords
    tokens_no_stop = [t for t in tokens if t not in stop_words and len(t) > 1]
    # Step 5: Stemming
    stemmed = [stemmer.stem(t) for t in tokens_no_stop]
    # Step 6: Lemmatization
    lemmatized = [lemmatizer.lemmatize(t) for t in tokens_no_stop]
    return {
        'original': text.strip(),
        'tokens': tokens,
        'no_stopwords': tokens_no_stop,
        'stemmed': stemmed,
        'lemmatized': lemmatized
    }

print('=== NLP Preprocessing Pipeline ===')
for i, text in enumerate(sample_texts[:2]):
    result = preprocess_text(text)
    print(f'\nText {i+1}:')
    print(f'  Original:     {sample_texts[i]}')
    print(f'  Tokens:       {result["tokens"]}')
    print(f'  No stopwords: {result["no_stopwords"]}')
    print(f'  Stemmed:      {result["stemmed"]}')
    print(f'  Lemmatized:   {result["lemmatized"]}')

## 2. TF-IDF dengan Scikit-learn

TF-IDF mengukur kepentingan kata dalam dokumen relatif terhadap seluruh corpus.
Kata yang sering muncul dalam satu dokumen tetapi jarang di dokumen lain mendapat skor tinggi.

In [None]:
# Expanded corpus for TF-IDF (12 documents, 4 topics)
corpus = [
    # Technology
    "Machine learning and artificial intelligence are revolutionizing data science.",
    "Deep learning neural networks achieve state of the art results in image recognition.",
    "Python is the most popular programming language for machine learning and data analysis.",
    # Sports
    "The football team won the championship after an incredible season performance.",
    "Basketball players need exceptional speed agility and coordination to excel on the court.",
    "Olympic athletes train for years to achieve peak performance in competitive sports.",
    # Health
    "Regular exercise and balanced nutrition are essential for maintaining good health.",
    "Medical research advances have significantly improved cancer treatment outcomes.",
    "Mental health awareness is crucial for overall wellbeing and quality of life.",
    # Environment
    "Climate change poses serious threats to biodiversity and global ecosystems.",
    "Renewable energy sources like solar and wind power reduce carbon emissions significantly.",
    "Sustainable development requires balancing economic growth with environmental protection."
]
labels = ['tech']*3 + ['sports']*3 + ['health']*3 + ['environment']*3

# Build TF-IDF matrix
tfidf = TfidfVectorizer(max_features=30, stop_words='english', ngram_range=(1, 2))
X_tfidf = tfidf.fit_transform(corpus)
feature_names = tfidf.get_feature_names_out()

df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=feature_names,
                          index=[f'Doc{i+1}({labels[i][:4]})' for i in range(len(corpus))])
print(f'TF-IDF Matrix shape: {X_tfidf.shape}')
print('\nTF-IDF Matrix (first 6 docs, top 15 features):')
print(df_tfidf.iloc[:6, :15].round(3).to_string())

# Top terms per document
print('\n=== Top 5 Terms per Document ===')
for i, doc_label in enumerate(labels[:6]):
    top_indices = X_tfidf[i].toarray()[0].argsort()[-5:][::-1]
    top_terms = [(feature_names[j], round(X_tfidf[i, j], 3)) for j in top_indices]
    print(f'  Doc{i+1} ({doc_label}): {top_terms}')

# Visualize TF-IDF heatmap
plt.figure(figsize=(14, 5))
plt.imshow(df_tfidf.iloc[:, :20].values, cmap='YlOrRd', aspect='auto')
plt.colorbar(label='TF-IDF Score')
plt.xticks(range(20), df_tfidf.columns[:20], rotation=45, ha='right', fontsize=8)
plt.yticks(range(len(corpus)), df_tfidf.index, fontsize=8)
plt.title('TF-IDF Matrix Heatmap')
plt.tight_layout()
plt.show()

## 3. Word2Vec dengan Gensim

Word2Vec mempelajari representasi vektor kata dari konteksnya.
Kata-kata dengan makna serupa akan memiliki vektor yang berdekatan dalam ruang embedding.

In [None]:
# Training corpus for Word2Vec
training_sentences = [
    ['machine', 'learning', 'artificial', 'intelligence', 'data', 'science'],
    ['deep', 'learning', 'neural', 'network', 'training', 'model'],
    ['python', 'programming', 'code', 'algorithm', 'software', 'developer'],
    ['data', 'analysis', 'statistics', 'visualization', 'pandas', 'numpy'],
    ['spark', 'hadoop', 'big', 'data', 'distributed', 'processing', 'cluster'],
    ['natural', 'language', 'processing', 'text', 'analysis', 'nlp', 'sentiment'],
    ['classification', 'regression', 'clustering', 'model', 'prediction', 'accuracy'],
    ['tensorflow', 'keras', 'pytorch', 'deep', 'learning', 'neural', 'network'],
    ['feature', 'engineering', 'preprocessing', 'normalization', 'encoding'],
    ['random', 'forest', 'gradient', 'boosting', 'ensemble', 'learning', 'xgboost'],
    ['word', 'embedding', 'vector', 'representation', 'semantic', 'similarity'],
    ['training', 'validation', 'testing', 'overfitting', 'regularization', 'dropout'],
    ['database', 'sql', 'nosql', 'mongodb', 'cassandra', 'storage', 'query'],
    ['cloud', 'aws', 'azure', 'gcp', 'computing', 'storage', 'scalable'],
]

# Train Word2Vec
w2v_model = Word2Vec(
    sentences=training_sentences,
    vector_size=50,
    window=3,
    min_count=1,
    workers=2,
    sg=0,  # CBOW
    epochs=100,
    seed=42
)

print(f'Vocabulary size: {len(w2v_model.wv)}')
print(f'Vector dimension: {w2v_model.vector_size}')

# Semantic similarity
print('\n=== Word Similarity ===')
test_words = ['learning', 'data', 'neural']
for word in test_words:
    if word in w2v_model.wv:
        similar = w2v_model.wv.most_similar(word, topn=4)
        print(f'  Similar to "{word}": {[(w, round(s,3)) for w,s in similar]}')

# PCA visualization of word vectors
words_to_plot = ['machine', 'learning', 'neural', 'network', 'data', 'analysis',
                 'python', 'spark', 'hadoop', 'cloud', 'model', 'training']
words_in_vocab = [w for w in words_to_plot if w in w2v_model.wv]
vectors = np.array([w2v_model.wv[w] for w in words_in_vocab])

pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(vectors)

plt.figure(figsize=(10, 6))
plt.scatter(coords[:, 0], coords[:, 1], s=100, c='steelblue', zorder=2)
for i, word in enumerate(words_in_vocab):
    plt.annotate(word, (coords[i, 0] + 0.01, coords[i, 1] + 0.01), fontsize=11)
plt.title('Word2Vec Embeddings Visualized with PCA')
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 4. Analisis Sentimen

Dua pendekatan: (1) VADER lexicon-based untuk sentimen langsung,
(2) TF-IDF + Logistic Regression untuk supervised classification.

In [None]:
# Sample reviews dataset
reviews = [
    ("This product is absolutely amazing! Best purchase I've made this year.", 'positive'),
    ("Terrible quality. Broke after 2 days. Complete waste of money.", 'negative'),
    ("Decent product, nothing special. Works as expected.", 'neutral'),
    ("Fantastic customer service and fast delivery. Very happy!", 'positive'),
    ("Disappointed with the quality. Expected much better for the price.", 'negative'),
    ("Average product. Does the job but nothing exceptional.", 'neutral'),
    ("Excellent build quality and great performance. Highly recommended!", 'positive'),
    ("Worst product ever. Total garbage. Do not buy!", 'negative'),
    ("It's okay. Not great, not bad. Does what it says on the box.", 'neutral'),
    ("Love it! Exceeded my expectations. Will buy again.", 'positive'),
    ("Poor quality materials. Very flimsy and feels cheap.", 'negative'),
    ("Reasonable product for the price. Good value overall.", 'positive'),
    ("Not worth the money. Several issues from the start.", 'negative'),
    ("Pretty good. Works well and looks nice. Happy with the purchase.", 'positive'),
    ("Mediocre at best. The product exists and that's about all.", 'neutral'),
]

texts, labels_sent = zip(*reviews)

# --- VADER Lexicon-based Sentiment ---
sia = SentimentIntensityAnalyzer()
print('=== VADER Lexicon-based Sentiment Analysis ===')
vader_results = []
for text, true_label in reviews:
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    pred = 'positive' if compound >= 0.05 else ('negative' if compound <= -0.05 else 'neutral')
    vader_results.append({'text': text[:50]+'...', 'true': true_label, 'pred': pred,
                           'compound': round(compound, 3), 'match': pred == true_label})

df_vader = pd.DataFrame(vader_results)
print(df_vader[['true', 'pred', 'compound', 'match']].to_string())
print(f'\nVADER Accuracy: {df_vader["match"].mean():.2%}')

# --- ML-based: TF-IDF + Logistic Regression ---
print('\n=== ML-based Sentiment (TF-IDF + Logistic Regression) ===')
le = LabelEncoder()
y_encoded = le.fit_transform(labels_sent)

X_train, X_test, y_train, y_test, t_train, t_test = train_test_split(
    list(texts), y_encoded, list(labels_sent), test_size=0.33, random_state=42
)

tfidf_sent = TfidfVectorizer(ngram_range=(1, 2), min_df=1)
X_train_tfidf = tfidf_sent.fit_transform(X_train)
X_test_tfidf  = tfidf_sent.transform(X_test)

lr = LogisticRegression(max_iter=500, random_state=42)
lr.fit(X_train_tfidf, y_train)
y_pred = lr.predict(X_test_tfidf)

print(classification_report(y_test, y_pred, target_names=le.classes_))
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2%}')

## 5. Topic Modeling dengan LDA

Latent Dirichlet Allocation (LDA) menemukan topik tersembunyi dalam koleksi dokumen.
Setiap dokumen direpresentasikan sebagai campuran topik.

In [None]:
# Run LDA on the corpus (12 documents, 4 true topics)
count_vec = CountVectorizer(stop_words='english', max_features=50, min_df=1)
X_counts = count_vec.fit_transform(corpus)
count_feature_names = count_vec.get_feature_names_out()

lda = LatentDirichletAllocation(n_components=4, random_state=42, max_iter=50,
                                  learning_method='batch')
lda.fit(X_counts)

print('=== LDA Topic Modeling — Top Words per Topic ===')
topic_labels_guess = ['Technology/AI', 'Sports/Fitness', 'Health/Medical', 'Environment/Energy']
for topic_idx, topic in enumerate(lda.components_):
    top_word_indices = topic.argsort()[:-11:-1]
    top_words = [count_feature_names[i] for i in top_word_indices]
    print(f'\nTopic {topic_idx} (est: {topic_labels_guess[topic_idx]}):')
    print(f'  Top words: {top_words}')

# Document-topic distribution
doc_topics = lda.transform(X_counts)
print('\n=== Document-Topic Distribution (probability) ===')
df_doc_topics = pd.DataFrame(
    doc_topics.round(3),
    columns=[f'Topic{i}' for i in range(4)],
    index=[f'Doc{i+1}({labels[i][:4]})' for i in range(len(corpus))]
)
print(df_doc_topics.to_string())

# Visualize topic distribution
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.imshow(doc_topics, cmap='Blues', aspect='auto')
plt.colorbar(im, label='Topic Probability')
ax.set_xticks(range(4))
ax.set_xticklabels([f'Topic {i}' for i in range(4)])
ax.set_yticks(range(len(corpus)))
ax.set_yticklabels([f'Doc{i+1}({labels[i][:4]})' for i in range(len(corpus))], fontsize=8)
ax.set_title('LDA: Document-Topic Distribution')
plt.tight_layout()
plt.show()

## 6. Named Entity Recognition (NER)

NER mengidentifikasi dan mengklasifikasikan entitas bernama dalam teks:
Orang (PERSON), Organisasi (ORGANIZATION), Lokasi (GPE/LOC), dll.

In [None]:
ner_texts = [
    "Elon Musk founded Tesla and SpaceX in California.",
    "Google and Microsoft are competing in the artificial intelligence market.",
    "The United Nations conference in New York addressed climate change issues.",
    "Barack Obama served as the 44th President of the United States from 2009 to 2017.",
    "Apple's headquarters is located in Cupertino, California, near San Francisco Bay."
]

def extract_named_entities(text):
    """Extract named entities using NLTK's ne_chunk."""
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    tree = ne_chunk(pos_tags, binary=False)
    entities = {'PERSON': [], 'ORGANIZATION': [], 'GPE': [], 'LOCATION': [], 'OTHER': []}
    for subtree in tree:
        if isinstance(subtree, Tree):
            entity_label = subtree.label()
            entity_text  = ' '.join([token for token, _ in subtree.leaves()])
            if entity_label in entities:
                entities[entity_label].append(entity_text)
            else:
                entities['OTHER'].append(f'{entity_label}: {entity_text}')
    return entities

print('=== Named Entity Recognition Results ===')
all_entities = []
for i, text in enumerate(ner_texts):
    entities = extract_named_entities(text)
    print(f'\nText {i+1}: "{text}"')
    for ent_type, ents in entities.items():
        if ents:
            print(f'  {ent_type}: {ents}')
    all_entities.append(entities)

# Aggregate entity counts
from collections import defaultdict
entity_counts = defaultdict(int)
for ents in all_entities:
    for ent_type, ent_list in ents.items():
        entity_counts[ent_type] += len(ent_list)

print('\n=== Entity Type Distribution ===')
for ent_type, count in sorted(entity_counts.items(), key=lambda x: -x[1]):
    if count > 0:
        print(f'  {ent_type}: {count}')

# Bar chart
if any(v > 0 for v in entity_counts.values()):
    filtered = {k: v for k, v in entity_counts.items() if v > 0}
    plt.figure(figsize=(8, 4))
    plt.bar(filtered.keys(), filtered.values(), color='steelblue')
    plt.title('Named Entity Type Distribution (5 sample texts)')
    plt.xlabel('Entity Type'); plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## Tugas Praktikum

Selesaikan tugas-tugas berikut:

1. **Tugas 1 — Preprocessing Bahasa Indonesia**: Kumpulkan 10 kalimat dalam Bahasa Indonesia
   dari berita online. Implementasikan preprocessing (tokenisasi, stopword removal menggunakan
   library `PySastrawi` atau daftar stopword manual). Bandingkan hasil dengan preprocessing
   bahasa Inggris.

2. **Tugas 2 — TF-IDF vs BoW**: Bandingkan representasi TF-IDF dengan Bag-of-Words (CountVectorizer)
   untuk dokumen corpus yang ada. Menggunakan cosine similarity, tentukan dokumen mana yang
   paling mirip untuk setiap dokumen dalam corpus. Apakah hasilnya berbeda antara TF-IDF dan BoW?

3. **Tugas 3 — Word2Vec Skip-gram**: Latih ulang model Word2Vec dengan arsitektur **Skip-gram**
   (`sg=1`) dan bandingkan top-5 kata paling mirip dengan `learning`, `data`, dan `model`
   dibandingkan dengan model CBOW. Apa perbedaan yang Anda amati?

4. **Tugas 4 — Aspect-based Sentiment**: Buat dataset 20 review produk elektronik dengan
   label sentimen (positive/negative). Implementasikan analisis sentimen berbasis aspek sederhana:
   identifikasi sentimen terhadap aspek 'harga', 'kualitas', 'pengiriman'. Gunakan pendekatan
   keyword-based dan VADER.

5. **Tugas 5 — LDA Hyperparameter Tuning**: Eksperimen dengan LDA menggunakan `n_components`
   = 2, 3, 4, 5, 6 pada corpus yang ada. Untuk setiap konfigurasi, hitung **perplexity** dan
   **log-likelihood** (`lda.perplexity()`, `lda.score()`). Plot hasilnya dan tentukan jumlah
   topik optimal. Interpretasikan topik yang ditemukan secara naratif.