<a href="https://colab.research.google.com/github/junya2025/text-retrieval-and-mining/blob/main/tutorial_II_applications_of_text_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial II: Applications of Text Analysis

**Duration:** 1.5 hours

**Prerequisites:** Basic Python knowledge, familiarity with pandas and numpy

**Learning Objectives**
* Implement document classification
* Create document clustering solutions
* Build sentiment analysis models
* Perform Named Entity Recognition (NER)

In [1]:
## Setup
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from textblob import TextBlob
import spacy

## Exercise 1: Document Classification (25 minutes)

In this exercise, you'll implement a document classifier using TF-IDF and Naive Bayes.

In [2]:
def create_document_classifier():
    """
    Create a pipeline for document classification using TF-IDF and Naive Bayes

    Returns:
        Pipeline: sklearn pipeline for document classification
    """
    # TODO: Create a pipeline with:
    # 1. TfidfVectorizer for converting text to numerical features
    # 2. MultinomialNB for classification (If you do not know what MultinomialNB is, feel free to do some research before proceeding)
    pass

# Example documents and labels
documents = [
    "The stock market saw significant gains today",
    "Scientists discover new species in Amazon",
    "Team wins championship game in overtime",
    "New economic policy impacts global markets"
]
labels = ['finance', 'science', 'sports', 'finance']

# TODO: Create and train classifier
classifier = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# training the model
classifier.fit(documents, labels)

# Test your classifier
new_docs = [
    "Investors worried about market trends",
    "Researchers study rare wildlife behavior"
]
# TODO: Make predictions on new documents
predictions = classifier.predict(new_docs)

print("Predicted categories:", predictions)

Predicted categories: ['finance' 'finance']


## Exercise 2: Document Clustering (25 minutes)

Implement unsupervised document clustering using K-means.

In [4]:
def cluster_documents(documents, n_clusters=2):
    """
    Cluster documents using TF-IDF and K-means

    Args:
        documents (list): List of text documents
        n_clusters (int): Number of clusters

    Returns:
        tuple: (cluster_assignments, top_terms_per_cluster)
    """
    # TODO: Create TF-IDF vectors
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(documents)

    # TODO: Perform K-means clustering
    kmeans = KMeans(n_clusters = n_clusters, random_state = 42)
    kmeans.fit(X)

    # TODO: Get top terms for each cluster
    # Hint: Look at cluster centers and feature names

    cluster_assignments = kmeans.labels_ # Get cluster assignments for each document
    top_terms_per_cluster = {}
    feature_names = vectorizer.get_feature_names_out()
    for i in range(n_clusters):
        order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]  # Sort indices of features by descending order of weight
        top_terms = [feature_names[ind] for ind in order_centroids[i, :5]]  # Get top 5 terms
        top_terms_per_cluster[i] = top_terms

    return (cluster_assignments, top_terms_per_cluster)

# Test documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks for AI tasks",
    "Python programming is essential for data science",
    "Data analysis requires statistical knowledge",
    "Neural networks are inspired by biological brains",
    "Statistics and probability are foundational for ML"
]

# TODO: Cluster documents and print results
clusters, top_terms = cluster_documents(documents, n_clusters=2)
for cluster, terms in top_terms.items():
    print(f"Cluster {cluster}: {', '.join(terms)}")

#stop words?

Cluster 0: data, is, knowledge, analysis, statistical, requires, programming, essential, python, science
Cluster 1: are, neural, networks, for, biological, brains, by, inspired, and, statistics


## Exercise 3: Sentiment Analysis (20 minutes)

Implement both dictionary-based and ML-based sentiment analysis.

In [5]:
def analyze_sentiment_dictionary(text):
    """
    Analyze sentiment using TextBlob's pre-trained model

    Args:
        text (str): Input text

    Returns:
        str: 'Positive', 'Negative', or 'Neutral'
    """
    # TODO: Use TextBlob to analyze sentiment
    # Hint: Look at the sentiment.polarity value
    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'
    pass

from sklearn.linear_model import LogisticRegression

def create_ml_sentiment_analyzer(texts, labels):
    """
    Create and train ML-based sentiment analyzer

    Args:
        texts (list): Training texts
        labels (list): Sentiment labels

    Returns:
        Pipeline: Trained sentiment analyzer
    """
    # TODO: Create and train a pipeline for sentiment analysis
    # Hint: Similar to document classifier
    sentiment_analyzer = Pipeline([
      ('tfidf', TfidfVectorizer()),
      ('clf', LogisticRegression(max_iter=1000))
    ])

    # Train the pipeline on the provided data
    sentiment_analyzer.fit(texts, labels)

    return sentiment_analyzer

    pass

# Test data
reviews = [
    "This product exceeded my expectations!",
    "Terrible customer service, would not recommend.",
    "The item was okay, nothing special.",
]

# TODO: Test dictionary-based approach
for review in reviews:
    sentiment = analyze_sentiment_dictionary(review)
    print(f"Text: {review}")
    print(f"Sentiment: {sentiment}\n")

# TODO: Test ML-based approach
train_texts = [
    "great product", "terrible service", "okay experience"
]
train_labels = [1, -1, 0]  # 1=positive, -1=negative, 0=neutral

classifier = create_ml_sentiment_analyzer(train_texts, train_labels)

predictions = classifier.predict(reviews)

print("Predictions:", predictions)

Text: This product exceeded my expectations!
Sentiment: Neutral

Text: Terrible customer service, would not recommend.
Sentiment: Negative

Text: The item was okay, nothing special.
Sentiment: Positive

Predictions: [ 1 -1  0]


## Exercise 4: Named Entity Recognition (20 minutes)

Implement Named Entity Recognition (NER) using spaCy.

In [6]:
def perform_ner(text):
    """
    Perform Named Entity Recognition using spaCy

    Args:
        text (str): Input text

    Returns:
        list: List of (entity_text, entity_label) tuples
    """
    # TODO: Load spaCy model
    nlp = spacy.load("en_core_web_sm")

    # TODO: Process text and extract entities
    # Hint: Look at doc.ents
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))

    return entities

# Test text
text = "When Elon Musk founded SpaceX in 2002, Tesla was already operating in California."

# TODO: Extract and print entities
entities = perform_ner(text)
print(entities)

[('Elon Musk', 'PERSON'), ('2002', 'DATE'), ('Tesla', 'ORG'), ('California', 'GPE')]


## Discussion Questions - Food for thought
1. What are the advantages and limitations of TF-IDF for text classification?
+ to convert text into strings so that machine can understand
+ can handle high-dimentional data
- sensitive to stop words
- do not capture word order
2. How do you determine the optimal number of clusters?
Silhouette, elbow method
3. What are the challenges in sentiment analysis?
scarsism detection
4. How can NER be improved for domain-specific applications?