## **#02: Text Parsing, Filtering, and Quantitative Representation**
- Instructor: [Jaeung Sim](https://jaeungs.github.io/) (University of Connecticut)
- Course: OPIM 5671: Data Mining and Time Series Forecasting
- Last updated: September 17, 2025

**Objectives**
1. Explore real-world text data using `nltk` library.
1. Understand the text processing procedure.

**Contents**
* Part 1: Understanding the Data
* Part 2: Text Parsing and Filtering
* Part 3: Singular Value Decomposition

**References**
* [YouTube Comments Dataset at Kaggle (Data Source)](https://www.kaggle.com/datasets/atifaliak/youtube-comments-dataset/data)
* [Natural Language Toolkit (NLTK)](https://www.nltk.org/)

### **Part 1: Understanding the Data**

**Introduction to the Dataset**
* **Source:** YouTube Comments Dataset at Kaggle (<https://www.kaggle.com/datasets/atifaliak/youtube-comments-dataset>)
* **About this file**
  * Introducing the `Youtube Comments Dataset.csv`, a fully cleaned and preprocessed collection of YouTube video comments. This dataset is ideal for sentiment analysis, natural language processing, and text-based machine learning projects. With all irrelevant data already removed and cleaning steps thoroughly performed, it provides clean, structured information, allowing you to focus solely on insights and analysis.

**Download data with Python codes**

In [None]:
import numpy as np
import pandas as pd
import kagglehub
import os

In [None]:
# Download latest version
path = kagglehub.dataset_download("atifaliak/youtube-comments-dataset")

# Print the dataset path
print("Path to dataset files:", path)

**Deal with DataFrame**

In [None]:
# List files in the downloaded dataset directory
files = os.listdir(path)
print("Files in dataset:", files)

# Load the CSV file (assuming there's only one CSV file)
csv_file = [f for f in files if f.endswith('.csv')][0]  # Get the first CSV file
csv_path = os.path.join(path, csv_file)

In [None]:
# Read into DataFrame
df = pd.read_csv(csv_path)

# Display the first few rows
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Drop missing values
df = df.dropna(subset=['Comment'])
df.isnull().sum()

In [None]:
# Check data types
df.info()

### **Part 2: Text Parsing and Filtering**

**Import libraries and download necessary NLTK resources**

In [None]:
import pandas as pd
import nltk
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

#### **Step 1: Text Parsing (Tokenization)**

* Tokenizes the text into words using `word_tokenize()`.



**Tokenization**

In [None]:
# Ensure text type
df['Comment'] = df['Comment'].astype(str)

In [None]:
# Tokenize the comment column
def tokenize_text(text):
    return word_tokenize(text)

df['tokenized'] = df['Comment'].apply(tokenize_text)

In [None]:
# Explore the tokenized column
df['tokenized']

**Zipf's Law: Explore the distribution before filtering**

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
from itertools import chain

# Flatten the list of tokenized words
all_tokens = list(chain.from_iterable(df['tokenized']))

# Count word frequencies
word_freq = Counter(all_tokens)

# Get the top 100 most common words
top_100_words = word_freq.most_common(100)

# Separate words and their frequencies for plotting
words, frequencies = zip(*top_100_words)

# Plot the frequency distribution
plt.figure(figsize=(15, 6))
plt.bar(words, frequencies)
plt.xticks(rotation=90)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 100 Most Frequent Terms in df["tokenized"]')
plt.show()

#### **Step 2: Text Filtering**

* Converts text to lowercase.
* Removes punctuation and special characters.
* Tokenizes the text into words.
* Removes stopwords (like "the", "is", "and").
* Applies lemmatization (converts words to their base form, e.g., "running" → "run").






**Basic pre-processing**

In [None]:
# Define a stopword set
stop_words = set(stopwords.words('english'))

In [None]:
# Bring lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
# Define a text pre-processing function
def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords & perform lemmatization
    filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

In [None]:
# Apply the pre-processing function
df['processed'] = df['Comment'].apply(preprocess_text)
df['processed']

**Check the distribution after basic pre-processing**

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
from itertools import chain

# Flatten the list of tokenized words
all_tokens = list(chain.from_iterable(df['processed'].apply(tokenize_text)))

# Count word frequencies
word_freq = Counter(all_tokens)

# Get the top 100 most common words
top_100_words = word_freq.most_common(100)

# Separate words and their frequencies for plotting
words, frequencies = zip(*top_100_words)

# Plot the frequency distribution
plt.figure(figsize=(15, 6))
plt.bar(words, frequencies)
plt.xticks(rotation=90)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 100 Most Frequent Terms in df["processed"]')
plt.show()

**Term Weighting (TF-IDF)**

* Converts the processed text into **TF-IDF vectors**, assigning importance weights to words.

Let's see the number of unique tokens before applying TF-IDF.

In [None]:
# Ensure necessary libraries are imported
from itertools import chain

# Flatten the list of tokenized words and count unique tokens
unique_tokens = set(chain.from_iterable(df['processed'].apply(tokenize_text)))
num_unique_tokens = len(unique_tokens)

# Display the number of unique tokens
num_unique_tokens

Let's apply TF-IDF and vectorize the corpus.

In [None]:
# Vectorize the processed data with TF-IDF
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['processed'])

In [None]:
# Now it's a sparse term-document matrix
X_tfidf

Given that you already removed stop words, the number of unique tokens merely reduced from 36788 to 36659 for now.

#### **Step 3: Singular Value Decomposition (SVD)**

* Reduces dimensionality by selecting a limited number of features (default max: 100)
* The result is stored in `df_svd`, which represents the most important textual features.

In [None]:
# Determine the number of components to retain (e.g., 100 components)
n_components = min(100, X_tfidf.shape[1])  # Keep max 100 or total features

svd = TruncatedSVD(n_components=n_components, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

In [None]:
# Convert SVD results to DataFrame
df_svd = pd.DataFrame(X_svd, columns=[f"feature_{i+1}" for i in range(n_components)])

In [None]:
# Display results
df_svd

In [None]:
# Summary statistics
df_svd.describe().T # Transpose (`T`) for convenience