<a href="https://colab.research.google.com/github/paidasahithi26/SahithiPaida_INFO5731_Fall2024/blob/main/In_class_exercises_4_Text_Cleaning_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In-Class Assignment — Data Preprocessing & Cleaning (Text)  
**Time:** 20 minutes  |  **Points:** 10  

## Instructions
- This is an individual in-class assignment.  
- Write your code **inside each answer cell**.  
- Print the required outputs.  
- Submit your GitHub/Colab link as instructed by the instructor.


You are given a small dataset of customer support messages as a **TAB-separated text file**:  
- `support_messages.txt`

You will download this file from **Canvas** and upload it to your **Google Colab** notebook.

**How to upload it to your Google Colab notebook?**

1. Download `support_messages.txt` from Canvas.
3. In **the left sidebar**, click the **Files** icon (folder).  
4. Click **Upload** and select `support_messages.txt`.

6. RightAfter uploading, the file will appear in the Colab file list on the left.

6. Right-click the file, copy its path, and paste it into the FILE_PATH variable in Q1.

7. Run Q1 to load the dataset.



> Important: Keep the file name exactly as `support_messages.txt`.


## Questions (Total = 10 points)

### Q1 (1 point) — Load the dataset
Load the TAB-separated file into a pandas DataFrame with columns: `id`, `message`.  
Print: **(a)** `df.shape`, **(b)** `df.head(3)`.

### Q2 (3 points) — Descriptive columns
Add these columns for each message and print the full DataFrame:
- `word_count`: number of words  
- `char_count`: number of characters  
- `num_count`: number of digits (0–9)  
- `upper_word_count`: number of ALL-CAPS words (e.g., `"WHY"`, `"DAMAGED"`)  

### Q3 (3 points) — Clean text
Build a `clean_text(text)` function and create a new column `clean` with these steps **in order**:
1) lowercase  
2) remove punctuation/symbols (keep letters/numbers/spaces)  
3) remove English stopwords (use **nltk** or **sklearn** list)  
4) remove extra spaces  

Print the **original** message and **clean** version for rows `id=1` and `id=4`.

### Q4 (2 points) — Regex extraction
Using RegEx, extract and create two new columns:
- `order_id`: first occurrence of pattern `ORD-####` (case-insensitive; `ord-1060` is valid)  
- `email`: first email address if present (otherwise `None`/`NaN`)  

Print: `id`, `order_id`, `email` for all rows.

### Q5 (1 point) — TF-IDF keywords
Using the `clean` column, compute **TF-IDF** for the messages and print the **top 5 keywords** with the highest **average TF-IDF** across documents.


In [3]:
# Setup (run this cell first)
import re
import pandas as pd


## Q1 (1 point) — Answer below

In [5]:
# Q1 — ANSWER CELL
FILE_PATH = "/support_messages (4).txt"
   # paste your copied path here

# Load TAB-separated file
df = pd.read_csv(FILE_PATH, sep="\t", names=["id", "message"])

# Print required outputs
print("Shape:", df.shape)
print("\nFirst 3 rows:")
print(df.head(3))


Shape: (9, 2)

First 3 rows:
   id                                            message
0  id                                            message
1   1  Hi!! My ORDER is late :(  Order# ORD-1042. Ema...
2   2  Refund please!!! I was charged 2 times... invo...


## Q2 (3 points) — Answer below

In [6]:
# Q2 — ANSWER CELL
# TODO: create word_count, char_count, num_count, upper_word_count
# Hint for digits: df["message"].str.count(r"\d")
# Hint for ALL-CAPS words: count tokens where token.isupper()

# TODO: display/print the full DataFrame
df["word_count"] = df["message"].str.split().apply(len)

# Character count
df["char_count"] = df["message"].str.len()

# Number of digits
df["num_count"] = df["message"].str.count(r"\d")

# ALL-CAPS word count
df["upper_word_count"] = df["message"].apply(
    lambda x: sum(1 for word in str(x).split() if word.isupper())
)

# Display full DataFrame
print(df)

   id                                            message  word_count  \
0  id                                            message           1   
1   1  Hi!! My ORDER is late :(  Order# ORD-1042. Ema...          12   
2   2  Refund please!!! I was charged 2 times... invo...          11   
3   3        Great service, thanks! arrived in 2 days :)           8   
4   4  WHY is my package DAMAGED??? tracking says del...           8   
5   5  Need to change address: 7421 Frankford Rd Apt ...          12   
6   6  Support ticket: ORD-1050. Call me at (469) 555...           9   
7   7  I can’t login— password reset link not working...          10   
8   8  Item missing from box. pls send replacement!! ...          10   

   char_count  num_count  upper_word_count  
0           7          0                 0  
1          73          4                 2  
2          71          9                 3  
3          43          1                 0  
4          55          0                 2  
5        

## Q3 (3 points) — Answer below

In [10]:
# Q3 — ANSWER CELL
# Option A (sklearn stopwords):
# from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# STOPWORDS = set(ENGLISH_STOP_WORDS)

# Option B (nltk stopwords):
# import nltk
# nltk.download("stopwords")
# from nltk.corpus import stopwords
# STOPWORDS = set(stopwords.words("english"))

# Q3 — ANSWER CELL

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
STOPWORDS = set(ENGLISH_STOP_WORDS)

# Make sure id is numeric (prevents matching issues)
df["id"] = pd.to_numeric(df["id"], errors="coerce")

def clean_text(text: str) -> str:
    # Convert to string (handles NaN safely)
    text = str(text)

    # 1️⃣ Lowercase
    text = text.lower()

    # 2️⃣ Remove punctuation/symbols (keep letters, numbers, spaces)
    text = re.sub(r"[^a-z0-9\s]", " ", text)

    # 3️⃣ Remove stopwords
    words = text.split()
    words = [word for word in words if word not in STOPWORDS]

    # 4️⃣ Remove extra spaces
    cleaned = " ".join(words)

    return cleaned.strip()

# Create clean column
df["clean"] = df["message"].apply(clean_text)

# Safely print required rows
print("ID = 1")
row1 = df[df["id"] == 1]

if not row1.empty:
    print("Original:", row1["message"].iloc[0])
    print("Clean   :", row1["clean"].iloc[0])
else:
    print("ID 1 not found in dataset")

print("\nID = 4")
row4 = df[df["id"] == 4]

if not row4.empty:
    print("Original:", row4["message"].iloc[0])
    print("Clean   :", row4["clean"].iloc[0])
else:
    print("ID 4 not found in dataset")

ID = 1
Original: Hi!! My ORDER is late :(  Order# ORD-1042. Email me at sara.Ali@gmail.com
Clean   : hi order late order ord 1042 email sara ali gmail com

ID = 4
Original: WHY is my package DAMAGED??? tracking says delivered...
Clean   : package damaged tracking says delivered


## Q4 (2 points) — Answer below

In [11]:
# Q4 — ANSWER CELL
# order_id pattern: r"ORD-\d{4}" with re.IGNORECASE
# email pattern (simple): r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"

# TODO: create df["order_id"] and df["email"]
# TODO: print/display df[["id", "order_id", "email"]]
# Q4 — ANSWER CELL

# Extract order_id (case-insensitive)
df["order_id"] = df["message"].str.extract(r"(ORD-\d{4})", flags=re.IGNORECASE)

# Extract email
df["email"] = df["message"].str.extract(
    r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})"
)

# Print required columns
print(df[["id", "order_id", "email"]])


    id  order_id                  email
0  NaN       NaN                    NaN
1  1.0  ORD-1042     sara.Ali@gmail.com
2  2.0  ORD-1042                    NaN
3  3.0       NaN                    NaN
4  4.0       NaN                    NaN
5  5.0       NaN                    NaN
6  6.0  ORD-1050                    NaN
7  7.0       NaN  mehri.sattari@unt.edu
8  8.0  ord-1060                    NaN


## Q5 (1 point) — Answer below

In [12]:
# Q5 — ANSWER CELL
# Hint: from sklearn.feature_extraction.text import TfidfVectorizer
# 1) fit TF-IDF on df["clean"]
# 2) compute average TF-IDF per term across documents
# 3) print top 5 terms + their average scores

# TODO
# Q5 — ANSWER CELL
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Fit TF-IDF on clean column
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["clean"])

# Compute average TF-IDF per term
avg_tfidf = np.mean(X.toarray(), axis=0)

# Get feature names
terms = vectorizer.get_feature_names_out()

# Create DataFrame of terms + scores
tfidf_df = pd.DataFrame({
    "term": terms,
    "avg_tfidf": avg_tfidf
})

# Sort and get top 5
top5 = tfidf_df.sort_values(by="avg_tfidf", ascending=False).head(5)

print("Top 5 Keywords by Average TF-IDF:")
print(top5)


Top 5 Keywords by Average TF-IDF:
       term  avg_tfidf
39      ord   0.125373
36  message   0.111111
40    order   0.088427
1      1042   0.063192
24    email   0.058799


## Grading Checklist
- Q1: correct load + prints  
- Q2: correct counts  
- Q3: cleaning follows the required order + prints for id=1 and id=4  
- Q4: regex extraction works (case-insensitive `ORD-####` and emails)  
- Q5: prints 5 keywords + their scores (rounding is fine)
