## DATA 602 - Spring 2024
### Homework Assignment 2
Total points : 60<br>
 Please provide your solutions into the cells provided after question cells. You can create new cells as needed. <br>

<b>Question 1</b> [<span style="color: red;">20 points</span>]:<br>
Consider the `spam_ham_dataset.csv` ([Original link](https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data)). This is a dataset that can be useful for trying out different models for spam or ham (not spam) detection.
Your job is to:
1. Load the `csv` and tokenize the text contents (word tokenization) from the `text` column
2. Remove stop words and punctuations
3. Use either stemming or lemmatization to consolidate inflected words to their root words.
4. Create `y` from the column for label (from the `label_num` column, 0 is ham and 1 is spam).
5. Split the preprocessed emails (and labels) with `train_test_split` with a 80-20 split (Please remember to  use the `stratify=y`  parameter)<br>
<b>Hint</b>: Suggest you to use NLTK for steps 2-4.

In [1]:
# importing the Natural Language Toolkit library
import nltk
# downloading the WordNet corpus, which is a lexical database for the English language
nltk.download('wordnet')
# downloading the Punkt tokenizer models, which are used for tokenizing text into sentences
nltk.download('punkt')
# downloading a list of stopwords, which are common words that are often removed from text during text processing tasks
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pooji\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pooji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pooji\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
#Your code goes here
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split


# 1. Load the dataset
df = pd.read_csv("spam_ham_dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [3]:
# Tokenize the text contents
df['tokens'] = df['text'].apply(lambda x: word_tokenize(str(x).lower()))

# 2. Remove stopwords and punctuations
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word.isalnum() and word not in stop_words])

# 3. Lemmatization
lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# 4. Create y from the label_num column
y = df['label_num']

# 5. Split the preprocessed emails and labels
X_train, X_test, y_train, y_test = train_test_split(df['tokens'], y, test_size=0.2, stratify=y)

# Print shapes to verify the split
print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Train set shape: (4136,) (4136,)
Test set shape: (1035,) (1035,)


<b>Question 2</b> [<span style="color: red;">20 points</span>]:<br>
Train  logistic regression models on the training set and print out the accuracy and F1 scores for the test set. Do this for :
1. Feature vectors encoded using `CountVectorizer`.
2. Feature vectors encoded using `TfidfVectorizer`.<br>

<b>Hint </b>: You may want to use `Pipeline` for this

In [4]:
#Your code goes here
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score

# Logistic Regression model
logistic_regression = LogisticRegression()

# Convert token lists to strings
X_train_str = X_train.apply(lambda x: ' '.join(x))
X_test_str = X_test.apply(lambda x: ' '.join(x))

# Pipeline with CountVectorizer
count_vectorizer_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', logistic_regression)
])

# Pipeline with TfidfVectorizer
tfidf_vectorizer_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', logistic_regression)
])

# Fit CountVectorizer model
count_vectorizer_pipeline.fit(X_train_str, y_train)

# Fit TfidfVectorizer model
tfidf_vectorizer_pipeline.fit(X_train_str, y_train)

# Predictions
count_predictions = count_vectorizer_pipeline.predict(X_test_str)
tfidf_predictions = tfidf_vectorizer_pipeline.predict(X_test_str)

# Calculate accuracy and F1 score for CountVectorizer
count_accuracy = accuracy_score(y_test, count_predictions)
count_f1_score = f1_score(y_test, count_predictions)

# Calculate accuracy and F1 score for TfidfVectorizer
tfidf_accuracy = accuracy_score(y_test, tfidf_predictions)
tfidf_f1_score = f1_score(y_test, tfidf_predictions)

# Print results
print("Results for CountVectorizer:")
print("Accuracy:", count_accuracy)
print("F1 Score:", count_f1_score)
print()
print("Results for TfidfVectorizer:")
print("Accuracy:", tfidf_accuracy)
print("F1 Score:", tfidf_f1_score)


Results for CountVectorizer:
Accuracy: 0.961352657004831
F1 Score: 0.9310344827586207

Results for TfidfVectorizer:
Accuracy: 0.9884057971014493
F1 Score: 0.9801980198019802


<b>Question 3</b> [<span style="color: red;">20 points</span>]:<br>
Train Multinomial Naive Bayes models on the training set and print out the accuracy and F1 scores for the test set. Do this for :
1. Feature vectors encoded using `CountVectorizer`.
2. Feature vectors encoded using `TfidfVectorizer`.<br>

<b>Hint </b>: You may want to use `Pipeline` for this

In [5]:
#Your code goes here
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Multinomial Naive Bayes model
multinomial_nb = MultinomialNB()

# Convert token lists to strings
X_train_str = X_train.apply(lambda x: ' '.join(x))
X_test_str = X_test.apply(lambda x: ' '.join(x))

# Pipeline with CountVectorizer
count_vectorizer_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', multinomial_nb)
])

# Pipeline with TfidfVectorizer
tfidf_vectorizer_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', multinomial_nb)
])

# Fit CountVectorizer model
count_vectorizer_pipeline.fit(X_train_str, y_train)

# Fit TfidfVectorizer model
tfidf_vectorizer_pipeline.fit(X_train_str, y_train)

# Predictions
count_predictions = count_vectorizer_pipeline.predict(X_test_str)
tfidf_predictions = tfidf_vectorizer_pipeline.predict(X_test_str)

# Calculate accuracy and F1 score for CountVectorizer
count_accuracy = accuracy_score(y_test, count_predictions)
count_f1_score = f1_score(y_test, count_predictions)

# Calculate accuracy and F1 score for TfidfVectorizer
tfidf_accuracy = accuracy_score(y_test, tfidf_predictions)
tfidf_f1_score = f1_score(y_test, tfidf_predictions)

# Print results
print("Results for CountVectorizer:")
print("Accuracy:", count_accuracy)
print("F1 Score:", count_f1_score)
print()
print("Results for TfidfVectorizer:")
print("Accuracy:", tfidf_accuracy)
print("F1 Score:", tfidf_f1_score)


Results for CountVectorizer:
Accuracy: 0.9217391304347826
F1 Score: 0.8451242829827915

Results for TfidfVectorizer:
Accuracy: 0.9178743961352657
F1 Score: 0.836223506743738
