## DATA 602 - Spring 2024
### Homework Assignment 3
Total points : 60<br>
 Please provide your solutions into the cells provided after question cells. You can create new cells as needed. <br>

<b>Question 1</b> [<span style="color: red;">20 points</span>]:<br>
Consider the `Fish.csv` dataset again
Your job is to use `Species`and `Width` for predicting the `Weight` (target). You will need to one-hot encode `Species` in order to do this. Perform a 80-20 split, do the training with the help of linear regression and then print the RMSE and R2 scores on the test set.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
df = pd.read_csv('Fish.csv')

# One-hot encode the 'Species' column
onehot_encoder = OneHotEncoder(sparse=False)
species_encoded = onehot_encoder.fit_transform(df[['Species']])
species_encoded_df = pd.DataFrame(species_encoded, columns=[f'Species_{i}' for i in range(species_encoded.shape[1])])
df_encoded = pd.concat([df, species_encoded_df], axis=1)
df_encoded.drop(columns=['Species'], inplace=True)

# Splitting the dataset into train and test sets (80-20 split)
X = df_encoded.drop(columns=['Weight'])
y = df_encoded['Weight']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting on the test set
y_pred = model.predict(X_test)

# Calculating RMSE and R2 score
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("RMSE:", rmse)
print("R2 Score:", r2)

import warnings
warnings.filterwarnings('ignore', message='`sparse` was renamed to `sparse_output`')


RMSE: 83.71011402365869
R2 Score: 0.9507352480054513


<b>Question 2</b> [<span style="color: red;">20 points</span>]:<br>
Consider the following code:

In [8]:
from sklearn.datasets import fetch_20newsgroups
#Training set
newsgroups_train = fetch_20newsgroups(subset='train',categories = ['alt.atheism', 'comp.graphics'],remove=('headers', 'footers', 'quotes'))
#Testing set
newsgroups_test = fetch_20newsgroups(subset='test',categories = ['alt.atheism', 'comp.graphics'],remove=('headers', 'footers', 'quotes'))

Your task is to:
1.  With the help of a Tfidf vectorizer, train logistic regression and knn models (for knn, use `n_neighbors=5` and algorithm set to `brute`) on `newsgroups_train`.
2. Calculate and print the accuracy and f1 scores on the entire `newsgroups_test` set for both  models.
3. Now, for both models, Using the `time.process_time()` function, calculate and print out the median time of performing `.predict` on the first 200 records in the test set for 100 runs. (Essentially do 100 iterations, in each iteration do `.predict` for `newsgroups_test.data[0:200]`)
<br>

<b>Note</b> : For better results, pass the stopwords from nltk into the tfidf vectorizer as a parameter.

In [9]:
import numpy as np
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords

# Load the newsgroups dataset with specified categories and removing headers, footers, and quotes
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics'], remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=['alt.atheism', 'comp.graphics'], remove=('headers', 'footers', 'quotes'))

# Define the TF-IDF vectorizer with stopwords
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(newsgroups_train.data)

# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(newsgroups_test.data)

# Train logistic regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train_tfidf, newsgroups_train.target)

# Train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
knn_model.fit(X_train_tfidf, newsgroups_train.target)

# Predictions on the entire test set
logistic_predictions = logistic_model.predict(X_test_tfidf)
knn_predictions = knn_model.predict(X_test_tfidf)

# Calculate accuracy and F1 scores
logistic_accuracy = accuracy_score(newsgroups_test.target, logistic_predictions)
logistic_f1 = f1_score(newsgroups_test.target, logistic_predictions, average='weighted')

knn_accuracy = accuracy_score(newsgroups_test.target, knn_predictions)
knn_f1 = f1_score(newsgroups_test.target, knn_predictions, average='weighted')

print("Logistic Regression - Accuracy:", logistic_accuracy)
print("Logistic Regression - F1 Score:", logistic_f1)
print("KNN - Accuracy:", knn_accuracy)
print("KNN - F1 Score:", knn_f1)

# Calculate median time for prediction on the first 200 records in the test set for 100 runs
num_runs = 100
logistic_prediction_times = []
knn_prediction_times = []

for _ in range(num_runs):
    start_time = time.process_time()
    logistic_model.predict(X_test_tfidf[:200])
    end_time = time.process_time()
    logistic_prediction_times.append(end_time - start_time)

    start_time = time.process_time()
    knn_model.predict(X_test_tfidf[:200])
    end_time = time.process_time()
    knn_prediction_times.append(end_time - start_time)

median_time_logistic = np.median(logistic_prediction_times)
median_time_knn = np.median(knn_prediction_times)

print("Logistic Regression - Median Prediction Time for first 200 records in test set (100 runs):", median_time_logistic, "seconds")
print("KNN - Median Prediction Time for first 200 records in test set (100 runs):", median_time_knn, "seconds")


Logistic Regression - Accuracy: 0.9152542372881356
Logistic Regression - F1 Score: 0.9146907157759091
KNN - Accuracy: 0.4689265536723164
KNN - F1 Score: 0.3190271688393855
Logistic Regression - Median Prediction Time for first 200 records in test set (100 runs): 0.0 seconds
KNN - Median Prediction Time for first 200 records in test set (100 runs): 1.03125 seconds


<b>Question 3</b> [<span style="color: red;">20 points</span>]:<br>
In the last question, you may have noticed that KNN performs far worse than logistic regression. Now, lets try a way different method (develop by UMBC professors!) for feature extraction called BWMD. Your tasks are :
1. Install pyBWMD, you may first need to install `Cython`. Link to [pyBWMD](https://github.com/EdwardRaff/pyBWMD/tree/master)
2. Study [this example](https://github.com/EdwardRaff/pyBWMD/blob/master/examples/20NewsGroups.ipynb) to see how to `vectorize` strings.
3. Vectorize the training and test dataset (essentially we are encoding the text using `pyBWMD` instead of `TfidfVectorizer`. <b>Note:</b> The target can stay the same, don't vectorize them with `pyBWMD`.
4. Now train a KNN model (5 neighbors) on the newly vectorized training set. Then compute the Accuracy and F1_score using the test set.
<br>

<b>Note</b>: Your results may slightly improve from the KNN for Tfidf but still not at the level of logistic regression

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from pyBWMD.bwmd import vectorize
import re

In [3]:

# Load stopwords from nltk
stop_words = set(stopwords.words('english'))

# Function to preprocess text data
def preprocess_text(text):
    # Remove punctuation and lowercase the text
    text = re.sub(r'[^\w\s]', '', text.lower())
    # Remove stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

# Load the dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics'], remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=['alt.atheism', 'comp.graphics'], remove=('headers', 'footers', 'quotes'))

# Preprocess text data
X_train = [preprocess_text(text) for text in newsgroups_train.data]
X_test = [preprocess_text(text) for text in newsgroups_test.data]

# Assuming you have a function called 'vectorize' that transforms text data into vectors
# Vectorize the datasets using pyBWMD
X_train = vectorize(X_train)
X_test = vectorize(X_test)

# Target remains the same, no need to vectorize
y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Train a KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predict and evaluate performance
knn_pred = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred, average='weighted')
print("KNN with pyBWMD:")
print("Accuracy:", knn_accuracy)
print("F1 Score:", knn_f1)


KNN with pyBWMD:
Accuracy: 0.5451977401129944
F1 Score: 0.4649829496327737
