# Homework 4
### Anirudh Margam
### 730002982

In [133]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from pathlib import Path  
import glob
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import re


1. Create a TfidfVectorizer using the spam files in BG/2004 and ham files for kitchen-l from Canvas

We first store all of the files into a list, so that we can pass them to the tfidf vectorizer object.

In [134]:
bg_2004_directory_path = '../Datasets/BG/2004/'
bg_2004_text_files = glob.glob(f'{bg_2004_directory_path}/*/*.txt')
kitchen_directory_path = '../Datasets/kitchen-l/'
kitchen_text_files = glob.glob(f'{kitchen_directory_path}/*/*')
train_text_files = []
train_text_files.extend(bg_2004_text_files)
train_text_files.extend(kitchen_text_files)

Instantiate the TfidfVectorizer and call fit_transform using the spam and ham files. 

The fit_transform method is used for feature extraction and transformation, especially with text and numerical data. 

fit_transform performs 2 key actions:

Fit: this part involves learning and capturing statistical information about the data. In regards to text data, it involves learning the vocabulary of words in the dataset. This information is stored in the model, enabling it to understand the data's characteristics.

Transform: this part takes the learned information and applies it to the data. In regards to text data, it involves converting text into numerical features (TF-IDF scores). Effectively, it transforms the data based on insights gained during the 'fit' step.

fit_transform ensures that the same transformations are consistently applied to both training and testing data, enabling accurate model training and evaluation.

In [135]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', encoding='utf-8', decode_error='ignore')
tfidf_vector = tfidf_vectorizer.fit_transform(train_text_files)

2. Use a pandas DataFrame to look at the top 25 words in one spam message and the top 25 words in one ham message

We instantiate the dataframe first. There were <4000 spam messages, so we can hardcode the indices for the spam message and ham message, and then use sort_values to sort the words in the message in descending order of use.

In [136]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

spam_message_index = 10
top_25_spam_words = tfidf_df.iloc[spam_message_index].sort_values(ascending=False)[:25]
ham_message_index = 4000
top_25_ham_words = tfidf_df.iloc[ham_message_index].sort_values(ascending=False)[:25]

In [137]:
print('Top 25 words in a spam message:')
print('   word\t\ttfidf')
for index, (word, tfidf) in enumerate(top_25_spam_words.items()):
    print(f'{index+1}. {word.ljust(12)}{tfidf:.6f}')

Top 25 words in a spam message:
   word		tfidf
1. 82686       0.334887
2. 4864605754  0.334887
3. 0940002579  0.334887
4. umbridol    0.308731
5. cslwbmrbj   0.251165
6. br          0.243594
7. yahoo       0.183220
8. un2u6       0.167443
9. web34414    0.167443
10. karmakamper 0.162016
11. 122         0.141024
12. 7781        0.137078
13. 217         0.134364
14. com         0.133636
15. nov         0.125072
16. www         0.119332
17. unsub       0.114743
18. ai          0.112224
19. http        0.111166
20. 200         0.100245
21. 851129      0.083722
22. 949270176157140.083722
23. 12791       0.083722
24. 21901       0.081008
25. 2004        0.080072


In [138]:
print('Top 25 words in a ham message:')
print('   word\t\ttfidf')
for index, (word, tfidf) in enumerate(top_25_ham_words.items()):
    print(f'{index+1}. {word.ljust(12)}{tfidf:.6f}')

Top 25 words in a ham message:
   word		tfidf
1. utilities   0.257595
2. power       0.257414
3. conservation0.188535
4. sfgate      0.184800
5. ect         0.167221
6. speech      0.145621
7. energy      0.138574
8. plants      0.137258
9. mike        0.135618
10. grigsby     0.132139
11. cut         0.130087
12. percent     0.128174
13. californians0.126858
14. gov         0.124925
15. holst       0.123078
16. rates       0.123006
17. generators  0.119938
18. gray        0.119516
19. gate        0.106892
20. average     0.104604
21. sf          0.100763
22. prices      0.100652
23. build       0.099846
24. feds        0.096979
25. california  0.092574


3. Train a RandomForestClassifier using the BG/2004 spam emails and kitchen-l ham Tfidf.

We first create a list of the actual labels for the data, then split the data into training and testing sets.

In [139]:
actual_labels = [1] * len(bg_2004_text_files) + [0] * len(kitchen_text_files)
X_train, X_test, y_train, y_test = train_test_split(tfidf_vector, actual_labels, test_size=0.2, random_state=42)

We then instantiate the Random Forest Classifier and train it on the training data.

In [140]:
rf_classifier = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=42)
rf_classifier.fit(X_train, y_train)

Now we can make predictions on the test data and calculate our accuracy.

In [141]:
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 1.00


4. Using your RandomForestClassifier, predict the emails in BG/2005 and farmer-d. Display the number of true positive (spam), false positive, true negative (ham) and false negatives.

We start by loading the emails from BG/2005 and farmer-d into a list.

In [142]:
bg_2005_directory_path = '../Datasets/BG/2005/'
bg_2005_text_files = glob.glob(f'{bg_2005_directory_path}/*/*.txt')
farmer_directory_path = '../Datasets/farmer-d/'
farmer_text_files = glob.glob(f'{farmer_directory_path}/*/*')
test_text_files = []
test_text_files.extend(bg_2005_text_files)
test_text_files.extend(farmer_text_files)

Next, we call transform on the existing tfidf vectorizer to transform the new data.

In [143]:
new_data_tfidf = tfidf_vectorizer.transform(test_text_files)

Now we can make predictions on the new data.

In [144]:
new_data_predictions = rf_classifier.predict(new_data_tfidf)

We create our list of actual labels for the new data and output the confusion matrix.

In [145]:
actual_labels = [1] * len(bg_2005_text_files) + [0] * len(farmer_text_files)
cm = confusion_matrix(actual_labels, new_data_predictions)

print('Confusion Matrix:')
print(cm)
tp = cm[0][0]
tn = cm[1][1]
fp = cm[0][1]
fn = cm[1][0]
print(f'\nNumber of True Positives (Spam):\t{tp}')
print(f'Number of False Positives:\t\t{fp}')
print(f'Number of True Negatives (Ham):\t\t{tn}')
print(f'Number of False Negatives:\t\t{fn}')

Confusion Matrix:
[[3665    4]
 [   2 6120]]

Number of True Positives (Spam):	3665
Number of False Positives:		4
Number of True Negatives (Ham):		6120
Number of False Negatives:		2


Evidently, our Random Forest Classifier is extremely accurate.

5. In the second half of your Jupyter notebook (i.e. don’t change your code above, add more), redo the steps above using a stopwords list containing “enron” and HTML tags. Again, predict the emails in BG/2005 and farmer-d. Display the number of true positive (spam), false positive, true negative (ham) and false negatives. (It will probably still be outstanding).

We start by creating a list of stopwords using the github link and the list of HTML tags.

In [146]:
text_file_of_stopwords = '../Datasets/stopwords_list.txt'
stopwords_list = ['enron']

with open(text_file_of_stopwords, 'r') as file:
    for line in file:
        word = line.strip()
        stopwords_list.append(word)

text_file_of_html_tags = '../Datasets/list_of_html_tags.txt'
with open(text_file_of_html_tags, 'r') as file:
    input_text = file.read()

html_tag_regex = r'<[^>]+>'
html_tags_list = re.findall(html_tag_regex, input_text)
cleaned_html_tags = [tag.strip('<>') for tag in html_tags_list]

stopwords_list.extend(cleaned_html_tags)

Now that we have our list of stopwords, we can instantiate a new TfidfVectorizer and fit it against our training files. Then we can transform it according to our test files and make predictions again using our Random Forest Classifier.

In [147]:
new_tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words=stopwords_list, encoding='utf-8', decode_error='ignore')
new_tfidf_vectorizer.fit_transform(train_text_files)
new_data_tfidf = new_tfidf_vectorizer.transform(test_text_files)

X_train, X_test, y_train, y_test = train_test_split(new_data_tfidf, actual_labels, test_size=0.2, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=42)
rf_classifier.fit(X_train, y_train)

new_data_predictions = rf_classifier.predict(new_data_tfidf)

cm = confusion_matrix(actual_labels, new_data_predictions)

print('Confusion Matrix:')
print(cm)
tp = cm[0][0]
tn = cm[1][1]
fp = cm[0][1]
fn = cm[1][0]
print(f'\nNumber of True Positives (Spam):\t{tp}')
print(f'Number of False Positives:\t\t{fp}')
print(f'Number of True Negatives (Ham):\t\t{tn}')
print(f'Number of False Negatives:\t\t{fn}')




Confusion Matrix:
[[3659   10]
 [   0 6122]]

Number of True Positives (Spam):	3659
Number of False Positives:		10
Number of True Negatives (Ham):		6122
Number of False Negatives:		0


The performance is still very good. We get a few more false positives, but no false negatives, which is better performance than before - in a real world context, false negatives are more costly than false positives.

In [148]:
tfidf_df = pd.DataFrame(new_data_tfidf.toarray(), columns=new_tfidf_vectorizer.get_feature_names_out())

spam_message_index = 10
top_25_spam_words = tfidf_df.iloc[spam_message_index].sort_values(ascending=False)[:25]

print('Top 25 words in a spam message:')
print('   word\t\ttfidf')
for index, (word, tfidf) in enumerate(top_25_spam_words.items()):
    print(f'{index+1}. {word.ljust(12)}{tfidf:.6f}')

Top 25 words in a spam message:
   word		tfidf
1. 2005        0.357249
2. mar         0.298718
3. bulgaria    0.292137
4. elle        0.248318
5. music       0.246069
6. 135         0.220193
7. bone        0.207728
8. 156         0.201008
9. refi        0.165545
10. 218         0.162617
11. dyndns      0.150727
12. guenter     0.150579
13. 05          0.141412
14. bruce       0.141037
15. 41          0.138585
16. 12          0.133674
17. 127         0.107921
18. localhost   0.106446
19. received    0.099903
20. 26915       0.095690
21. host76      0.095690
22. rait        0.094729
23. 0000        0.088127
24. approval    0.086336
25. homeowner   0.084362


The code above is used to verify the functionality of the stopwords. We see that the top 25 words in a spam message are more useful now, since we aren't seeing any HTML tags (ex: br)