# Spam detection model using the vectorized data

To create a spam detection model using the vectorized data, we'll follow these steps:

1. Label the Data: We need labels indicating whether each email is spam or not.
2. Split the Data: Divide the data into training and testing sets to evaluate the model's performance accurately.
3. Train a Model: Use a machine learning algorithm to train a model on the training data. Common choices for text classification include Logistic Regression, Random Forest, Naive Bayes, and Support Vector Machines (SVM).
4. Evaluate the Model: Test the model on the testing set to assess its accuracy, recall, f1-score and confusion matrix

## 1. Loading the e-mail dataset 

In [15]:
import pandas as pd

# Load the dataset
file_path = r"csv/mail.csv"
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
data.head()


Unnamed: 0.1,Unnamed: 0,Subject,Content
0,0,announcing sftp support for backup box,hi thanks for using backup box i wanted y...
1,1,what s new big android update skitch sharing...,evernote newsletter march 2012 in th...
2,2,boxbuzz new onecloud unifies mobile apps with...,http app en25 com e es aspx s 1464 e 28...
3,3,kevin see who you already know on linkedin,dear kevin see who you alrea...
4,4,3 99 album deals for nirvana jay z soundga...,google play http play google com ...


# 2. Vectorize the dataset

To vectorize the dataset, we can use techniques like: <br>

 - Bag-of-Words (BoW): Represents text data by counting how many times each word appears.
 - TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words according to their frequency across documents, giving higher importance to less common but potentially more relevant terms.

 We will use TF-IDF, which is commonly used for text processing as it helps in distinguishing more meaningful words from common ones. 


Focusing on the top 1000 terms to simplify the representation.

## 2.1. Vectorize the dataset with TF-IDF (Term Frequency-Inverse Document Frequency)

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to top 1000 features for simplicity

# Fit and transform the 'Content' column
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Content'])

# Create a DataFrame to show the TF-IDF features for the first few rows
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df



Unnamed: 0,00,000,01,05,0px,10,100,10012,102,11,...,writing,www,year,years,yet,york,you,your,yourself,youtube
0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.121157,0.015350,0.0,0.000000
1,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.051946,0.122855,0.0,0.000000
2,0.000000,0.0,0.0,0.0,0.0,0.000000,0.02228,0.0,0.0,0.000000,...,0.024515,0.000000,0.000000,0.0,0.0,0.051266,0.040902,0.088839,0.0,0.029388
3,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.154238,0.000000,0.0,0.000000
4,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.244195,0.025114,0.0,0.0,0.000000,0.035385,0.035866,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1627,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.054336,0.220298,0.0,0.000000
1628,0.000000,0.0,0.0,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.109635,0.022225,0.0,0.000000
1629,0.108243,0.0,0.0,0.0,0.0,0.066234,0.00000,0.0,0.0,0.538199,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.025862,0.078639,0.0,0.000000
1630,0.108175,0.0,0.0,0.0,0.0,0.066193,0.00000,0.0,0.0,0.537864,...,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.025845,0.078590,0.0,0.000000


In [17]:
import pickle
pkl_filename = 'model/prog8245nlp-tfidf.pkl'
with open(pkl_filename, 'wb') as file:
    pickle.dump(tfidf_vectorizer, file)

## 2.2. Saving the Vectorize the dataset

In [18]:
import pandas as pd

tfidf_df.to_csv(r'csv/tfidf_vectorized_data.csv', index=False)


In [19]:
import pandas as pd

# Load the vectorized data
file_path = r'csv/tfidf_vectorized_data.csv'
vectorized_data = pd.read_csv(file_path)

# Display the first few rows of the dataset and its structure
vectorized_data.head(), vectorized_data.columns


(    00  000   01   05  0px   10      100  10012  102   11  ...   writing   
 0  0.0  0.0  0.0  0.0  0.0  0.0  0.00000    0.0  0.0  0.0  ...  0.000000  \
 1  0.0  0.0  0.0  0.0  0.0  0.0  0.00000    0.0  0.0  0.0  ...  0.000000   
 2  0.0  0.0  0.0  0.0  0.0  0.0  0.02228    0.0  0.0  0.0  ...  0.024515   
 3  0.0  0.0  0.0  0.0  0.0  0.0  0.00000    0.0  0.0  0.0  ...  0.000000   
 4  0.0  0.0  0.0  0.0  0.0  0.0  0.00000    0.0  0.0  0.0  ...  0.000000   
 
         www      year  years  yet      york       you      your  yourself   
 0  0.000000  0.000000    0.0  0.0  0.000000  0.121157  0.015350       0.0  \
 1  0.000000  0.000000    0.0  0.0  0.000000  0.051946  0.122855       0.0   
 2  0.000000  0.000000    0.0  0.0  0.051266  0.040902  0.088839       0.0   
 3  0.000000  0.000000    0.0  0.0  0.000000  0.154238  0.000000       0.0   
 4  0.244195  0.025114    0.0  0.0  0.000000  0.035385  0.035866       0.0   
 
     youtube  
 0  0.000000  
 1  0.000000  
 2  0.029388  
 3  0.

## 2.3.Experiment with 3 different feature extraction techniques to capture meaningful representations of social media text where the 3 techniques should be of different word embedding categories.

To explore different techniques for capturing meaningful representations of social media text, we can experiment with three distinct types of word embedding categories: 

- count-based: TF-IDF Vectorization (already implemented item 2.1.)<br>
TF-IDF is effective for highlighting important words that are frequent in a few documents

- predictive: Word2Vec


-  sparse Matrix Factorization: Non-negative Matrix Factorization (NMF)

### 2.3.1. Predictive Model: Word2Vec
Word2Vec is a predictive model that uses neural networks to learn word associations from a large corpus of text. After training, the model can detect synonymous words or suggest additional words for a partial sentence. To implement this, we can use the gensim library:

In [20]:
!pip install gensim




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
import numpy as np  # Make sure to import numpy
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Tokenize the text
data['tokenized'] = data['Content'].apply(word_tokenize)

# Train a Word2Vec model
model_w2v = Word2Vec(sentences=data['tokenized'], vector_size=100, window=5, min_count=1, workers=4)

# Function to average all word vectors in a paragraph
def document_vector(doc):
    # Remove out-of-vocabulary words
    doc = [word for word in doc if word in model_w2v.wv.key_to_index]
    return np.mean(model_w2v.wv[doc], axis=0) if doc else np.zeros((100,))

# Apply the function to each document
data['doc_vector'] = data['tokenized'].apply(document_vector)
w2v_df = pd.DataFrame(data['doc_vector'].tolist())

print(w2v_df.head())  # Displays the first five rows of the Word2Vec DataFrame2

         0         1         2         3         4         5         6    
0 -0.132779 -0.306009  0.218971  0.886176 -0.039734 -0.740119  0.057367  \
1 -0.076492 -0.047740  0.336058  0.791808 -0.142548 -0.761013  0.090558   
2 -0.036042 -0.010476  0.353221  0.893480 -0.113499 -0.728073  0.105870   
3  0.046882 -0.245239  0.039179  1.019770 -0.045665 -0.846046  0.473050   
4 -0.314919 -0.301436  0.112026  0.520614 -0.506260 -1.147483 -0.366100   

         7         8         9   ...        90        91        92        93   
0  1.001050  0.035703  0.231131  ...  0.276000  0.299138  0.052639  0.344628  \
1  1.025684 -0.084030  0.079842  ...  0.236329  0.227532 -0.049899  0.182343   
2  1.228045 -0.190517  0.022867  ...  0.100347  0.200697  0.003069  0.212455   
3  1.562357 -0.100602  0.465762  ...  0.506108  0.517001  0.210131 -0.034123   
4  0.700114  0.094000  0.401149  ...  0.529058  0.460223 -0.045456 -0.408327   

         94        95        96        97        98        99  
0  0

### 3.3.2 Sparse Matrix Factorization: Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF) is indeed used in combination with a TF-IDF matrix. In text mining, NMF is commonly applied to matrices derived from count-based techniques.

NMF belongs to a group of algorithms that perform matrix factorization under non-negativity constraints, commonly used for dimensionality reduction and feature extraction, particularly in the context of discovering latent topics or structures within data.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the 'Content' column
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Content'])

# Apply NMF to reduce dimensionality and capture context
nmf = NMF(n_components=50, random_state=42)  # You can adjust the number of components
nmf_features = nmf.fit_transform(tfidf_matrix)

# Optionally convert to DataFrame
nmf_df = pd.DataFrame(nmf_features)


# Display the first 5 rows of the DataFrame
print(nmf_df.head())

# Display the last 5 rows of the DataFrame
print(nmf_df.tail())


         0         1         2         3         4    5         6         7    
0  0.000000  0.005815  0.000000  0.002579  0.000000  0.0  0.000606  0.000000  \
1  0.000000  0.000000  0.013166  0.000000  0.000000  0.0  0.000000  0.000000   
2  0.000000  0.000000  0.003473  0.002369  0.000000  0.0  0.000000  0.000000   
3  0.000765  0.000000  0.000000  0.000000  0.000000  0.0  0.000000  0.000000   
4  0.002852  0.000000  0.000000  0.000000  0.186151  0.0  0.000000  0.000026   

    8         9   ...       40        41        42        43        44   
0  0.0  0.000000  ...  0.00016  0.000000  0.001891  0.048537  0.132014  \
1  0.0  0.000000  ...  0.00000  0.005964  0.012670  0.056631  0.118135   
2  0.0  0.000000  ...  0.00000  0.000000  0.000000  0.078932  0.072407   
3  0.0  0.000386  ...  0.00000  0.000000  0.000000  0.027284  0.097642   
4  0.0  0.000000  ...  0.00000  0.000000  0.000000  0.042866  0.000000   

         45        46        47        48        49  
0  0.010202  0.00181

# 3. Label the Data

Labeling data for spam detection. Here are some general guidelines and approaches we might consider:

1. Manual Labeling
If the dataset is not too large, we could manually review each email (or document) and assign a label based on its content:<br>
Spam (1): Emails that are unsolicited and typically trying to sell something, phishing attempts, or contain malicious links.<br>
Not Spam (0): Legitimate emails that are typically personal, business-related, or relevant communications you expect to receive.

2. Using Pre-labeled Data
If manually labeling the dataset is impractical due to its size or complexity, consider using a pre-labeled dataset:<br>
Many public datasets are available for spam detection, such as the Enron Corpus, SpamAssassin Public Corpus, etc.<br>
You can train your model on a pre-labeled dataset and use your unlabeled data as additional testing or real-world application data.

3. Automated Labeling with Rules
For larger datasets, you might develop rules to automatically assign labels based on:<br>
Keywords: Spam often contains specific keywords (e.g., "free", "winner", "urgent", "risk-free").<br>
Sender's Email Address: Emails from certain domains or with suspicious patterns might be more likely to be spam.<br>
Formatting and Presentation: Use of excessive capitalization, multiple font colors, or poor formatting can be indicators of spam.

We are going to use our own manual labeling data 

### 3.1.1. Download from our github

https://github.com/onlyxool/PROG8245NLP


In [None]:
import pandas as pd

# Load the dataset
file_path = r"csv/mail_labeling.csv"
data_label = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
# data_label.head()
data_label




### 3.1.2. Vectorize labeling data

How the `tfidf_vectorized_data` already encompasses a comprehensive vectorization of both the Subject and Content, then additional vectorization would not be necessary and could be redundant. 

# 4. Split the Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Convert labels to numeric format (1 for Spam, 0 for Not Spam)
data_label['label_numeric'] = (data_label['label'] == 'Spam').astype(int)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectorized_data, data_label['label_numeric'], test_size=0.2, random_state=42)

# 5. Train a Model - Logistic Regression

In [None]:

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectorized_data, data_label['label_numeric'], test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence

model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

In [None]:
pkl_filename = 'model/prog8245nlp.pkl'
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)

# 6. Evaluate the model: Logistic Regression

## 6.1. Accuracy, recall, f1-score

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

# Calculate and display accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}\n")
print(report)



Precision: Precision measures the proportion of correctly identified positive cases (spam emails) out of all cases classified as positive. For spam class (Spam), it's 0.76, meaning 76% of the emails classified as spam are actually spam.

Recall: Recall measures the proportion of correctly identified positive cases (spam emails) out of all actual positive cases. For spam class (Spam), it's 0.99, indicating that the model captures 99% of the actual spam emails.

F1-score: The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. For spam class (Spam), it's 0.86, reflecting a good balance between precision and recall.

Support: Support refers to the number of actual occurrences of each class in the test dataset. For spam class (Spam), there are 246 instances in the test dataset.

The support value for "0" (non-spam emails) in the classification report indicates that out of the 327 emails in the test set, 81 were classified as "0" (non-spam emails). 

## 6.2. Confusion matrix

![Matrix confision](https://miro.medium.com/v2/resize:fit:640/format:webp/1*Z54JgbS4DUwWSknhDCvNTQ.png)



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred are defined and contain the true and predicted labels respectively
conf_matrix = confusion_matrix(y_test, y_pred)

# Create a pandas DataFrame from the confusion matrix for easier plotting
conf_matrix_df_lr = pd.DataFrame(conf_matrix,columns=['Positive', 'Negative'],
                              index=['Positive', 'Negative'])


import seaborn as sns
import matplotlib.pyplot as plt



# Plot the confusion matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')
plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.show()



Top Left (Purple): True Positive (TP) - The number of positive instances correctly predicted by the model as positive. In this case, there are 4.

Top Right (Blue): False Negative (FN) - The number of positive instances incorrectly predicted by the model as negative. There are 77, indicating these were spam emails that the model failed to identify correctly.

Bottom Left (Purple): False Positive (FP) - The number of negative instances incorrectly predicted by the model as positive. The model has identified 2 such cases.

Bottom Right (Yellow): True Negative (TN) - The number of negative instances correctly predicted by the model as negative. There are 244, indicating these were non-spam emails that the model correctly identified.

# 7. Train a Model - RandomForestClassifier

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectorized_data, data_label['label_numeric'], test_size=0.2, random_state=42)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)  # You can adjust the number of trees (n_estimators)

model.fit(X_train, y_train)

# Predict on the testing set
y_pred= model.predict(X_test)


# 8. Evaluate the model: RandomForestClassifier

## 8.1. Accuracy, recall, f1-score

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

# Calculate and display accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)
report_rf = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}\n")
print(report_rf)



## 8.2. Confusion matrix

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred are defined and contain the true and predicted labels respectively
conf_matrix = confusion_matrix(y_test, y_pred)

# Create a pandas DataFrame from the confusion matrix for easier plotting
conf_matrix_df_rf = pd.DataFrame(conf_matrix,columns=['Positive', 'Negative'],
                              index=['Positive', 'Negative'])


import seaborn as sns
import matplotlib.pyplot as plt



# Plot the confusion matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_df_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.show()


- Top-left square (True Positives, TP): The number of positive instances correctly predicted by the model (23).
- Top-right square (False Negatives, FN): The number of positive instances incorrectly predicted as negative by the model (58).
- Bottom-left square (False Positives, FP): The number of negative instances incorrectly predicted as positive by the model (19).
- Bottom-right square (True Negatives, TN): The number of negative instances correctly predicted by the model (227).

# 10.Which confusion matrix is better?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you already have the confusion matrices conf_matrix_lr and conf_matrix_rf

# Convert confusion matrices to pandas DataFrames
conf_matrix_df_lr = pd.DataFrame(conf_matrix_df_lr, columns=['Positive', 'Negative'], index=['Positive', 'Negative'])
conf_matrix_df_rf = pd.DataFrame(conf_matrix_df_rf, columns=['Positive', 'Negative'], index=['Positive', 'Negative'])

# Create a figure with two subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# Plot confusion matrix for logistic regression
sns.heatmap(conf_matrix_df_lr, annot=True, fmt='d', cmap='Blues', cbar=False, ax=axs[0])
axs[0].set_title('Logistic Regression')

# Plot confusion matrix for random forest
sns.heatmap(conf_matrix_df_rf, annot=True, fmt='d', cmap='Blues', cbar=False, ax=axs[1])
axs[1].set_title('Random Forest')

plt.tight_layout()
plt.show()


For the Logistic Regression model:

- True Positives (TP): 4
- False Negatives (FN): 77
- False Positives (FP): 2
- True Negatives (TN): 244

For the Random Forest model:

- True Positives (TP): 23
- False Negatives (FN): 58
- False Positives (FP): 19
- True Negatives (TN): 227

In [None]:
print(f"Logistic Regression:\n {report}")
print(f"Randon Forest:\n {report_rf}")

- Precision: Logistic Regression is more precise than Random Forest, meaning it has a lower rate of false positives.
- Recall: Random Forest has a higher recall than Logistic Regression, meaning it is better at identifying positive instances.
- Accuracy: Both models have similar accuracy, with Random Forest being slightly higher.
- F1 Score: Random Forest has a much higher F1 Score than Logistic Regression, indicating a better balance between precision and recall.

Considering these factors, the Random Forest model seems to perform better overall.<br>
It has significantly better recall and F1 score, indicating that it is more effective at classifying positive instances without compromising too much on precision.<br> 
The slightly higher accuracy of the Random Forest model also supports this conclusion.