Natural Language Processing (NLP) tasks using Machine Learning techniques

steps:

Data Preprocessing: We'll discuss how to prepare and clean the data for NLP tasks. This includes tasks such as removing special characters, handling capitalization, tokenization, and dealing with stopwords.

Data Vectorization: Next, we'll explore various methods for converting text data into numerical representations suitable for machine learning models. This includes techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

Model Training: We'll dive into training different machine learning models for NLP tasks. This involves selecting appropriate models like Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes, Decision Trees, Random Forests, and more. We'll evaluate their performance using metrics like accuracy, precision.

a. Training and Evaluation: We'll train each model and evaluate its performance using various evaluation metrics to understand how well it generalizes to unseen data.

b. Model Selection: Finally, we'll identify the model that demonstrates the best performance on our dataset and discuss strategies for model selection.



###  Text Preprocessing

| Library | Function |
|--------|----------|
| `string` | Provides tools to handle and manipulate strings, including punctuation removal. |
| `re` | Regular expressions for pattern matching and text cleaning. |
| `nltk.corpus.stopwords` | Contains lists of stopwords (common words like "the", "is") to remove from text. |
| `nltk.stem.porter.PorterStemmer` | Reduces words to their root form (e.g., "running" → "run"). |

---

### Data Handling & Visualization

| Library | Function |
|--------|----------|
| `numpy` | Supports numerical operations and array handling. |
| `pandas` | Used for data manipulation and analysis with DataFrames. |
| `matplotlib.pyplot` | Basic plotting library for visualizing data. |
| `seaborn` | Built on matplotlib; provides more attractive and informative statistical graphics. |

---

###  Feature Extraction

| Library | Function |
|--------|----------|
| `CountVectorizer` | Converts text to a matrix of token counts (Bag of Words model). |
| `TfidfVectorizer` | Converts text to a matrix of TF-IDF features (term importance). |

---

###  Machine Learning Models

| Library | Function |
|--------|----------|
| `LogisticRegression` | Linear classifier for binary/multiclass classification. |
| `SVC` | Support Vector Classifier for separating data with hyperplanes. |
| `GaussianNB`, `MultinomialNB`, `BernoulliNB` | Naive Bayes classifiers for different data distributions. |
| `DecisionTreeClassifier` | Tree-based model that splits data based on feature values. |
| `KNeighborsClassifier` | Classifies based on the majority label of nearest neighbors. |
| `RandomForestClassifier` | Ensemble of decision trees for better accuracy and robustness. |
| `AdaBoostClassifier` | Boosts weak learners sequentially to improve performance. |
| `BaggingClassifier` | Trains multiple models on random subsets of data to reduce variance. |
| `ExtraTreesClassifier` | Similar to Random Forest but uses more randomness in tree splits. |
| `GradientBoostingClassifier` | Builds models sequentially to correct previous errors. |
| `XGBClassifier` | Optimized gradient boosting library for high performance. |

---

###  Model Evaluation & Splitting

| Library | Function |
|--------|----------|
| `train_test_split` | Splits data into training and testing sets. |
| `sklearn.metrics` | Provides tools to evaluate model performance (accuracy, precision, recall, etc.). |


Importing Basis Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Punctuations
import string
# Pandas
import pandas as pd
# Remove Stopwords
from nltk.corpus import stopwords
# Regular Expressions
import re
# Import PorterStemmer from NLTK Library
from nltk.stem.porter import PorterStemmer
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB , MultinomialNB , BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
# Metrix and Train Test
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Loading Data

In [2]:
# Train
df = pd.read_csv("emails.csv")

In [3]:
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [4]:
# Null Values
df.isnull().sum()

text    0
spam    0
dtype: int64

Our Dataset Contains No Null Values we Simply Move forward on the Text Preprocessing.

In [5]:
# Information check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [6]:
# Duplicates
df.duplicated().sum()

np.int64(33)

In [7]:
# Duplicates
print(f"Duplicates values in Dataset is : {df.duplicated().sum()}")

Duplicates values in Dataset is : 33


In [8]:
# Drop Duplicates
df.drop_duplicates(inplace=True)

In [9]:
# Null Values Columns
df.isnull().sum()

text    0
spam    0
dtype: int64

In [10]:
# Lets Check Some Text
print(df['text'][0])
print(df['text'][1])
print(df['text'][2])

Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  mar

## Text Preprocessing
Basis Text Preprocessing Like Cleaning Text , Tokenization

In [11]:
# 1. LowerCase
df['text'] = df['text'].str.lower()

# Head
df['text'].head()


0    subject: naturally irresistible your corporate...
1    subject: the stock trading gunslinger  fanny i...
2    subject: unbelievable new homes made easy  im ...
3    subject: 4 color printing special  request add...
4    subject: do not have money , get software cds ...
Name: text, dtype: object

In [12]:

# Remove all punctuation (like !, ?, @, $, etc.)
df['text'] = df['text'].apply(lambda x: re.sub(f"[{re.escape(string.punctuation)}]", "", x))

# If you want to remove other special symbols too, such as numbers, use:
# df['text'] = df['text'].apply(lambda x: re.sub("[^A-Za-z\s]", "", x))

# Optional: convert to lowercase
df['text'] = df['text'].str.lower()

# Optional: remove extra spaces
df['text'] = df['text'].apply(lambda x: re.sub("\s+", " ", x).strip())


  df['text'] = df['text'].apply(lambda x: re.sub("\s+", " ", x).strip())


In [13]:
# 3. Remove @ From Train and Test Text
df['text'] = df['text'].str.replace('@','')

In [14]:
# 4. Remove URLs from Test and Train Text
df['text'] = df['text'].str.replace(r'^https?:\/\/.*[\r\n]*','')

In [15]:
# 5. Remove $ From Train and Test Text
df['text'] = df['text'].str.replace('$','')

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
# 6. Intilize Stopwords
stop_words = stopwords.words('english')

# Apply Stopwords
df['text'] = df['text'].apply(lambda x : ' '.join([word for word in x.split()if word not in (stop_words)]))

# Head
df.head()


Unnamed: 0,text,spam
0,subject naturally irresistible corporate ident...,1
1,subject stock trading gunslinger fanny merrill...,1
2,subject unbelievable new homes made easy im wa...,1
3,subject 4 color printing special request addit...,1
4,subject money get software cds software compat...,1


In [18]:
# 7. Handling ChatWords: abbravations
chat_words = {
    "u": "you",
    "ur": "your",
    "r": "are",
    "btw": "by the way",
    "idk": "I don't know",
    "omg": "oh my god",
    "lol": "laugh out loud",
    "pls": "please",
    "thx": "thanks",
    "gr8": "great",
    "b4": "before"

}


# Function
def chat_conversion(Text):
    new_text = []
    for i in Text.split():
        if i.upper() in chat_words:
            new_text.append(chat_words[i.upper()])
        else:
            new_text.append(i)
    return " ".join(new_text)

# Calling Function
df['text'] = df['text'].apply(chat_conversion)


# Head
df.head()

Unnamed: 0,text,spam
0,subject naturally irresistible corporate ident...,1
1,subject stock trading gunslinger fanny merrill...,1
2,subject unbelievable new homes made easy im wa...,1
3,subject 4 color printing special request addit...,1
4,subject money get software cds software compat...,1


In [19]:
import nltk
nltk.download('punkt')


nltk.download('punkt_tab')


# 8. Tokenization
from nltk.tokenize import sent_tokenize

# Apply sent_tokenize
df['text_sent_token'] = df['text'].apply(sent_tokenize)

# Head
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,text,spam,text_sent_token
0,subject naturally irresistible corporate ident...,1,[subject naturally irresistible corporate iden...
1,subject stock trading gunslinger fanny merrill...,1,[subject stock trading gunslinger fanny merril...
2,subject unbelievable new homes made easy im wa...,1,[subject unbelievable new homes made easy im w...
3,subject 4 color printing special request addit...,1,[subject 4 color printing special request addi...
4,subject money get software cds software compat...,1,[subject money get software cds software compa...


In [20]:
# Download NLTK resources
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [21]:
# 9. Stemming

# Intilize Stemmer
stemmer = PorterStemmer()

# This Function Will Stem Words
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

# Calling
df['stem_msg'] = df['text'].apply(stem_words)

# Head
df.head()

Unnamed: 0,text,spam,text_sent_token,stem_msg
0,subject naturally irresistible corporate ident...,1,[subject naturally irresistible corporate iden...,subject natur irresist corpor ident lt realli ...
1,subject stock trading gunslinger fanny merrill...,1,[subject stock trading gunslinger fanny merril...,subject stock trade gunsling fanni merril muzo...
2,subject unbelievable new homes made easy im wa...,1,[subject unbelievable new homes made easy im w...,subject unbeliev new home made easi im want sh...
3,subject 4 color printing special request addit...,1,[subject 4 color printing special request addi...,subject 4 color print special request addit in...
4,subject money get software cds software compat...,1,[subject money get software cds software compa...,subject money get softwar cd softwar compat gr...


# Model Building
Text Representation / Converting text Into Numbers

We Do Text Vectorization. Text Vectorization is the Process of Converting Text into Numbers



Initialization:


This line initializes a CountVectorizer object named cv with default parameters. CountVectorizer is a class provided by scikit-learn for converting a collection of text documents into a matrix of token counts.

In [22]:
cv = CountVectorizer()

Fitting CountVectorizer on Text Data:


cv.fit_transform(df['stem_msg']): This method fits the CountVectorizer to the text data in the 'stem_msg' column of the DataFrame df and transforms the text data into a sparse matrix representation. The fit_transform() method both learns the vocabulary from the text data and transforms the text data into a document-term matrix.
.toarray(): This method converts the sparse matrix representation obtained from fit_transform() into a dense numpy array. This array, denoted by X, contains the document-term matrix where each row represents a document (message) and each column represents a unique word in the vocabulary.


In [23]:
X = cv.fit_transform(df['stem_msg']).toarray()

In [24]:
# Shape Of X
X.shape

(5695, 29254)

# Encoding 'y'

In [25]:
# y
y = df['spam']

In [26]:
# Shape of Y
y.shape

(5695,)

In [27]:
y

0       1
1       1
2       1
3       1
4       1
       ..
5723    0
5724    0
5725    0
5726    0
5727    0
Name: spam, Length: 5695, dtype: int64

# Train Test


In [28]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Model Fitting

#### 1. Support Vector Machine (SVM):

In [29]:
svc = SVC(kernel='sigmoid', gamma=1.0)

##### - This initializes an SVM classifier with a sigmoid kernel and a gamma value of 1.0.

#### 2. K-Nearest Neighbors (KNN):*italicized text*

In [30]:
knc = KNeighborsClassifier()

##### - This initializes a KNN classifier with default parameters.

#### 3. Multinomial Naive Bayes:

In [31]:
mnb = MultinomialNB()

##### - This initializes a Multinomial Naive Bayes classifier, which is commonly used for text classification tasks.


#### 4. Decision Tree:

In [32]:
dtc = DecisionTreeClassifier(max_depth=5)

##### - This initializes a decision tree classifier with a maximum depth of 5.

#### 5. Logistic Regression:

In [33]:
lrc = LogisticRegression(solver='liblinear', penalty='l1')

##### This initializes a logistic regression classifier with L1 regularization using the liblinear solver.


#### 6. Random Forest Classifier:

In [34]:
rfc = RandomForestClassifier(n_estimators=50, random_state=2)

##### This initializes a random forest classifier with 50 decision trees and a random state of 2 for reproducibility.

#### 7. AdaBoost Classifier:

In [35]:
abc = AdaBoostClassifier(n_estimators=50, random_state=2)

##### - This initializes an AdaBoost classifier with 50 decision trees as weak learners and a random state of 2.

#### 8. Extra Trees Classifier:

In [36]:
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)

##### - This initializes an Extra Trees classifier with 50 trees in the forest and a random state of 2.


#### 9. XGBoost Classifier:

In [37]:
xgb = XGBClassifier(n_estimators=50, random_state=2)

##### - This initializes an XGBoost classifier with 50 boosting rounds and a random state of 2.

#### NOTE:
 Each model is now ready to be trained and evaluated on the dataset for the classification task. Adjust the hyperparameters as needed based on your specific task requirements and dataset characteristics.



In [38]:
# Initlize Models
# Support Vector MAchine
svc = SVC(kernel='sigmoid', gamma=1.0)
# KNeighbours
knc = KNeighborsClassifier()
# Multinomial NaiveBayes
mnb = MultinomialNB()
# Decision Tree
dtc = DecisionTreeClassifier(max_depth=5)
# Logistic Regression
lrc = LogisticRegression(solver='liblinear', penalty='l1')
# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=50, random_state=2)
# AddaBoost Classifier
abc = AdaBoostClassifier(n_estimators=50, random_state=2)
# Extra Tree Classifier a Ensemble Method
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)
# XGB Classifier
xgb = XGBClassifier(n_estimators=50,random_state=2)


##### The provided code fits each initialized model to the training data (X_train and y_train) and makes predictions on the test data (X_test). Here's a breakdown of the fitting and prediction process for each model:

#### 1.Support Vector Machine (SVC):

In [39]:
svc.fit(X_train, y_train)
svc_pred = svc.predict(X_test)

#### 2.K-Nearest Neighbors (KNN):

In [40]:
knc.fit(X_train, y_train)
knn_pred = knc.predict(X_test)

#### 3. Multinomial Naive Bayes:



In [41]:
mnb.fit(X_train, y_train)
mnb_pred = mnb.predict(X_test)

#### 4.Decision Tree:

In [42]:
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)

#### 5. Logistic Regression:

In [43]:
lrc.fit(X_train, y_train)
lrc_pred = lrc.predict(X_test)

#### 6.Random Forest Classifier:

In [44]:
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)

#### 7. AdaBoost Classifier:

In [45]:
abc.fit(X_train, y_train)
abc_pred = abc.predict(X_test)

#### 8. Extra Trees Classifiers:

In [46]:
etc.fit(X_train, y_train)
etc_pred = etc.predict(X_test)

#### 9. XGBoost Classifier:

In [47]:
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

## NOTE:
 For each model, the fit() method is used to train the model on the training data, and then the predict() method is used to make predictions on the test data. The predictions are stored in separate variables (svc_pred, knn_pred, etc.) for each model. These predictions can then be evaluated using appropriate evaluation metrics to assess the performance of each model on the test data.

In [48]:
# Fitting Each Model One by One
# 1. SVC
svc.fit(X_train ,y_train)
# Pred
svc_pred = svc.predict(X_test)
#-----------------------------
# 2. KNeighbours
knc.fit(X_train ,y_train)
# Pred
knn_pred = knc.predict(X_test)
#-----------------------------
# 3. Multinomial NaiveBayes
mnb.fit(X_train ,y_train)
# Pred
mnb_pred = mnb.predict(X_test)
#-----------------------------
# 4. Decision Tree
dtc.fit(X_train ,y_train)
# Pred
dtc_pred = dtc.predict(X_test)
#-----------------------------
# 5. Logistic Regression
lrc.fit(X_train ,y_train)
# Pred
lrc_pred = lrc.predict(X_test)
#-----------------------------
# 6. Random Forest Classifier
rfc.fit(X_train ,y_train)
# Pred
rfc_pred = rfc.predict(X_test)
#-----------------------------
# 7. AddaBoost Classifier
abc.fit(X_train ,y_train)
# Pred
abc_pred = abc.predict(X_test)
#-----------------------------
# 8.Extra Tree Classifier a Ensemble Method
etc.fit(X_train ,y_train)
# Pred
etc_pred = etc.predict(X_test)
#-----------------------------
# 9. XGB Classifier
xgb.fit(X_train ,y_train)
# Pred
xgb_pred = xgb.predict(X_test)
#-----------------------------

# Evaluation

#### 1.Define the evaluate Function:

In [49]:
# def evaluate(y_test, y_pred):

#### - This line defines a function named evaluate that takes two arguments: y_test (true labels) and y_pred (predicted labels).


#### 2.Calculate Accuracy:

In [50]:
# accuracy = accuracy_score(y_test, y_pred)

##### - This line calculates the accuracy score by comparing the true labels (y_test) with the predicted labels (y_pred) using the accuracy_score function from scikit-learn.


#### 3. Calculate Precision:

In [51]:
# precision = precision_score(y_test, y_pred)

##### - This line calculates the precision score by comparing the true labels (y_test) with the predicted labels (y_pred) using the precision_score function from scikit-learn.

#### 4. Calculate Confusion Matrix:

In [52]:
# confusion = confusion_matrix(y_test, y_pred)

##### - This line calculates the confusion matrix by comparing the true labels (y_test) with the predicted labels (y_pred) using the confusion_matrix function from scikit-learn.


#### 5. Return Evaluation Metrics:

In [53]:
# return accuracy, precision, confusion

##### - This line returns the calculated accuracy, precision, and confusion matrix as a tuple.

##### - This evaluate function can be used to assess the performance of a classification model by providing it with the true labels (y_test) and the predicted labels (y_pred). It will then return the accuracy, precision, and confusion matrix for the model's predictions.

In [54]:
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

def evaluate(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)

    return accuracy, precision, confusion

#### 1. Support Vector Machine (SVC):

In [55]:
accuracy_SVC, precision_SVC, confusion_SVC = evaluate(y_test, svc_pred)
print(f"The Accuracy Score Of SVC is {accuracy_SVC}, Precision Is {precision_SVC},\nConfusion Matrix is \n{confusion_SVC} ")

The Accuracy Score Of SVC is 0.8244073748902546, Precision Is 0.6363636363636364,
Confusion Matrix is 
[[778  92]
 [108 161]] 


##### - This code calculates and prints the accuracy, precision, and confusion matrix for the SVC model based on its predictions (svc_pred) on the test data.


#### 2.K-Nearest Neighbors (KNN):

In [56]:
accuracy_KNN, precision_KNN, confusion_KNN = evaluate(y_test, knn_pred)
print(f"The Accuracy Score Of KNN is {accuracy_KNN}, Precision Is {precision_KNN},\nConfusion Matrix is \n{confusion_KNN} ")

The Accuracy Score Of KNN is 0.8999122036874452, Precision Is 0.9144385026737968,
Confusion Matrix is 
[[854  16]
 [ 98 171]] 


##### - Similar to SVC, this code evaluates and prints the performance metrics for the KNN model.


#### 3. Multinomial Naive Bayes:

In [57]:
accuracy_MNB, precision_MNB, confusion_MNB = evaluate(y_test, mnb_pred)
print(f"The Accuracy Score Of MultinomialNB is {accuracy_MNB}, Precision Is {precision_MNB},\nConfusion Matrix is \n{confusion_MNB} ")

The Accuracy Score Of MultinomialNB is 0.9894644424934153, Precision Is 0.9638989169675091,
Confusion Matrix is 
[[860  10]
 [  2 267]] 


###### - This code evaluates and prints metrics for the Multinomial Naive Bayes model.

#### 4. Decision Tree:

In [58]:
accuracy_DTC, precision_DTC, confusion_DTC = evaluate(y_test, dtc_pred)
print(f"The Accuracy Score Of Decision Tree is {accuracy_DTC}, Precision Is {precision_DTC},\nConfusion Matrix is \n{confusion_DTC} ")

The Accuracy Score Of Decision Tree is 0.9157155399473222, Precision Is 0.7694704049844237,
Confusion Matrix is 
[[796  74]
 [ 22 247]] 


##### - Evaluates and prints metrics for the Decision Tree model.

#### 5. Logistic Regression:

In [59]:
accuracy_LR, precision_LR, confusion_LR = evaluate(y_test, lrc_pred)
print(f"The Accuracy Score Of Logistic Regression is {accuracy_LR}, Precision Is {precision_LR},\nConfusion Matrix is \n{confusion_LR} ")

The Accuracy Score Of Logistic Regression is 0.9912203687445127, Precision Is 0.9850187265917603,
Confusion Matrix is 
[[866   4]
 [  6 263]] 


##### - Evaluates and prints metrics for Logistic Regression.

#### 6. Evaluates and prints metrics for Logistic Regression.

In [60]:
accuracy_RF, precision_RF, confusion_RF = evaluate(y_test, rfc_pred)
print(f"The Accuracy Score Of Random Forest Classifier is {accuracy_RF}, Precision Is {precision_RF},\nConfusion Matrix is \n{confusion_RF} ")

The Accuracy Score Of Random Forest Classifier is 0.9701492537313433, Precision Is 0.9957805907172996,
Confusion Matrix is 
[[869   1]
 [ 33 236]] 



##### - Evaluates and prints metrics for the Random Forest Classifier.



#### 7. AdaBoost Classifier:

In [61]:
accuracy_ADC, precision_ADC, confusion_ADC = evaluate(y_test, abc_pred)
print(f"The Accuracy Score Of AdaBoost Classifier is {accuracy_ADC}, Precision Is {precision_ADC},\nConfusion Matrix is \n{confusion_ADC} ")

The Accuracy Score Of AdaBoost Classifier is 0.9552238805970149, Precision Is 0.8732876712328768,
Confusion Matrix is 
[[833  37]
 [ 14 255]] 


##### - Evaluates and prints metrics for the AdaBoost Classifier


#### 8. Extra Trees Classifier:

In [62]:
accuracy_ETC, precision_ETC, confusion_ETC = evaluate(y_test, etc_pred)
print(f"The Accuracy Score Of Extra Tree Classifier is {accuracy_ETC}, Precision Is {precision_ETC},\nConfusion Matrix is \n{confusion_ETC} ")

The Accuracy Score Of Extra Tree Classifier is 0.974539069359087, Precision Is 0.9958677685950413,
Confusion Matrix is 
[[869   1]
 [ 28 241]] 


##### - Evaluates and prints metrics for the Extra Trees Classifier.

#### 9. XGBoost Classifier:

In [63]:
accuracy_XGB, precision_XGB, confusion_XGB = evaluate(y_test, xgb_pred)
print(f"The Accuracy Score Of XGBoost Classifier is {accuracy_XGB}, Precision Is {precision_XGB},\nConfusion Matrix is \n{confusion_XGB} ")

The Accuracy Score Of XGBoost Classifier is 0.9859525899912204, Precision Is 0.9846743295019157,
Confusion Matrix is 
[[866   4]
 [ 12 257]] 


##### - Evaluates and prints metrics for the XGBoost Classifier.

## Note:
This allows you to observe the performance of each model on the test data and compare their accuracy, precision, and confusion matrices. Adjustments to model hyperparameters or features can be made based on the observed results.

In [64]:
# Lets Evaluate Results One by One For Each
# 1. SVC
accuracy_SVC , precision_SVC , confusion_SVC = evaluate(y_test,svc_pred)
print(f"The Accuracy Score Of SVC is {accuracy_SVC} , Precision Is {precision_SVC} ,\nConfusion Matrix is \n{confusion_SVC} ")

print("\n")

# 2. KNN
accuracy_KNN , precision_KNN , confusion_KNN = evaluate(y_test,knn_pred)
print(f"The Accuracy Score Of KNN is {accuracy_KNN} , Precision Is {precision_KNN} ,\nConfusion Matrix is \n{confusion_KNN} ")

print("\n")

# 3.Multinomial
accuracy_MNB , precision_MNB , confusion_MNB = evaluate(y_test,mnb_pred)
print(f"The Accuracy Score Of MultinomialNB is {accuracy_MNB} , Precision Is {precision_MNB} ,\nConfusion Matrix is \n{confusion_MNB} ")

print("\n")

# 4.Decision Tree
accuracy_DTC , precision_DTC , confusion_DTC = evaluate(y_test,dtc_pred)
print(f"The Accuracy Score Of Decision Tree is {accuracy_DTC} , Precision Is {precision_DTC} ,\nConfusion Matrix is \n{confusion_DTC} ")

print("\n")

# 5.Logistic Regression
accuracy_LR , precision_LR , confusion_LR = evaluate(y_test,lrc_pred)
print(f"The Accuracy Score Of Logistic Regression is {accuracy_LR} , Precision Is {precision_LR} ,\nConfusion Matrix is \n{confusion_LR} ")

print("\n")

# 6.Random Forest Classifier
accuracy_RF , precision_RF , confusion_RF = evaluate(y_test,rfc_pred)
print(f"The Accuracy Score Of Random Forest Classifier is {accuracy_RF} , Precision Is {precision_RF} ,\nConfusion Matrix is \n{confusion_RF} ")

print("\n")

# 7.AddaBoost Classifier
accuracy_ADC , precision_ADC , confusion_ADC = evaluate(y_test,abc_pred)
print(f"The Accuracy Score Of AddaBoost Classifier is {accuracy_ADC} , Precision Is {precision_ADC} ,\nConfusion Matrix is \n{confusion_ADC} ")

print("\n")

# 8.Extra Tree Classifier a Ensemble Method
accuracy_ETC , precision_ETC , confusion_ETC = evaluate(y_test,etc_pred)
print(f"The Accuracy Score Of Extra Tree Classifier  is {accuracy_ETC} , Precision Is {precision_ETC} ,\nConfusion Matrix is \n{confusion_ETC} ")

print("\n")

# 9. XGB Classifier
accuracy_XGB , precision_XGB , confusion_XGB = evaluate(y_test,xgb_pred)
print(f"The Accuracy Score Of XGB Classifier is {accuracy_XGB} , Precision Is {precision_XGB} ,\nConfusion Matrix is \n{confusion_XGB} ")

The Accuracy Score Of SVC is 0.8244073748902546 , Precision Is 0.6363636363636364 ,
Confusion Matrix is 
[[778  92]
 [108 161]] 


The Accuracy Score Of KNN is 0.8999122036874452 , Precision Is 0.9144385026737968 ,
Confusion Matrix is 
[[854  16]
 [ 98 171]] 


The Accuracy Score Of MultinomialNB is 0.9894644424934153 , Precision Is 0.9638989169675091 ,
Confusion Matrix is 
[[860  10]
 [  2 267]] 


The Accuracy Score Of Decision Tree is 0.9157155399473222 , Precision Is 0.7694704049844237 ,
Confusion Matrix is 
[[796  74]
 [ 22 247]] 


The Accuracy Score Of Logistic Regression is 0.9912203687445127 , Precision Is 0.9850187265917603 ,
Confusion Matrix is 
[[866   4]
 [  6 263]] 


The Accuracy Score Of Random Forest Classifier is 0.9701492537313433 , Precision Is 0.9957805907172996 ,
Confusion Matrix is 
[[869   1]
 [ 33 236]] 


The Accuracy Score Of AddaBoost Classifier is 0.9552238805970149 , Precision Is 0.8732876712328768 ,
Confusion Matrix is 
[[833  37]
 [ 14 255]] 


The Accur

#       DataFrame For Storing Results

Creates a DataFrame named evaluation_df containing evaluation results (accuracy and precision) for each model. The DataFrame is sorted based on the Accuracy and Precision columns in descending order. Here's what each part of the code does:

1. Create a Dictionary with Evaluation Results:

In [65]:
evaluation_data = {
    'Model': ['SVC', 'KNN', 'MultinomialNB', 'Decision Tree', 'Logistic Regression', 'Random Forest', 'AdaBoost', 'Extra Tree', 'XGBoost'],
    'Accuracy': [accuracy_SVC, accuracy_KNN, accuracy_MNB, accuracy_DTC, accuracy_LR, accuracy_RF, accuracy_ADC, accuracy_ETC, accuracy_XGB],
    'Precision': [precision_SVC, precision_KNN, precision_MNB, precision_DTC, precision_LR, precision_RF, precision_ADC, precision_ETC, precision_XGB]
}

This dictionary contains model names ('Model'), their corresponding accuracy scores ('Accuracy'), and precision scores ('Precision').

2. Create a DataFrame:

In [66]:
evaluation_df = pd.DataFrame(evaluation_data)

This line converts the dictionary evaluation_data into a DataFrame named evaluation_df.


3. Sort the DataFrame:

In [67]:
evaluation_df = evaluation_df.sort_values(by=['Accuracy', 'Precision'], ascending=False)

This line sorts the DataFrame evaluation_df based on the 'Accuracy' and 'Precision' columns in descending order. This will arrange the models with the highest accuracy and precision at the top of the DataFrame.


4. Display the Sorted DataFrame:

In [68]:
evaluation_df

Unnamed: 0,Model,Accuracy,Precision
4,Logistic Regression,0.99122,0.985019
2,MultinomialNB,0.989464,0.963899
8,XGBoost,0.985953,0.984674
7,Extra Tree,0.974539,0.995868
5,Random Forest,0.970149,0.995781
6,AdaBoost,0.955224,0.873288
3,Decision Tree,0.915716,0.76947
1,KNN,0.899912,0.914439
0,SVC,0.824407,0.636364


- This line displays the sorted DataFrame evaluation_df, showing the model names along with their corresponding accuracy and precision scores, sorted in descending order of accuracy and precision.

Overall, this code provides a concise and organized summary of the evaluation results for each model, making it easier to compare their performance based on accuracy and precision.

In [69]:
# Create a dictionary with evaluation results
evaluation_data = {
    'Model': ['SVC', 'KNN', 'MultinomialNB', 'Decision Tree', 'Logistic Regression', 'Random Forest', 'AdaBoost', 'Extra Tree', 'XGBoost'],
    'Accuracy': [accuracy_SVC, accuracy_KNN, accuracy_MNB, accuracy_DTC, accuracy_LR, accuracy_RF, accuracy_ADC, accuracy_ETC, accuracy_XGB],
    'Precision': [precision_SVC, precision_KNN, precision_MNB, precision_DTC, precision_LR, precision_RF, precision_ADC, precision_ETC, precision_XGB]
}

# Create a dataframe
evaluation_df = pd.DataFrame(evaluation_data)

# Sort the dataframe based on Accuracy and Precision columns in descending order
evaluation_df = evaluation_df.sort_values(by=['Accuracy', 'Precision'], ascending=False)

# Display the sorted dataframe
evaluation_df


Unnamed: 0,Model,Accuracy,Precision
4,Logistic Regression,0.99122,0.985019
2,MultinomialNB,0.989464,0.963899
8,XGBoost,0.985953,0.984674
7,Extra Tree,0.974539,0.995868
5,Random Forest,0.970149,0.995781
6,AdaBoost,0.955224,0.873288
3,Decision Tree,0.915716,0.76947
1,KNN,0.899912,0.914439
0,SVC,0.824407,0.636364


In [70]:
print(f"As we can see Out of All The Models , XGB is Performing State Of the Art With The Accuracy Of {accuracy_XGB} and Precision of {precision_XGB}")

As we can see Out of All The Models , XGB is Performing State Of the Art With The Accuracy Of 0.9859525899912204 and Precision of 0.9846743295019157


As we can see Out of All The Models , XGB is Performing State Of the Art With The Accuracy Of 0.9757751937984496 and Precision of 1.0

#               Visualizing Results

In [87]:
!pip install portly


ERROR: Could not find a version that satisfies the requirement portly (from versions: none)
ERROR: No matching distribution found for portly


In [88]:
import plotly.graph_objects as go

# Define the models and their accuracies and precisions
models = ['SVC', 'KNN', 'MultinomialNB', 'Decision Tree', 'Logistic Regression', 'Random Forest', 'AdaBoost', 'Extra Tree', 'XGBoost']
accuracies = [accuracy_SVC, accuracy_KNN, accuracy_MNB, accuracy_DTC, accuracy_LR, accuracy_RF, accuracy_ADC, accuracy_ETC, accuracy_XGB]
precisions = [precision_SVC, precision_KNN, precision_MNB, precision_DTC, precision_LR, precision_RF, precision_ADC, precision_ETC, precision_XGB]

# Create the figure
fig = go.Figure()

# Add bar traces for accuracy and precision
fig.add_trace(go.Bar(
    x=models,
    y=accuracies,
    name='Accuracy',
    marker_color='skyblue'
))
fig.add_trace(go.Bar(
    x=models,
    y=precisions,
    name='Precision',
    marker_color='salmon'
))

# Update layout
fig.update_layout(
    title='Accuracy and Precision of Different Models',
    xaxis=dict(title='Models'),
    yaxis=dict(title='Score'),
    barmode='group'  # Group bars for each model
)

# Show the plot
fig.show()


ModuleNotFoundError: No module named 'plotly'

# Predictions

In [72]:
df.head()

Unnamed: 0,text,spam,text_sent_token,stem_msg
0,subject naturally irresistible corporate ident...,1,[subject naturally irresistible corporate iden...,subject natur irresist corpor ident lt realli ...
1,subject stock trading gunslinger fanny merrill...,1,[subject stock trading gunslinger fanny merril...,subject stock trade gunsling fanni merril muzo...
2,subject unbelievable new homes made easy im wa...,1,[subject unbelievable new homes made easy im w...,subject unbeliev new home made easi im want sh...
3,subject 4 color printing special request addit...,1,[subject 4 color printing special request addi...,subject 4 color print special request addit in...
4,subject money get software cds software compat...,1,[subject money get software cds software compa...,subject money get softwar cd softwar compat gr...


In [73]:
df['stem_msg'][2]

'subject unbeliev new home made easi im want show homeown pre approv 454 169 home loan 3 72 fix rate offer extend uncondit credit way factor take advantag limit time opportun ask visit websit complet 1 minut post approv form look foward hear dorca pittman'

In [74]:
# Predict Messgae
text = [
    "Hi there! How are you doing?",
    "Just wanted to check in and see how your day is going.",
    "Hope everything is going well with you.",
    "Did you have a chance to review the document I sent earlier?",
    "Looking forward to hearing from you soon.",
    "I'll be in the office until 5 PM today.",
    "Let me know if you need anything else from me.",
    "Thanks for your help with the project!",
    "Don't forget about the meeting tomorrow morning at 9 AM.",
    "Hope you had a great weekend!",
    "Here are the notes from our last discussion for your reference.",
    "Just a reminder to submit your report by the end of the day.",
    "Congratulations on your recent promotion!",
    "Let's catch up over coffee sometime this week.",
    "Wishing you a fantastic day ahead!",
    "Thanks for reaching out. I'll get back to you as soon as possible.",
    "Could you please provide an update on the status of the project?",
    "Hope you're enjoying the nice weather today.",
    "Looking forward to our team lunch tomorrow.",
    "I've attached the file you requested to this email.",
    "Let me know if you have any questions about the upcoming presentation.",
    "Hope you had a relaxing weekend!",
    "Just wanted to remind you about the deadline for the proposal submission.",
    "Thanks for your attention to this matter.",
    "I appreciate your help with this task.",
    "Let's discuss the details of the project during our meeting tomorrow.",
    "Hope your day is going smoothly.",
    "I'll be out of the office for the rest of the day. Please reach out if you need anything.",
    "Could you please review and provide feedback on the draft document?",
    "Looking forward to seeing you at the conference next week.",
    "Thanks for your quick response!",
    "Let's plan to meet next Tuesday to finalize the budget.",
    "Hope you had a productive day!",
    "Just a friendly reminder about our weekly team meeting tomorrow.",
    "Thanks for your cooperation on this project.",
    "Hope you're having a great start to the week.",
    "Please let me know if you need any further clarification on the instructions.",
    "Looking forward to the training session later today.",
    "Thanks for your patience while we work through this issue.",
    "Hope you're feeling better soon!",
    "Just wanted to touch base regarding the progress of the project.",
    "Let me know if you need any assistance with the presentation slides.",
    "Hope you had a wonderful holiday season!",
    "Thanks for your understanding and cooperation.",
    "Please review the attached document and let me know your thoughts.",
    "Looking forward to the upcoming team building event.",
    "Hope your day is going well so far.",
    "Just wanted to say thank you for your hard work on this project.",
    "I'll follow up with you next week to discuss the action items.",
    "Thanks for your input during the brainstorming session.",
    "Hope you're having a relaxing evening!",
    "Please find the requested information attached to this email.",
    "Looking forward to working with you on this project.",
    "naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
]

# Preprocess Text
preprocessed_text = []
for txt in text:
    # Lower case
    txt = txt.lower()
    # Remove punctuation
    txt = txt.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    txt = ' '.join([word for word in txt.split() if word not in stopwords.words('english')])
    preprocessed_text.append(txt)

# Vectorize the preprocessed messages using the same vectorizer used during training
X_input = cv.transform(preprocessed_text).toarray()

# Predict the class label for the vectorized text messages
predicted_labels = xgb.predict(X_input)

# Print the predicted label for each message
for idx, txt in enumerate(preprocessed_text):
    if predicted_labels[idx] == 0:
        print(f"Text: '{txt}' - Predicted Label: 0")
    else:
        print(f"Text: '{txt}' - Predicted Label: 1")

Text: 'hi' - Predicted Label: 0
Text: 'wanted check see day going' - Predicted Label: 1
Text: 'hope everything going well' - Predicted Label: 1
Text: 'chance review document sent earlier' - Predicted Label: 0
Text: 'looking forward hearing soon' - Predicted Label: 0
Text: 'ill office 5 pm today' - Predicted Label: 1
Text: 'let know need anything else' - Predicted Label: 0
Text: 'thanks help project' - Predicted Label: 0
Text: 'dont forget meeting tomorrow morning 9' - Predicted Label: 1
Text: 'hope great weekend' - Predicted Label: 1
Text: 'notes last discussion reference' - Predicted Label: 1
Text: 'reminder submit report end day' - Predicted Label: 1
Text: 'congratulations recent promotion' - Predicted Label: 1
Text: 'lets catch coffee sometime week' - Predicted Label: 0
Text: 'wishing fantastic day ahead' - Predicted Label: 1
Text: 'thanks reaching ill get back soon possible' - Predicted Label: 1
Text: 'could please provide update status project' - Predicted Label: 0
Text: 'hope you

# Use a Function for Preprocessing

Encapsulating preprocessing logic so it's reusable:

In [75]:
from nltk.corpus import stopwords
import string

def preprocess_text(text_list):
    stop_words = set(stopwords.words('english'))
    processed = []
    for txt in text_list:
        txt = txt.lower()
        txt = txt.translate(str.maketrans('', '', string.punctuation))
        txt = ' '.join([word for word in txt.split() if word not in stop_words])
        processed.append(txt)
    return processed


# Saving xgb model

Handling Vectorizer and Model Loading

In [76]:
import joblib

# Assuming you used CountVectorizer or TfidfVectorizer
joblib.dump(cv, 'vectorizer.pkl')


['vectorizer.pkl']

In [77]:
import joblib

# Save model
joblib.dump(xgb, "spam_classifier.pkl")



['spam_classifier.pkl']

In [78]:
import joblib

cv = joblib.load('vectorizer.pkl')
xgb = joblib.load('spam_classifier.pkl')


In [81]:
nltk.__version__

'3.9.1'

In [84]:
from xgboost import XGBClassifier
import xgboost
print(xgboost.__version__)


3.0.1
