# Treue Technologies (Internship - Data Science) [Ansh Pandey]

##  Task 4 : Email Spam Detection (ML Model)

### Objective = Create an email spam detection system that uses machine learning algorithms to classify incoming emails. By training the model on a labeled dataset of "SPAM" and "NON-SPAM" emails, We aim to develop an accurate and efficient spam detector that can reliably identify and categorize emails based on their content and characteristics.


#### i) Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
import nltk
nltk.download('punkt')


[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

#### ii) Loading The Dataset (.csv file)

In [2]:
data = pd.read_csv('C:\\Users\\asus\\Desktop\\Data Science or Analytics_Ansh\\Tasks Images\spam.csv', encoding = 'cp1252')

In [3]:
print(data)

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


In [4]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
print(data.columns)

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')


In [6]:
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [7]:
print(data.dtypes)

v1            object
v2            object
Unnamed: 2    object
Unnamed: 3    object
Unnamed: 4    object
dtype: object


In [8]:
import nltk
from nltk.corpus import stopwords

# Download the stopwords dataset if you haven't already (Becuse I haven't)
nltk.download('stopwords')

# Text preprocessing
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    return text


[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


#### iii) Preprocessing The Text Data

In [9]:
# Text preprocessing
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and tokenize
    text = nltk.word_tokenize(text)
    # Remove stopwords
    text = [word for word in text if word not in stop_words]
    # Join tokens back into a single string
    text = ' '.join(text)
    return text

data['cleaned_text'] = data['v2'].apply(preprocess_text)


#### iv) Splitting The Data 

In [10]:
X = data['cleaned_text']
Y = data['v1']


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

#### v) Features Extraction

In [11]:
tfidf_vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

#### vi) Building And Training The Model

In [12]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_tfidf, Y_train)

#### vii) Evaluating The Model

In [13]:
Y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(Y_test, Y_pred)
report = classification_report(Y_test, Y_pred)


print(f'Accuracy : {accuracy}')
print(report)

Accuracy : 0.9650224215246637
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       949
        spam       1.00      0.77      0.87       166

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115



# Done (Save and Deploy this ML Model)

In [14]:
from sklearn.ensemble import RandomForestClassifier
from joblib import dump

# Save the model
model_filename = "email_spam_detection_model.joblib"
dump(model, model_filename)


['email_spam_detection_model.joblib']

#### Using CountVectorizer to train the ML Model

In [15]:
# Importing important Libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [16]:
# Preparing my Data
texts = ["This is a sample email.", "Get rich quick! Earn $$$ in just one week!", "Hi, how are you? Let's meet up this weekend.", "Buy now! Limited time offer!", "You've won a free iPhone. Claim your prize now!", "Important meeting tomorrow at 10 AM.", "Congratulations, you've won a lottery!", "Meet singles in your area! Chat now!", "Please find attached the report for review.", "Make $$$ from home with our work-at-home program", "Unbelievable deals on electronics! Limited time offer", "Click here to claim your prize!", "You've been selected for a job interview.", "Dear Learner, You have successfully created your account on our website.", "Dear user, Your one time password is 24567.", "Congratullations Ansh, you have been shortlishted.", "Get rich quick!", "Discover 50 Full-Time Remote Jobs Now!"]
labels = ["not spam", "spam", "not spam", "spam", "spam", "not spam", "spam", "spam", "not spam", "not spam", "not spam", "spam", "spam", "not spam", "not spam", "not spam", "spam", "not spam" ]

In [17]:
print(len(texts))
print(len(labels))

18
18


In [18]:
# Spliting the Data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=0)

In [19]:
# Create and Fit CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)

In [20]:
# Training my Model
model = MultinomialNB()
model.fit(X_train_counts, y_train)

In [25]:
# Evaluating the Model
X_test_counts = vectorizer.transform(X_test)
accuracy = model.score(X_test_counts, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5


In [26]:
# Making Predictions

# Example (1)
new_text = ["Discover 50 Full-Time Remote Jobs Now!"]
new_text_counts = vectorizer.transform(new_text)
prediction1 = model.predict(new_text_counts)
print(f"Prediction: {prediction1}")


# Example (2)
new_text1 = ["Congratulations, you got a new Iphone 15 pro, click to the link and get now! "]
new_text1_counts = vectorizer.transform(new_text1)
prediction2 = model.predict(new_text1_counts)
print(f"Prediction: {prediction2}")

# Example (3)
new_text2 = ["Admissions closing soon for 2023. Limited seats left. Apply now."]
new_text2_counts = vectorizer.transform(new_text2)
prediction3 = model.predict(new_text2_counts)
print(f"Prediction: {prediction3}")

# Example (4)
new_text3 = ["Dear candidate, You are shortlisted."]
new_text3_counts = vectorizer.transform(new_text3)
prediction4 = model.predict(new_text3_counts)
print(f"Prediction: {prediction4}")



Prediction: ['not spam']
Prediction: ['spam']
Prediction: ['spam']
Prediction: ['not spam']


In [27]:
from joblib import load
# Provide the absolute path to the 'count_vectorizer.joblib' file
vectorizer = load("C:\\Users\\asus\\Downloads\email_spam_detection_model.joblib")

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create and fit the TfidfVectorizer during training
vectorizer = TfidfVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

# Transform the email text during prediction
#email_text_features = vectorizer.transform([email_text])


## Conclusion

### In conclusion, the Email Spam Detection (ML Model), developed as part of the Data Science Internship at Treue Technologies by me (Ansh Pandey), exhibits exceptional performance. With an accuracy rate of approximately 96%, this model effectively distinguishes between spam and legitimate emails. Its successful deployment and seamless integration into email systems make it a valuable tool for enhancing email communication security and efficiency. This project showcases the power of machine learning in addressing real-world challenges and is a testament to my (Ansh Pandey's) data science skills and dedication.

Note : This model is trained with demo data.