## <div style="text-align: center; color: red;"><b>Spam Mail Classifier</b></div>


<div style="text-align: center;">
    <img src="image.jpeg" alt="Spam Classifier Image" width="500"/>
</div>

#

### Overview 🌐

The Spam-Email Classifier is an interactive web application designed to help users identify whether a given email is spam or legitimate (ham). By utilizing machine learning techniques, specifically Logistic Regression, the app provides a user-friendly interface for real-time email classification, enhancing email management and security.

### Objective 🎯

The primary objective of this app is to:

- Classify Emails: Accurately classify emails as "spam" or "ham" based on their content.
- Enhance User Experience: Provide a simple, intuitive interface for users to check their emails without technical knowledge.
- Promote Email Safety: Help users filter out unwanted and potentially harmful spam emails.

### Methodology 🔍

- #### Data Collection and Preparation:

Collected a dataset of labeled emails (spam and ham).
Preprocess the text data by removing irrelevant information and normalizing the text (e.g., lowercasing, removing punctuation).

- #### Feature Extraction:

Use TfidfVectorizer to convert the text emails into numerical format, capturing the importance of words while ignoring common stop words.

- #### Model Training:

Split the dataset into training and test sets using train_test_split.
Train a Logistic Regression model on the training data to learn the patterns that distinguish spam from ham emails.

- #### User Interface Development:

Build a web interface using Streamlit to allow users to input email text.
Implement functionality to display the classification results .

- #### Prediction:

Upon user input, transform the text using the same TfidfVectorizer and make predictions using the trained model.
Display the classification result (spam or ham) on the user interface.

#

#

#

#### Binary classification using logisitc regression

In [75]:
#Importing the required libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer



In [76]:
#Reading the csv and seeing top 5 rows

data = pd.read_csv('mail_data.csv')
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [77]:
#lets convert Category columns value to numerical (ham=1 , spam=0)
data['Category'] = data['Category'].map({'ham':1 , 'spam':0})

In [78]:
#lets split the data into X and y

X = data['Message']
y = data['Category']

In [79]:
# making our train and test split
X_train , X_test , y_train , y_test = train_test_split(X,y,random_state=100 , train_size=0.80)

In [80]:
# lets also convert the Messages column to numeric values using TFid vectorizer , 
 
# TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency) 
# is a popular technique in text mining and natural language processing (NLP) to convert textual data into numerical features. It reflects how 
# important a word is to a document in a collection (or corpus). This importance increases with the number of times the 
# word appears in the document but is offset by how frequently the word occurs in the entire corpus. 


In [81]:

# Step 1: Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')

In [82]:
# Step 2: Fit the vectorizer and transform the training data
X_train_features = vectorizer.fit_transform(X_train)

In [83]:
# Step 3: Transform the test data
X_test_features = vectorizer.transform(X_test)

In [84]:
# lets check the percentage of spam and ham email     
data.Category.value_counts(normalize=True) * 100

Category
1    86.593683
0    13.406317
Name: proportion, dtype: float64

In [85]:
# we notice that the data is highly imbalanced to fix this we will need to adjust the hyper-parameter class_weight inside LogisticRegression
# function

In [86]:
# Train a Logistic Regression model

model = LogisticRegression(class_weight='balanced')
model.fit(X_train_features, y_train)

In [87]:
# Make predictions on the test data
y_test_pred = model.predict(X_test_features)

In [88]:
# accuracy score of test set
accuracy_score(y_test_pred,y_test)

0.9811659192825112

In [89]:
# 1 is ham , 0 is spam 

def spam_ham_checker(abc):
    try:
        # Transform input using the vectorizer
        X_feature = vectorizer.transform([abc])
        prediction = model.predict(X_feature)

        # Log prediction for debugging
        print(f"DEBUG: Model prediction: {prediction}")

        # Return result based on prediction
        return 'ham' if prediction[0] == 1 else 'spam mail'
    except Exception as e:
        print(f"Error during prediction: {e}")
        return None  # Explicitly return None on error

        