# Implement e-mail spam filtering using text classification algorithm with appropriate dataset.

#### To implement an email spam filter, we’ll use a text classification algorithm on a spam dataset. The key steps are:

1. Load a Dataset: We'll use a common dataset like the SpamAssassin dataset or SMS Spam Collection dataset.
2. Preprocess the Data: Clean the text by removing stopwords, punctuation, etc.
3. Vectorize the Text: Convert text to numerical form using techniques like TF-IDF.
4. Train a Classifier: Use a classifier such as Naive Bayes, which is effective for text classification.
5. Evaluate the Model: Measure the classifier's performance using metrics like accuracy, precision, and recall.

In [7]:
import pandas as pd

# Load the dataset (make sure to provide the correct path)
df = pd.read_csv(r"C:\Users\VEDARTH KHANDVE\Downloads\SMSSpamCollection", sep='\t',names=['label','message'])

# Show the first few rows of the dataset
print(df.head())


  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [8]:
# Label encoding: spam=1, ham=0
df['label']=df['label'].map({'spam':1,'ham':0})

In [9]:
# train test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train, y_test= train_test_split(df['message'], df['label'], test_size=0.2, random_state= 42)

In [10]:
# text preprocessing
import re
def preprocessing_text(text):
    text =text.lower()
    text = re.sub(r'\d+','',text) #remove numbers
    text = re.sub(r'\W+','',text) #remove punctuations
    return text

In [11]:
# apply preprocessing
X_train=X_train.apply(preprocessing_text)
X_test=X_test.apply(preprocessing_text)

In [12]:
#Tf_IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(stop_words='english',max_features=3000)
X_train_vec=vectorizer.fit_transform(X_train)
X_test_vec=vectorizer.transform(X_test)

In [13]:
# train neive bayes classifire
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_vec,y_train)

In [17]:
# predict on test_data
y_pred= classifier.predict(X_test_vec)

In [18]:
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93       966
           1       1.00      0.05      0.09       149

    accuracy                           0.87      1115
   macro avg       0.94      0.52      0.51      1115
weighted avg       0.89      0.87      0.82      1115



In [19]:
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

              precision    recall  f1-score   support

         ham       0.87      1.00      0.93       966
        spam       1.00      0.05      0.09       149

    accuracy                           0.87      1115
   macro avg       0.94      0.52      0.51      1115
weighted avg       0.89      0.87      0.82      1115



In [24]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import re

# Load the dataset (replace 'emails.csv' with your actual file path)
data = pd.read_csv(r"C:\Users\VEDARTH KHANDVE\Downloads\SMSSpamCollection", sep='\t',names=['label','message'])

# Split the data into features (text) and target (spam)
X = data['message']
y = data['label']

# Convert text to numerical features using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

# Create an SVM classifier
svm_classifier = SVC(kernel='linear')

# Train the SVM classifier
svm_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
class_report = classification_report(y_test, y_pred, target_names=['Not Spam', 'Spam'])
print("Classification Report:\n", class_report)

# Define a function to classify email subjects
def classify_email(subject):
    cleaned_subject = re.sub(r'^Subject:\s*', '', subject)  # Remove "Subject:" prefix
    vectorized_subject = vectorizer.transform([cleaned_subject])
    prediction = svm_classifier.predict(vectorized_subject)
    if prediction[0] == 1:
        return "Spam"
    else:
        return "Not Spam"

# Ask the user to enter an email subject
user_input = input("Enter an email subject: ")
classification_result = classify_email(user_input)
print("Classification:", classification_result)


Accuracy: 0.99
Classification Report:
               precision    recall  f1-score   support

    Not Spam       0.99      1.00      0.99       966
        Spam       1.00      0.93      0.96       149

    accuracy                           0.99      1115
   macro avg       0.99      0.96      0.98      1115
weighted avg       0.99      0.99      0.99      1115

Classification: Not Spam
