## Data Loading & Initial Exploration

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

file_path = r"C:\Users\Asus\Downloads\spam.csv"

print("--- Step 1: Data Loading ---")
try:

    # Note: 'encoding' might be required for some datasets
    df = pd.read_csv(file_path, encoding='latin-1') 
    
    # Drop unnecessary columns if they exist (common in Kaggle's SMS dataset)
    if 'Unnamed: 2' in df.columns:
        df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
        
    # Rename columns for clarity (v1 -> label, v2 -> message)
    df.columns = ['label', 'message']

    print(f" Data file successfully loaded.")
except FileNotFoundError:
    print(f" Error: File not found at '{file_path}'. Please check the path.")
    exit()

# 2. Initial Inspection
print("\n--- Initial Inspection (Head) ---")
print(df.head())

print("\n--- Label Distribution ---")
# Check the distribution of 'spam' vs 'ham'
print(df['label'].value_counts())

--- Step 1: Data Loading ---
 Data file successfully loaded.

--- Initial Inspection (Head) ---
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

--- Label Distribution ---
label
ham     4825
spam     747
Name: count, dtype: int64


## NLP Preprocessing (Text Cleaning)

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer # Hum Stemming ka upyog karenge

# Download stopwords list if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    
# Initialize stemmer and stopwords
ps = PorterStemmer()
all_stopwords = stopwords.words('english')

# Remove common but non-informative words like 'will' if they interfere
# all_stopwords.remove('will') 

# Define the text cleaning function
def clean_text(text):
    # 1. Remove all non-word characters and numbers (only keep letters)
    text = re.sub('[^a-zA-Z]', ' ', text)
    # 2. Convert to Lowercase
    text = text.lower()
    # 3. Split into words
    text = text.split()
    
    # 4. Apply Stemming and remove Stopwords
    text = [ps.stem(word) for word in text if not word in set(all_stopwords)]
    
    # 5. Join the words back into a single string
    text = ' '.join(text)
    return text

# Apply the cleaning function to the 'message' column
df['cleaned_message'] = df['message'].apply(clean_text)

print("Text Cleaning and Stemming Complete.")

# 2. Label Encoding (Ham=0, Spam=1)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

print("\n--- Encoded and Cleaned Data Head ---")
print(df[['label', 'message', 'cleaned_message']].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Text Cleaning and Stemming Complete.

--- Encoded and Cleaned Data Head ---
   label                                            message  \
0      0  Go until jurong point, crazy.. Available only ...   
1      0                      Ok lar... Joking wif u oni...   
2      1  Free entry in 2 a wkly comp to win FA Cup fina...   
3      0  U dun say so early hor... U c already then say...   
4      0  Nah I don't think he goes to usf, he lives aro...   

                                     cleaned_message  
0  go jurong point crazi avail bugi n great world...  
1                              ok lar joke wif u oni  
2  free entri wkli comp win fa cup final tkt st m...  
3                u dun say earli hor u c alreadi say  
4               nah think goe usf live around though  


## Text Vectorization (TF-IDF)

#### 1. Data Splitting

In [3]:
from sklearn.model_selection import train_test_split

X = df['cleaned_message']
y = df['label']

# 80/20 train-test split (X_train: cleaned messages for training, y_train: labels for training)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y # Stratify ensures the 6.46:1 ham/spam ratio is maintained.
)

print(" Data successfully split.")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

 Data successfully split.
Training set size: 4457 samples
Testing set size: 1115 samples


#### 2. Applying TF-IDF Vectorizer

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
# max_features=3000 ka matlab hai ki hum sirf top 3000 sabse zyada important words ko lenge.
# Isse model ki complexity kam hoti hai.
tfidf_vectorizer = TfidfVectorizer(max_features=3000)

# Fit the vectorizer on the training text data (X_train)
# Vectorizer training data mein unique words ki vocabulary, unki TF aur IDF values seekhta hai.
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()

# Transform the test text data (X_test) using the fitted vectorizer
# Yahaan sirf transform kiya jaata hai, fit nahi.
X_test_tfidf = tfidf_vectorizer.transform(X_test).toarray()

print("\n TF-IDF Vectorization Complete.")
print(f"Shape of Training Data (Samples, Features): {X_train_tfidf.shape}")
print(f"Shape of Testing Data (Samples, Features): {X_test_tfidf.shape}")


 TF-IDF Vectorization Complete.
Shape of Training Data (Samples, Features): (4457, 3000)
Shape of Testing Data (Samples, Features): (1115, 3000)


## Model Training (Naive Bayes)

#### 1. Training the Model

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the Multinomial Naive Bayes model
mnb_model = MultinomialNB()

# Train the model using the TF-IDF transformed training data
print("Training Multinomial Naive Bayes Model...")
mnb_model.fit(X_train_tfidf, y_train)

print(" Naive Bayes Model successfully trained.")

Training Multinomial Naive Bayes Model...
 Naive Bayes Model successfully trained.


#### 2. Prediction

In [6]:
# Make predictions on the test set
y_pred_mnb = mnb_model.predict(X_test_tfidf)

print("\n Predictions completed.")


 Predictions completed.


#### Evaluation and Interpretation

In [7]:
# Calculate Accuracy
accuracy_mnb = accuracy_score(y_test, y_pred_mnb)
print(f"\nOverall Accuracy (Test Set): {accuracy_mnb:.4f}")

# Calculate the Confusion Matrix
cm_mnb = confusion_matrix(y_test, y_pred_mnb)
print("\n--- Confusion Matrix ---")
print(cm_mnb)

# Generate the detailed Classification Report
print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred_mnb))


Overall Accuracy (Test Set): 0.9740

--- Confusion Matrix ---
[[965   1]
 [ 28 121]]

--- Classification Report ---
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       966
           1       0.99      0.81      0.89       149

    accuracy                           0.97      1115
   macro avg       0.98      0.91      0.94      1115
weighted avg       0.97      0.97      0.97      1115



## Model Training (Support Vector Machine - SVM)

In [8]:
from sklearn.svm import SVC

# Initialize the Support Vector Machine (SVC) model
# Hum linear kernel ka use karenge kyunki data high-dimensional hai (3000 features).
svm_model = SVC(kernel='linear', random_state=42)

# Train the model on the TF-IDF transformed training data
print("\nTraining Support Vector Machine (SVM) Model...")
# SVM Training is generally slower than Naive Bayes
svm_model.fit(X_train_tfidf, y_train)

print(" SVM Model successfully trained.")


Training Support Vector Machine (SVM) Model...
 SVM Model successfully trained.


## Prediction & Evaluation (SVM)

In [9]:
# Make predictions on the test set using the SVM model
y_pred_svm = svm_model.predict(X_test_tfidf)

print(" SVM Predictions completed.")

 SVM Predictions completed.


In [10]:
#. Evaluation Metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate Accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"\nOverall Accuracy (SVM Test Set): {accuracy_svm:.4f}")

# Calculate the Confusion Matrix
cm_svm = confusion_matrix(y_test, y_pred_svm)
print("\n--- SVM Confusion Matrix ---")
print(cm_svm)

# Generate the detailed Classification Report
print("\n--- SVM Classification Report ---")
print(classification_report(y_test, y_pred_svm))


Overall Accuracy (SVM Test Set): 0.9857

--- SVM Confusion Matrix ---
[[965   1]
 [ 15 134]]

--- SVM Classification Report ---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.90      0.94       149

    accuracy                           0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115



##  Project Conclusion: Best Model Selection (SVM)

The Spam Detection project successfully utilized NLP techniques (Cleaning, Stemming, and TF-IDF Vectorization) with two powerful classification algorithms: Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM).

### Final Model Comparison

| Metric | MNB Score | SVM Score | Final Decision |
| :--- | :--- | :--- | :--- |
| **Overall Accuracy** | $0.9740$ | **$0.9857$** | SVM (Higher) |
| **Precision (Spam)** | $0.99$ | $0.99$ | Excellent for avoiding False Positives. |
| **Recall (Spam)** | $0.81$ | **$0.90$** | **SVM is superior.** |
| **F1-Score (Spam)** | $0.89$ | **$0.94$** | SVM (Better balance). |

### Conclusion

1.  **Best Performer: Support Vector Machine (SVM).** SVM delivered a remarkable **$98.57\%$ accuracy** and, more importantly, achieved **$90\%$ Recall** for the Spam class.
2.  **User Experience Focus:** SVM is highly reliable for a practical spam filter:
    * It maintained near-perfect **Precision ($0.99$)** (only 1 legitimate message was misclassified as spam).
    * It significantly reduced **False Negatives** (spam messages leaking into the inbox) from 28 (MNB) to just **15** (SVM).

The SVM model, using a **linear kernel** on the **TF-IDF vector space**, provides the optimal balance of speed, accuracy, and performance on the crucial positive class (Spam).