# **Spam SMS Detection**

### Block 1: Import Libraries
This block imports the necessary libraries for the task, including pandas for data manipulation, scikit-learn for machine learning algorithms and tools, and specific modules from scikit-learn for feature extraction and model evaluation.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report


### Block 2: Load and Explore Data
This block loads the dataset from the 'spam.csv' file into a pandas DataFrame. It prints the first few rows of the dataset to inspect its structure and checks for any missing values in the dataset.


In [2]:
# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')

# Explore the first few rows of the dataset
print(data.head())

# Check for any missing values
print(data.isnull().sum())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64


### Block 3: Data Preprocessing
This block performs data preprocessing tasks:
- It selects only the necessary columns ('v1' for labels and 'v2' for text messages).
- It renames the columns to 'label' and 'text' for better readability.
- It maps the 'ham' and 'spam' labels to binary values (0 for 'ham' and 1 for 'spam').
- It splits the dataset into training and testing sets using the `train_test_split` function from scikit-learn.


In [3]:
# Drop unnecessary columns
data = data[['v1', 'v2']]

# Rename columns
data.columns = ['label', 'text']

# Convert labels to binary (0 for ham, 1 for spam)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)


### Block 4: Feature Extraction
This block converts the text data into TF-IDF (Term Frequency-Inverse Document Frequency) vectors using the `TfidfVectorizer` from scikit-learn. TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). The `stop_words='english'` parameter removes common English stopwords from the text data during vectorization.


In [4]:
# Convert text data into TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


### Block 5: Model Training
This block initializes and trains the Support Vector Machine (SVM) classifier using the training data. SVM is a supervised learning model used for classification tasks. In this case, we're using a linear kernel (`kernel='linear'`), which is suitable for text classification tasks.


In [5]:
# Initialize and train the Support Vector Machine (SVM) classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_tfidf, y_train)


### Block 6: Model Evaluation
This block makes predictions on the test set using the trained SVM classifier. It then evaluates the performance of the model by calculating accuracy and generating a classification report, which includes precision, recall, F1-score, and support for each class (ham and spam). The accuracy score indicates the proportion of correctly classified instances in the test set, while the classification report provides more detailed performance metrics for each class.


In [6]:
# Predictions on the test set
y_pred = svm_classifier.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.979372197309417

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.97      0.87      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



## trained a SVC model using the given training data and printed the model accruacy and classification report 