## Using LinearSVC to Detect Spam

#### The Dataset

The dataset we are gonna be using is available at https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data

In [15]:
import numpy as np
import pandas as pd

df = pd.read_csv('spam.csv', encoding='LATIN-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Cleaning the Dataset

In [16]:
# Removes unnecessary columns
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

In [17]:
# Rename columns
df.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)

In [18]:
# Checking missing values
print(df.isnull().sum())

label    0
text     0
dtype: int64


In [19]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


That's better!!

In [20]:
# Detect and Remove empty strings

blanks = []

for i, lb, txt in df.itertuples():
    if type(txt) == str:
        if txt.isspace():
            blanks.append(i)
print(len(blanks), 'Blanks: ', blanks)

0 Blanks:  []


### Applying LinearSVC with a Pipeline

#### Split the data

In [23]:
from sklearn.model_selection import train_test_split

X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [36]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

classifier = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

  if LooseVersion(joblib_version) < '0.12':


The inclusion of stopwords resulted in worse performance metrics. That's why I did not use them.

### Classification Report

In [35]:
from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_pred))

[[1580    7]
 [  30  222]]


In [38]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1587
        spam       0.97      0.89      0.93       252

   micro avg       0.98      0.98      0.98      1839
   macro avg       0.98      0.94      0.96      1839
weighted avg       0.98      0.98      0.98      1839



In [39]:
print(metrics.accuracy_score(y_test, y_pred))

0.9820554649265906
