## Building a spam classifier using NLP techniques



#### Step 1: Dataset Preparation
##### 1.- Load the Dataset: 
Use a dataset that includes text messages labeled as spam or not spam. Common datasets include the SMS Spam Collection from UCI Machine Learning Repository.

In [5]:
import pandas as pd

# Load dataset
df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['label', 'message'])

##### 2.- Explore the Dataset: 
Understand the structure and content of your dataset.

In [None]:
print(df.head())
print(df['label'].value_counts())

#### Step 2: Data Preprocessing
##### 1.- Text Cleaning: 
Remove unnecessary characters, punctuation, and stopwords (common words like 'and', 'the').

In [8]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Text cleaning function
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = ' '.join([word for word in word_tokenize(text) if word not in stopwords.words('english')])  # Remove stopwords
    return text
    
df['clean_text'] = df['message'].apply(clean_text)

##### 02.- Feature Extraction: 
Convert text data into numerical vectors using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['clean_text'])
y = df['label']

#### Step 3: Train-Test Split
##### Split the dataset into training and testing sets.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Step 4: Model Training and Evaluation
##### 1.- Choose a Classifier: 
Train a classifier, such as a Naive Bayes classifier, which often works well for text classification tasks.

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9695067264573991
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98       966
        spam       1.00      0.77      0.87       149

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115



#### Step 5: Deployment
Once satisfied with the model's performance, you can save the trained model and use it to predict on new, unseen data.

In [None]:
# Example of how to use the trained model for prediction
new_texts = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."]
new_texts_clean = [clean_text(text) for text in new_texts]
new_texts_tfidf = tfidf_vectorizer.transform(new_texts_clean)
predictions = clf.predict(new_texts_tfidf)
print(predictions)

#### Code complete in a single file
lets see the code in a single file and the behavior

In [None]:
#1.- 
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['label', 'message'])

#2.- 
# Text cleaning function
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = ' '.join([word for word in word_tokenize(text) if word not in stopwords.words('english')])  # Remove stopwords
    return text

df['clean_text'] = df['message'].apply(clean_text)

# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['clean_text'])
y = df['label']

#3.-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#4.- 
# Initialize the classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

#5.- 
# Example of how to use the trained model for prediction
new_texts = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."]
new_texts_clean = [clean_text(text) for text in new_texts]
new_texts_tfidf = tfidf_vectorizer.transform(new_texts_clean)
predictions = clf.predict(new_texts_tfidf)
print(predictions)

#### Conclusions
As you can see, we use a file (SMSSpamCollection.csv) with comments to help train AI. In this case, it’s labeled as either ‘Spam’ or ‘Ham.’ However, you can create any kind of list to train your own version. For example, the ‘enron_spam_data.csv’ is a larger dataset of emails that isn’t marked. You can use it to train and experiment. For instance, you could detect who the emails are mostly sent and received by, or explore the topics within these emails. The possibilities are limitless—enjoy exploring!!