<a href="https://colab.research.google.com/github/isandrade-udea/LabIA/blob/main/TallerEmailSpamDetection_Aprendices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="140px" src="https://raw.githubusercontent.com/isandrade-udea/LabIA/main/Captura%20desde%202024-05-24%2012-44-56.png" align="left" hspace="10px" vspace="0px"></p>


# **Email Spam Detection** 💌

El taller muestra cómo crear un modelo para clasificar spam en SMS. Utiliza el conjunto de datos SMS Spam Collection de la plataforma [Kaggle](https://www.kaggle.com/datasets/venky73/spam-mails-dataset/code?datasetId=109196&sortBy=voteCount) y esta baso en el notebook de [ZABIHULLAH18](https://www.kaggle.com/code/zabihullah18/email-spam-detection). Al final, se tendra una herramientapara filtrar mensajes no deseados y hacer más segura la experiencia de mensajería de texto.

**Instructores:**

* Herna Villar

* Isabel C. Andrade M.

#**1. Aim**


<p><img alt="model" height="240px" src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*mbFBPcPUJD-53v3h.png" align="centering" hspace="10px" vspace="0px"></p>

El objetivo es generar un modelo predictivo que clasifique los SMS en spam o ham.

#**2. Libraries**

In [None]:
# Importing necessary libraries
import numpy as np        # For numerical operations
import pandas as pd       # For data manipulation and analysis
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns
%matplotlib inline

sns.set()
plt.rcParams['figure.figsize'] = [7, 5]
plt.rcParams['legend.fontsize'] = 16

# Importing WordCloud for text visualization
from wordcloud import WordCloud

# Importing NLTK for natural language processing
import nltk
from nltk.corpus import stopwords    # For stopwords


# Downloading NLTK data
nltk.download('stopwords')   # Downloading stopwords data
nltk.download('punkt')       # Downloading tokenizer data

#**3. Loading the data**

In [None]:
#@title  dataset
path_to_file = "https://raw.githubusercontent.com/isandrade-udea/datasets/main/spam.csv" # @param {type:"string"}

df = pd.read_csv(path_to_file,encoding='latin1')
df.head()

#**4. Data Cleaning**

##Data Info

In [None]:
df.info()

##Drop the Columns

In [None]:
df.head(3)

In [None]:
df = df.drop(columns = ['Unnamed: 2',	'Unnamed: 3',	'Unnamed: 4'])

In [None]:
df.head()

##Rename the Column

In [None]:
# Rename the columns name
df.rename(columns = {'v1': 'target', 'v2':'text'}, inplace = True)
df.head(2)

##Convert the target variable

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['target'] = encoder.fit_transform(df['target'])

##Check Duplicate values

In [None]:
df.duplicated().sum()

##Remove Duplicate values

In [None]:
df = df.drop_duplicates(keep = 'first')

##Shape of the Dataset

In [None]:
df.shape

#**5. EDA**

##Percentage of Ham and Spam

In [None]:
values = df['target'].value_counts()
total = values.sum()

percentage_0 = (values[0] /total) * 100
percentage_1 = (values[1]/ total) *100

print('percentage of 0 :' ,percentage_0)
print('percentage of 1 :' ,percentage_1)

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.pie(
    values, labels=['ham', 'spam'],
    autopct='%0.2f%%',
    startangle=90,
    wedgeprops={'linewidth': 2, 'edgecolor': 'white'},
    shadow=True  # Add shadow
)


##Text Length and Structure Analysis

**Tokenize:**   es una herramienta de NLTK (Natural Language Toolkit)  para dividir el texto en palabras individuales o tokens

Text:

 `df['text'][1] = Subject: hpl nom for january 9 , 2001\r\n( see attached file : hplnol 09 . xls )\r\n- hplnol 09 . xls`

tokenize:

 `nltk.word_tokenize(df['text'][1]) = ['Subject', ':', 'hpl','nom','for', 'january',...`]`

In [None]:
df.loc[:, 'num_characters'] = df['text'].apply(len)
df.loc[:,'num_words'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))
df.loc[:,'num_sentence'] = df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

In [None]:
df.head(2)

In [None]:
df[['num_characters', 'num_words', 'num_sentence']].describe()

##Summary Statistics for Legitimate Messages

In [None]:
df[df['target'] == 0][['num_characters', 'num_words', 'num_sentence']].describe()

##Summary Statistics for Spam Messages

<p><img alt="model" height="70px" src="https://raw.githubusercontent.com/isandrade-udea/LabIA/main/Captura%20desde%202024-05-24%2012-44-30.png" align="left" hspace="10px" vspace="0px"></p>


**Ejercicio**:
Filtra el DataFrame df para seleccionar las filas donde 'target' sea igual a 1. Luego, muestra las estadísticas descriptivas de las columnas 'num_characters', 'num_words' y 'num_sentence'.


In [None]:
df[df['target'] == 1][['num_characters', 'num_words', 'num_sentence']].describe()

##Character Length Distribution for Legitimate and Spam Messages

In [None]:
# Create a figure and set the figure size
plt.figure(figsize=(10, 6))

# Plot the histogram for target 0 in blue
sns.histplot(df[df['target'] == 0]['num_characters'], color='blue', label='Target 0', kde=True)

# Plot the histogram for target 1 in red
sns.histplot(df[df['target'] == 1]['num_characters'], color='red', label='Target 1', kde=True)

# Add labels and a title
plt.xlabel('Number of Characters', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Distribution of Number of Characters by Target', fontsize=16, fontweight='bold')

# Add a legend
plt.legend()

##Word Count Distribution for Legitimate and Spam Messages

<p><img alt="model" height="70px" src="https://raw.githubusercontent.com/isandrade-udea/LabIA/main/Captura%20desde%202024-05-24%2012-44-30.png" align="left" hspace="10px" vspace="0px"></p>

**Ejercicio**:
Muestra en una sola grafica la distribución del recuento de palabras para mensajes ham y spam


In [None]:
# Create a figure and set the figure size
plt.figure(figsize=(10, 6))

# Plot the histogram for target 0 in blue
sns.histplot(df[df['target'] == 0]['num_words'], color='blue', label='Target 0', kde=True)

# Plot the histogram for target 1 in red
sns.histplot(df[df['target'] == 1]['num_words'], color='red', label='Target 1', kde=True)

# Add labels and a title
plt.xlabel('Number of Characters', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Distribution of Number of words by Target', fontsize=16, fontweight='bold')

# Add a legend
plt.legend()

#**6. Data Preprocessing**

<p><img alt="model" height="340px" src="https://raw.githubusercontent.com/isandrade-udea/datasets/7dce70e30f4e7af85331857e022b9c8d183c3d11/Captura%20desde%202024-05-23%2015-32-22.png" align="centering" hspace="10px" vspace="0px"></p>

In [None]:
# Importing the Porter Stemmer for text stemming
from nltk.stem.porter import PorterStemmer

# Importing the string module for handling special characters
import string

# Creating an instance of the Porter Stemmer
ps = PorterStemmer()

# Lowercase transformation and text preprocessing function
def transform_text(text):
    # Transform the text to lowercase
    text = text.lower()

    # Tokenization using NLTK
    text = nltk.word_tokenize(text)

    # Removing special characters
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)

    # Removing stop words and punctuation
    text = y[:]
    y.clear()

    # Loop through the tokens and remove stopwords and punctuation
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)

    # Stemming using Porter Stemmer
    text = y[:]
    y.clear()
    for i in text:
        y.append(ps.stem(i))

    # Join the processed tokens back into a single string
    return " ".join(y)

In [None]:
transform_text('This in $ an Example')

##Creating a New Column: 'transformed_text'

In [None]:
df['transformed_text'] = df['text'].apply(transform_text)

In [None]:
df.head(3)

##Word Cloud for Spam Messages

In [None]:
wc = WordCloud(width = 500, height = 500, min_font_size = 10, background_color = 'white')
spam_wc = wc.generate(df[df['target'] == 1]['transformed_text'].str.cat(sep = " "))
plt.figure(figsize = (15,6))
plt.imshow(spam_wc)
plt.show()

##Word Cloud for Not spam Messages

<p><img alt="model" height="70px" src="https://raw.githubusercontent.com/isandrade-udea/LabIA/main/Captura%20desde%202024-05-24%2012-44-30.png" align="left" hspace="10px" vspace="0px"></p>

**Ejercicio:**
Muestra la nube de palabras para mensajes ham



##Find top 30 words of spam

In [None]:
spam_carpos = []
for sentence in df[df['target'] == 1]['transformed_text'].tolist():
    for word in sentence.split():
        spam_carpos.append(word)

In [None]:
from collections import Counter
filter_df = pd.DataFrame(Counter(spam_carpos).most_common(30))

In [None]:
sns.barplot(data=filter_df, x=filter_df[0], y=filter_df[1], hue=filter_df[0], palette='bright', legend=False)
plt.xticks(rotation = 90)
plt.show()

##Find top 30 words of Not spam Messages

<p><img alt="model" height="70px" src="https://raw.githubusercontent.com/isandrade-udea/LabIA/main/Captura%20desde%202024-05-24%2012-44-30.png" align="left" hspace="10px" vspace="0px"></p>

**Ejercicio**:
Mostrar un histograma del top 30 de palabras mas frecuentes en mensajes ham


#**7. Model Building**

## Initializing CountVectorizer and TfidfVectorizer


<p><img alt="model" height="340px" src="https://raw.githubusercontent.com/isandrade-udea/datasets/main/Captura%20desde%202024-05-23%2016-01-56.png" align="centering" hspace="10px" vspace="0px"></p>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tfid = TfidfVectorizer(max_features = 3000)

##Dependent and Independent Variable

In [None]:
X = tfid.fit_transform(df['transformed_text']).toarray()
y = df['target'].values

##Split into Train and Test Data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X,y,test_size = 0.20, random_state = 2)

##Import the Models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

##Initialize the Models

In [None]:
SVC?

In [None]:
svc = SVC(kernel= "sigmoid", gamma  = 1.0)
knc = KNeighborsClassifier()
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth = 5)
lrc = LogisticRegression(solver = 'liblinear', penalty = 'l1')
rfc = RandomForestClassifier(n_estimators = 50, random_state = 2 )
abc = AdaBoostClassifier(n_estimators = 50, random_state = 2)
bc = BaggingClassifier(n_estimators = 50, random_state = 2)
etc = ExtraTreesClassifier(n_estimators = 50, random_state = 2)
gbdt = GradientBoostingClassifier(n_estimators = 50, random_state = 2)
xgb  = XGBClassifier(n_estimators = 50, random_state = 2)

##Dictionary of the Models

In [None]:
clfs = {
    'SVC': svc,
    'KNN': knc,
    'NB': mnb,
    'DT': dtc,
    'LR': lrc,
    'RF': rfc,
    'Adaboost': abc,
    'Bgc': bc,
    'ETC': etc,
    'GBDT': gbdt,
    'xgb': xgb

}

##Train the Models

In [None]:
from sklearn.metrics import accuracy_score, precision_score
def train_classifier(clfs, X_train, y_train, X_test, y_test):
    clfs.fit(X_train,y_train)
    y_pred = clfs.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    return accuracy , precision

#**8. Evaluate the Models**

In [None]:
accuracy_scores = []
precision_scores = []
for name , clfs in clfs.items():
    current_accuracy, current_precision = train_classifier(clfs, X_train, y_train, X_test, y_test)
    print()
    print("For: ", name)
    print("Accuracy: ", current_accuracy)
    print("Precision: ", current_precision)

    accuracy_scores.append(current_accuracy)
    precision_scores.append(current_precision)

#**Software Version**

In [None]:
!pip install session_info

In [None]:
import session_info
session_info.show()