# Spam Detection Preprocessing

This notebook demonstrates the preprocessing steps for detecting spam emails using a neural network. The steps include loading data, cleaning text, tokenizing, lemmatizing, and converting text to sequences.


In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gonzalopereyrametnik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gonzalopereyrametnik/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Loading Data
We load the dataset from a CSV file.

In [None]:
def load_data(file_path):
    """
    Load data from a CSV file.
    """
    return pd.read_csv(file_path)

### Text Cleaning

This function removes non-alphabetic characters and converts text to lowercase.

In [None]:
def clean_text(text):
    """
    Perform basic text cleaning:
    - Remove non-alphabetic characters
    - Convert to lowercase
    """
    text = re.sub(r'[^a-zA-Z]', ' ', text)  
    text = text.lower() 
    return text

### Tokenization and Stop Word Removal

This function tokenizes text into words and removes common stop words.

In [None]:
def tokenize_and_remove_stopwords(text):
    """
    Tokenize text and remove stop words.
    """
    words = text.split()  # Tokenize
    words = [word for word in words if word not in stop_words]  # Remove stop words
    return words


### Lemmatization

This function converts words to their base form.

In [None]:
def lemmatize_words(words):
    """
    Lemmatize words to their base form.
    """
    return [lemmatizer.lemmatize(word) for word in words]


### Full Preprocessing Pipeline

This function combines text cleaning, tokenization, stop word removal, and lemmatization.

In [None]:
def preprocess_text(text):
    """
    Full preprocessing pipeline for a single text:
    - Clean text
    - Tokenize and remove stop words
    - Lemmatize words
    """
    text = clean_text(text)
    words = tokenize_and_remove_stopwords(text)
    lemmatized_words = lemmatize_words(words)
    return ' '.join(lemmatized_words)  # Join words back into a single string

### Preprocessing DataFrame

This function applies the preprocessing pipeline to the entire DataFrame.

In [None]:
def preprocess_dataframe(df, text_column):
    """
    Apply preprocessing to a dataframe containing text data.
    """
    df['cleaned_text'] = df[text_column].apply(preprocess_text)
    return df

### Splitting Data

This function splits the data into training and testing sets.

In [None]:

def split_data(df, text_column, label_column):
    """
    Split data into training and testing sets.
    """
    X_train, X_test, y_train, y_test = train_test_split(df[text_column], df[label_column], test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

### Preprocess data

Load, preprocess, and split the dataset.

In [None]:
file_path = 'data/spam_ham_dataset.csv'
text_column = 'text'
label_column = 'label_num'

# Load data
df = load_data(file_path)

# Preprocess data
df = preprocess_dataframe(df, text_column)

# Split data
X_train, X_test, y_train, y_test = split_data(df, 'cleaned_text', label_column)

# Tokenize text data
tokenizer = Tokenizer(num_words=5000)  # Adjust the number of words as needed
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

### Tokenizing and Padding Sequences
Convert text data to sequences of integers and pad them to ensure uniform length.

In [10]:
max_sequence_length = 2011  
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_sequence_length)
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_sequence_length)

### Saving Processed Data
Save the preprocessed and tokenized data for future use.

In [11]:
np.save('data/X_train_padded.npy', X_train_padded)
np.save('data/X_test_padded.npy', X_test_padded)
np.save('data/y_train.npy', y_train)
np.save('data/y_test.npy', y_test)
np.save('data/tokenizer.json', tokenizer.to_json())

This notebook demonstrates the complete preprocessing pipeline for spam email detection, including loading data, cleaning text, tokenizing, lemmatizing, and converting text to sequences. The processed data is then saved for future use in training a neural network.