<h1 style='text-align: center; font-size: 38px; font-weight: bold; color: #5a189a;'> Spam Classification </h1>
<p style='text-align: center; color: #212121;'> This is a special notebook. I'll be exploring another classification problem in NLP but in this time, I'll be also commenting everything I'm doing - with hope to be helpful for someone in the future. This exercise will also be important for me to organize my thoughts, put in practice what I've learnt and also retain knowledge more easily. </p>

<h3 style='color: #7b2cbf; text-align: center;'> Setting Up </h3>

<p style='text-align: justify; color: #212121; font-size: 16px;'> First of all, I'm going to verify the GPU status for this machine. You can see that I also install some libraries that I might use in the future. </p>

In [38]:
%pip install chardet plotly loguru demoji --quiet

[0mNote: you may need to restart the kernel to use updated packages.


In [39]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using CUDA device: {torch.cuda.get_device_name(device)}")
    print(f"Device capability: {torch.cuda.get_device_capability(device)}")
    print(f"Total memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.2f} GB")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Using CUDA device: NVIDIA RTX A6000
Device capability: (8, 6)
Total memory: 47.54 GB
Number of CUDA devices: 1


<p style='text-align: justify; color: #212121; font-size: 16px;'> Now I'll be importing the required libraries so I can get everything set up. In addition, I'll be verifying the encoding for my data file so I can import it using Pandas in a better way. </p>

In [66]:
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import torch.nn.functional as F
from loguru import logger
import torch.nn as nn
import pandas as pd
import numpy as np
import chardet
import torch
import tqdm
import nltk

with open('data/email_spam.csv', 'rb') as f:
    result = chardet.detect(f.read())
    encoding = result['encoding']

df = pd.read_csv('data/email_spam.csv', encoding=encoding)

if len(df) != 0:
  logger.success('Dataset loaded successfully with {} rows'.format(len(df)))
else:
  logger.error('Dataset not loaded')

[32m2023-08-16 04:00:29.365[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36m<module>[0m:[36m22[0m - [32m[1mDataset loaded successfully with 5572 rows[0m


<p style='text-align: justify; color: #212121; font-size: 16px;'> I might be a perfectionist, but I'll make some modifications regarding the label column names. </p>

In [68]:
print(f'Previous columns: {df.columns.tolist()}')
df.columns = list(['label', 'text'])
print(f'New columns: {df.columns.tolist()}')
df['label'] = df.label.map({'ham': 'not spam', 'spam': 'spam'})
df = df[['text', 'label']]
df.head()

Previous columns: ['Category', 'Message']
New columns: ['label', 'text']


Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",not spam
1,Ok lar... Joking wif u oni...,not spam
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,not spam
4,"Nah I don't think he goes to usf, he lives aro...",not spam


<h3 style='color: #7b2cbf; text-align: center;'> Quick EDA </h3>

<p style='text-align: justify; color: #212121; font-size: 16px;'> Now I want to start a quick process that I call <bold>Quick EDA<bold> - which consists in a short EDA where I can see some overall tips on how my data is behaving or what type of problem am I dealing. Some useful questions we might come up with are as follows: </p>

<ul>
  <li><p style='text-align: justify; color: #212121; font-size: 16px;'>Is this an imbalence or balanced problem?</p></li>
  <li><p style='text-align: justify; color: #212121; font-size: 16px;'>Are there any null values? If yes, then how many?</p></li>
  <li><p style='text-align: justify; color: #212121; font-size: 16px;'>Are there any duplicates? If yes, how many? Could it compromise our model?</p></li>
  <li><p style='text-align: justify; color: #212121; font-size: 16px;'>How many raw tokens do we have approximately?</p></li>
</ul>