# Email Analysis and Classification for SPAM Detection

**Project Description**

A company specialized in Artificial Intelligenceâ€“based automation, aims to develop a software library for analyzing and classifying incoming emails.  
The main objective is to identify **SPAM emails** in order to perform in-depth content analysis and improve the security of corporate communications.

The project originates from the CEOâ€™s need to better understand **trends, content, and behaviors** associated with SPAM emails. These insights will be used to:
- enhance anti-spam filters;
- strengthen communication security;
- support strategic decisions in the cybersecurity domain.

---

**Project Objectives**

Based on an email dataset provided by the CTO, the project aims to:

- Train a **classifier** to identify SPAM emails.
- Identify the **main topics** among emails classified as SPAM.
- Compute the **semantic distance between topics** to evaluate the heterogeneity of SPAM content.
- Extract information about **organizations mentioned** in **NON-SPAM** emails.


**Dataset**

ðŸ‘‰ https://github.com/ProfAI/natural-language-processing/tree/main/datasets/Verifica%20Finale%20-%20Spam%20Detection


## Import

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns

'''
from tqdm import tqdm
import time
import joblib
import random

# nlp import
import gensim
from gensim.utils import simple_preprocess
from gensim import corpora, models
from bertopic import BERTopic

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy
import spacy.displacy as displacy
import re

from pprint import pprint

# models 
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, cross_validate

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_auc_score, classification_report, make_scorer

'''


'\nfrom tqdm import tqdm\nimport time\nimport joblib\nimport random\n\n# nlp import\nimport gensim\nfrom gensim.utils import simple_preprocess\nfrom gensim import corpora, models\nfrom bertopic import BERTopic\n\nimport nltk\nfrom nltk.stem import WordNetLemmatizer\nfrom nltk.corpus import stopwords\n\nimport spacy\nimport spacy.displacy as displacy\nimport re\n\nfrom pprint import pprint\n\n# models \nfrom sklearn.neural_network import MLPClassifier\nfrom sklearn.cluster import KMeans\nfrom sklearn.linear_model import LogisticRegression \nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.ensemble import RandomForestClassifier\nimport xgboost as xgb\nfrom lightgbm import LGBMClassifier\n\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.model_selection import tra

## Functions

## Load Dataset

In [5]:
link_dataset = "https://raw.githubusercontent.com/ProfAI/natural-language-processing/refs/heads/main/datasets/Verifica%20Finale%20-%20Spam%20Detection/spam_dataset.csv"

In [6]:
df = pd.read_csv(link_dataset)

df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\nthe transport v...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\nhpl ...,0
5168,2933,ham,Subject: calpine daily gas nomination\n>\n>\nj...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


# EDA