<a href="https://colab.research.google.com/github/pedrohhenriqueas/EmailClassificatorSolution/blob/main/EmailClassificatorSolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Email Classificator Solution

### Importação das bibliotecas necessárias para a execução do projeto

In [184]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Carregar o Dataset

In [185]:
df = pd.read_csv('spam_assassin.csv')

### Separar os dados em variáveis independentes (X) e dependentes (y)

In [186]:
X = df['text']
y = df['target']

### Dividir os dados em 80% para treino e 20% para teste

In [164]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

### Verificar o tamanho dos conjuntos de treino e teste

In [187]:
print(f"Tamanho do conjunto de treino: {len(X_train)}")
print(f"Tamanho do conjunto de teste: {len(X_test)}")

Tamanho do conjunto de treino: 4926
Tamanho do conjunto de teste: 870


### Criar o vetor de transformação TF-IDF

In [188]:
vectorizer = TfidfVectorizer(stop_words='english')

### Ajustar e transformar os dados de treinamento

In [189]:
X_train_tfidf = vectorizer.fit_transform(X_train)

### Transformar os dados de teste

In [168]:
X_test_tfidf = vectorizer.transform(X_test)

### Verificar as dimensões dos dados transformados

In [169]:
print(f"Tamanho do X_train_tfidf: {X_train_tfidf.shape}")
print(f"Tamanho do X_test_tfidf: {X_test_tfidf.shape}")

Tamanho do X_train_tfidf: (4926, 120169)
Tamanho do X_test_tfidf: (870, 120169)


### Criar e treinar o modelo

In [170]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

### Verificar a acurácia no conjunto de treinamento

In [171]:
train_accuracy = model.score(X_train_tfidf, y_train)
print(f"Acurácia no conjunto de treinamento: {train_accuracy:.4f}")

Acurácia no conjunto de treinamento: 0.9425


### Avaliar a acurácia no conjunto de teste

In [172]:
test_accuracy = model.score(X_test_tfidf, y_test)
print(f"Acurácia no conjunto de teste: {test_accuracy:.4f}")


Acurácia no conjunto de teste: 0.9356


### Exemplo de novos e-mails para prever se são spam ou não

In [180]:
new_emails = [
    "Exclusive offer just for you! Buy 1 get 1 free on all products. Hurry up!",
    "Dear friend, I hope you are doing well. Would love to catch up soon.",
    "Great news! You've been selected for an exclusive membership with benefits.",
    "Can we reschedule our meeting to next Wednesday? Let me know.",
    "Congratulations! You've won a gift card worth $1000. Click here to redeem.",
    "I wanted to thank you for your time during the meeting yesterday. Looking forward to working together.",
    "Get a free consultation on your tax filing. Book an appointment today.",
    "Important security update! Please verify your account to avoid suspension."
]

### Transformar os novos e-mails usando o mesmo vetorizer

In [181]:
new_emails_tfidf = vectorizer.transform(new_emails)

### Fazer previsões

In [182]:
predictions = model.predict(new_emails_tfidf)

### Exibir as previsões

In [183]:
for email, prediction in zip(new_emails, predictions):
    print(f"Email: {email}\nClassificação: {'SPAM' if prediction == 1 else 'HAM'}\n")

Email: Exclusive offer just for you! Buy 1 get 1 free on all products. Hurry up!
Classificação: SPAM

Email: Dear friend, I hope you are doing well. Would love to catch up soon.
Classificação: HAM

Email: Great news! You've been selected for an exclusive membership with benefits.
Classificação: HAM

Email: Can we reschedule our meeting to next Wednesday? Let me know.
Classificação: HAM

Email: Congratulations! You've won a gift card worth $1000. Click here to redeem.
Classificação: HAM

Email: I wanted to thank you for your time during the meeting yesterday. Looking forward to working together.
Classificação: HAM

Email: Get a free consultation on your tax filing. Book an appointment today.
Classificação: SPAM

Email: Important security update! Please verify your account to avoid suspension.
Classificação: HAM

