<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M08-deep-learning/AT%26T_logo_2016.svg" alt="AT&T LOGO" width="50%" />

# Orange SPAM detector

## Company's Description 📇

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! 😮

## Project 🚧

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals 🎯

Your goal is to build a spam detector, that can automatically flag spams as they come based solely on the sms' content.

## Scope of this project 🖼️

To start off, AT&T would like you to use the folowing dataset:

[Dowload the Dataset](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv)

In [None]:
%cd Data
%pwd
!wget https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv
%cd ..

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Start simple
A good deep learing model does not necessarily have to be super complicated!

### Transfer learning
You do not have access to a whole lot of data, perhaps channeling the power of a more sophisticated model trained on billions of observations might help!

## Deliverable 📬

To complete this project, your team should: 

* Write a notebook that runs preprocessing and trains one or more deep learning models in order to predict the spam or ham nature of the sms
* State the achieved performance clearly

# Project


In [None]:
pip install datasets transformers evaluate rouge_score -q

In [None]:
import pandas as pd
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import pipeline

In [None]:
spam_dataset = pd.read_csv("./Data/spam.csv",encoding="ISO-8859-1")

## Dataset exploration

Nous explorons le jeu de données pour comprendre sa structure par rapport au problème d'analyse de spam qui nous préoccupe.

In [None]:
display(spam_dataset.head())
display(spam_dataset.info())
display(spam_dataset.describe())


# ETL sur le jeu de données de SPAM

In [None]:
spam_dataset.dropna(inplace=True)
spam_dataset.rename(columns={ 'v1': 'type','v2': 'line1', 'Unnamed: 2': 'line2','Unnamed: 3': 'line3','Unnamed: 4': 'line4'}, inplace=True)
spam_dataset.head(30)

Nous fusionnons les lignes du texto en un unique message.

In [None]:
spam_dataset['message'] = spam_dataset[['line1', 'line2', 'line3', 'line4']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)


Nous transformons la colonne 'v1' en une colonne binaire 'spam' où 'spam' devient 1 et 'ham' devient 0.

In [None]:
mapping = {
  "spam": 1,
  "ham": 0
}
spam_dataset["spam"] = spam_dataset["type"].map(mapping)
spam_dataset = spam_dataset[['spam', 'message','type']]


## Verification de l'intégration

In [None]:
print(spam_dataset.isnull().sum())
spam_dataset

In [None]:
from datasets import Dataset

spam_dataset = Dataset.from_pandas(spam_dataset)

## Preprocessing du dataset 
Il est nécessaire de convertir le texte en suite de nombres avec un "tokenizer'.

In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["message"], truncation=True)

tokenized_datasets = spam_dataset.map(tokenize_function)
samples = tokenized_datasets[:]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]


In [None]:
tokenized_datasets

## Jeux d'entrainement et de test
Le jeu de données est propre et prêt, il faut maintenant le diviser en un jeu d'entrainement, un jeu de validation et un jeu de test.

In [None]:
from sklearn.model_selection import train_test_split
x=spam_dataset.drop('spam', axis=1)
y=spam_dataset['spam']
x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=0.2, random_state= 42)

print("Train set:", x_train.shape, y_train.shape)
print("Test set:", x_test.shape, y_test.shape)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:


classifier = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad", 
                      tokenizer="google-bert/bert-base-cased")
spam_preds = classifier(spam_train['text'].to_list())