<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M08-deep-learning/AT%26T_logo_2016.svg" alt="AT&T LOGO" width="50%" />

# Orange SPAM detector

## Company's Description 📇

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! 😮

## Project 🚧

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals 🎯

Your goal is to build a spam detector, that can automatically flag spams as they come based solely on the sms' content.

## Scope of this project 🖼️

To start off, AT&T would like you to use the folowing dataset:

[Dowload the Dataset](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv)

In [37]:
#%cd Data
#%pwd
#!wget https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv
#%cd ..

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you:

### Start simple
A good deep learing model does not necessarily have to be super complicated!

### Transfer learning
You do not have access to a whole lot of data, perhaps channeling the power of a more sophisticated model trained on billions of observations might help!

## Deliverable 📬

To complete this project, your team should:

* Write a notebook that runs preprocessing and trains one or more deep learning models in order to predict the spam or ham nature of the sms
* State the achieved performance clearly

In [38]:
# if use of google colab
#from google.colab import drive
#drive.mount('/content/drive')

# Project


In [39]:
pip install datasets transformers evaluate rouge_score -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [96]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import pipeline,TrainingArguments,Trainer,EvalPrediction

In [41]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


In [42]:
#ROOT_DIR="/content/drive/MyDrive/Colab Notebooks/Jedha/dsfs-ft-35/att"
ROOT_DIR="."

Toilettage du fichier pour retirer les caractères invalides.

In [43]:
import unicodedata

def clean_international_text(text):
    # Normalize Unicode characters
    normalized = unicodedata.normalize('NFKD', text)
    # Remove non-ASCII characters
    ascii_text = normalized.encode('ASCII', 'ignore').decode('ASCII')
    return ascii_text

In [44]:
filenameIn=ROOT_DIR+"/Data/spam.csv"
output=clean_international_text(open(filenameIn, "r",encoding="ISO-8859-1").read())
filename=ROOT_DIR+"/Data/spam_clean.csv"
with open(filename, "w") as f:
    f.write(output)

In [45]:
spam_dataset = pd.read_csv(ROOT_DIR+"/Data/spam_clean.csv")#,encoding="ISO-8859-1"

In [46]:
spam_dataset.shape

(5572, 5)

## Dataset exploration

Nous explorons le jeu de données pour comprendre sa structure par rapport au problème d'analyse de spam qui nous préoccupe.

In [47]:
display(spam_dataset.head())
display(spam_dataset.info())
display(spam_dataset.describe())
display(spam_dataset.shape)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


None

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


(5572, 5)

# ETL sur le jeu de données de SPAM

In [48]:
#spam_dataset.dropna(inplace=True)
spam_dataset.rename(columns={ 'v1': 'type','v2': 'line1', 'Unnamed: 2': 'line2','Unnamed: 3': 'line3','Unnamed: 4': 'line4'}, inplace=True)
spam_dataset.shape

(5572, 5)

Nous fusionnons les lignes du texto en un unique message.

In [49]:
spam_dataset['message'] = spam_dataset[['line1', 'line2', 'line3', 'line4']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)
spam_dataset['message'] =spam_dataset['message'].apply(lambda x: x.replace('\\', ''))
#spam_dataset['message'] = spam_dataset['line1']

Nous transformons la colonne 'v1' en une colonne binaire 'spam' où 'spam' devient 1 et 'ham' devient 0.

In [50]:
mapping = {
  "spam": 1,
  "ham": 0
}
spam_dataset["spam"] = spam_dataset["type"].map(mapping)
spam_dataset = spam_dataset[['spam', 'message','type']]
spam_dataset.shape

(5572, 3)

In [51]:
spam_new=spam_dataset.rename(columns={ 'message': 'text','spam': 'label', 'type': 'label_text'})

In [52]:
spam_new.shape

(5572, 3)

## Verification de l'intégration

In [53]:
print(spam_dataset.isnull().sum())
spam_dataset

spam       0
message    0
type       0
dtype: int64


Unnamed: 0,spam,message,type
0,0,"Go until jurong point, crazy.. Available only ...",ham
1,0,Ok lar... Joking wif u oni...,ham
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,0,U dun say so early hor... U c already then say...,ham
4,0,"Nah I don't think he goes to usf, he lives aro...",ham
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,spam
5568,0,Will I_ b going to esplanade fr home?,ham
5569,0,"Pity, * was in mood for that. So...any other s...",ham
5570,0,The guy did some bitching but I acted like i'd...,ham


In [54]:
from datasets import Dataset

spam_dataset = Dataset.from_pandas(spam_new) #spam_dataset

## Preprocessing du dataset
Il est nécessaire de convertir le texte en suite de nombres avec un "tokenizer'.

In [56]:
#checkpoint = "bert-base-uncased"
#checkpoint = "AventIQ-AI/bert-spam-detection"

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding=True)

tokenized_datasets = spam_dataset.map(tokenize_function, batched=True)
samples = tokenized_datasets[:2]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]


Map:   0%|          | 0/5572 [00:00<?, ? examples/s]

[100, 100]

In [57]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [58]:
pip install 'accelerate>=0.26.0'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


## Jeux d'entrainement et de test
Le jeu de données est propre et prêt, il faut maintenant le diviser en un jeu d'entrainement, un jeu de validation et un jeu de test.

In [59]:
tokenized_datasets=tokenized_datasets.train_test_split(test_size=0.2)


In [60]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4457
    })
    test: Dataset({
        features: ['label', 'text', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1115
    })
})

In [61]:
training_args = TrainingArguments("test-trainer", report_to='none')

In [62]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [63]:
from transformers import Trainer

In [64]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

In [65]:
trainer.train()

Step,Training Loss
500,0.0851
1000,0.0346
1500,0.0144


TrainOutput(global_step=1674, training_loss=0.04070527365558036, metrics={'train_runtime': 524.9431, 'train_samples_per_second': 25.471, 'train_steps_per_second': 3.189, 'total_flos': 1551317685177180.0, 'train_loss': 0.04070527365558036, 'epoch': 3.0})

In [66]:
#tokenized_datasets = tokenized_datasets.rename_column("spam", "label")
display(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4457
    })
    test: Dataset({
        features: ['label', 'text', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1115
    })
})


### Evaluation

In [67]:


predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1115, 2) (1115,)


In [72]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

predictions

PredictionOutput(predictions=array([[ 4.44886  , -4.702136 ],
       [ 4.4614434, -4.6972404],
       [ 4.4546704, -4.701467 ],
       ...,
       [ 4.341084 , -4.6399536],
       [-4.3537846,  4.2916656],
       [ 4.4300356, -4.696036 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 0, 1, 0]), metrics={'test_loss': 0.03923804685473442, 'test_runtime': 13.5496, 'test_samples_per_second': 82.29, 'test_steps_per_second': 10.332})

In [73]:
from sklearn.metrics import classification_report

true_labels = tokenized_datasets['test']['label']
print(classification_report(true_labels, preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       970
           1       0.99      0.97      0.98       145

    accuracy                           0.99      1115
   macro avg       0.99      0.99      0.99      1115
weighted avg       0.99      0.99      0.99      1115



In [74]:
%pip install plotly 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [76]:
confusion_matrix(true_labels, preds)

array([[968,   2],
       [  4, 141]])

In [77]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

cm = pd.DataFrame(confusion_matrix(true_labels, preds),
                  index=['ham', 'spam'],
                  columns=['ham', 'spam'])
px.imshow(cm, text_auto=True)

In [94]:
%pip install evaluate
%pip install seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
[33m  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'seqeval'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16249 sha256=0dcd30020e0c54f2033b29a291e24e3e1d10b72cdf59525c066f43e582bfb86b
  Stored in directory: /home/zeus/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing coll

In [97]:
import evaluate

seqeval = evaluate.load('seqeval')


def compute_metrics(p: EvalPrediction):
    return seqeval.compute(predictions=p.predictions, references=p.label_ids)

In [98]:
training_args = TrainingArguments("test-trainer", eval_strategy="epoch") # push_to_hub=True if you wish to push your model to the hugging face hub
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


## Question pipeline

In [121]:
spam_sample = spam_new.sample(3000)
display(spam_sample.head())

Unnamed: 0,label,text,label_text
3572,1,You won't believe it but it's true. It's Incre...,spam
3423,1,Am new 2 club & dont fink we met yet Will B gr...,spam
1340,0,Might ax well im there.,ham
2382,0,I will reach before ten morning,ham
4516,0,"Men always needs a beautiful, intelligent, car...",ham


In [122]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
classifier = pipeline("question-answering", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-cased-distilled-squad",
#                      tokenizer="google-bert/bert-base-cased")
question = "The context is a sms. Answer with one word betwween ham and spam. Is it a spam or a ham ?"
prep=[{"context": ele, "question": question } for ele in spam_sample['text'].to_list()]
spam_preds = classifier(prep)
#map={'POSITIVE' :'ham', 'NEGATIVE': 'spam'}
#spam_preds = [{'label': map[ele['label']],'score':ele['score']} for ele in spam_preds]
#spam_preds['label'].map(map)
display(spam_preds[:10])

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0

Passing a list of SQuAD examples to the pipeline is deprecated and will be removed in v5. Inputs should be passed using the `question` and `context` keyword arguments instead.



[{'score': 0.00083197932690382, 'start': 21, 'end': 24, 'answer': 'but'},
 {'score': 0.0009000927675515413, 'start': 21, 'end': 25, 'answer': 'fink'},
 {'score': 0.02849952131509781, 'start': 14, 'end': 22, 'answer': 'im there'},
 {'score': 0.02465648017823696, 'start': 2, 'end': 6, 'answer': 'will'},
 {'score': 0.0017468755831941962,
  'start': 43,
  'end': 70,
  'answer': 'caring, loving, adjustable,'},
 {'score': 0.020632125437259674,
  'start': 0,
  'end': 10,
  'answer': 'Hurry home'},
 {'score': 0.005736622493714094,
  'start': 36,
  'end': 80,
  'answer': "answering my texts so I'm guessing he flaked"},
 {'score': 0.0022522564977407455,
  'start': 57,
  'end': 102,
  'answer': 'else can be our 4th guy before we commit to a'},
 {'score': 0.003708276664838195, 'start': 68, 'end': 74, 'answer': 'need a'},
 {'score': 0.0005298318574205041,
  'start': 116,
  'end': 128,
  'answer': '087187262701'}]

## Nouveau modèle

In [94]:
spam_sample = spam_new.sample(3000)
display(spam_sample.head())

Unnamed: 0,label,text,label_text
1086,0,I don't think he has spatula hands!,ham
483,0,Thank you baby! I cant wait to taste the real ...,ham
855,1,Talk sexy!! Make new friends or fall in love i...,spam
4652,0,Lol yes. But it will add some spice to your day.,ham
1859,0,What's up. Do you want me to come online?,ham


In [107]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-cased-distilled-squad",
#                      tokenizer="google-bert/bert-base-cased")
context = "There are three major types of rock: igneous, sedimentary, and metamorphic. The rock cycle is an important concept in geology which illustrates the relationships between these three types of rock, and magma. When a rock crystallizes from melt (magma and/or lava), it is an igneous rock. This rock can be weathered and eroded, and then redeposited and lithified into a sedimentary rock, or be turned into a metamorphic rock due to heat and pressure that change the mineral content of the rock which gives it a characteristic fabric. The sedimentary rock can then be subsequently turned into a metamorphic rock due to heat and pressure and is then weathered, eroded, deposited, and lithified, ultimately becoming a sedimentary rock. Sedimentary rock may also be re-eroded and redeposited, and metamorphic rock may also undergo additional metamorphism. All three types of rocks may be re-melted; when this happens, a new magma is formed, from which an igneous rock may once again crystallize."
question = "What are the three major types of rock?"
qa_pipe = pipeline("question-answering")
qa_pipe({ "context": context, "question": question })
spam_preds = classifier(spam_sample['text'].to_list())
map={'POSITIVE' :'ham', 'NEGATIVE': 'spam'}
spam_preds = [{'label': map[ele['label']],'score':ele['score']} for ele in spam_preds]
#spam_preds['label'].map(map)
display(spam_preds[:10])

Device set to use cuda:0


[{'label': 'spam', 'score': 0.9831442832946777},
 {'label': 'ham', 'score': 0.9997801184654236},
 {'label': 'ham', 'score': 0.9986414313316345},
 {'label': 'ham', 'score': 0.9997112154960632},
 {'label': 'ham', 'score': 0.9704649448394775},
 {'label': 'ham', 'score': 0.9997745156288147},
 {'label': 'ham', 'score': 0.8795937299728394},
 {'label': 'ham', 'score': 0.980728268623352},
 {'label': 'spam', 'score': 0.9934523105621338},
 {'label': 'ham', 'score': 0.9996738433837891}]

In [103]:
spam_preds = pd.DataFrame(spam_preds)
spam_preds.head()
spam_preds.shape

(3000, 2)

In [104]:
from sklearn.metrics import classification_report

true_labels = spam_sample['label_text']
pred_labels = spam_preds['label']

print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

         ham       0.83      0.61      0.70      2586
        spam       0.08      0.21      0.11       414

    accuracy                           0.56      3000
   macro avg       0.45      0.41      0.41      3000
weighted avg       0.72      0.56      0.62      3000



In [105]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

cm = pd.DataFrame(confusion_matrix(true_labels, pred_labels),
                  index=['ham',  'spam'],
                  columns=['ham',  'spam'])

px.imshow(cm, text_auto=True)

## Second model

In [None]:
from datasets import load_dataset

In [None]:

spam_new.head(30)

In [None]:
spam_new.shape

In [87]:
spam_sample = spam_new.sample(3000)
display(spam_sample.head())

Unnamed: 0,label,text,label_text
3245,0,"Funny fact Nobody teaches volcanoes 2 erupt, t...",ham
944,0,I sent my scores to sophas and i had to do sec...,ham
1044,1,We know someone who you know that fancies you....,spam
2484,0,Only if you promise your getting out as SOON a...,ham
812,1,Congratulations ur awarded either a500 of CD g...,spam


In [88]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english") Non
model_name = "AventIQ-AI/bert-spam-detection"
tokenizer = BertTokenizer.from_pretrained(model_name)
googletokenizer="google-bert/bert-base-cased"
model = BertForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model_name,tokenizer=tokenizer)
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-cased-distilled-squad",
#                      tokenizer="google-bert/bert-base-cased")
spam_preds = classifier(spam_sample['text'].to_list())
map={'LABEL_0': 'ham', 'LABEL_1': 'spam'}
spam_preds = [{'label': map[ele['label']],'score':ele['score']} for ele in spam_preds]
#spam_preds['label'].map(map)
display(spam_preds[:10])

Device set to use cuda:0


[{'label': 'ham', 'score': 0.9997051358222961},
 {'label': 'ham', 'score': 0.9998780488967896},
 {'label': 'spam', 'score': 0.9999432563781738},
 {'label': 'ham', 'score': 0.9999186992645264},
 {'label': 'spam', 'score': 0.9999481439590454},
 {'label': 'ham', 'score': 0.9999291896820068},
 {'label': 'ham', 'score': 0.9999237060546875},
 {'label': 'ham', 'score': 0.9999427795410156},
 {'label': 'ham', 'score': 0.9999134540557861},
 {'label': 'ham', 'score': 0.999913215637207}]

In [89]:
spam_preds = pd.DataFrame(spam_preds)
spam_preds.head()
spam_preds.shape

(3000, 2)

In [90]:
spam_preds
spam_sample


Unnamed: 0,label,text,label_text
3245,0,"Funny fact Nobody teaches volcanoes 2 erupt, t...",ham
944,0,I sent my scores to sophas and i had to do sec...,ham
1044,1,We know someone who you know that fancies you....,spam
2484,0,Only if you promise your getting out as SOON a...,ham
812,1,Congratulations ur awarded either a500 of CD g...,spam
...,...,...,...
4897,0,Oh for fuck's sake she's in like tallahassee,ham
1506,1,Thanks for the Vote. Now sing along with the s...,spam
1923,0,Hello. They are going to the village pub at 8 ...,ham
448,0,LOL ... Have you made plans for new years?,ham


In [91]:
from sklearn.metrics import classification_report

true_labels = spam_sample['label_text']
pred_labels = spam_preds['label']

print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      2593
        spam       1.00      1.00      1.00       407

    accuracy                           1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



In [92]:
confusion_matrix(true_labels, pred_labels)

array([[2592,    1],
       [   2,  405]])

In [93]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

cm = pd.DataFrame(confusion_matrix(true_labels, pred_labels),
                  index=['ham',  'spam'],
                  columns=['ham',  'spam'])

px.imshow(cm, text_auto=True)

##Troisième modèle


In [86]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
#classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english") Non
classifier = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
spam_preds = classifier(spam_sample['text'].to_list())
map={'LABEL_0': 'ham', 'LABEL_1': 'spam'}
spam_preds = [{'label': map[ele['label']],'score':ele['score']} for ele in spam_preds]
#spam_preds['label'].map(map)
display(spam_preds[:10])

Device set to use cuda:0


[{'label': 'ham', 'score': 0.7893965244293213},
 {'label': 'ham', 'score': 0.9358083009719849},
 {'label': 'spam', 'score': 0.9059011340141296},
 {'label': 'ham', 'score': 0.9381789565086365},
 {'label': 'spam', 'score': 0.9056588411331177},
 {'label': 'ham', 'score': 0.9383450746536255},
 {'label': 'ham', 'score': 0.9373378753662109},
 {'label': 'ham', 'score': 0.9381335377693176},
 {'label': 'ham', 'score': 0.9321560263633728},
 {'label': 'ham', 'score': 0.9378365874290466}]

In [82]:
spam_preds = pd.DataFrame(spam_preds)
spam_preds.head()
spam_preds.shape

(3000, 2)

In [85]:
from sklearn.metrics import classification_report

true_labels = spam_sample['label_text']
pred_labels = spam_preds['label']

print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      2593
        spam       0.98      0.93      0.96       407

    accuracy                           0.99      3000
   macro avg       0.99      0.96      0.97      3000
weighted avg       0.99      0.99      0.99      3000



In [84]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

cm = pd.DataFrame(confusion_matrix(true_labels, pred_labels),
                  index=['ham',  'spam'],
                  columns=['ham',  'spam'])

px.imshow(cm, text_auto=True)