- Этот ноутбук содержит решение для задачи СПАМ-детекции текстов (бинарная классификация).
- В качестве основной метрики используется roc_auc, однако я также решила обратить внимание на precision, т.к. в данной задаче важна минимизация false positives для того, чтобы случайно не классифицировать важное сообщение как спам

  
*Были проведены следующие эксперименты*:
- Обучение различных классификаторов с помощью библиотеки sklearn в качестве отправной точки. Наилучшее качество показали SVM, RandomForest, ExtraTreesClassifier. Однако roc_auc не превысил 0.93
- Для лучших алгоритмов выполнен GridSearch для поиска лучших параметров, однако roc_auc существенно не изменился
- Стекинг из лучших алгоритмов перечисленных выше, roc_auc существенно не изменился
- Классификация с использованием модели-трансформера roberta из библиотеки hugginface. Здесь даже дообучение не понадобилось, т.к. модель уже была дообучена на аналогичном датасете и показала очень высокое качество roc_auc=0.99. Эта модель использована для получения финальных предсказаний

In [None]:
# загрузка библиотек
import pandas as pd
import nltk
from nltk.corpus import stopwords  
nltk.download('stopwords')
nltk.download('punkt')

from sklearn.preprocessing import LabelEncoder
from nltk.stem.porter import PorterStemmer
import string
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import warnings
warnings.filterwarnings("ignore") 

### 1. Анализ данных

**Выводы**
1. В трейн датасете 16278 строк на английском языке
2. Есть 2 класса - spam и ham (нужна кодировка числами 0 или 1)
3. Пропущенных значений нет
4. Есть дисбаланс классов (преобладает класс 0). Соответственно, для обучения лучше брать сбалансированную выборку
5. Есть дубликаты

In [3]:
# загрузка датасета
train_df = pd.read_csv('train_spam.csv')
train_df

Unnamed: 0,text_type,text
0,ham,make sure alex knows his birthday is over in f...
1,ham,a resume for john lavorato thanks vince i will...
2,spam,plzz visit my website moviesgodml to get all m...
3,spam,urgent your mobile number has been awarded wit...
4,ham,overview of hr associates analyst project per ...
...,...,...
16273,spam,if you are interested in binary options tradin...
16274,spam,dirty pictureblyk on aircel thanks you for bei...
16275,ham,or you could do this g on mon 1635465 sep 1635...
16276,ham,insta reels par 80 गंद bhara pada hai 👀 kuch b...


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16278 entries, 0 to 16277
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text_type  16278 non-null  object
 1   text       16278 non-null  object
dtypes: object(2)
memory usage: 254.5+ KB


In [7]:
# проверка наличия пустых строк
train_df.isnull().sum()

text_type    0
text         0
dtype: int64

In [8]:
# проверка наличия дубликатор
train_df.duplicated().sum()

11

In [9]:
# удаление дубликатов
train_df = train_df.drop_duplicates(keep = 'first')

In [11]:
# соотношение классов
values = train_df['text_type'].value_counts()
total = values.sum()

percentage_0 = (values[0] /total) * 100
percentage_1 = (values[1]/ total) *100

print('percentage of 0 :' ,percentage_0)
print('percentage of 1 :' ,percentage_1)

percentage of 0 : 70.4370812073523
percentage of 1 : 29.56291879264769


  percentage_0 = (values[0] /total) * 100
  percentage_1 = (values[1]/ total) *100


In [117]:
# информация о кол-ве символов, слов и предложений
train_df['num_characters'] = train_df['text'].apply(len)
train_df['num_words'] = train_df['text'].apply(lambda x: len(nltk.word_tokenize(x)))
train_df['num_sentence'] = train_df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

In [16]:
train_df[['num_characters', 'num_words', 'num_sentence']].describe()

Unnamed: 0,num_characters,num_words,num_sentence
count,16267.0,16267.0,16267.0
mean,310.468986,57.141944,1.062212
std,287.887904,52.1344,0.376116
min,1.0,1.0,1.0
25%,60.0,12.0,1.0
50%,157.0,31.0,1.0
75%,639.0,114.0,1.0
max,800.0,207.0,12.0


In [32]:
# по отчету видно, что есть строки, длина которых = 1 слово. Скорее всего они не несут ценной информации, поэтому их можно удалить
train_df[train_df['num_words'] < 2].groupby('text_type').count()

Unnamed: 0_level_0,text,num_characters,num_words,num_sentence
text_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ham,142,142,142,142
spam,3,3,3,3


In [35]:
indice_to_drop = train_df[train_df['num_words'] < 2].index
train_df[train_df['num_words'] < 2]

Unnamed: 0,text_type,text,num_characters,num_words,num_sentence
76,ham,urgent,6,1,1
149,ham,fast,4,1,1
170,ham,freemasonry,11,1,1
233,ham,logs,4,1,1
331,ham,landed,6,1,1
...,...,...,...,...,...
15738,ham,txt,3,1,1
15780,ham,staffsciencenusedusgphyhcmkteachingpc1323,41,1,1
15890,ham,derpherp,8,1,1
16067,ham,ok,2,1,1


In [None]:
# удаление слишком коротких строк
train_df = train_df.drop(indice_to_drop)

### 2. Предобработка данных

In [39]:
# целочисленное кодирование классов (0-ham, 1-spam)
encoder = LabelEncoder()
train_df['target'] = encoder.fit_transform(train_df['text_type'])
train_df

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Unnamed: 0,text_type,text,num_characters,num_words,num_sentence,target
0,ham,make sure alex knows his birthday is over in f...,86,16,1,0
1,ham,a resume for john lavorato thanks vince i will...,520,97,1,0
2,spam,plzz visit my website moviesgodml to get all m...,126,22,1,1
3,spam,urgent your mobile number has been awarded wit...,139,23,1,1
4,ham,overview of hr associates analyst project per ...,733,127,1,0
...,...,...,...,...,...,...
16273,spam,if you are interested in binary options tradin...,114,18,1,1
16274,spam,dirty pictureblyk on aircel thanks you for bei...,454,74,1,1
16275,ham,or you could do this g on mon 1635465 sep 1635...,799,147,1,0
16276,ham,insta reels par 80 गंद bhara pada hai 👀 kuch b...,102,21,1,0


In [75]:
# определение объекта для стемминга
ps = PorterStemmer()

# Ф-я для предобработки текста
def transform_text(text):
    # перевод в нижний регистр
    text = text.lower()
    
    # токенизация
    text = nltk.word_tokenize(text)
    
    # убираем лишние символы
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
            
    # убираем стоп-слова и пунктуацию
    text = y[:]
    y.clear()
    
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
        
    # стемминг
    text = y[:]
    y.clear()
    for i in text:
        y.append(ps.stem(i))
    
    return " ".join(y)

In [76]:
# применяем трансформации
train_df['transformed_text'] = train_df['text'].apply(transform_text)
train_df

Unnamed: 0,text_type,text,num_characters,num_words,num_sentence,target,transformed_text
0,ham,make sure alex knows his birthday is over in f...,86,16,1,0,make sure alex know birthday fifteen minut far...
1,ham,a resume for john lavorato thanks vince i will...,520,97,1,0,resum john lavorato thank vinc get move right ...
2,ham,overview of hr associates analyst project per ...,733,127,1,0,overview hr associ analyst project per david r...
3,ham,url url date not supplied government employees...,156,26,1,0,url url date suppli govern employe routin scre...
4,ham,looks like your ham corpus by and large has to...,419,85,1,0,look like ham corpu larg jeremi url header spa...
...,...,...,...,...,...,...,...
3995,spam,got bored right? 😐 then certainly you must che...,271,50,2,1,got bore right certainli must check netflix ne...
3996,spam,hey you know about this app i have earned 100 ...,244,47,1,1,hey know app earn 100 rupe app also want earn ...
3997,spam,pvt finance arranged on cheque basics 4 busine...,132,20,1,1,pvt financ arrang chequ basic 4 busi peopl tra...
3998,spam,𝑮𝒐𝒐𝒅 𝒊𝒏𝒗𝒆𝒔𝒕𝒎𝒆𝒏𝒕 𝒉𝒂𝒔 𝒃𝒆𝒆𝒏 𝒎𝒚 𝒎𝒂𝒊𝒏 𝒔𝒐𝒖𝒓𝒄𝒆 𝒐𝒇 𝒊𝒏𝒄...,326,61,1,1,𝑮𝒐𝒐𝒅 𝒊𝒏𝒗𝒆𝒔𝒕𝒎𝒆𝒏𝒕 𝒉𝒂𝒔 𝒃𝒆𝒆𝒏 𝒎𝒚 𝒎𝒂𝒊𝒏 𝒔𝒐𝒖𝒓𝒄𝒆 𝒐𝒇 𝒊𝒏𝒄...


In [41]:
# возьмем сбалансированную выборку и выполним undersampling для ускорения обучения
df_class_0 = train_df[train_df['target'] == 0][:2000]
df_class_1 = train_df[train_df['target'] == 1][:2000]

train_df_2 = pd.concat([df_class_0, df_class_1], ignore_index=True)
train_df_2

Unnamed: 0,text_type,text,num_characters,num_words,num_sentence,target
0,ham,make sure alex knows his birthday is over in f...,86,16,1,0
1,ham,a resume for john lavorato thanks vince i will...,520,97,1,0
2,ham,overview of hr associates analyst project per ...,733,127,1,0
3,ham,url url date not supplied government employees...,156,26,1,0
4,ham,looks like your ham corpus by and large has to...,419,85,1,0
...,...,...,...,...,...,...
3995,spam,got bored right? 😐 then certainly you must che...,271,50,2,1
3996,spam,hey you know about this app i have earned 100 ...,244,47,1,1
3997,spam,pvt finance arranged on cheque basics 4 busine...,132,20,1,1
3998,spam,𝑮𝒐𝒐𝒅 𝒊𝒏𝒗𝒆𝒔𝒕𝒎𝒆𝒏𝒕 𝒉𝒂𝒔 𝒃𝒆𝒆𝒏 𝒎𝒚 𝒎𝒂𝒊𝒏 𝒔𝒐𝒖𝒓𝒄𝒆 𝒐𝒇 𝒊𝒏𝒄...,326,61,1,1


In [105]:
# определение объекта векторизатора
tfid = TfidfVectorizer(max_features = 3000)

In [86]:
# применение векторизации
X = tfid.fit_transform(train_df['transformed_text']).toarray()
y = train_df['target'].values

In [87]:
# разделение выборки на train-test
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 44)

### 3. Обучение моделей и сравнение

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

In [61]:
# определение объектов моделей-классификаторов из библиотеки sklearn

svc = SVC(kernel= "sigmoid", gamma  = 1.0)
knc = KNeighborsClassifier()
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth = 5)
lrc = LogisticRegression(solver = 'liblinear', penalty = 'l1')
rfc = RandomForestClassifier(n_estimators = 50, random_state = 2 )
abc = AdaBoostClassifier(n_estimators = 50, random_state = 2)
bc = BaggingClassifier(n_estimators = 50, random_state = 2)
etc = ExtraTreesClassifier(n_estimators = 50, random_state = 2)
gbdt = GradientBoostingClassifier(n_estimators = 50, random_state = 2)    
xgb  = XGBClassifier(n_estimators = 50, random_state = 2)


In [108]:
# ф-я для обучения
def train_classifier(clfs, X_train, y_train, X_test, y_test):
    clfs.fit(X_train,y_train)
    y_pred = clfs.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    return accuracy, precision, roc_auc

In [80]:
# процесс обучения на сокращенном сбалансированном датасете
accuracy_scores = []
precision_scores = []
roc_auc_scores = []
for name, clf in clfs.items():
    current_accuracy, current_precision, current_roc_auc = train_classifier(clf, X_train, y_train, X_test, y_test)
    print()
    print("For: ", name)
    print("Accuracy: ", current_accuracy)
    print("Precision: ", current_precision)
    print("Roc_auc: ", current_roc_auc)
    
    accuracy_scores.append(current_accuracy)
    precision_scores.append(current_precision)
    roc_auc_scores.append(current_roc_auc)


For:  SVC
Accuracy:  0.91
Precision:  0.9205128205128205
Roc_auc:  0.9099999999999999

For:  KNN
Accuracy:  0.63125
Precision:  0.9411764705882353
Roc_auc:  0.63125

For:  NB
Accuracy:  0.88625
Precision:  0.8652482269503546
Roc_auc:  0.88625

For:  DT
Accuracy:  0.65875
Precision:  0.5960665658093798
Roc_auc:  0.65875

For:  LR
Accuracy:  0.8675
Precision:  0.8888888888888888
Roc_auc:  0.8674999999999999

For:  RF
Accuracy:  0.92125
Precision:  0.9309462915601023
Roc_auc:  0.9212500000000001

For:  Adaboost
Accuracy:  0.8175
Precision:  0.8691860465116279
Roc_auc:  0.8175

For:  Bgc
Accuracy:  0.87375
Precision:  0.8673218673218673
Roc_auc:  0.87375

For:  ETC
Accuracy:  0.92125
Precision:  0.924433249370277
Roc_auc:  0.92125

For:  GBDT
Accuracy:  0.81125
Precision:  0.8738738738738738
Roc_auc:  0.81125

For:  xgb
Accuracy:  0.87625
Precision:  0.903485254691689
Roc_auc:  0.8762500000000001


In [88]:
# обучение на полном датасете (не сбалансированном)
accuracy_scores = []
precision_scores = []
roc_auc_scores = []
for name, clf in clfs.items():
    current_accuracy, current_precision, current_roc_auc = train_classifier(clf, X_train, y_train, X_test, y_test)
    print()
    print("For: ", name)
    print("Accuracy: ", current_accuracy)
    print("Precision: ", current_precision)
    print("Roc_auc: ", current_roc_auc)
    
    accuracy_scores.append(current_accuracy)
    precision_scores.append(current_precision)
    roc_auc_scores.append(current_roc_auc)


For:  SVC
Accuracy:  0.932093023255814
Precision:  0.9190421892816419
Roc_auc:  0.9067999855991183

For:  KNN
Accuracy:  0.7996899224806202
Precision:  0.9529411764705882
Roc_auc:  0.6662886435200186

For:  NB
Accuracy:  0.9221705426356589
Precision:  0.8866886688668867
Roc_auc:  0.8997546311297215

For:  DT
Accuracy:  0.782015503875969
Precision:  0.8658892128279884
Roc_auc:  0.6455326803087328

For:  LR
Accuracy:  0.9249612403100775
Precision:  0.9208037825059102
Roc_auc:  0.8935297115115663

For:  RF
Accuracy:  0.937984496124031
Precision:  0.928409090909091
Roc_auc:  0.914326523377893

For:  Adaboost
Accuracy:  0.8874418604651163
Precision:  0.8793324775353016
Roc_auc:  0.8383189462985579



KeyboardInterrupt



In [90]:
# пока что можно сделать вывод о том, что на сбалансированной выборке с меньшим кол-вом образцов roc_auc выше (несмотря на то что accuracy меньше)  
# стоит попробовать увеличить выборку, но оставить ее сбалансированной

# общее кол-во строк класса 1
max_len = len(train_df[train_df['target'] == 1])
max_len

4806

In [109]:
# возьмем сбалансированную выборку с максимальным размером датасета
df_class_0 = train_df[train_df['target'] == 0][:max_len]
df_class_1 = train_df[train_df['target'] == 1][:max_len]

train_df_2 = pd.concat([df_class_0, df_class_1], ignore_index=True)
train_df_2

Unnamed: 0,text_type,text,num_characters,num_words,num_sentence,target,transformed_text
0,ham,make sure alex knows his birthday is over in f...,86,16,1,0,make sure alex know birthday fifteen minut far...
1,ham,a resume for john lavorato thanks vince i will...,520,97,1,0,resum john lavorato thank vinc get move right ...
2,ham,overview of hr associates analyst project per ...,733,127,1,0,overview hr associ analyst project per david r...
3,ham,url url date not supplied government employees...,156,26,1,0,url url date suppli govern employe routin scre...
4,ham,looks like your ham corpus by and large has to...,419,85,1,0,look like ham corpu larg jeremi url header spa...
...,...,...,...,...,...,...,...
9607,spam,your e mail to anvasetc 1111 groups msn com ca...,429,87,1,1,e mail anvasetc 1111 group msn com deliv sent ...
9608,spam,rs 250 for dental services worth rs 2150 denta...,130,24,1,1,rs 250 dental servic worth rs 2150 dental spa ...
9609,spam,dost i am playing cricket knifeup pool etc and...,196,38,2,1,dost play cricket knifeup pool etc win cash da...
9610,spam,if you are interested in binary options tradin...,114,18,1,1,interest binari option trade may continu infor...


In [110]:
# разделение на трейн и тест
X = tfid.fit_transform(train_df_2['transformed_text']).toarray()
y = train_df_2['target'].values

X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 44)

In [112]:
# процесс обучения
accuracy_scores = []
precision_scores = []
roc_auc_scores = []
for name, clf in clfs.items():
    current_accuracy, current_precision, current_roc_auc = train_classifier(clf, X_train, y_train, X_test, y_test)
    print()
    print("For: ", name)
    print("Accuracy: ", current_accuracy)
    print("Precision: ", current_precision)
    print("Roc_auc: ", current_roc_auc)
    
    accuracy_scores.append(current_accuracy)
    precision_scores.append(current_precision)
    roc_auc_scores.append(current_roc_auc)


For:  SVC
Accuracy:  0.9121164846593863
Precision:  0.9290393013100436
Roc_auc:  0.9119753580546061

For:  KNN
Accuracy:  0.642225689027561
Precision:  0.9651567944250871
Roc_auc:  0.6398608887542728

For:  NB
Accuracy:  0.8881955278211129
Precision:  0.8592233009708737
Roc_auc:  0.888454091125438

For:  DT
Accuracy:  0.6640665626625065
Precision:  0.5979708306911858
Roc_auc:  0.6662379386439359

For:  LR
Accuracy:  0.8975559022360895
Precision:  0.9155701754385965
Roc_auc:  0.8974000475963826

For:  RF
Accuracy:  0.9235569422776911
Precision:  0.9400871459694989
Roc_auc:  0.9234233698238934

For:  Adaboost
Accuracy:  0.84399375975039
Precision:  0.8857479387514723
Roc_auc:  0.8436139717017871

For:  Bgc
Accuracy:  0.8965158606344253
Precision:  0.8913043478260869
Roc_auc:  0.8965498031240535

For:  ETC
Accuracy:  0.9209568382735309
Precision:  0.9195402298850575
Roc_auc:  0.9209602570204665

For:  GBDT
Accuracy:  0.8289131565262611
Precision:  0.9151193633952255
Roc_auc:  0.828198693

### 4. Дообучение лучших моделей и с использованием GridSearch

In [None]:
# выбраны модели - ETC, RF, SVC
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Словарь с классификаторами
clfs = {
    'ETC': ExtraTreesClassifier(random_state=42),
    'RF': RandomForestClassifier(random_state=42),
    'SVC': SVC(random_state=42)
}

# Параметры 
param_grid = {
    'ETC': {
        'n_estimators': [50, 100, 200],
        'max_features': ['auto', 'sqrt', 'log2'],
        'min_samples_split': [2, 4, 6]
    },
    'RF': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_leaf': [1, 2, 4]
    },
    'SVC': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    }
}

# Словарь для хранения результатов
best_models = {}

# Создание пользовательского скорера
scorers = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'roc_auc': make_scorer(roc_auc_score)
}

# Цикл по классификаторам
for name, clf in clfs.items():
    grid_search = GridSearchCV(clf, param_grid=param_grid[name], scoring=scorers, refit='roc_auc', n_jobs=-1, cv=3, verbose=1)
    grid_search.fit(X_train, y_train)
    best_models[name] = grid_search.best_estimator_
    
    # Вывод результатов
    print(f"Лучшие параметры для {name}: {grid_search.best_params_}")
    y_pred = best_models[name].predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"ROC AUC: {roc_auc_score(y_test, y_pred)}")


Лучшие параметры для ETC: {'max_features': 'log2', 'min_samples_split': 6, 'n_estimators': 200}  #2
Accuracy: 0.9318772750910036
Precision: 0.9291666666666667
ROC AUC: 0.931891739864134

Лучшие параметры для RF: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 100}
Accuracy: 0.9162766510660426
Precision: 0.9241452991452992
ROC AUC: 0.9162060274328242

Лучшие параметры для SVC: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Accuracy: 0.9334373374934998
Precision: 0.9431939978563773
ROC AUC: 0.9333569512353426

### 5. Построение ансамбля

In [125]:
# Используемые модели
classifiers = [
    ('rf', ExtraTreesClassifier(n_estimators=200, max_features='log2', min_samples_split=6, random_state=42)),
    ('etc', RandomForestClassifier(n_estimators=100, min_samples_leaf=1, max_depth=None, random_state=42)),
    ('svc', SVC(C=10, gamma='scale', kernel='rbf', random_state=42))
]

# Мета-классификатор
meta_classifier = LogisticRegression()

# Создание стекинг-модели
stacking_model = StackingClassifier(estimators=classifiers, final_estimator=meta_classifier, cv=5)

# Обучение стекинг-модели
stacking_model.fit(X_train, y_train)

# Оценка модели
y_pred = stacking_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred)}")


Accuracy: 0.9360374414976599
Precision: 0.9378947368421052
ROC AUC: 0.9360169399852883


### 6. Использование готовой предобученной модели transformer https://huggingface.co/mshenoda/roberta-spam?text=delivery+status+notification+failure+the+following+message+to+was+undeliverable+the+reason+for+the+problem+5+1+0+unknown+address+error+550+5+1+1+unknown+or+illegal+alias+gkoppmal+elp+rr+com

In [141]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset

# Загрузка токенизатора и модели
tokenizer = AutoTokenizer.from_pretrained("mshenoda/roberta-spam")
model = AutoModelForSequenceClassification.from_pretrained("mshenoda/roberta-spam")

# Подготовка датасета
class TextDataset(Dataset):
    def __init__(self, texts):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts.iloc[idx]

# Cписок текстов
texts = train_df['text'][:1000]

# Создание объекта датасета
dataset = TextDataset(texts)

# Создание DataLoader для управления батчами
loader = DataLoader(dataset, batch_size=100, shuffle=False)

# Функция для обработки батчей
def predict(model, dataloader):
    model.eval()
    predictions = []
    probabilities = []
    for batch in dataloader:
        texts = batch
        inputs = tokenizer(texts, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)
            preds = torch.argmax(probs, dim=1)
            probabilities.extend(probs.tolist())
            predictions.extend(preds.tolist())
    return probabilities, predictions

# Получение предсказаний
probabilities, predictions = predict(model, loader)
print('Done')


Done


In [142]:
# измерение качества
y_test = train_df['target'][:1000]

print(f"Accuracy: {accuracy_score(y_test, predictions)}")
print(f"Precision: {precision_score(y_test, predictions)}")
print(f"ROC AUC: {roc_auc_score(y_test, predictions)}")


Accuracy: 0.998
Precision: 0.9965986394557823
ROC AUC: 0.9975911044304407


### 7. Получение предсказаний на тестовом датасете

In [146]:
test_df = pd.read_csv('test_spam.csv')
test_df

Unnamed: 0,text
0,j jim whitehead ejw cse ucsc edu writes j you ...
1,original message from bitbitch magnesium net p...
2,java for managers vince durasoft who just taug...
3,there is a youtuber name saiman says
4,underpriced issue with high return on equity t...
...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...
4066,baylor enron case study cindy yes i shall co a...
4067,boring as compared to tp
4068,hellogorgeous hows u my fone was on charge lst...


In [147]:
# проверка пустых строк
test_df.isnull().sum()

text    0
dtype: int64

In [148]:
texts = test_df['text']

dataset = TextDataset(texts)
loader = DataLoader(dataset, batch_size=100, shuffle=False)

# Функция для обработки батчей
def predict(model, dataloader):
    model.eval()
    predictions = []
    probabilities = []
    for idx, batch in enumerate(dataloader):
        print(f'Batch {idx}')
        texts = batch
        inputs = tokenizer(texts, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)
            preds = torch.argmax(probs, dim=1)
            probabilities.extend(probs.tolist())
            predictions.extend(preds.tolist())
    return probabilities, predictions

# Получение предсказаний
probabilities, predictions = predict(model, loader)
print('Done')


Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
Batch 6
Batch 7
Batch 8
Batch 9
Batch 10
Batch 11
Batch 12
Batch 13
Batch 14
Batch 15
Batch 16
Batch 17
Batch 18
Batch 19
Batch 20
Batch 21
Batch 22
Batch 23
Batch 24
Batch 25
Batch 26
Batch 27
Batch 28
Batch 29
Batch 30
Batch 31
Batch 32
Batch 33
Batch 34
Batch 35
Batch 36
Batch 37
Batch 38
Batch 39
Batch 40
Done


In [151]:
# ф-я для перевода целочисленных меток класса в ham/spam
def decode_prediction(class_prediction):
    if class_prediction == 0:
        return 'ham'
    else:
        return 'spam'

test_df['class_prediction'] = predictions
test_df['score'] = test_df['class_prediction'].apply(decode_prediction)


In [156]:
# сохранение ответов модели
test_df[['score', 'text']].to_csv('submission.csv', index=False)