## Spam Detector

This project consists of two tasks:
1) to find spam-messages in the set of texts and remove them
2) to build the model which will be able to detect spam-messages

To solve these tasks I used pretrained Spacy model for Russian language and my own module in Python `nk_nlp1_5`  

This module contains two classes:  
`TextPreprocessing` - this class has several methods which helps to process texts using such operations like: regular expressions, deduplication, mapping, quoting and NLP-methods based on semantic similarity, finding part-of-speech and sentence dependences, named entity recognition and allows applying these methods to the collection of texts directly  

`Categorizator` - the class whose methods can identify dependencies between a sets of texts. It supports several approaches to the similarity calculation which can be specified by the special parameters. In addition to the similarity the methods allows you to calculate quoting - how often this word or expression occurs in the other set of texts. This can help you to concentrate your attention on the most significant objects.

See help(classname) for details.

In [1]:
import pandas as pd
import numpy as np
import json
import tqdm
from glob import glob

import spacy

# my module for text processing and categorization based on SpaCy
from nk_nlp1_5 import TextPreprocessing, Categorizator

from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split

pd.set_option('display.max_row', 1000)
pd.set_option('display.max_column', 100)
pd.set_option('display.max_colwidth', None)

## Data loading and preprocessing

In [2]:
# func for the replacement of incorrect characters

def alphabet_replacer(string, to='ru'):
    ru = list('–∞–±–¥–µ–∏–∫–º–Ω–æ—Ä—Å—Ç—É—Ö–ê–í–ï–ö–ú–ù–û–†–°–¢–£–•–∞')
    en = list('abdeukmnopctyxABEKMHOPCTYXŒ±')
    if to == 'ru':
        for char in en:
            if char in string:
                string = string.replace(char, ru[en.index(char)])
    elif to == 'en':
        for char in ru:
            if char in string:
                string = string.replace(char, en[ru.index(char)])
    return string
    

In [3]:
# loading data from JSONL format

messages = []

for path in sorted(glob('../PT/chat_data/*.jsonl')):
    with open(path, 'r', encoding='utf8') as file:
        for line in file:
            messages.append(json.loads(line))
          

In [4]:
# extracting needed data from json

data_for_df = []

for message in messages:
    record = {
        'message_id': message['message'].get('message_id') if 'message' in message else None,
        'text': message['message'].get('text') if 'text' in message['message'] else None
    }
    
    data_for_df.append(record)

df = pd.DataFrame(data_for_df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3824 entries, 0 to 3823
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   message_id  3824 non-null   int64 
 1   text        3268 non-null   object
dtypes: int64(1), object(1)
memory usage: 59.9+ KB


In [5]:
# NAN preprocessing

df = df[df.text.notna()]

In [6]:
# incorrect character replacing

df['text_repl'] = df['text'].apply(alphabet_replacer).str.lower()
df = df.reset_index()

In [10]:
df.loc[25:31, 'text':'text_repl']

Unnamed: 0,text,text_repl
25,–ï—Å–ª–∏ –≤ —ç—Ç—É –∑–∞–¥–∞—á—É –±–æ–ª—å—à–µ –Ω–∏—á–µ–≥–æ –Ω–µ –ª–µ—Ç–∏—Ç - –º–æ–∂–Ω–æ trace –≤–∫–ª—é—á–∏—Ç—å –∏ –±—É–¥–µ—Ç –≤–∏–¥–Ω–æ —á—Ç–æ —Ç–æ—á–Ω–æ –ø—Ä–∏—à–ª–æ –∏ —Ç –¥,–µ—Å–ª–∏ –≤ —ç—Ç—É –∑–∞–¥–∞—á—É –±–æ–ª—å—à–µ –Ω–∏—á–µ–≥–æ –Ω–µ –ª–µ—Ç–∏—Ç - –º–æ–∂–Ω–æ —Çr–∞—Å–µ –≤–∫–ª—é—á–∏—Ç—å –∏ –±—É–¥–µ—Ç –≤–∏–¥–Ω–æ —á—Ç–æ —Ç–æ—á–Ω–æ –ø—Ä–∏—à–ª–æ –∏ —Ç –¥
26,–ó–∞–¥–∞—á–∞ –∑–∞–≤–µ—Ä—à–∏—Ç—Å—è —Å –æ—à–∏–±–∫–æ–π,–∑–∞–¥–∞—á–∞ –∑–∞–≤–µ—Ä—à–∏—Ç—Å—è —Å –æ—à–∏–±–∫–æ–π
27,"–¥–∞, –≤ —ç—Ç—É –∑–∞–¥–∞—á—É –Ω–∏—á–µ–≥–æ –Ω–µ –ª–µ—Ç–∏—Ç. –í—ã –∏–º–µ–µ—Ç–µ –≤ –≤–∏–¥—É trace - —É—Ç–∏–ª–∏—Ç–∞, –∫–æ—Ç–æ—Ä–∞—è –ø–æ–∑–≤–æ–ª—è–µ—Ç –ø—Ä–æ—Å–ª–µ–¥–∏—Ç—å –º–∞—Ä—à—Ä—É—Ç —Å–ª–µ–¥–æ–≤–∞–Ω–∏—è –¥–∞–Ω–Ω—ã—Ö –¥–æ —É–¥–∞–ª–µ–Ω–Ω–æ–≥–æ –∞–¥—Ä–µ—Å–∞—Ç–∞ –≤ —Å–µ—Ç—è—Ö TCP/IP?","–¥–∞, –≤ —ç—Ç—É –∑–∞–¥–∞—á—É –Ω–∏—á–µ–≥–æ –Ω–µ –ª–µ—Ç–∏—Ç. –≤—ã –∏–º–µ–µ—Ç–µ –≤ –≤–∏–¥—É —Çr–∞—Å–µ - —É—Ç–∏–ª–∏—Ç–∞, –∫–æ—Ç–æ—Ä–∞—è –ø–æ–∑–≤–æ–ª—è–µ—Ç –ø—Ä–æ—Å–ª–µ–¥–∏—Ç—å –º–∞—Ä—à—Ä—É—Ç —Å–ª–µ–¥–æ–≤–∞–Ω–∏—è –¥–∞–Ω–Ω—ã—Ö –¥–æ —É–¥–∞–ª–µ–Ω–Ω–æ–≥–æ –∞–¥—Ä–µ—Å–∞—Ç–∞ –≤ —Å–µ—Ç—è—Ö —Ç—Å—Ä/i—Ä?"
28,"–ù–∞–±–∏—Ä–∞—é –ª—é–¥–µ–π –≤ –∫–æ–º–∞–Ω–¥—É –¥–ª—è –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–≥–æ —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–∞ –≤ Cr–£p–¢o\n–°–æ —Å—Ç–∞—Ä—Ç–∞ –ø–æ–ª—É—á–∞–µ—Ç—Å—è –æ—Ç 2315$ –≤ –Ω–µ–¥–µ–ª—é\n–ü–æ –≤—Ä–µ–º–µ–Ω–∏ - –≤ –¥–µ–Ω—å –∑–∞–Ω–∏–º–∞–µ—Ç –¥–æ 2-—Ö —á–∞—Å–æ–≤\n–ú–æ–∂–Ω–æ –ü–æ–ª—É—á–∞—Ç—å –ø–∞—Å—Å–∏–≤–Ω—ã–π –¥–æ—Ö–æ–¥ —Å –ª—é–±–æ–π —Ç–æ—á–∫–∏ –º–∏—Ä–∞!\n–û—Ç 18-—Ç–∏ –ª–µ—Ç‚ÄºÔ∏è\n\n–ù–∞–ø–∏—à–∏ –º–Ω–µ, –µ—Å–ª–∏ –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–ª–∞‚úâÔ∏è","–Ω–∞–±–∏—Ä–∞—é –ª—é–¥–µ–π –≤ –∫–æ–º–∞–Ω–¥—É –¥–ª—è –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–≥–æ —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–∞ –≤ —År—É—Ä—Ç–æ\n—Å–æ —Å—Ç–∞—Ä—Ç–∞ –ø–æ–ª—É—á–∞–µ—Ç—Å—è –æ—Ç 2315$ –≤ –Ω–µ–¥–µ–ª—é\n–ø–æ –≤—Ä–µ–º–µ–Ω–∏ - –≤ –¥–µ–Ω—å –∑–∞–Ω–∏–º–∞–µ—Ç –¥–æ 2-—Ö —á–∞—Å–æ–≤\n–º–æ–∂–Ω–æ –ø–æ–ª—É—á–∞—Ç—å –ø–∞—Å—Å–∏–≤–Ω—ã–π –¥–æ—Ö–æ–¥ —Å –ª—é–±–æ–π —Ç–æ—á–∫–∏ –º–∏—Ä–∞!\n–æ—Ç 18-—Ç–∏ –ª–µ—Ç‚ÄºÔ∏è\n\n–Ω–∞–ø–∏—à–∏ –º–Ω–µ, –µ—Å–ª–∏ –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–ª–∞‚úâÔ∏è"
29,"–Ω–µ—Ç, –º–æ–∂–Ω–æ –≤ –∑–∞–¥–∞—á–µ —Å–±–æ—Ä–∞ (logging_settings) –Ω–∞—Å—Ç—Ä–æ–∏—Ç—å —Ä–∞—Å—à–∏—Ä–µ–Ω–Ω—ã–π –∂—É—Ä–Ω–∞–ª –∏ —Ç–∞–º –±—É–¥–µ—Ç –≤—Å–µ —á—Ç–æ –ø—Ä–∏–ª–µ—Ç–∞–µ—Ç –∏ –æ–±—Ä–∞–±–∞—Ç—ã–≤–∞–µ—Ç—Å—è","–Ω–µ—Ç, –º–æ–∂–Ω–æ –≤ –∑–∞–¥–∞—á–µ —Å–±–æ—Ä–∞ (l–æggi–Ωg_s–µ—Ç—Çi–Ωgs) –Ω–∞—Å—Ç—Ä–æ–∏—Ç—å —Ä–∞—Å—à–∏—Ä–µ–Ω–Ω—ã–π –∂—É—Ä–Ω–∞–ª –∏ —Ç–∞–º –±—É–¥–µ—Ç –≤—Å–µ —á—Ç–æ –ø—Ä–∏–ª–µ—Ç–∞–µ—Ç –∏ –æ–±—Ä–∞–±–∞—Ç—ã–≤–∞–µ—Ç—Å—è"
30,"–•–æ—Ä–æ—à–æ, –ø–æ–Ω—è–ª–∞, —Å–µ–π—á–∞—Å –ø–æ–ø—Ä–æ–±—É–µ–º, —Å–ø–∞—Å–∏–±–æ!","—Ö–æ—Ä–æ—à–æ, –ø–æ–Ω—è–ª–∞, —Å–µ–π—á–∞—Å –ø–æ–ø—Ä–æ–±—É–µ–º, —Å–ø–∞—Å–∏–±–æ!"
31,"–î–æ–±—Ä—ã–π –¥–µ–Ω—å!\n–≤–µ—Ä—Å–∏—è 25 —Å–µ—Ä—Ç–∏—Ñ–∏—Ü–∏—Ä–æ–≤–∞–Ω–Ω–∞—è\n–ö–∞–∫ –≤—ã–ø—É—Å—Ç–∏—Ç—å –æ—Ç—á–µ—Ç –ø–æ —É—è–∑–≤–∏–º–æ—Å—Ç—è–º?\n–†–∞–Ω–µ–µ –≤ 24–π –≤–µ—Ä—Å–∏–∏ –±—ã–ª–æ –ø—Ä–æ—â–µ - –∑–∞—Ö–æ–¥–∏—à—å –≤ –∞–∫—Ç–∏–≤—ã, –≤—ã–±–∏—Ä–∞–µ—à—å –Ω—É–∂–Ω—ã–µ, –∏ —Å–æ–∑–¥–∞—Ç—å –æ—Ç—á–µ—Ç –ø–æ —É—è–∑–≤–∏–º–æ—Å—Ç—è–º","–¥–æ–±—Ä—ã–π –¥–µ–Ω—å!\n–≤–µ—Ä—Å–∏—è 25 —Å–µ—Ä—Ç–∏—Ñ–∏—Ü–∏—Ä–æ–≤–∞–Ω–Ω–∞—è\n–∫–∞–∫ –≤—ã–ø—É—Å—Ç–∏—Ç—å –æ—Ç—á–µ—Ç –ø–æ —É—è–∑–≤–∏–º–æ—Å—Ç—è–º?\n—Ä–∞–Ω–µ–µ –≤ 24–π –≤–µ—Ä—Å–∏–∏ –±—ã–ª–æ –ø—Ä–æ—â–µ - –∑–∞—Ö–æ–¥–∏—à—å –≤ –∞–∫—Ç–∏–≤—ã, –≤—ã–±–∏—Ä–∞–µ—à—å –Ω—É–∂–Ω—ã–µ, –∏ —Å–æ–∑–¥–∞—Ç—å –æ—Ç—á–µ—Ç –ø–æ —É—è–∑–≤–∏–º–æ—Å—Ç—è–º"


## Spam detection on similarity

In [73]:
# loading the pretrained model

nlp = spacy.load('ru_core_news_lg')

Below we use my class `TextPreprocessing` for data processing. It contains several methods for data processing using NLP. Method `.word_extractor` is used to remain the specified number of the words which are closest to the set of words or phrases given as argument 'pattern'. We can also put some additional parameters into this method to filter words more finely. In the example below I use filtering by parts-of-speech.

In [74]:
tp = TextPreprocessing(nlp=nlp, text_col=df['text_repl'])

In [75]:
search_phrase = '–∑–∞—Ä–∞–±–æ—Ç–æ–∫, –¥–æ—Ö–æ–¥, –ø—Ä–∏–±—ã–ª—å, –ø—Ä–æ—Ñ–∏—Ç, —Ç—Ä–µ–π–¥–∏–Ω–≥, –ø–µ—Ä—Å–ø–µ–∫—Ç–∏–≤–∞, –ø–æ–¥—Ä–∞–±–æ—Ç–∫–∞, —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–æ, –≤—ã–≥–æ–¥–∞, –±–∏—Ä–∂–∞, —É–¥–∞–ª–µ–Ω–Ω–∞—è, –∏—â—É –ª—é–¥–µ–π, –ø–æ–∏—Å–∫ –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤, –ª–æ–º–±–∞—Ä–¥, –¥–µ–Ω—å–≥–∏, –∫—É–ø—é—Ä—ã'

In [76]:
extr_results = tp.word_extractor(pattern=search_phrase, threshold=None, count_thres=10, dep=None, pos=['NOUN', 'VERB'], desc_sim=True, stat=False, full_df=False, aliquot=10)

HBox(children=(IntProgress(value=0, description='NLP-progress:', max=3268), Label(value='0')))

HBox(children=(IntProgress(value=0, description='Progress:', max=3268), Label(value='0')))

In [77]:
df['expr_results'] = extr_results

In [78]:
df.loc[32:36, 'text_repl':'expr_results']

Unnamed: 0,text_repl,expr_results
32,"–¥–æ–±—Ä—ã–π –¥–µ–Ω—å!\n–ø–æ–¥—Å–∫–∞–∂–∏—Ç–µ –ø–æ–∂–∞–ª—É–π—Å—Ç–∞, –∫–∞–∫ –≤ 25-–π –≤–µ—Ä—Å–∏–∏ –ø–æ–ª—É—á–∏—Ç—å —Ç–æ–∫–µ–Ω –¥–æ—Å—Ç—É–ø–∞, –µ—Å–ª–∏ —É–∑ –∞–≤—Ç–æ—Ä–∏–∑—É–µ—Ç—Å—è –ø–æ l–¥–∞—Ä? –≤ 26 —Ç–æ–∂–µ –Ω–µ —Å–æ–≤—Å–µ–º —è—Å–Ω–æ –∏–∑ –¥–æ–∫—É–º–µ–Ω—Ç–∞—Ü–∏–∏ –∫–∞–∫–∏–µ –∑–Ω–∞—á–µ–Ω–∏—è –ø—Ä–∏–Ω–∏–º–∞–µ—Ç –ø–∞—Ä–∞–º–µ—Ç—Ä –∞–ºr –ø—Ä–∏ –ø–æ–ª—É—á–µ–Ω–∏–∏ —Ç–æ–∫–µ–Ω–∞.",–ø–æ–ª—É—á–∏—Ç—å –ø–æ–ª—É—á–µ–Ω–∏–∏ –¥–æ—Å—Ç—É–ø–∞ –∑–Ω–∞—á–µ–Ω–∏—è –ø—Ä–∏–Ω–∏–º–∞–µ—Ç –¥–µ–Ω—å –≤–µ—Ä—Å–∏–∏ –ø–∞—Ä–∞–º–µ—Ç—Ä –ø–æ–¥—Å–∫–∞–∂–∏—Ç–µ —É–∑
33,"–∂–µ–ª–∞—Ç–µ–ª—å–Ω–æ –±–µ–∑ –ø–æ—Å—Ç –∑–∞–ø—Ä–æ—Å–æ–≤ –∫ :3334/–∏i/l–ægi–Ω/, –∞ –∏–º–µ–Ω–Ω–æ –≤–∞—Ä–∏–∞–Ω—Ç —Å –ø–æ–ª—É—á–µ–Ω–∏–µ–º —Ç–æ–∫–µ–Ω–∞",–≤–∞—Ä–∏–∞–Ω—Ç –ø–æ–ª—É—á–µ–Ω–∏–µ–º –∑–∞–ø—Ä–æ—Å–æ–≤ –ø–æ—Å—Ç
34,"–∫–æ–ª–ª–µ–≥–∏, –¥–µ–Ω—å –¥–æ–±—Ä—ã–π. –Ω–∏–∫—Ç–æ –Ω–µ –≤ –∫—É—Ä—Å–µ, —Ä–∞–∑—Ä–∞–±–∞—Ç—ã–≤–∞—é—Ç –Ω–æ—Ä–º–∞–ª–∏–∑–∞—Ü–∏—é –ø–æ–¥ –ª–∏–Ω—É–∫—Å–æ–≤—ã–π –∫s—Å? —Ç–∞–º –∂–µ –º–∞ri–∞d–≤, –≤ –¥–µ—Ñ–æ–ª—Ç–µ –Ω–∏–∫–∞–∫ –Ω–µ –ø–æ–¥—Ä—É–∂–∏—à—å. –ø–æ —Å–∏—Å–ª–æ–≥—É, –Ω–∞—Å–∫–æ–ª—å–∫–æ —è –ø–æ–Ω–∏–º–∞—é, –º–∞–ª–æ–≤–∞—Ç–æ –∏–Ω—Ñ—ã.",–ø–æ–Ω–∏–º–∞—é –∫—É—Ä—Å–µ –¥–µ–Ω—å —Ä–∞–∑—Ä–∞–±–∞—Ç—ã–≤–∞—é—Ç –∫–æ–ª–ª–µ–≥–∏ –Ω–æ—Ä–º–∞–ª–∏–∑–∞—Ü–∏—é
35,–≤ 25- –Ω–∏–∫–∞–∫\n–≤ –∞–∫—Ç—É–∞–ª—å–Ω—ã—Ö –≤–µ—Ä—Å–∏—è—Ö\nh—Ç—Ç—Äs://h–µl—Ä.—Ä—Çs–µ—Å–∏ri—Ç—É.—Å–æ–º/—Är–æj–µ—Å—Çs/–º–∞—Ö—Ä–∞—Çr–æl10/26.2/r–∏-ru/h–µl—Ä/3678991755\n–ø–∞—Ä–∞–º–µ—Ç—Ä –∞–ºr,–ø–∞—Ä–∞–º–µ—Ç—Ä –≤–µ—Ä—Å–∏—è—Ö
36,"–∞–≥–∞, —Å—Ç–∞—Ä—ã–π –º–µ—Ç–æ–¥ —á–µ—Ä–µ–∑ —ç–º—É–ª—è—Ü–∏—é ui –∏ –±–µ–∑ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è —Ål–µ–Ω—Ç_s–µ—År–µ—Ç —Ç–æ–∂–µ —Ä–∞–±–æ—Ç–∞–µ—Ç –∏ –µ–≥–æ –º–æ–∂–Ω–æ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å . –ø—Ä–∏–º–µ—Ä —Ä–∞–±–æ—Ç—ã –µ—Å—Ç—å –≤ –º—Äsi–µ–ºli–±",—Ä–∞–±–æ—Ç–∞–µ—Ç —Ä–∞–±–æ—Ç—ã –º–µ—Ç–æ–¥ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å –ø—Ä–∏–º–µ—Ä


Here we use another my class `Categorizator` which helps to categorize text data by the similarity. The method `cat_sim` calculates similarity between the specified phrase and text_data and sorts them by descending the similarity.

In [79]:
cat_kw = Categorizator(pattern_list=df['expr_results'], nlp=nlp, only_w_vector=False)

No cat given. Use param "cat_list".
Categories without vectors: Series([], dtype: object)
Starting NLP-processing for pattern_list


HBox(children=(IntProgress(value=0, description='Progress: ', max=3268), Label(value='0')))

pattern_list processed



In [81]:
spam_results_kw = cat_kw.cat_sim(cat=search_phrase, sim_func='adv', metric='mean_top5')

Using preprocessed pattern_list
Starting quotes counting...
No text data for quoting! Uze param "quoting".


In [82]:
result_df = df.join(spam_results_kw.sort_index())

In [83]:
result_df.sort_values(search_phrase, ascending=False)[['text', 'expr_results', search_phrase]].head(5)

Unnamed: 0,text,expr_results,"–∑–∞—Ä–∞–±–æ—Ç–æ–∫, –¥–æ—Ö–æ–¥, –ø—Ä–∏–±—ã–ª—å, –ø—Ä–æ—Ñ–∏—Ç, —Ç—Ä–µ–π–¥–∏–Ω–≥, –ø–µ—Ä—Å–ø–µ–∫—Ç–∏–≤–∞, –ø–æ–¥—Ä–∞–±–æ—Ç–∫–∞, —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–æ, –≤—ã–≥–æ–¥–∞, –±–∏—Ä–∂–∞, —É–¥–∞–ª–µ–Ω–Ω–∞—è, –∏—â—É –ª—é–¥–µ–π, –ø–æ–∏—Å–∫ –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤, –ª–æ–º–±–∞—Ä–¥, –¥–µ–Ω—å–≥–∏, –∫—É–ø—é—Ä—ã"
1996,"–î–æ–±—Ä—ã–π –≤–µ—á–µ—Ä, –∏—â—É –ª—é–¥–µ–π –Ω–∞ —É–¥–∞–ª–µ–Ω–Ω—ã–π –∑–∞—Ä–∞–±–æ—Ç–æ–∫ —Å 18 –ª–µ—Ç \n–ó–ü –æ—Ç 150$ –≤ –¥–µ–Ω—å\n–ü–∏—à–∏—Ç–µ + –≤ –ª—Å",–∑–∞—Ä–∞–±–æ—Ç–æ–∫ –∏—â—É –ª—é–¥–µ–π –ª–µ—Ç –¥–µ–Ω—å –ø–∏—à–∏—Ç–µ –≤–µ—á–µ—Ä,0.818371
1435,"–ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ, –∏—â—É –ª—é–¥–µ–π –≤ —Ç–∏–º—É.\n–°–≤–æ–±–æ–¥–Ω—ã–π –≥—Ä–∞—Ñ–∏–∫üëå\n–ü—Ä–∏—è—Ç–Ω—ã–π –∑–∞—Ä–∞–±–æ—Ç–æ–∫ –æ—Ç 200 $ –≤ –¥–µ–Ω—å\n–ï—Å–ª–∏ –∏–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç –ø–æ–¥—Ä–æ–±–Ω–∞—è –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è "" + "" –≤ –ª—Å",–∑–∞—Ä–∞–±–æ—Ç–æ–∫ –∏–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç –∏—â—É –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è –ª—é–¥–µ–π –≥—Ä–∞—Ñ–∏–∫ –¥–µ–Ω—å –∑–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ,0.818371
490,Ec—Ç—å c–øoco–± –ø–æ–ª—É—á–∏—Ç—å –¥oxo–¥\n–ù–∞ –ø—Ä–æ–∫—Ä—É—Ç–∞—Ö –±–∏—Ä–∂–∏ Bybit –∏ Bitget\n–ü—Ä–∏–±—ã–ª—å –∫ –∫–∞–ø–∏—Ç–∞–ª—É +2-3%\n–û–±—É—á–∞–µ–º –Ω–æ–≤–∏—á–∫–æ–≤ —Å 0.\n–†–∞–±–æ—Ç–∞–µ–º –±–µ–∑ —Å—Ç–æ—Ä–æ–Ω–Ω–∏—Ö —Å–∞–π—Ç–æ–≤\n–ï—Å—Ç—å —Ñ–æ—Ç–æ/–≤–∏–¥–µ–æ –∏–Ω—Ñ–æ–º–∞—Ç–µ—Ä–∏–∞–ª—ã –ø–æ —Å–≤—è–∑–∫–∞–º.\n–ò–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç? —Ç–æ–≥–¥–∞ –ø–∏—à–∏ –≤ –õ—Å.,–ø—Ä–∏–±—ã–ª—å –¥–æ—Ö–æ–¥ –ø–æ–ª—É—á–∏—Ç—å –∏–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç –∫–∞–ø–∏—Ç–∞–ª—É –±–∏—Ä–∂–∏ –µ—Å—Ç—å —Å–∞–π—Ç–æ–≤ —Ä–∞–±–æ—Ç–∞–µ–º,0.812563
846,–ü—Ä–∏–≤–µ—Çc—Ç–≤y—é! –ò—âe–º –ª—é–¥e–π - –≥–æ—Ço–≤—ã—Ö –Ωa –≤–∑a–∏–ºo–≤—ã–≥o–¥–Ωo–º —Åo—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Öo—Äo—à–∏–π –¥o–ø. –¥oxo–¥ –øo –Ω–∞—àe–ºy —Ñp–∏–ªa–Ωc –ø—Äoe–∫—Ç—É.\n–üo –≤o–øpo—Åa–º - –ø–∏—à–∏—Çe –≤ –ª–∏—á–Ω—ãe co–æ–±—âe–Ω–∏—è.,–¥–æ—Ö–æ–¥ –ø–æ–ª—É—á–∞—Ç—å –∏—â–µ–º —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ª—é–¥–µ–π –ø—Ä–æ–µ–∫—Ç—É –ø–∏—à–∏—Ç–µ —Å–æ–æ–±—â–µ–Ω–∏—è –≤–æ–ø—Ä–æ—Å–∞–º –ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é,0.804261
2449,"–ò—â–µ–º –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤, –≥–æ—Ço–≤—ã—Ö –Ωa –≤–∑a–∏–ºo–≤—ã–≥o–¥–Ωo–º —Åo—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Öo—Äo—à–∏–π –¥o–ø. –¥oxo–¥ –øo –Ω–∞—àe–ºy –ø—Äoe–∫—Ç—É. –ù–µ–øo–ª–Ω–∞—è –∑–∞–Ω—è—Ç–æ—Å—Ç—å. \n–üo –≤o–øpo—Åa–º - –ø–∏—à–∏—Ç–µ –≤ –ª—Å",–¥–æ—Ö–æ–¥ –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤ –ø–æ–ª—É—á–∞—Ç—å –∏—â–µ–º —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –∑–∞–Ω—è—Ç–æ—Å—Ç—å –ø—Ä–æ–µ–∫—Ç—É –ø–∏—à–∏—Ç–µ –≤–æ–ø—Ä–æ—Å–∞–º,0.804261


__Almost all the spam is located in the top of the table and it helps to label this data easily and prepare training dataset.__

In [None]:
# saving the result into excel for the verification.

result_df.to_excel('spam_results2.xlsx')

We've obtained a list of the texts, in the top of that all spam-messages locate. So, we easily can label it and get train data.

## Estimating the labeling result

In [84]:
# loading the verified data

spam_results_df = pd.read_excel('spam_results_checked.xlsx', index_col='Unnamed: 0')

In [85]:
spam_results_df[['text', 'predict', 'target']].head()

Unnamed: 0,text,predict,target
1435,"–ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ, –∏—â—É –ª—é–¥–µ–π –≤ —Ç–∏–º—É.\n–°–≤–æ–±–æ–¥–Ω—ã–π –≥—Ä–∞—Ñ–∏–∫üëå\n–ü—Ä–∏—è—Ç–Ω—ã–π –∑–∞—Ä–∞–±–æ—Ç–æ–∫ –æ—Ç 200 $ –≤ –¥–µ–Ω—å\n–ï—Å–ª–∏ –∏–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç –ø–æ–¥—Ä–æ–±–Ω–∞—è –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è "" + "" –≤ –ª—Å",1,1
1996,"–î–æ–±—Ä—ã–π –≤–µ—á–µ—Ä, –∏—â—É –ª—é–¥–µ–π –Ω–∞ —É–¥–∞–ª–µ–Ω–Ω—ã–π –∑–∞—Ä–∞–±–æ—Ç–æ–∫ —Å 18 –ª–µ—Ç \n–ó–ü –æ—Ç 150$ –≤ –¥–µ–Ω—å\n–ü–∏—à–∏—Ç–µ + –≤ –ª—Å",1,1
490,Ec—Ç—å c–øoco–± –ø–æ–ª—É—á–∏—Ç—å –¥oxo–¥\n–ù–∞ –ø—Ä–æ–∫—Ä—É—Ç–∞—Ö –±–∏—Ä–∂–∏ Bybit –∏ Bitget\n–ü—Ä–∏–±—ã–ª—å –∫ –∫–∞–ø–∏—Ç–∞–ª—É +2-3%\n–û–±—É—á–∞–µ–º –Ω–æ–≤–∏—á–∫–æ–≤ —Å 0.\n–†–∞–±–æ—Ç–∞–µ–º –±–µ–∑ —Å—Ç–æ—Ä–æ–Ω–Ω–∏—Ö —Å–∞–π—Ç–æ–≤\n–ï—Å—Ç—å —Ñ–æ—Ç–æ/–≤–∏–¥–µ–æ –∏–Ω—Ñ–æ–º–∞—Ç–µ—Ä–∏–∞–ª—ã –ø–æ —Å–≤—è–∑–∫–∞–º.\n–ò–Ω—Ç–µ—Ä–µ—Å—É–µ—Ç? —Ç–æ–≥–¥–∞ –ø–∏—à–∏ –≤ –õ—Å.,1,1
275,–îe–Ω—å –¥o–±p—ã–π!\n\n–ò—âe–º –øap—Ç–Ωepo–≤ –≤ —Å—Ñepe –∫p–∏–ø—Ço–≤a–ª—é—Ç—ã \n\n- Y–¥a–ª—ë–Ω–Ωo!\n- –†a–±o—Ça —Å –ü–ö/–¢e–ªe—Ño–Ωa\n- –îo—Öo–¥ o—Ç 500$ –≤ –Ωe–¥e–ª—é\n\n–ú—ã –øpe–¥o—Å—Ça–≤–ª—èe–º:\n- –üep—Å–øe–∫—Ç–∏–≤y \n- –û–±y—áe–Ω–∏e –±e—Å–ø–ªa—Ç–Ωo.\n- –üo–¥–¥ep–∂–∫a 24/7.\n- –°—Ça–±–∏–ª—å–Ω—ã–π –¥o—Öo–¥\n\nE—Å–ª–∏ –∑a–∏–Ω—Çepe—Åo–≤a–ª–∏—Å—å - –ø–∏—à–∏—Çe –≤ –õ–°.,1,1
459,–ü—Ä–∏–≤–µ—Çc—Ç–≤y—é! –ò—âe–º –ª—é–¥e–π - –≥–æ—Ço–≤—ã—Ö –Ωa –≤–∑a–∏–ºo–≤—ã–≥o–¥–Ωo–º —Åo—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Öo—Äo—à–∏–π –¥o–ø. –¥oxo–¥ –øo –Ω–∞—àe–ºy —Ñp–∏–ªa–Ωc –ø—Äoe–∫—Ç—É.\n–üo –≤o–øpo—Åa–º - –ø–∏—à–∏—Çe –≤ –ª–∏—á–Ω—ãe co–æ–±—âe–Ω–∏—è.,1,1


In [86]:
spam_pred = spam_results_df['predict']
spam_target = spam_results_df['target']

In [87]:
spam_target.value_counts()

target
0    3150
1     118
Name: count, dtype: int64

In [88]:
acc = accuracy_score(spam_target, spam_pred)
acc

0.9877600979192166

In [89]:
f1 = f1_score(spam_target, spam_pred)
f1

0.8095238095238095

As we can see from the metrics this methods of labeling shows a good result.

## Training model

We use the special method `get_train_data` of the class `TextPreprocessing` to prepare train data. For the model training we take pretrained Spacy model for Russian language and train it on a new data.

In [90]:
spam_target = pd.DataFrame({'SPAM': spam_target})
spam_target.head()

Unnamed: 0,SPAM
1435,1
1996,1
490,1
275,1
459,1


In [31]:
%%time
train = tp.get_train_data(label_data=spam_target.to_dict('records'), pattern_list=None, to_disk='./train_data/spam/',
                          split=0.2, label=None, text_col=spam_results_df['text_repl'], stratify=spam_target.squeeze('columns'))

Using label_data (list or Series with special dict


HBox(children=(IntProgress(value=0, description='Progress:', max=3268), Label(value='0')))

Splitting data: TRAIN - 80.0%,  TEST - 20.0%
Training data locates:
./train_data/spam/train.spacy
./train_data/spam/dev.spacy
CPU times: total: 27.6 s
Wall time: 27.5 s


In [32]:
%%time
# starting model training

spacy.cli.train.train("./config/config.cfg", "./TRAINED_MODEL/", overrides={"paths.train": "./train_data/spam/train.spacy", "paths.dev": "./train_data/spam/dev.spacy"})

[38;5;2m‚úî Created output directory: TRAINED_MODEL[0m
[38;5;4m‚Ñπ Saving to output directory: TRAINED_MODEL[0m
[38;5;4m‚Ñπ Using CPU[0m
[1m
[38;5;2m‚úî Initialized pipeline[0m
[1m
[38;5;4m‚Ñπ Pipeline: ['tok2vec', 'morphologizer', 'parser', 'attribute_ruler',
'lemmatizer', 'ner', 'textcat_multilabel'][0m
[38;5;4m‚Ñπ Frozen components: ['tok2vec', 'morphologizer', 'parser', 'senter',
'attribute_ruler', 'lemmatizer', 'ner'][0m
[38;5;4m‚Ñπ Initial learn rate: 0.001[0m
E    #       LOSS TEXTC...  POS_ACC  MORPH_ACC  DEP_UAS  DEP_LAS  SENTS_P  SENTS_R  SENTS_F  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  CATS_SCORE  SCORE 
---  ------  -------------  -------  ---------  -------  -------  -------  -------  -------  ---------  ------  ------  ------  ----------  ------
  0       0           0.15    98.61      98.61    98.33    98.33   100.00   100.00   100.00      98.61  100.00  100.00  100.00       60.38    0.93
  3    1000          20.47    98.61      98.61    98.33    98.33   100.0

## Model testing

In [91]:
# loading trained model

spam_model = spacy.load('./TRAINED_MODEL/spam_detector')

### Testing on the source data

The method `extract_cats` allows to categorize a text if it is spam or not using the trained model.

In [92]:
tp_spam = TextPreprocessing(text_col=df['text_repl'], nlp=spam_model)

In [93]:
spam_res = tp_spam.extract_cats(df=True)

HBox(children=(IntProgress(value=0, description='Progress', max=3268), Label(value='0')))

In [94]:
spam_res.sort_values('SPAM', ascending=False).head()

Unnamed: 0,text_col,SPAM
1418,–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! –∏—â–µ–º –ª—é–¥–µ–π - –≥–æ—Ç–æ–≤—ã—Ö –Ω–∞ –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–º —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Ö–æ—Ä–æ—à–∏–π –¥–æ–ø. –¥–æ—Ö–æ–¥ –ø–æ –Ω–∞—à–µ–º—É —Ñ—Ä–∏–ª–∞–Ω—Å –ø—Ä–æ–µ–∫—Ç—É.\n–ø–æ –≤–æ–ø—Ä–æ—Å–∞–º - –ø–∏—à–∏—Ç–µ –≤ –ª–∏—á–Ω—ã–µ —Å–æ–æ–±—â–µ–Ω–∏—è.,1.0
3187,"–Ω–∞–±–∏—Ä–∞—é –ª—é–¥–µ–π –≤ –∫–æ–º–∞–Ω–¥—É –¥–ª—è —Å–æ–≤–º–µ—Å—Ç–Ω–æ–≥–æ –∑–∞—Ä–∞–±–æ—Ç–∫–∞ \n–¥–æ—Ö–æ–¥ –≤ —Å—Ä–µ–¥–Ω–µ–º 800$ –≤ –Ω–µ–¥–µ–ª—é, —Ç—Ä–∞—Ç—è –¥–æ –¥–≤—É—Ö —á–∞—Å–æ–≤ –≤ –¥–µ–Ω—å\n–æ–±—É—á–∞–µ–º –≤—Å–µ—Ö –±–µ—Å–ø–ª–∞—Ç–Ω–æ\n–∏–Ω—Ç–µ—Ä–µ—Å–Ω–æ –ø–æ–ø—Ä–æ–±–æ–≤–∞—Ç—å? –æ—Ç–ø—Ä–∞–≤—å—Ç–µ –º–Ω–µ +",1.0
936,"—Ö–æ—á–µ—à—å –∏–∑–º–µ–Ω–∏—Ç—å —Å–≤–æ—é —Ñ–∏–Ω–∞–Ω—Å–æ–≤—É—é –∂–∏–∑–Ω—å —Å –º–∏–Ω–∏–º—É–º–æ–º —É—Å–∏–ª–∏–π? —É–∑–Ω–∞–π, –∫–∞–∫ –∑–∞—Ä–∞–±–∞—Ç—ã–≤–∞—Ç—å —Å –ø–∫, –Ω–µ –≤—ã—Ö–æ–¥—è –∏–∑ –¥–æ–º–∞! –Ω–∞–ø–∏—à–∏ + –≤ –ª—Å –∏ –æ—Ç–∫—Ä–æ–π –¥–ª—è —Å–µ–±—è –º–∏—Ä –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç–µ–π –∏ –≤—ã—Å–æ–∫–∏—Ö –¥–æ—Ö–æ–¥–æ–≤. –Ω–µ —É–ø—É—Å—Ç–∏ —à–∞–Ω—Å –Ω–∞ —Ñ–∏–Ω–∞–Ω—Å–æ–≤—É—é –Ω–µ–∑–∞–≤–∏—Å–∏–º–æ—Å—Ç—å!",1.0
935,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! –∏—â–µ–º –ª—é–¥–µ–π - –≥–æ—Ç–æ–≤—ã—Ö –Ω–∞ –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–º —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Ö–æ—Ä–æ—à–∏–π –¥–æ–ø. –¥–æ—Ö–æ–¥ –ø–æ –Ω–∞—à–µ–º—É —Ñ—Ä–∏–ª–∞–Ω—Å –ø—Ä–æ–µ–∫—Ç—É (—Å—Ñ–µ—Ä–∞ –∫—Ä–∏–ø—Ç–æ–≤–∞–ª—é—Ç, –Ω–µ –∑–∞–Ω–∏–º–∞–µ–º—Å—è —Ç–æ—Ä–≥–æ–≤–ª–µ–π, –∞—Ä–±–∏—Ç—Ä–∞–∂–µ–º –∏ —Ç.–¥)\n–ø–æ–º–æ–∂–µ–º —Ä–∞–∑–æ–±—Ä–∞—Ç—å—Å—è –Ω–∞ –ø—Ä–∞–∫—Ç–∏–∫–µ –µ—Å–ª–∏ –Ω–µ—Ç –æ–ø—ã—Ç–∞. \n–æ—Ç 20 –ª–µ—Ç. –Ω–µ–ø–æ–ª–Ω–∞—è –∑–∞–Ω—è—Ç–æ—Å—Ç—å.\n–ø–æ –≤–æ–ø—Ä–æ—Å–∞–º - –ø–∏—à–∏—Ç–µ –µ–º—É –≤ –ª.—Å: @–∫li–º_s–∞z–æni",1.0
934,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! –∏—â–µ–º –ª—é–¥–µ–π - –≥–æ—Ç–æ–≤—ã—Ö –Ω–∞ –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–º —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–µ –ø–æ–ª—É—á–∞—Ç—å —Ö–æ—Ä–æ—à–∏–π –¥–æ–ø. –¥–æ—Ö–æ–¥ –ø–æ –Ω–∞—à–µ–º—É —Ñ—Ä–∏–ª–∞–Ω—Å –ø—Ä–æ–µ–∫—Ç—É (—Å—Ñ–µ—Ä–∞ –∫—Ä–∏–ø—Ç–æ–≤–∞–ª—é—Ç, –Ω–µ –∑–∞–Ω–∏–º–∞–µ–º—Å—è —Ç–æ—Ä–≥–æ–≤–ª–µ–π, –∞—Ä–±–∏—Ç—Ä–∞–∂–µ–º –∏ —Ç.–¥)\n–ø–æ–º–æ–∂–µ–º —Ä–∞–∑–æ–±—Ä–∞—Ç—å—Å—è –Ω–∞ –ø—Ä–∞–∫—Ç–∏–∫–µ –µ—Å–ª–∏ –Ω–µ—Ç –æ–ø—ã—Ç–∞. \n–æ—Ç 20 –ª–µ—Ç. –Ω–µ–ø–æ–ª–Ω–∞—è –∑–∞–Ω—è—Ç–æ—Å—Ç—å.\n–ø–æ –≤–æ–ø—Ä–æ—Å–∞–º - –ø–∏—à–∏—Ç–µ –µ–º—É –≤ –ª.—Å: @–∫li–º_s–∞z–æni",1.0


In [95]:
spam_target = spam_results_df.sort_index()['target']
spam_pred = (spam_res['SPAM'] > 0.6).astype('int')

In [96]:
acc = accuracy_score(spam_target, spam_pred)
acc

0.9954100367197063

In [97]:
f1 = f1_score(spam_target, spam_pred)
f1

0.9382716049382717

We can see that metrics are good and this result was fully expected

### Testing on new data

As a test dataset we are using a set of the generated texts which are very similar to the source texts, but the model has never seen them before.

In [98]:
new_data = pd.read_excel('generated_spam_test.xlsx')

In [99]:
new_data.head()

Unnamed: 0.1,Unnamed: 0,message_text
0,515,"–ü—Ä–∏–≤–µ—Ç –≤—Å–µ–º! –£ –º–µ–Ω—è –µ—Å—Ç—å –∫—É—Ä—Å—ã –ø–æ —Ç—Ä–µ–π–¥–∏–Ω–≥—É, –∫–æ—Ç–æ—Ä—ã–º–∏ —è –º–æ–≥—É –ø–æ–¥–µ–ª–∏—Ç—å—Å—è –∞–±—Å–æ–ª—é—Ç–Ω–æ –±–µ—Å–ø–ª–∞—Ç–Ω–æ. –ï—Å–ª–∏ –∫–æ–º—É-—Ç–æ –Ω—É–∂–Ω–æ, –¥–∞–π—Ç–µ –∑–Ω–∞—Ç—å"
1,570,"–í–Ω–∏–º–∞–Ω–∏–µ! –ù—É–∂–µ–Ω 1 —Å–æ—Ç—Ä—É–¥–Ω–∏–∫ –¥–ª—è —Ä–∞–±–æ—Ç—ã –Ω–∞ –¥–æ–º—É, –æ–ø–ª–∞—Ç–∞ –¥–æ—Å—Ç–æ–π–Ω–∞—è. –ü–æ–¥—Ä–æ–±–Ω–æ—Å—Ç–∏ –≤—ã—à–ª—é –≤ –ª–∏—á–Ω—ã–µ —Å–æ–æ–±—â–µ–Ω–∏—è."
2,561,–ó–∞—Ö–æ–¥–∏—Ç–µ –∫ –Ω–∞–º –∑–∞ –∑–∞—Ä–∞–±–æ—Ç–∫–æ–º! –û—Ç–ø—Ä–∞–≤—å—Ç–µ —Å–æ–æ–±—â–µ–Ω–∏–µ –∏ —É–∑–Ω–∞–π—Ç–µ –∫–∞–∫!
3,162,–í—Å–µ–º –ø—Ä–∏–≤–µ—Ç! –£ –Ω–∞—Å –µ—Å—Ç—å —Å–ø–æ—Å–æ–± –ø–æ–¥–Ω—è—Ç—å –∫—ç—à. –ú—ã –ø—Ä–µ–¥–æ—Å—Ç–∞–≤–ª—è–µ–º —ç—Ç–æ –∑–∞ –ø—Ä–æ—Ü–µ–Ω—Ç –æ—Ç —Å—É–º–º—ã. –†–∞–±–æ—Ç–∞ –ø—Ä–æ–≤–æ–¥–∏—Ç—Å—è –¥–∏—Å—Ç–∞–Ω—Ü–∏–æ–Ω–Ω–æ. –ó–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–Ω–Ω—ã–µ –º–æ–≥—É—Ç —Å–≤—è–∑–∞—Ç—å—Å—è —Å –Ω–∞–º–∏üëàüèº
4,616,"–ù–∞–π–¥–∏—Ç–µ –∏–Ω—Ç–∏–º–Ω—ã–µ —Ñ–æ—Ç–æ –¥–µ–≤—É—à–∫–∏, –∏—Å–ø–æ–ª—å–∑—É—è –∫–æ–¥ cdy382 –≤ –¢–µ–ª–µ–≥—Ä–∞–º–µ."


In [114]:
# replace incorrect characters if they are

new_data['text_repl'] = new_data['message_text'].apply(alphabet_replacer).str.lower()

In [101]:
# spam recognition

tp_spam = TextPreprocessing(text_col=new_data['text_repl'], nlp=spam_model)
spam_res = tp_spam.extract_cats(df=True)

HBox(children=(IntProgress(value=0, description='Progress', max=160), Label(value='0')))

In [38]:
spam_res

Unnamed: 0,text_col,SPAM
0,"–ø—Ä–∏–≤–µ—Ç –≤—Å–µ–º! —É –º–µ–Ω—è –µ—Å—Ç—å –∫—É—Ä—Å—ã –ø–æ —Ç—Ä–µ–π–¥–∏–Ω–≥—É, –∫–æ—Ç–æ—Ä—ã–º–∏ —è –º–æ–≥—É –ø–æ–¥–µ–ª–∏—Ç—å—Å—è –∞–±—Å–æ–ª—é—Ç–Ω–æ –±–µ—Å–ø–ª–∞—Ç–Ω–æ. –µ—Å–ª–∏ –∫–æ–º—É-—Ç–æ –Ω—É–∂–Ω–æ, –¥–∞–π—Ç–µ –∑–Ω–∞—Ç—å",0.998
1,"–≤–Ω–∏–º–∞–Ω–∏–µ! –Ω—É–∂–µ–Ω 1 —Å–æ—Ç—Ä—É–¥–Ω–∏–∫ –¥–ª—è —Ä–∞–±–æ—Ç—ã –Ω–∞ –¥–æ–º—É, –æ–ø–ª–∞—Ç–∞ –¥–æ—Å—Ç–æ–π–Ω–∞—è. –ø–æ–¥—Ä–æ–±–Ω–æ—Å—Ç–∏ –≤—ã—à–ª—é –≤ –ª–∏—á–Ω—ã–µ —Å–æ–æ–±—â–µ–Ω–∏—è.",1.0
2,–∑–∞—Ö–æ–¥–∏—Ç–µ –∫ –Ω–∞–º –∑–∞ –∑–∞—Ä–∞–±–æ—Ç–∫–æ–º! –æ—Ç–ø—Ä–∞–≤—å—Ç–µ —Å–æ–æ–±—â–µ–Ω–∏–µ –∏ —É–∑–Ω–∞–π—Ç–µ –∫–∞–∫!,0.938
3,–≤—Å–µ–º –ø—Ä–∏–≤–µ—Ç! —É –Ω–∞—Å –µ—Å—Ç—å —Å–ø–æ—Å–æ–± –ø–æ–¥–Ω—è—Ç—å –∫—ç—à. –º—ã –ø—Ä–µ–¥–æ—Å—Ç–∞–≤–ª—è–µ–º —ç—Ç–æ –∑–∞ –ø—Ä–æ—Ü–µ–Ω—Ç –æ—Ç —Å—É–º–º—ã. —Ä–∞–±–æ—Ç–∞ –ø—Ä–æ–≤–æ–¥–∏—Ç—Å—è –¥–∏—Å—Ç–∞–Ω—Ü–∏–æ–Ω–Ω–æ. –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–Ω–Ω—ã–µ –º–æ–≥—É—Ç —Å–≤—è–∑–∞—Ç—å—Å—è —Å –Ω–∞–º–∏üëàüèº,0.0
4,"–Ω–∞–π–¥–∏—Ç–µ –∏–Ω—Ç–∏–º–Ω—ã–µ —Ñ–æ—Ç–æ –¥–µ–≤—É—à–∫–∏, –∏—Å–ø–æ–ª—å–∑—É—è –∫–æ–¥ —Å–¥—É382 –≤ —Ç–µ–ª–µ–≥—Ä–∞–º–µ.",0.001
5,–≥—Ä–∞–Ω–¥–∏–æ–∑–Ω—ã–µ –Ω–æ–≤–æ—Å—Ç–∏ - –ø—Ä–∏–≥–ª–∞—à–∞–µ–º –Ω–∞ —É–¥–∞–ª–µ–Ω–Ω—É—é —Ä–∞–±–æ—Ç—É –≤ —Ç–µs—Ç–Ω–µ—Ç! –±–µ—Å–ø–ª–∞—Ç–Ω–æ–µ –æ–±—É—á–µ–Ω–∏–µ –∏ –ø–µ—Ä–≤—ã–π –¥–æ—Ö–æ–¥ —É–∂–µ —á–µ—Ä–µ–∑ 30 –º–∏–Ω—É—Ç. –¥–æ—Ö–æ–¥ —Å–æ—Å—Ç–∞–≤–∏—Ç –æ—Ç 1035$/100000 —Ä—É–± –≤ –Ω–µ–¥–µ–ª—é! —É –Ω–∞—Å –ø–æ–ª–Ω–∞—è –ø—Ä–æ–∑—Ä–∞—á–Ω–æ—Å—Ç—å! üòã–µ—â–µ –¥–æ—Å—Ç—É–ø–Ω–æ 8 –º–µ—Å—Ç! –Ω–∞–±–æ—Ä –∏–¥–µ—Ç –¥–æ 25.2.2024. –±—É–¥—É —Ä–∞–¥ –≤–∏–¥–µ—Ç—å –≤–∞—Å –≤ –∫–æ–º–∞–Ω–¥–µ!ü•∞,0.932
6,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! –∏—â—É –ª–∏—á–Ω–æ—Å—Ç–µ–π, –∫–æ—Ç–æ—Ä—ã–µ —Ö–æ—Ç—è—Ç —Å—Ç–∞—Ç—å —á–∞—Å—Ç—å—é –Ω–∞—à–µ–π –∫–æ–º–∞–Ω–¥—ã –∏ —Ä–∞–±–æ—Ç–∞—Ç—å —Å –∫—Ä–∏–ø—Ç–æ–≤–∞–ª—é—Ç–∞–º–∏. –≤ –≤–∞—à–µ–º –¥–æ—Å—Ç—É–ø–µ –±—É–¥–µ—Ç –ø–æ—Å—Ç–æ—è–Ω–Ω—ã–π –¥–æ—Ö–æ–¥ –æ—Ç 1575$ –≤ –Ω–µ–¥–µ–ª—é, –ø—Ä–∏ –≤—Ä–µ–º–µ–Ω–Ω—ã—Ö –∑–∞—Ç—Ä–∞—Ç–∞—Ö –≤—Å–µ–≥–æ 2 —á–∞—Å–∞ –≤ –¥–µ–Ω—å! –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–Ω—ã –∏ –≤–∞–º –Ω–µ—Ç 18 –ª–µ—Ç? –Ω–∞–ø–∏—à–∏—Ç–µ –º–Ω–µ!",0.997
7,"–ø—Ä–∏—ë–º –Ω–∞ —Ä–∞–±–æ—Ç—É! –Ω—É–∂–Ω—ã —Ç—Ä–∏ —á–µ–ª–æ–≤–µ–∫–∞, —Ä–∞–±–æ—Ç–∞ —á–µ—Ä–µ–∑ —Ç–µ–ª–µ—Ñ–æ–Ω –∏–ª–∏ –ø–∫. –≤–æ–∑–Ω–∞–≥—Ä–∞–∂–¥–µ–Ω–∏–µ –æ—Ç 500$. –ø–∏—à–∏—Ç–µ ""+"" –¥–ª—è –æ—Ç–∫–ª–∏–∫–∞.",0.998
8,–Ω–∞—à –ø—Ä–æ–µ–∫—Ç –æ—Ç–∫—Ä—ã–≤–∞–µ—Ç –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç–∏ –¥–ª—è –ø–æ–ª—É—á–µ–Ω–∏—è –¥–æ–ø. –¥–æ—Ö–æ–¥–∞ –Ω–∞ —É—Å–ª–æ–≤–∏—è—Ö –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–≥–æ —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–∞. –∏—â–µ–º –∞–∫—Ç–∏–≤–Ω—ã—Ö –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤! –≤–æ–ø—Ä–æ—Å—ã –º–æ–∂–µ—Ç–µ –∑–∞–¥–∞—Ç—å –≤ –ª–∏—á–Ω—ã—Ö —Å–æ–æ–±—â–µ–Ω–∏—è—Ö.,0.0
9,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! üëã —Ç—Ä–µ–±—É—é—Ç—Å—è –ø–∞—Ä—Ç–Ω–µ—Ä—ã –≤ –∫–æ–º–∞–Ω–¥—É. –æ—Ç –≤–∞—Å —Ç—Ä–µ–±—É–µ—Ç—Å—è –∏–Ω—Ç–µ—Ä–Ω–µ—Ç –∏ —Å–º–∞—Ä—Ç—Ñ–æ–Ω. –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç—å –∑–∞—Ä–∞–±–æ—Ç–∞—Ç—å 135-185 $¬†–≤ –¥–µ–Ω—å. –≤—Å–µ –¥–µ–π—Å—Ç–≤–∏—è –∑–∞–∫–æ–Ω–Ω—ã! –µ—Å–ª–∏ –≤–∞–º –∏–Ω—Ç–µ—Ä–µ—Å–Ω–æ, –ø–∏—à–∏—Ç–µ –≤ –ª—Å.",0.998


In [39]:
acc = accuracy_score(np.ones(len(spam_res)), (spam_res['SPAM'] > 0.6).astype('int'))
acc

0.6625

Since we used only spam texts as a test dataset `f1`-metric doesn't make sense and we calculates only `accuracy`.  
Unfortunately the result isn't good and this reports us about a low generalization level of this model in the task of spam detection. Obviously that the reason for this problem is connected with small quantity of spam examples

## Training the model on the enriched dataset

In [103]:
800 / 3880

0.20618556701030927

I have used ChatGPT, YandexGPT to generate new spam messages similar to source messages.  
For this purpose I write the special script to connect with the api.  
Before this we had only 120 spam examples (3.8% of the whole dataset), but after enriching we have obtained about 800 spam examples (20.6%)

In [104]:
enr_df = pd.read_excel('enriched_dataset.xlsx')

In [105]:
enr_df.head()

Unnamed: 0,message_text,target
0,"–í–∞—à —à–∞–Ω—Å –Ω–∞ –∑–∞—Ä–∞–±–æ—Ç–æ–∫ –æ–∂–∏–¥–∞–µ—Ç –≤–∞—Å! –ù–∞–ø–∏—à–∏—Ç–µ –Ω–∞–º –ø—Ä—è–º–æ —Å–µ–π—á–∞—Å, —á—Ç–æ–±—ã —É–∑–Ω–∞—Ç—å, –∫–∞–∫ –ø—Ä–∏—Å—Ç—É–ø–∏—Ç—å!",1
1,"–ò—â—É –∫–∞–Ω–¥–∏–¥–∞—Ç–æ–≤ –Ω–∞ —É–¥–∞–ª–µ–Ω–Ω—É—é —Ä–∞–±–æ—Ç—É, –¥–æ—Ö–æ–¥ –æ—Ç 230$ –≤ –¥–µ–Ω—å. –ó–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–ª–∏—Å—å? –ü–∏—à–∏—Ç–µ –º–Ω–µ –≤ –ª—Å –¥–ª—è –ø–æ–ª—É—á–µ–Ω–∏—è –¥–µ—Ç–∞–ª—å–Ω–æ–π –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏–∏üòä",1
2,"–ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ! –ü—Ä–µ–¥–ª–∞–≥–∞–µ—Ç—Å—è —É–¥–∞–ª–µ–Ω–Ω–∞—è –ø–æ–¥—Ä–∞–±–æ—Ç–∫–∞, –≥–¥–µ –≤—ã –º–æ–∂–µ—Ç–µ –∑–∞—Ä–∞–±–æ—Ç–∞—Ç—å –æ—Ç 90$ –≤ –¥–µ–Ω—å –∏ –±–æ–ª—å—à–µ. –û–ø—ã—Ç –Ω–µ —Ç—Ä–µ–±—É–µ—Ç—Å—è. –ï—Å–ª–∏ –∏–Ω—Ç–µ—Ä–µ—Å–Ω–æ, —Å–æ–æ–±—â–∏—Ç–µ –æ–± —ç—Ç–æ–º –≤ –ª—Å —Å –ø–æ–º–æ—â—å—é +",1
3,–ï—Å—Ç—å –ø—Ä–µ–¥–ª–æ–∂–µ–Ω–∏–µ –¥–ª—è –ø–æ—Ç–µ–Ω—Ü–∏–∞–ª—å–Ω—ã—Ö –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤ –≤ –Ω–æ–≤—ã–π –ø—Ä–æ–µ–∫—Ç. –û–∂–∏–¥–∞–µ–º—ã–π –¥–æ—Ö–æ–¥ —Å–æ—Å—Ç–∞–≤–ª—è–µ—Ç –ø—Ä–∏–º–µ—Ä–Ω–æ 950$ –≤ –Ω–µ–¥–µ–ª—é. –ü–∏—à–∏—Ç–µ –≤ –ª—Å –¥–ª—è –±–æ–ª–µ–µ –ø–æ–¥—Ä–æ–±–Ω–æ–π –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏–∏.,1
4,"–ü—Ä–∏–≤–µ—Ç –≤—Å–µ–º! –£ –º–µ–Ω—è –µ—Å—Ç—å –º–∞—Ç–µ—Ä–∏–∞–ª—ã –∫—É—Ä—Å–∞ –ø–æ —Ç—Ä–µ–π–¥–∏–Ω–≥—É, –∫–æ—Ç–æ—Ä—ã–π —è –ø—Ä–æ—à—ë–ª –≥–æ–¥ –Ω–∞–∑–∞–¥. –ï—Å–ª–∏ –≤–∞–º –±—É–¥–µ—Ç –∏–Ω—Ç–µ—Ä–µ—Å–Ω–æ, –º–æ–≥—É –æ—Ç–ø—Ä–∞–≤–∏—Ç—å –≤–∞–º. –ñ–¥—É —Ç–æ–ª—å–∫–æ –≤–∞—à–µ —Å–ø–∞—Å–∏–±–æ –≤–∑–∞–º–µ–Ω.",1


In [107]:
enr_df['text_repl'] = enr_df['message_text'].apply(alphabet_replacer).str.lower()

In [108]:
spam_target = enr_df['target']

In [109]:
spam_tp2 = TextPreprocessing(text_col=enr_df['text_repl'], nlp=spam_model)

In [54]:
%%time

# obtaining new training data

train = spam_tp2.get_train_data(label_data=enr_df[['target']].to_dict(orient='records'), pattern_list=None, to_disk='./train_data/spam2/',
                          split=0.2, label=None, text_col=None, stratify=spam_target)

Using label_data (list or Series with special dict


HBox(children=(IntProgress(value=0, description='Progress:', max=3906), Label(value='0')))

Splitting data: TRAIN - 80.0%,  TEST - 20.0%
Training data locates:
./train_data/spam2/train.spacy
./train_data/spam2/dev.spacy
CPU times: total: 40.1 s
Wall time: 40.1 s


In [55]:
%%time

# training a model

spacy.cli.train.train("./config/config.cfg", "./TRAINED_MODEL/", overrides={"paths.train": "./train_data/spam2/train.spacy", "paths.dev": "./train_data/spam2/dev.spacy"})

[38;5;4m‚Ñπ Saving to output directory: TRAINED_MODEL[0m
[38;5;4m‚Ñπ Using CPU[0m
[1m
[38;5;2m‚úî Initialized pipeline[0m
[1m
[38;5;4m‚Ñπ Pipeline: ['tok2vec', 'morphologizer', 'parser', 'attribute_ruler',
'lemmatizer', 'ner', 'textcat_multilabel'][0m
[38;5;4m‚Ñπ Frozen components: ['tok2vec', 'morphologizer', 'parser', 'senter',
'attribute_ruler', 'lemmatizer', 'ner'][0m
[38;5;4m‚Ñπ Initial learn rate: 0.001[0m
E    #       LOSS TEXTC...  POS_ACC  MORPH_ACC  DEP_UAS  DEP_LAS  SENTS_P  SENTS_R  SENTS_F  LEMMA_ACC  ENTS_F  ENTS_P  ENTS_R  CATS_SCORE  SCORE 
---  ------  -------------  -------  ---------  -------  -------  -------  -------  -------  ---------  ------  ------  ------  ----------  ------
  0       0           0.06    98.91      98.91    98.69    98.69   100.00   100.00   100.00      98.91  100.00  100.00  100.00       51.38    0.91
  2    1000          24.76    98.91      98.91    98.69    98.69   100.00   100.00   100.00      98.91  100.00  100.00  100.00   

## Testing the trained model

In [110]:
spam_model = spacy.load('./TRAINED_MODEL/spam_detector2')

### Testing on new data

__Taking those test dataset that we used for the previous model__

In [116]:
tp_spam = TextPreprocessing(text_col=new_data['text_repl'], nlp=spam_model)

In [117]:
spam_res = tp_spam.extract_cats(df=True)

HBox(children=(IntProgress(value=0, description='Progress', max=160), Label(value='0')))

In [118]:
spam_res

Unnamed: 0,text_col,target
0,"–ø—Ä–∏–≤–µ—Ç –≤—Å–µ–º! —É –º–µ–Ω—è –µ—Å—Ç—å –∫—É—Ä—Å—ã –ø–æ —Ç—Ä–µ–π–¥–∏–Ω–≥—É, –∫–æ—Ç–æ—Ä—ã–º–∏ —è –º–æ–≥—É –ø–æ–¥–µ–ª–∏—Ç—å—Å—è –∞–±—Å–æ–ª—é—Ç–Ω–æ –±–µ—Å–ø–ª–∞—Ç–Ω–æ. –µ—Å–ª–∏ –∫–æ–º—É-—Ç–æ –Ω—É–∂–Ω–æ, –¥–∞–π—Ç–µ –∑–Ω–∞—Ç—å",1.0
1,"–≤–Ω–∏–º–∞–Ω–∏–µ! –Ω—É–∂–µ–Ω 1 —Å–æ—Ç—Ä—É–¥–Ω–∏–∫ –¥–ª—è —Ä–∞–±–æ—Ç—ã –Ω–∞ –¥–æ–º—É, –æ–ø–ª–∞—Ç–∞ –¥–æ—Å—Ç–æ–π–Ω–∞—è. –ø–æ–¥—Ä–æ–±–Ω–æ—Å—Ç–∏ –≤—ã—à–ª—é –≤ –ª–∏—á–Ω—ã–µ —Å–æ–æ–±—â–µ–Ω–∏—è.",1.0
2,–∑–∞—Ö–æ–¥–∏—Ç–µ –∫ –Ω–∞–º –∑–∞ –∑–∞—Ä–∞–±–æ—Ç–∫–æ–º! –æ—Ç–ø—Ä–∞–≤—å—Ç–µ —Å–æ–æ–±—â–µ–Ω–∏–µ –∏ —É–∑–Ω–∞–π—Ç–µ –∫–∞–∫!,1.0
3,–≤—Å–µ–º –ø—Ä–∏–≤–µ—Ç! —É –Ω–∞—Å –µ—Å—Ç—å —Å–ø–æ—Å–æ–± –ø–æ–¥–Ω—è—Ç—å –∫—ç—à. –º—ã –ø—Ä–µ–¥–æ—Å—Ç–∞–≤–ª—è–µ–º —ç—Ç–æ –∑–∞ –ø—Ä–æ—Ü–µ–Ω—Ç –æ—Ç —Å—É–º–º—ã. —Ä–∞–±–æ—Ç–∞ –ø—Ä–æ–≤–æ–¥–∏—Ç—Å—è –¥–∏—Å—Ç–∞–Ω—Ü–∏–æ–Ω–Ω–æ. –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–Ω–Ω—ã–µ –º–æ–≥—É—Ç —Å–≤—è–∑–∞—Ç—å—Å—è —Å –Ω–∞–º–∏üëàüèº,1.0
4,"–Ω–∞–π–¥–∏—Ç–µ –∏–Ω—Ç–∏–º–Ω—ã–µ —Ñ–æ—Ç–æ –¥–µ–≤—É—à–∫–∏, –∏—Å–ø–æ–ª—å–∑—É—è –∫–æ–¥ —Å–¥—É382 –≤ —Ç–µ–ª–µ–≥—Ä–∞–º–µ.",1.0
5,–≥—Ä–∞–Ω–¥–∏–æ–∑–Ω—ã–µ –Ω–æ–≤–æ—Å—Ç–∏ - –ø—Ä–∏–≥–ª–∞—à–∞–µ–º –Ω–∞ —É–¥–∞–ª–µ–Ω–Ω—É—é —Ä–∞–±–æ—Ç—É –≤ —Ç–µs—Ç–Ω–µ—Ç! –±–µ—Å–ø–ª–∞—Ç–Ω–æ–µ –æ–±—É—á–µ–Ω–∏–µ –∏ –ø–µ—Ä–≤—ã–π –¥–æ—Ö–æ–¥ —É–∂–µ —á–µ—Ä–µ–∑ 30 –º–∏–Ω—É—Ç. –¥–æ—Ö–æ–¥ —Å–æ—Å—Ç–∞–≤–∏—Ç –æ—Ç 1035$/100000 —Ä—É–± –≤ –Ω–µ–¥–µ–ª—é! —É –Ω–∞—Å –ø–æ–ª–Ω–∞—è –ø—Ä–æ–∑—Ä–∞—á–Ω–æ—Å—Ç—å! üòã–µ—â–µ –¥–æ—Å—Ç—É–ø–Ω–æ 8 –º–µ—Å—Ç! –Ω–∞–±–æ—Ä –∏–¥–µ—Ç –¥–æ 25.2.2024. –±—É–¥—É —Ä–∞–¥ –≤–∏–¥–µ—Ç—å –≤–∞—Å –≤ –∫–æ–º–∞–Ω–¥–µ!ü•∞,0.994
6,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! –∏—â—É –ª–∏—á–Ω–æ—Å—Ç–µ–π, –∫–æ—Ç–æ—Ä—ã–µ —Ö–æ—Ç—è—Ç —Å—Ç–∞—Ç—å —á–∞—Å—Ç—å—é –Ω–∞—à–µ–π –∫–æ–º–∞–Ω–¥—ã –∏ —Ä–∞–±–æ—Ç–∞—Ç—å —Å –∫—Ä–∏–ø—Ç–æ–≤–∞–ª—é—Ç–∞–º–∏. –≤ –≤–∞—à–µ–º –¥–æ—Å—Ç—É–ø–µ –±—É–¥–µ—Ç –ø–æ—Å—Ç–æ—è–Ω–Ω—ã–π –¥–æ—Ö–æ–¥ –æ—Ç 1575$ –≤ –Ω–µ–¥–µ–ª—é, –ø—Ä–∏ –≤—Ä–µ–º–µ–Ω–Ω—ã—Ö –∑–∞—Ç—Ä–∞—Ç–∞—Ö –≤—Å–µ–≥–æ 2 —á–∞—Å–∞ –≤ –¥–µ–Ω—å! –∑–∞–∏–Ω—Ç–µ—Ä–µ—Å–æ–≤–∞–Ω—ã –∏ –≤–∞–º –Ω–µ—Ç 18 –ª–µ—Ç? –Ω–∞–ø–∏—à–∏—Ç–µ –º–Ω–µ!",1.0
7,"–ø—Ä–∏—ë–º –Ω–∞ —Ä–∞–±–æ—Ç—É! –Ω—É–∂–Ω—ã —Ç—Ä–∏ —á–µ–ª–æ–≤–µ–∫–∞, —Ä–∞–±–æ—Ç–∞ —á–µ—Ä–µ–∑ —Ç–µ–ª–µ—Ñ–æ–Ω –∏–ª–∏ –ø–∫. –≤–æ–∑–Ω–∞–≥—Ä–∞–∂–¥–µ–Ω–∏–µ –æ—Ç 500$. –ø–∏—à–∏—Ç–µ ""+"" –¥–ª—è –æ—Ç–∫–ª–∏–∫–∞.",1.0
8,–Ω–∞—à –ø—Ä–æ–µ–∫—Ç –æ—Ç–∫—Ä—ã–≤–∞–µ—Ç –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç–∏ –¥–ª—è –ø–æ–ª—É—á–µ–Ω–∏—è –¥–æ–ø. –¥–æ—Ö–æ–¥–∞ –Ω–∞ —É—Å–ª–æ–≤–∏—è—Ö –≤–∑–∞–∏–º–æ–≤—ã–≥–æ–¥–Ω–æ–≥–æ —Å–æ—Ç—Ä—É–¥–Ω–∏—á–µ—Å—Ç–≤–∞. –∏—â–µ–º –∞–∫—Ç–∏–≤–Ω—ã—Ö –ø–∞—Ä—Ç–Ω–µ—Ä–æ–≤! –≤–æ–ø—Ä–æ—Å—ã –º–æ–∂–µ—Ç–µ –∑–∞–¥–∞—Ç—å –≤ –ª–∏—á–Ω—ã—Ö —Å–æ–æ–±—â–µ–Ω–∏—è—Ö.,1.0
9,"–ø—Ä–∏–≤–µ—Ç—Å—Ç–≤—É—é! üëã —Ç—Ä–µ–±—É—é—Ç—Å—è –ø–∞—Ä—Ç–Ω–µ—Ä—ã –≤ –∫–æ–º–∞–Ω–¥—É. –æ—Ç –≤–∞—Å —Ç—Ä–µ–±—É–µ—Ç—Å—è –∏–Ω—Ç–µ—Ä–Ω–µ—Ç –∏ —Å–º–∞—Ä—Ç—Ñ–æ–Ω. –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç—å –∑–∞—Ä–∞–±–æ—Ç–∞—Ç—å 135-185 $¬†–≤ –¥–µ–Ω—å. –≤—Å–µ –¥–µ–π—Å—Ç–≤–∏—è –∑–∞–∫–æ–Ω–Ω—ã! –µ—Å–ª–∏ –≤–∞–º –∏–Ω—Ç–µ—Ä–µ—Å–Ω–æ, –ø–∏—à–∏—Ç–µ –≤ –ª—Å.",1.0


In [119]:
acc = accuracy_score(np.ones(len(spam_res)), (spam_res['target'] > 0.6).astype('int'))
acc

0.99375

The metric is excellent and the generalization of this model is significantly better than the generalization of the previous one.  
Spam filter for this chat is ready.