## Задание 5.1

Набор данных тут: https://github.com/sismetanin/rureviews, также есть в папке [Data](https://drive.google.com/drive/folders/1YAMe7MiTxA-RSSd8Ex2p-L0Dspe6Gs4L). Те, кто предпочитает работать с английским языком, могут использовать набор данных `sms_spam`.

Применим полученные навыки и решим задачу анализа тональности отзывов. 

Нужно повторить весь пайплайн от сырых текстов до получения обученной модели.

Обязательные шаги предобработки:
1. токенизация
2. приведение к нижнему регистру
3. удаление стоп-слов
4. лемматизация
5. векторизация (с настройкой гиперпараметров)
6. построение модели
7. оценка качества модели

Обязательно использование векторайзеров:
1. мешок n-грамм (диапазон для n подбирайте самостоятельно, запрещено использовать только униграммы).
2. tf-idf ((диапазон для n подбирайте самостоятельно, также нужно подбирать параметры max_df, min_df, max_features)
3. символьные n-граммы (диапазон для n подбирайте самостоятельно)

В качестве классификатора нужно использовать наивный байесовский классификатор. 

Для сравнения векторайзеров между собой используйте precision, recall, f1-score и accuracy. Для этого сформируйте датафрейм, в котором в строках будут разные векторайзеры, а в столбцах разные метрики качества, а в  ячейках будут значения этих метрик для соответсвующих векторайзеров.

In [None]:
import pandas as pd
import numpy as np
import string

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from sklearn.metrics import * 
from sklearn.model_selection import train_test_split

In [None]:
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Data/sms_spam.csv', sep=",", usecols=[0, 1])
data.head(10)

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...
5,ham,Aiya we discuss later lar... Pick u up at 4 is...
6,ham,Are you this much buzy
7,ham,Please ask mummy to call father
8,spam,Marvel Mobile Play the official Ultimate Spide...
9,ham,"fyi I'm at usf now, swing by the room whenever"


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5559 entries, 0 to 5558
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    5559 non-null   object
 1   text    5559 non-null   object
dtypes: object(2)
memory usage: 87.0+ KB


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
import nltk.data # библиотека Natural Language Toolkit
import re   # библиотека для регулярных выражений
import nltk
nltk.download('punkt')
from nltk import FreqDist
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
copy_data = data.copy()

In [None]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def review_to_wordlist(review):
    sentences = []
    for symbol in string.punctuation:
      review = review.replace(symbol, " ")

    tokenized = word_tokenize(review.lower())
    stops = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()

    processed_sentence = [lemmatizer.lemmatize((word), get_wordnet_pos(word)) for word in tokenized if word not in string.punctuation]
    processed_sentence = [w for w in processed_sentence if not w in stops]
    sentences.append(" ".join(processed_sentence))
    return(sentences)

In [None]:
sentences = []  # эта ячейка может выполняться довольно долго (примерно 2 минуты)

print("Parsing sentences from training set...")
for review in copy_data.text:
    sentences += review_to_wordlist(review)

Parsing sentences from training set...


In [None]:
print(len(sentences))
print(sentences[0])

5559
hope good week check


In [None]:
for i in range(0, 5559):
  copy_data.text[i] = sentences[i]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data.text, data.type, train_size = 0.7)

In [None]:
my_x_train, my_x_test, my_y_train, my_y_test = train_test_split(copy_data.text, copy_data.type, train_size = 0.7)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import ngrams

In [None]:
vectorize_result = []

In [None]:
for min_n in range(1, 7):
  for max_n in range(min_n, 7):
        count_vectorizer = CountVectorizer(ngram_range=(min_n, max_n), analyzer="word")
        count_vectorizer_x_train = count_vectorizer.fit_transform(my_x_train)
        clf = MultinomialNB()
        clf.fit(count_vectorizer_x_train, y_train)
        vectorized_x_test = count_vectorizer.transform(my_x_test)
        pred = clf.predict(vectorized_x_test)
        print("ngram_range: (", min_n,',', max_n, ") analyzer:", str(count_vectorizer.analyzer))
        print(classification_report(y_test, pred, output_dict=False))
        vectorize_result.append((classification_report(y_test, pred, output_dict=True), count_vectorizer))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      0.99      0.93      1458
        spam       0.05      0.00      0.01       210

    accuracy                           0.86      1668
   macro avg       0.46      0.50      0.47      1668
weighted avg       0.77      0.86      0.81      1668

ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      0.99      0.93      1458
        spam       0.12      0.01      0.02       210

    accuracy                           0.87      1668
   macro avg       0.50      0.50      0.47      1668
weighted avg       0.78      0.87      0.81      1668

ngram_range: ( 1 , 3 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      0.99      0.93      1458
        spam       0.10      0.01      0.02       210

    accuracy                           0.86      1668
   macro avg  

In [None]:
for min_n in range(1, 3):
  for max_n in range(min_n, 3):
    for max_df in [0.1, 0.5, 0.99]:
      for min_df in [0, 0.01]:
        for max_features in [4000, 8000, 16000, 32000, 64000]:
          print(max_df, min_df, max_features)
          tfidf_vectorizer = TfidfVectorizer(ngram_range=(min_n, max_n), max_df=max_df, min_df=min_df, max_features=max_features)
          tfidf_vectorizer_x_train = tfidf_vectorizer.fit_transform(my_x_train)
          clf = MultinomialNB()
          clf.fit(tfidf_vectorizer_x_train, y_train)
          vectorized_x_test = tfidf_vectorizer.transform(my_x_test)
          pred = clf.predict(vectorized_x_test)
          print("ngram_range: (", min_n,',', max_n, ") analyzer:", str(tfidf_vectorizer.analyzer))
          print(classification_report(y_test, pred, output_dict=False))
          vectorize_result.append((classification_report(y_test, pred, output_dict=True), tfidf_vectorizer))


0.1 0 4000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0 8000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0 16000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0 64000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 4000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0.01 8000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0.01 16000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0.01 32000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.1 0.01 64000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 8000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 8000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 16000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.5 0.01 32000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 64000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 4000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 8000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 16000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.99 0 32000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 4000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 8000
ngram_range: ( 1 , 1 ) analyzer: word


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 16000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.99 0.01 32000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 64000
ngram_range: ( 1 , 1 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.1 0 8000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.1 0 16000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 8000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.5 0 8000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.5 0 16000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 8000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.99 0 8000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.99 0 16000
ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 8000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 1 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.33      0.00      0.01       210

    accuracy                           0.87      1668
   macro avg       0.60      0.50      0.47      1668
weighted avg       0.81      0.87      0.82      1668

0.1 0 8000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.1 0 16000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.1 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.33      0.00      0.01       210

    accuracy                           0.87      1668
   macro avg       0.60      0.50      0.47      1668
weighted avg       0.81      0.87      0.82      1668

0.5 0 8000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.5 0 16000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.5 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0 4000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.33      0.00      0.01       210

    accuracy                           0.87      1668
   macro avg       0.60      0.50      0.47      1668
weighted avg       0.81      0.87      0.82      1668

0.99 0 8000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.81      1668

0.99 0 16000
ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 16000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 32000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668

0.99 0.01 64000


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ngram_range: ( 2 , 2 ) analyzer: word
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93      1458
        spam       0.00      0.00      0.00       210

    accuracy                           0.87      1668
   macro avg       0.44      0.50      0.47      1668
weighted avg       0.76      0.87      0.82      1668



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
for min_n in range(1, 6):
  for max_n in range(min_n, 6):
    count_vectorizer = CountVectorizer(ngram_range=(min_n, max_n), analyzer="char")
    count_vectorizer_x_train = count_vectorizer.fit_transform(my_x_train)
    clf = MultinomialNB()
    clf.fit(count_vectorizer_x_train, y_train)
    vectorized_x_test = count_vectorizer.transform(my_x_test)
    pred = clf.predict(vectorized_x_test)
    print("ngram_range: (", min_n,',', max_n, ") analyzer:", str(count_vectorizer.analyzer))
    print(classification_report(y_test, pred, output_dict=False))
    vectorize_result.append((classification_report(y_test, pred, output_dict=True), count_vectorizer))


ngram_range: ( 1 , 1 ) analyzer: char
              precision    recall  f1-score   support

         ham       0.87      0.99      0.93      1458
        spam       0.11      0.00      0.01       210

    accuracy                           0.87      1668
   macro avg       0.49      0.50      0.47      1668
weighted avg       0.78      0.87      0.81      1668

ngram_range: ( 1 , 2 ) analyzer: char
              precision    recall  f1-score   support

         ham       0.87      0.95      0.91      1458
        spam       0.11      0.05      0.07       210

    accuracy                           0.83      1668
   macro avg       0.49      0.50      0.49      1668
weighted avg       0.78      0.83      0.80      1668

ngram_range: ( 1 , 3 ) analyzer: char
              precision    recall  f1-score   support

         ham       0.87      0.96      0.91      1458
        spam       0.06      0.02      0.03       210

    accuracy                           0.84      1668
   macro avg  

In [None]:
pattern = re.compile(r"[a-zA-Z]+Vectorizer")

In [None]:
raw_data = []
for vectorize in vectorize_result:
  metrics = vectorize[0]["weighted avg"]
  name = pattern.findall(str(vectorize[1]))
  accuracy_score = vectorize[0]["accuracy"]
  raw_data.append({"Vectorizer": name[0], "Analyzer": vectorize[1].analyzer, "Parameters": vectorize[1].ngram_range, "Precision": metrics["precision"], "Recall": metrics["recall"], "F1-Score": metrics['f1-score'], "Accuracy": accuracy_score, "Info": vectorize[1]})
data_result = pd.DataFrame(raw_data)
sorted = data_result.sort_values(by=["Accuracy"], ascending=False).head(3)
sorted

Unnamed: 0,Vectorizer,Analyzer,Parameters,Precision,Recall,F1-Score,Accuracy,Info
63,TfidfVectorizer,word,"(1, 2)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.5, max_features=16000..."
56,TfidfVectorizer,word,"(1, 2)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.1, max_features=4000,..."
67,TfidfVectorizer,word,"(1, 2)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.5, max_features=8000,..."


Accuracy у большинства векторайзеров составляет ~0.87, поэтому в топе выводит только один тип векторайзера

Лучшей моделью среди всех является TfidfVectorizer с гиперпараметрами max_df=0.1, max_features=64000, min_df=0, ngram_range=(1, 2)

In [None]:
count_vector = data_result.loc[(data_result['Vectorizer'] == 'CountVectorizer') & (data_result['Analyzer'] == 'word')]
tfidf_vector = data_result.loc[data_result['Vectorizer'] == 'TfidfVectorizer']
count_char_vector = data_result.loc[(data_result['Vectorizer'] == 'CountVectorizer') & (data_result['Analyzer'] == 'char')]

In [None]:
count_vector.sort_values(by=["Accuracy"], ascending=False).head(5)

Unnamed: 0,Vectorizer,Analyzer,Parameters,Precision,Recall,F1-Score,Accuracy,Info
20,CountVectorizer,word,"(6, 6)",0.785294,0.869305,0.815177,0.869305,"CountVectorizer(ngram_range=(6, 6))"
19,CountVectorizer,word,"(5, 6)",0.785424,0.866906,0.814978,0.866906,"CountVectorizer(ngram_range=(5, 6))"
18,CountVectorizer,word,"(5, 5)",0.779782,0.866906,0.813933,0.866906,"CountVectorizer(ngram_range=(5, 5))"
1,CountVectorizer,word,"(1, 2)",0.779782,0.866906,0.813933,0.866906,"CountVectorizer(ngram_range=(1, 2))"
15,CountVectorizer,word,"(4, 4)",0.778789,0.866307,0.813622,0.866307,"CountVectorizer(ngram_range=(4, 4))"


Лучшие модели среди n-грамм (CountVectorizer, 'word') с гиперпараметрами (6, 6), (5, 6), (5, 5)

In [None]:
tfidf_vector.sort_values(by=["Accuracy"], ascending=False).head(3)

Unnamed: 0,Vectorizer,Analyzer,Parameters,Precision,Recall,F1-Score,Accuracy,Info
21,TfidfVectorizer,word,"(1, 1)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.1, max_features=4000,..."
68,TfidfVectorizer,word,"(1, 2)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.5, max_features=16000..."
76,TfidfVectorizer,word,"(1, 2)",0.764052,0.874101,0.81538,0.874101,"TfidfVectorizer(max_df=0.99, max_features=4000..."


Лучшими моделями среди TF-IDF векторизации являются униграммы в сочетании с другими н-граммами с гиперпараметрами :

max_df=0.1, max_features=4000, min_df=0, ngram_range=(1, 1)

max_df=0.5, max_features=16000, min_df=0.01, ngram_range=(1, 2)

In [None]:
count_char_vector.sort_values(by=["Accuracy"], ascending=False).head(3)

Unnamed: 0,Vectorizer,Analyzer,Parameters,Precision,Recall,F1-Score,Accuracy,Info
111,CountVectorizer,char,"(1, 1)",0.777971,0.869904,0.814397,0.869904,CountVectorizer(analyzer='char')
115,CountVectorizer,char,"(1, 5)",0.775294,0.868705,0.813786,0.868705,"CountVectorizer(analyzer='char', ngram_range=(..."
119,CountVectorizer,char,"(2, 5)",0.7779,0.865707,0.813311,0.865707,"CountVectorizer(analyzer='char', ngram_range=(..."


Лучшие модели среди n-грамм (CountVectorizer, 'char') с гиперпараметрами (1, 5), (2, 5), (3, 3)

## Задание 5.2 Регулярные выражения

Регулярные выражения - способ поиска и анализа строк. Например, можно понять, какие даты в наборе строк представлены в формате DD/MM/YYYY, а какие - в других форматах. 

Или бывает, например, что перед работой с текстом, надо почистить его от своеобразного мусора: упоминаний пользователей, url и так далее.

Навык полезный, давайте в нём тоже потренируемся.

Для работы с регулярными выражениями есть библиотека **re**

In [None]:
import re

В регулярных выражениях, кроме привычных символов-букв, есть специальные символы:
* **?а** - ноль или один символ **а**
* **+а** - один или более символов **а**
* **\*а** - ноль или более символов **а** (не путать с +)
* **.** - любое количество любого символа

Пример:
Выражению \*a?b. соответствуют последовательности a, ab, abc, aa, aac НО НЕ abb!

Рассмотрим подробно несколько наиболее полезных функций:

### findall
возвращает список всех найденных непересекающихся совпадений.

Регулярное выражение **ab+c.**: 
* **a** - просто символ **a**
* **b+** - один или более символов **b**
* **c** - просто символ **c**
* **.** - любой символ


In [None]:
result = re.findall('ab+c.', 'abcdefghijkabcabcxabc') 
print(result)

['abcd', 'abca']


Вопрос на внимательность: почему нет abcx?

**Задание**: вернуть список первых двух букв каждого слова в строке, состоящей из нескольких слов.

In [None]:
result = re.findall(r'\b[а-яёА-ЯЁa-zA-z]{2}', 'Я вас любил: любовь еще, быть может.')
print(result)

['ва', 'лю', 'лю', 'ещ', 'бы', 'мо']


### split
разделяет строку по заданному шаблону


In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie') 
print(result)

['itsy', ' bitsy', ' teenie', ' weenie']


можно указать максимальное количество разбиений

In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie', maxsplit=2) 
print(result)

['itsy', ' bitsy', ' teenie, weenie']


**Задание**: разбейте строку, состоящую из нескольких предложений, по точкам, но не более чем на 3 предложения.

In [None]:
result = re.split('(?<=[.!?...])\s+', 'Я вас любил: любовь еще, быть может. В душе моей угасла не совсем. Но пусть она вас больше не тревожит. Я не хочу печалить вас ничем.', maxsplit=2) 
print(result)

['Я вас любил: любовь еще, быть может.', 'В душе моей угасла не совсем.', 'Но пусть она вас больше не тревожит. Я не хочу печалить вас ничем.']


### sub
ищет шаблон в строке и заменяет все совпадения на указанную подстроку

параметры: (pattern, repl, string)

In [None]:
result = re.sub('a', 'b', 'abcabc')
print (result)

bbcbbc


**Задание**: напишите регулярное выражение, которое позволит заменить все цифры в строке на "DIG".

In [None]:
result = re.sub('\d', 'DIG', '28 ноября 2022, время 1:22')
print (result)

DIGDIG ноября DIGDIGDIGDIG, время DIG:DIGDIG


**Задание**: напишите  регулярное выражение, которое позволит убрать url из строки.

In [None]:
result = re.sub('http[s]?://[\w.:?/#=-]+', '', 'Ссылка: https://colab.research.google.com/drive/10cCeZrd6ybB7r4wiNOsLvJ3-ElyrebiZ#scrollTo=KwNS9zt4WhAv конец ссылки.')
print (result)

Ссылка:  конец ссылки.


### compile
компилирует регулярное выражение в отдельный объект

In [None]:
# Пример: построение списка всех слов строки:
prog = re.compile('[А-Яа-яё\-]+')
prog.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.")

['Слова', 'Да', 'больше', 'ещё', 'больше', 'слов', 'Что-то', 'ещё']

**Задание**: для выбранной строки постройте список слов, которые длиннее трех символов.

In [None]:
prog = re.compile("[А-Яа-яё\-+]{4,}")
result = prog.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.") 
print(result)

['Слова', 'больше', 'больше', 'слов', 'Что-то']


**Задание**: вернуть список доменов (@gmail.com) из списка адресов электронной почты:

```
abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz
```

In [None]:
prog = re.compile("@[a-zA-Z]+\.[a-zA-Z]+")
result = prog.findall('abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print(result)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']
