<h2>Стемминг<h2>

<h4>Стемминг – это своего рода нормализация слов. Нормализация – это метод, при котором набор слов в предложении преобразуется в последовательность, чтобы сократить время поиска. Слова, которые имеют то же значение, но имеют некоторые различия в зависимости от контекста или предложения, нормализуются. Другими словами, есть одно корневое слово, но есть много вариантов одних и тех же слов. Например, корневое слово «есть» и его вариации «есть, есть, есть и так далее». Точно так же, с помощью Stemming, мы можем найти корневое слово любых вариаций.<h4>

<h4>Импортируем библиотеки<h4>

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from collections import Counter
nltk.download('punkt')
from nltk.stem import LancasterStemmer
from model_selction import model_selection_word_count, model_selection_word_exist, model_selection_tfidf

[nltk_data] Downloading package stopwords to /home/val/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/val/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
%ls data

[0m[01;32mmisspel.csv[0m*     [01;32mprocessedNegative.csv[0m*  [01;32mprocessedPositive.csv[0m*
[01;32mp00_tweets.zip[0m*  [01;32mprocessedNeutral.csv[0m*


<h4>В качестве примера рассмотрим содержимое файла 'processedNegative.csv' после применения метода<h4>

In [3]:
neg_df = pd.read_csv('data/processedNegative.csv').T.reset_index()
neg_text = " ".join([tweet[0] for tweet in neg_df.values.tolist()])
neg_tokens = [word for word in word_tokenize(neg_text) if not word in stopwords.words('english')]
ls = LancasterStemmer()
neg_stem = [ls.stem(word) for word in neg_tokens]
neg_stem

['how',
 'unhappy',
 'dog',
 'lik',
 'though',
 'talk',
 'driv',
 'i',
 "'m",
 'goingh',
 'said',
 "'d",
 'lov',
 'go',
 'new',
 'york',
 'sint',
 'trump',
 "'s",
 'prob',
 'doe',
 'anybody',
 'know',
 'rand',
 "'s",
 'lik',
 'fal',
 'doll',
 '?',
 'i',
 'got',
 'money',
 'i',
 'nee',
 'chang',
 'r',
 'keep',
 'get',
 'stronger',
 'unhappy',
 'i',
 'miss',
 'going',
 'gig',
 'liverpool',
 'unhappy',
 'ther',
 'isnt',
 'new',
 'riverd',
 'tonight',
 '?',
 'unhappy',
 "'s",
 'a',
 '*',
 'dy',
 'guy',
 'pop',
 'as',
 'transl',
 "'ll",
 'prob',
 'go',
 'around',
 'au',
 'unhappy',
 'who',
 "'s",
 'chair',
 "'re",
 'sit',
 '?',
 'is',
 'i',
 'find',
 '.',
 'everyon',
 'know',
 '.',
 'you',
 "'ve",
 'sham',
 'pu',
 "n't",
 'lik',
 'jittery',
 'caffein',
 'mak',
 'sad',
 'my',
 'are',
 "'s",
 'list',
 'unhappy',
 'think',
 'i',
 "'ll",
 'go',
 'libdem',
 'anyway',
 'i',
 'want',
 'fun',
 'plan',
 'weekend',
 'unhappy',
 'when',
 'not',
 '.',
 'unhappy',
 '?',
 'ahhhhh',
 '!',
 'you',
 'recogn

<h4>Функция, которая создасть набор данных для обучения моделей<h4>

In [8]:
def stem_file_to_df(file_name):
    neg_fn, neut_fn, pos_fn = file_name

    neg_df = pd.read_csv(neg_fn).T.reset_index()
    neut_df = pd.read_csv(neut_fn).T.reset_index()
    pos_df = pd.read_csv(pos_fn).T.reset_index()
    
    neg_text = " ".join([tweet[0] for tweet in neg_df.values.tolist()])
    neut_text = " ".join([tweet[0] for tweet in neut_df.values.tolist()])
    pos_text = " ".join([tweet[0] for tweet in pos_df.values.tolist()])

    ls = LancasterStemmer()
    
    neg_words = Counter([ls.stem(word) for word in word_tokenize(neg_text) if not word in stopwords.words('english')])
    neut_words = Counter([ls.stem(word) for word in word_tokenize(neut_text) if not word in stopwords.words('english')])
    pos_words = Counter([ls.stem(word) for word in word_tokenize(pos_text) if not word in stopwords.words('english')])
    
    unic_words = list(set(neg_words.keys()) | set(neut_words.keys()) | set(pos_words.keys()))

    neg_exist_index = 0
    neut_exist_index = 1
    pos_exist_index = 2
    neg_count_index = 3
    neut_count_index = 4
    pos_count_index = 5
    word_count_index = 6
    neg_tfidf_index = 7
    neut_tfidf_index = 8
    pos_tfidf_index = 9

    df = np.zeros((len(unic_words), 10))
    for i, word in enumerate(unic_words):
        if word in neg_words.keys():
            df[i,neg_exist_index] = 1
            df[i,neg_count_index] = neg_words[word]
        if word in neut_words.keys():
            df[i,neut_exist_index] = 1
            df[i,neut_count_index] = neut_words[word]
        if word in pos_words.keys():
            df[i,pos_exist_index] = 1
            df[i,pos_count_index] = pos_words[word]

    df[:,word_count_index] = df[:,neg_count_index] + df[:,neut_count_index] + df[:,pos_count_index]
    df[:,neg_tfidf_index] = df[:,neg_count_index] / df[:,word_count_index]
    df[:,neut_tfidf_index] = df[:,neut_count_index] / df[:,word_count_index]
    df[:,pos_tfidf_index] = df[:,pos_count_index] / df[:,word_count_index]

    stem_df = pd.DataFrame(df, columns=[
        'Negative', 'Neutral', 'Positive',
        'Negative counts', 'Neutral counts', 'Positive counts', 'Word counts',
        'Negative TFIDF', 'Neutral TFIDF', 'Positive TFIDF'])
    stem_df["word"] = unic_words
    return stem_df, unic_words

<h4>Узнаем, как называются остальные файлы, содержащие исходный набор данных<h4>

In [5]:
%ls data

[0m[01;32mmisspel.csv[0m*     [01;32mprocessedNegative.csv[0m*  [01;32mprocessedPositive.csv[0m*
[01;32mp00_tweets.zip[0m*  [01;32mprocessedNeutral.csv[0m*


<h4>Создадим набор данных для обучения<h4>

In [9]:
file_names = ('data/processedNegative.csv', 'data/processedNeutral.csv', 'data/processedPositive.csv')
stem_df, unic_words = stem_file_to_df(file_names)
stem_df

Unnamed: 0,Negative,Neutral,Positive,Negative counts,Neutral counts,Positive counts,Word counts,Negative TFIDF,Neutral TFIDF,Positive TFIDF,word
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.00,0.00,remov
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.00,0.00,tor
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.00,0.00,mushroom
3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.00,0.00,dynam
4,0.0,1.0,1.0,0.0,6.0,2.0,8.0,0.0,0.75,0.25,company
...,...,...,...,...,...,...,...,...,...,...,...
5041,0.0,1.0,0.0,0.0,2.0,0.0,2.0,0.0,1.00,0.00,guj
5042,0.0,1.0,0.0,0.0,3.0,0.0,3.0,0.0,1.00,0.00,edappad
5043,0.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,0.00,1.00,minecraft
5044,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.00,0.00,standoff


<h4>Узнаем полученную точность модели<h4>

In [10]:
word_exist_accuracy_score = model_selection_word_exist(stem_df, unic_words)
word_count_accuracy_score = model_selection_word_count(stem_df, unic_words)
tfidf_accuracy_score = model_selection_tfidf(stem_df, unic_words)



In [11]:
print(f"""Accuracy score by word exist: {word_exist_accuracy_score}
Accuracy score by word count: {word_count_accuracy_score}
Fccuracy score by tfidf: {tfidf_accuracy_score}""")

Accuracy score by word exist: 0.4871287128712871
Accuracy score by word count: 0.9544554455445544
Fccuracy score by tfidf: 0.49306930693069306
