[Colab link](https://colab.research.google.com/drive/1DM4VWBhL2gFP7VmesmNUN947QW3TwHv4?usp=sharing)

# Задачи:

(binarize, word counts, TFIDF)  
- tokenization (разбиение текста на слова)
- stemming (убрать суффиксы префиксы)
- lemmatization (привести слова к начальной форме)
- stemming + misspelling(обработать ошибки в словах)
- lemmatization + misspelling
- word2vec

n-grams(рассматривать слова как n связанных, рядом идущих)  
TF-IDF (от англ. TF — term frequency, IDF — inverse document frequency) — статистическая мера, используемая для оценки важности слова в контексте документа, являющегося частью коллекции документов или корпуса.
Вес некоторого слова пропорционален частоте употребления этого слова в документе и обратно пропорционален частоте употребления слова во всех документах коллекции.  
stop-words - шумовые слова, исключаются из рассмотрения

## Imports

In [1]:
!pip install autocorrect -q

In [43]:
import sklearn
import keras
import nltk
import pandas as pd
import numpy as np
import re
import codecs
from nltk.tokenize import RegexpTokenizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
import nltk
from autocorrect import Speller
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import warnings
warnings. filterwarnings('ignore')

#### NLTK

In [3]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

# Data cleaning

In [69]:
df_pos = pd.read_csv('data/processedPositive.csv')
df_neg = pd.read_csv('data/processedNegative.csv')
df_neut = pd.read_csv('data/processedNeutral.csv')


In [5]:
df_pos.T.head()

An inspiration in all aspects: Fashion
fitness
beauty and personality. :)KISSES TheFashionIcon
Apka Apna Awam Ka Channel Frankline Tv Aam Admi Production Please Visit Or Likes Share :)Fb Page :...
Beautiful album from the greatest unsung guitar genius of our time - and I've met the great backstage


In [70]:
def df_preparation(df: pd.DataFrame, target_value: int):
    df_new = df.T.reset_index().rename(columns={'index': 'tweet'})
    df_new['target'] = target_value
    return df_new

## Concatenate in one dataframe:  
- негативный 1
- нейтральный 2
- положительный 3

In [71]:
df_neg = df_preparation(df_neg, 1)
df_neut = df_preparation(df_neut, 2)
df_pos = df_preparation(df_pos, 3)

df = pd.concat([df_neg, df_neut, df_pos], ignore_index=True)

In [8]:
df.head()

Unnamed: 0,tweet,target
0,How unhappy some dogs like it though,1
1,talking to my over driver about where I'm goin...,1
2,Does anybody know if the Rand's likely to fall...,1
3,I miss going to gigs in Liverpool unhappy,1
4,There isnt a new Riverdale tonight ? unhappy,1


In [9]:
df.target.value_counts()

2    1570
3    1186
1    1117
Name: target, dtype: int64

## Clean the unnecessary characters and put words in lowercase
очистим от лишних символов и приведем в нижний регистр

In [72]:
def standardize_text(df, text_field):
    df[text_field] = df[text_field].str.replace(r"http", "")
    df[text_field] = df[text_field].str.replace(r'[^a-zA-Z ]', '', regex=True)
    df[text_field] = df[text_field].str.lower()
    return df

In [73]:
df = standardize_text(df, 'tweet')

# Data preparation

## Tokenize

разобъем твиты на токены

In [74]:
tokenizer = RegexpTokenizer(r'\w+')

df["tokens"] = df["tweet"].apply(tokenizer.tokenize)
df.head()

Unnamed: 0,tweet,target,tokens
0,how unhappy some dogs like it though,1,"[how, unhappy, some, dogs, like, it, though]"
1,talking to my over driver about where im going...,1,"[talking, to, my, over, driver, about, where, ..."
2,does anybody know if the rands likely to fall ...,1,"[does, anybody, know, if, the, rands, likely, ..."
3,i miss going to gigs in liverpool unhappy,1,"[i, miss, going, to, gigs, in, liverpool, unha..."
4,there isnt a new riverdale tonight unhappy,1,"[there, isnt, a, new, riverdale, tonight, unha..."


посмотрим подробнее на твиты

In [13]:
all_words = [word for tokens in df["tokens"] for word in tokens]
tweet_lengths = [len(tokens) for tokens in df["tokens"]]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
print("Max sentence length is %s" % max(tweet_lengths))

33208 words total, with a vocabulary size of 6378
Max sentence length is 30


## Vectorizer and splitter functions
Функции которые будут векторизировать и делить обработанный датасет

In [14]:
def binarize_tokens_and_split(df, tokens_column, target_column, test_size=0.2):
    '''
    Binarize, split stratified and return X_train, X_test, y_train, y_test
    '''
    X_train, X_test, y_train, y_test = train_test_split(df[tokens_column], df[target_column],
                                                        stratify=df[target_column], test_size=test_size, random_state=21)

    mlb = MultiLabelBinarizer()
    mlb.fit(X_train)
    X_train_bin = mlb.transform(X_train)
    X_test_bin = mlb.transform(X_test)
    # binarized = pd.DataFrame(data=mlb.fit_transform(df[tokens_column]), columns=mlb.classes_)
    return X_train_bin, X_test_bin, y_train, y_test


In [15]:
def count_words_and_split(df, tokens_column, target_column, test_size=0.2):
    '''
    Count vectorize, split stratified and return X_train, X_test, y_train, y_test
    '''
    list_corpus = df[tokens_column].apply(lambda x: ' '.join(x)).tolist()
    list_labels = df[target_column].tolist()

    X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, stratify=list_labels, test_size=test_size, random_state=21)

    count_vectorizer = CountVectorizer()
    X_train_counts = count_vectorizer.fit_transform(X_train)
    X_test_counts = count_vectorizer.transform(X_test)

    return X_train_counts, X_test_counts, y_train, y_test

In [16]:
def tfidf_and_split(df, tokens_column, target_column, test_size=0.2):
    '''
    tfidf vectorize, split stratified and return X_train, X_test, y_train, y_test
    '''
    list_corpus = df[tokens_column].apply(lambda x: ' '.join(x)).tolist()
    list_labels = df[target_column].tolist()

    X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, stratify=list_labels, test_size=test_size, random_state=21)
    tfidf = TfidfVectorizer()
    tfidf.fit(X_train)
    X_train_tfidf = tfidf.transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    return X_train_tfidf, X_test_tfidf, y_train, y_test

## Functions that transform tweets in different ways
Функции которые преобразуют твиты разными способами

In [17]:
def lemming(df, tokens_column):
    lemmatizer = WordNetLemmatizer()
    return df[tokens_column].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [18]:
def stemming(df, tokens_column):
    stemmer = PorterStemmer()
    return df[tokens_column].apply(lambda x: [stemmer.stem(word) for word in x])

In [19]:
def stop_words_remove(df, tokens_column):
    stop_words = set(stopwords.words('english'))
    return df[tokens_column].apply(lambda x: [word for word in x if word not in stop_words])

In [20]:
def misspelling(df, tokens_column):
    speller = Speller(lang="en")
    return df[tokens_column].apply(lambda words: [speller(word) for word in words])

# ML

## Logistic Regression

In [21]:
clf = LogisticRegression(C=30.0, class_weight='balanced', solver='newton-cg',
                         multi_class='multinomial', n_jobs=-1, random_state=21)

In [22]:
results_logreg = pd.DataFrame({'preprocces':[], '0 or 1 if the word exists': [], 'word counts':[], 'TFIDF':[]})

In [23]:
def make_classification(prep_name, df, tokens_column, target_column, model):
    result = [prep_name]
    X_train, X_test, y_train, y_test = binarize_tokens_and_split(df, tokens_column, target_column)
    result.append(accuracy_score(y_test, model.fit(X_train, y_train).predict(X_test)))

    X_train, X_test, y_train, y_test = count_words_and_split(df, tokens_column, target_column)
    result.append(accuracy_score(y_test, model.fit(X_train, y_train).predict(X_test)))

    X_train, X_test, y_train, y_test = tfidf_and_split(df, tokens_column, target_column)
    result.append(accuracy_score(y_test, model.fit(X_train, y_train).predict(X_test)))

    return result


In [24]:
%%time

results_logreg.loc[len(results_logreg)] = make_classification('just tok', df, 'tokens', 'target', clf)

df['stemmed'] = stemming(df, 'tokens')
results_logreg.loc[len(results_logreg)] = make_classification('stemming', df, 'stemmed', 'target', clf)

df['lemmed'] = lemming(df, 'tokens')
results_logreg.loc[len(results_logreg)] = make_classification('lemming', df, 'lemmed', 'target', clf)

df['misspelled'] = misspelling(df, 'tokens')
df['misspelled+stemmed'] = stemming(df, 'misspelled')
results_logreg.loc[len(results_logreg)] = make_classification('misspelled+stemmed', df, 'misspelled+stemmed', 'target', clf)

df['misspelled'] = misspelling(df, 'tokens')
df['misspelled+lemmed'] = lemming(df, 'misspelled')
results_logreg.loc[len(results_logreg)] = make_classification('misspelled+lemmed', df, 'misspelled+lemmed', 'target', clf)

df['stop_words_removed'] = stop_words_remove(df, 'tokens')
results_logreg.loc[len(results_logreg)] = make_classification('stop_words_remove', df, 'stop_words_removed', 'target', clf)

CPU times: user 4min 58s, sys: 2.56 s, total: 5min
Wall time: 6min 36s


In [25]:
results_logreg

Unnamed: 0,preprocces,0 or 1 if the word exists,word counts,TFIDF
0,just tok,0.901935,0.882581,0.886452
1,stemming,0.903226,0.885161,0.894194
2,lemming,0.895484,0.87871,0.889032
3,misspelled+stemmed,0.894194,0.87871,0.892903
4,misspelled+lemmed,0.891613,0.87871,0.887742
5,stop_words_remove,0.88,0.883871,0.889032


## Random Forest

In [33]:
rf = RandomForestClassifier(min_samples_leaf=2, n_estimators=200, random_state=21)

In [24]:
results_rf = pd.DataFrame({'preprocces':[], '0 or 1 if the word exists': [], 'word counts':[], 'TFIDF':[]})

In [28]:
%%time

results_rf.loc[len(results_rf)] = make_classification('just tok', df, 'tokens', 'target', rf)

df['stemmed'] = stemming(df, 'tokens')
results_rf.loc[len(results_rf)] = make_classification('stemming', df, 'stemmed', 'target', rf)

df['lemmed'] = lemming(df, 'tokens')
results_rf.loc[len(results_rf)] = make_classification('lemming', df, 'lemmed', 'target', rf)

df['misspelled'] = misspelling(df, 'tokens')
df['misspelled+stemmed'] = stemming(df, 'misspelled')
results_rf.loc[len(results_rf)] = make_classification('misspelled+stemmed', df, 'misspelled+stemmed', 'target', rf)

df['misspelled'] = misspelling(df, 'tokens')
df['misspelled+lemmed'] = lemming(df, 'misspelled')
results_rf.loc[len(results_rf)] = make_classification('misspelled+lemmed', df, 'misspelled+lemmed', 'target', rf)

df['stop_words_removed'] = stop_words_remove(df, 'tokens')
results_rf.loc[len(results_rf)] = make_classification('stop_words_remove', df, 'stop_words_removed', 'target', rf)

CPU times: user 6min 49s, sys: 1.46 s, total: 6min 50s
Wall time: 6min 58s


In [29]:
results_rf

Unnamed: 0,preprocces,0 or 1 if the word exists,word counts,TFIDF
0,just tok,0.899355,0.885161,0.885161
1,stemming,0.894194,0.886452,0.886452
2,lemming,0.892903,0.885161,0.88
3,misspelled+stemmed,0.895484,0.883871,0.891613
4,misspelled+lemmed,0.890323,0.885161,0.88
5,stop_words_remove,0.88,0.87871,0.88129


## word2vec

In [28]:
def word2vec(df: pd.DataFrame, tokens_column: str, target_column: str, vector_size: int = 80, min_count: int = 1,
                 window: int = 5, epochs: int = 80) -> pd.DataFrame:
    """The method return a "word2vec" column that contains the word2vec representation,
    text must be misspelled"""

    sentences = [text for text in df[tokens_column]]
    model = Word2Vec(sentences, vector_size=vector_size, min_count=min_count,
                     window=window, epochs=epochs)

    word2vec_vectors = []
    for text in sentences:
        if len(text) > 0:
            vector = sum(model.wv[word] for word in text if word in model.wv) / len(text)
        else:
            vector = [0] * vector_size
        word2vec_vectors.append(vector)

    df["word2vec"] = word2vec_vectors
    list_corpus = df["word2vec"].tolist()
    list_labels = df[target_column].tolist()

    X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, stratify=list_labels, test_size=0.2, random_state=21)

    return X_train, X_test, y_train, y_test

In [30]:
df['misspelled'] = misspelling(df, 'tokens')
X_train, X_test, y_train, y_test = word2vec(df, 'misspelled', 'target')


In [31]:
accuracy_score(y_test, clf.fit(X_train, y_train).predict(X_test))

0.8851612903225806

In [34]:
accuracy_score(y_test, rf.fit(X_train, y_train).predict(X_test))

0.8554838709677419

## similar tweets

In [85]:
df['misspelled'] = misspelling(df, 'tokens')
df['ready'] = lemming(df, 'misspelled').drop_duplicates(keep='first')
df['ready'] = df['ready'].dropna().apply(lambda x: ' '.join(x))

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["ready"].dropna())
similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

similar_tweets_amount = 10
for t, tweet in enumerate(df["ready"]):
    if t+1 > similar_tweets_amount:
        break
    similar_indices = similarity_matrix[t].argsort()[: -2 - 1: -1][1: ]
    print(t+1, '.     ', tweet, '\nsimilar:', df.iloc[0]["ready"], '\n')

1 .      how unhappy some dog like it though 
similar: how unhappy some dog like it though 

2 .      talking to my over driver about where im going said hed love to go to new york too but since trump it probably not 
similar: how unhappy some dog like it though 

3 .      doe anybody know if the hand likely to fall against the dollar i got some money i need to change into r but it keep getting stronger unhappy 
similar: how unhappy some dog like it though 

4 .      i miss going to gig in liverpool unhappy 
similar: how unhappy some dog like it though 

5 .      there isnt a new riverdale tonight unhappy 
similar: how unhappy some dog like it though 

6 .      it that ady guy from pop asia and then the translator so they probe go with them around au unhappy 
similar: how unhappy some dog like it though 

7 .      who that chair youre sitting in is this how i find out everyone know now youve shared me in pu 
similar: how unhappy some dog like it though 

8 .      dont like how battery 

# Best scores:
- logreg : stemming	0.903226
- random forest: misspelled+stemmed	0.895484