## Задание

Взять ноутбук `colab_text_classification_part1.ipynb` который разбирали на занятии и добавить пункты которые мы пропустили 

1. Посмотрите на токены если будут мусорные добавьте их в стоп слова и обучите заново

2. Проверьте изменилось ли качество при лемматизации/и без неё

3. Замените все токены которые принадлежат сущностям на их тег. Проверьте изменилось ли качество после этого

---

In [1]:
import pickle
import re
import spacy
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm

from utils import apostrophe_dict, emoticon_dict, short_word_dict
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

tqdm.pandas()

---

### Классификация текстов

In [2]:
import pandas as pd

train_df = pd.read_csv("train.tsv", delimiter="\t")
test_df = pd.read_csv("test.tsv", delimiter="\t")

print('Train size = {}'.format(len(train_df)))
print('Test size = {}'.format(len(test_df)))

Train size = 25000
Test size = 25000


In [3]:
train_df.head(3)

Unnamed: 0,is_positive,review
0,0,"Dreamgirls, despite its fistful of Tony wins i..."
1,0,This show comes up with interesting locations ...
2,1,I simply love this movie. I also love the Ramo...


In [4]:
positive_words = 'love', 'great', 'best', 'wonderful' 
negative_words = 'worst', 'awful', '1/10', 'crap' 

positives_count = test_df.review.apply(lambda text: sum(word in text for word in positive_words))
negatives_count = test_df.review.apply(lambda text: sum(word in text for word in negative_words))
is_positive = positives_count > negatives_count
correct_count = (is_positive == test_df.is_positive).values.sum()

accuracy = correct_count / len(test_df)

print('Test accuracy = {:.2%}'.format(accuracy))
if accuracy > 0.71:
    from IPython.display import Image, display
    display(Image('https://s3.amazonaws.com/achgen360/t/rmmoZsub.png', width=500))

Test accuracy = 66.73%


In [5]:
pattern = re.compile('<br />')

print(pattern.subn(' ', train_df['review'].iloc[3])[0])

Spoilers ahead if you want to call them that...  I would almost recommend this film just so people can truly see a 1/10. Where to begin, we'll start from the top...  THE STORY: Don't believe the premise - the movie has nothing to do with abandoned cars, and people finially understanding what the mysterious happenings are. It's a draub, basic, go to cabin movie with no intensity or "effort".  THE SCREENPLAY: I usually give credit to indie screenwriters, it's hard work when you are starting out...but this is crap. The story is flat - it leaves you emotionless the entire movie. The dialogue is extremely weak and predictable boasting lines of "Woah, you totally freaked me out" and "I was wondering if you'd uh...if you'd like to..uh, would you come to the cabin with me?". It makes me want to rip out all my hair, one strand at a time and feed it to myself.  THE CHARACTERS: HOLY CRAP!!!! Some have described the characters as flat, I want to take it one step further and say that they actually 

In [6]:
train_df['review'] = train_df['review'].apply(lambda text: pattern.subn(' ', text)[0])
test_df['review'] = test_df['review'].apply(lambda text: pattern.subn(' ', text)[0])

In [7]:
def replase_words(text,dict_): 
    output = ''
    for word in text.split(' '):
        word = word.strip()
        if word in dict_.keys(): 
            output += ' ' + dict_[word]
        else:
            output += ' ' + word
    return output

In [8]:
def clean_text(text):
    text = re.sub("@[\w]*","",text)
    text = replase_words(text, emoticon_dict)
    text = replase_words(text, apostrophe_dict)
    text = replase_words(text, short_word_dict)
    text = re.sub("[^\w\s]"," ",text)
    text = re.sub("[^a-zA-Z0-9\_]"," ",text)
    return text

In [9]:
train_df['review'] = train_df['review'].apply(lambda x: clean_text(x))
test_df['review'] = test_df['review'].apply(lambda x: clean_text(x))

---

In [10]:
train_df

Unnamed: 0,is_positive,review
0,0,Dreamgirls despite its fistful of Tony win...
1,0,This show comes up with interesting locatio...
2,1,I simply love this movie I also love the R...
3,0,Spoilers ahead if you want to call them tha...
4,1,My all time favorite movie I have seen man...
...,...,...
24995,1,I am a big fan of the movie but not for th...
24996,0,I m not going to bother with a plot synopsi...
24997,0,This movie I do not know Why they wo...
24998,1,Saw this film on DVD yesterday and was gob ...


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dummy_data = ['The movie was excellent',
              'the movie was awful']

dummy_matrix = vectorizer.fit_transform(dummy_data)

print(dummy_matrix.toarray())
print(vectorizer.get_feature_names())

[[0 1 1 1 1]
 [1 0 1 1 1]]
['awful', 'excellent', 'movie', 'the', 'was']


In [12]:
vectorizer = CountVectorizer()
vectorizer.fit(train_df['review'].values)

CountVectorizer()

In [13]:
vectorizer.transform([train_df['review'].iloc[3]])

<1x74581 sparse matrix of type '<class 'numpy.int64'>'
	with 207 stored elements in Compressed Sparse Row format>

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

dummy_data = ['The movie was excellent',
              'the movie was awful']
dummy_labels = [1, 0]

vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)

print(vectorizer.get_feature_names())
print(classifier.coef_)

['awful', 'excellent', 'movie', 'the', 'was']
[[-0.40104279  0.40104279  0.          0.          0.        ]]


In [15]:
model.fit(train_df['review'], train_df['is_positive'])

Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('classifier', LogisticRegression())])

In [16]:
from sklearn.metrics import accuracy_score

def eval_model(model, test_df):
    preds = model.predict(test_df['review'])
    print('Test accuracy = {:.2%}'.format(accuracy_score(test_df['is_positive'], preds)))

In [17]:
eval_model(model, test_df)

Test accuracy = 86.52%


In [18]:
pip install eli5==0.13.0

Note: you may need to restart the kernel to use updated packages.


In [20]:
import eli5
eli5.show_weights(classifier, vec = vectorizer, top = 50)

Weight?,Feature
+1.805,refreshing
+1.717,wonderfully
+1.697,funniest
+1.651,rare
+1.644,surprisingly
+1.451,superb
+1.369,incredible
+1.327,excellent
+1.325,delightful
+1.306,perfect


In [21]:
print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Positive


Contribution?,Feature
19.402,Highlighted in text (sum)
-0.001,<BIAS>


In [22]:
print('Positive' if test_df['is_positive'].iloc[6] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[6], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Negative


Contribution?,Feature
-0.001,<BIAS>
-10.808,Highlighted in text (sum)


In [23]:
import numpy as np

preds = model.predict(test_df['review'])
incorrect_pred_index = np.random.choice(np.where(preds != test_df['is_positive'])[0])

eli5.show_prediction(classifier, test_df['review'].iloc[incorrect_pred_index],
                     vec=vectorizer, targets=['positive'], target_names=['negative', 'positive'])

Contribution?,Feature
0.044,Highlighted in text (sum)
-0.001,<BIAS>


---

### Проверьте повысилось ли качество на стандартных подходах при лемматизации/и без неё

In [26]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.3.0/en_core_web_lg-3.3.0-py3-none-any.whl (400.7 MB)
[K     |████████████████████████████████| 400.7 MB 3.8 kB/s eta 0:00:01    |███████▌                        | 94.4 MB 1.3 MB/s eta 0:03:51     |████████████                    | 150.8 MB 232 kB/s eta 0:17:54     |█████████████▉                  | 173.3 MB 440 kB/s eta 0:08:37     |██████████████████████████▋     | 333.3 MB 684 kB/s eta 0:01:39
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [27]:
nlp = spacy.load("en_core_web_lg", disable=["ner"])

In [28]:
def lemmatize_text(text):
    doc = nlp(text)
    tokens=[token.lemma_.strip() for token in doc]
    text=" ".join(tokens)
    return text

In [29]:
train_lem_df = train_df
test_lem_df = test_df

In [30]:
train_lem_df['review'] = train_lem_df['review'].progress_apply(lambda x: lemmatize_text(x))

100%|█████████████████████████████████████| 25000/25000 [09:36<00:00, 43.39it/s]


In [31]:
test_lem_df['review'] = test_lem_df['review'].progress_apply(lambda x: lemmatize_text(x))

100%|█████████████████████████████████████| 25000/25000 [09:17<00:00, 44.88it/s]


In [32]:
train_lem_df.head(5)

Unnamed: 0,is_positive,review
0,0,dreamgirl despite its fistful of Tony win in...
1,0,this show come up with interesting location a...
2,1,I simply love this movie I also love the Ram...
3,0,spoiler ahead if you want to call they that ...
4,1,my all time favorite movie I have see many m...


In [33]:
with open('train_docs.pkl', 'wb') as f:
    pickle.dump(train_lem_df,f)
    
with open('test_docs.pkl', 'wb') as f: 
    pickle.dump(test_lem_df,f)

In [34]:
with open('train_docs.pkl', 'rb') as f:
    train_lem_df = pickle.load(f)
    
with open('test_docs.pkl', 'rb') as f:
    test_lem_df = pickle.load(f)

In [35]:
train_lem_df

Unnamed: 0,is_positive,review
0,0,dreamgirl despite its fistful of Tony win in...
1,0,this show come up with interesting location a...
2,1,I simply love this movie I also love the Ram...
3,0,spoiler ahead if you want to call they that ...
4,1,my all time favorite movie I have see many m...
...,...,...
24995,1,I be a big fan of the movie but not for the ...
24996,0,I m not go to bother with a plot synopsis sin...
24997,0,this movie I do not know why they would tak...
24998,1,see this film on DVD yesterday and be gob sma...


In [36]:
vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_lem_df['review'], train_lem_df['is_positive'])

eval_model(model, test_lem_df)

Test accuracy = 86.52%


In [37]:
vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=20000, analyzer='word')
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_lem_df['review'], train_lem_df['is_positive'])

eval_model(model, test_lem_df)

Test accuracy = 88.33%


---

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

In [39]:
def tagging_text(text):
    doc = nlp(text)
    tokens=[token.ent_type_.strip() if token.ent_type_ !="" else token.text.strip() for token in doc ]

    text = [tokens[i] for i in range(1, len(tokens)) if tokens[i] != tokens[i-1] ]
    text=" ".join(text)
    return text

In [40]:
train_tag_df = train_lem_df
test_tag_df = test_lem_df

In [41]:
train_tag_df['review'] = train_tag_df['review'].progress_apply(lambda x: tagging_text(x))

100%|█████████████████████████████████████| 25000/25000 [09:29<00:00, 43.87it/s]


In [42]:
test_tag_df['review'] = test_tag_df['review'].progress_apply(lambda x: tagging_text(x))

100%|███████████████████████████████████| 25000/25000 [4:25:29<00:00,  1.57it/s]


In [43]:
train_tag_df['review'][0]

'dreamgirl  despite its fistful of Tony win in an incredibly weak year on Broadway  have never be what one would call a jewel in the crown of stage musical  however  that be not to say that in the right cinematic hand it could not be flesh out and polish into something worthwhile on screen  unfortunately  what transfer to the screen be basically a slavishly faithful version of the stage hit with all of its inherent weakness intact  first  the score have never be one of the strong point of this production and the film do not change that factor  there be lot of song  perhaps too many  but few of they be especially memorable  the close any come to catchy tune be the title song and one Night only  the much acclaimed and I be tell you that I be not go be less a great song than it be a dramatic set piece for the character of Effie  Jennifer Hudson  the film be slick and technically well produce  but the story and character be surprisingly thin and lack in any resonance  there be some interes

In [44]:
with open('train_tags.pkl', 'wb') as f:
    pickle.dump(train_tag_df,f)
    
with open('test_tags.pkl', 'wb') as f: 
    pickle.dump(test_tag_df,f)
    
with open('train_tags.pkl', 'rb') as f:
    train_tag_df = pickle.load(f)
    
with open('test_tags.pkl', 'rb') as f:
    test_tag_df = pickle.load(f)

In [45]:
vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_tag_df['review'], train_tag_df['is_positive'])

eval_model(model, test_tag_df)

Test accuracy = 86.09%


In [46]:
vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=20000, analyzer='word')
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_tag_df['review'], train_tag_df['is_positive'])

eval_model(model, test_tag_df)

Test accuracy = 88.32%


---

### Запустите классификатор и модельки на сеточках

In [47]:
import matplotlib.pyplot as plt 
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalMaxPooling1D, Dropout, Conv1D, BatchNormalization, MaxPooling1D#, GlobalAveragePooling

In [48]:
from collections import Counter

words_counter = Counter((word for text in train_tag_df.review for word in text.lower().split()))

word2idx = {
    '': 0,
    '<unk>': 1
}
for word, count in words_counter.most_common():
    if count < 10:
        break
        
    word2idx[word] = len(word2idx)
    
print('Words count', len(word2idx))

Words count 16610


In [49]:
def convert(texts, word2idx, max_text_len):
    data = np.zeros((len(texts), max_text_len), dtype=np.int)
    
    for inx, text in enumerate(texts):
        result = []
        for word in text.split():
            if word in word2idx:
                result.append(word2idx[word])
        padding = [0]*(max_text_len - len(result))
        data[inx] = np.array(padding + result[-max_text_len:], dtype=np.int)
    return data

In [50]:
X_train = convert(train_tag_df.review, word2idx, 1000)
X_test = convert(test_tag_df.review, word2idx, 1000)

In [51]:
model = Sequential([
    Embedding(input_dim=len(word2idx), output_dim=256, input_shape=(X_train.shape[1],)),
    GlobalMaxPooling1D(),
    Dense(units=256, activation='relu'),
    Dropout(0.2),
    Dense(units=128, activation='relu'),
    Dropout(0.2),
    Dense(units=1, activation='sigmoid')
])


model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

2022-07-16 20:37:42.950775: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [52]:
model.fit(X_train, train_tag_df.is_positive, batch_size=1024, epochs=5, 
          validation_data=(X_test, test_tag_df.is_positive))

2022-07-16 20:37:43.158684: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7ff530803610>

In [53]:
model.evaluate(X_test, test_tag_df.is_positive, batch_size=1024)



[0.2908460795879364, 0.8876799941062927]

---

In [54]:
X_train = convert(train_lem_df.review, word2idx, 1000)
X_test = convert(test_lem_df.review, word2idx, 1000)

In [55]:
model = Sequential([
    Embedding(input_dim=len(word2idx), output_dim=256, input_shape=(X_train.shape[1],)),
    GlobalMaxPooling1D(),
    Dense(units=256, activation='relu'),
    Dropout(0.2),
    Dense(units=128, activation='relu'),
    Dropout(0.2),
    Dense(units=1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [56]:
model.fit(X_train, train_lem_df.is_positive, batch_size=1024, epochs=5, 
          validation_data=(X_test, test_lem_df.is_positive))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7ff52e13ff40>

In [57]:
model.evaluate(X_test, test_lem_df.is_positive, batch_size=1024)



[0.2964848577976227, 0.8865200281143188]

**Модель на токенезированном датасете лучше, чем только лемматизация.**