# Review Classifier project by Mikhail Mikhailov

## План отчета:  
                     I: Задачи классификации
                        1) Подготовка данных и обучение модели
                        2) Оценка точности полученного результата на тестовой выборке
                     II: Веб-сервис
                        1) Разработка веб-сервиса на базе фреймворка Django
                        2) Ссылки на репозиторий github и на прототип сервиса 

## Задачи классификации

### Подготовка данных и обучение модели

Загрузим необходимые библиотеки

In [1]:
import numpy as np # для манипуляций с данными
import pandas as pd # для манипуляций с данными
import fasttext # для обучения модели
from tqdm import tqdm # для мониторинга работы циклов
import re # для работы с регулярными выражениями
import string
from nltk.corpus import PlaintextCorpusReader # для считывания названия текстовых файлов
from sklearn.utils import shuffle # для перемешивания данных в dataframe
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk

#### Разобьем задачу на две подзадачи: 
                            1) Обучим модель многоклассовой классификации (прогнозирование рейтинга фильма) и оценим ее точность
                            2) Обучим модель бинарной классификации (прогнозирование статуса комментария)

#### Многоклассовая классификация

Для выполнения этой задачи, мной было принято решение уменьшить количество классов с 8 до 4, путем объедининия двух близких классов. Так как классы очень близки между собой. 

#### Загрузка данных

In [2]:
def sorter(paths): # функция сортирующая отзывы в папках в словарь reviews
    reviews = {'1-2':[], "3-4":[], "7-8":[], "9-10":[]}
    for path in paths:
        if path == 'neg_train' or path == 'neg':
            for rating in range(1, 5):
                corpus = PlaintextCorpusReader(path, '.*\_{}.txt'.format(rating), encoding='utf-8') # считывание названий файлов 
                names = corpus.fileids()
                for name in names:
                    file = open(path + f"/{name}")
                    text = file.read()
                    # сортировка по ключам
                    if rating == 1 or rating == 2:  
                        reviews["1-2"].append(text)
                    if rating == 3 or rating == 4:
                        reviews["3-4"].append(text)
        elif path == "pos_train" or path == 'pos':
            for rating in range(7, 11):
                corpus = PlaintextCorpusReader(path, '.*\_{}.txt'.format(rating), encoding='utf-8')
                names = corpus.fileids()
                for name in names:
                    file = open(path + f"/{name}")
                    text = file.read()
                    # сортировка по ключам
                    if rating == 7 or rating == 8:
                        reviews["7-8"].append(text)
                    if rating == 9 or rating == 10:
                        reviews["9-10"].append(text)
    return reviews

In [3]:
paths = ['train_neg', 'train_neg', 'pos', 'neg']
reviews_rating = sorter(paths)

#### Знакомство с данными и их предобработка

Преобразуем слова reviews_rating в Dataframe df_rating. Так как содержимое Dataframe сортировано по строкам, перемешаем строки.

In [4]:
df_rating = shuffle(pd.DataFrame([(key, var) for (key, L) in reviews_rating.items() for var in L], 
                 columns=['rating', 'review']))

Сохраним полученный Dataframe в csv файл.

In [5]:
df_rating.to_csv('df_rating_doubled.csv')

Созданим вспомогательную функцию preprocessor, которая будет убирать из текста знаки препинания, переводить слова в нижний регистр. 

In [6]:
import re
def preprocessor(text):
    text =re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

preprocessor("This is a :) test :-( !")

'this is a test :) :('

Создадим функцию, которая разбивает отзывы на слова. А также убирает стоп-слова.

In [7]:
porter = PorterStemmer()
def tokenizer_stemmer(text):
    stop = stopwords.words('english')
    return [w for w in [porter.stem(word) for word in text.split()] if w not in stop]

tokenizer_stemmer('like running thus they run')

['like', 'run', 'thu', 'run']

In [8]:
def list_to_string(text):
    str1 = ' '
    return(str1.join(text))

s = ['Geeks', 'for', "Geeks"]
print(list_to_string(s))

Geeks for Geeks


Обработаем Dataframe.

In [9]:
df_rating['preprocessed_review'] = df_rating.review.apply(preprocessor)

In [10]:
df_rating['tokenized_review'] = df_rating.preprocessed_review.apply(tokenizer_stemmer)

In [11]:
df_rating['string_review'] = df_rating.tokenized_review.apply(list_to_string)

In [12]:
df_rating.head()

Unnamed: 0,rating,review,preprocessed_review,tokenized_review,string_review
23637,9-10,It takes guts to make a movie on Gandhi in Ind...,it takes guts to make a movie on gandhi in ind...,"[take, gut, make, movi, gandhi, india, shown, ...",take gut make movi gandhi india shown man coul...
12099,3-4,"After watching Awake,I led to a conclusion:dir...",after watching awake i led to a conclusion dir...,"[watch, awak, led, conclus, director, screenwr...",watch awak led conclus director screenwrit job...
13508,7-8,I can safely admit (as an IMDb geek) that 'Pha...,i can safely admit as an imdb geek that phanto...,"[safe, admit, imdb, geek, phantom, ladi, never...",safe admit imdb geek phantom ladi never crack ...
1848,1-2,"Warning Spoiler. . . I have to agree with you,...",warning spoiler i have to agree with you it wa...,"[warn, spoiler, agre, wa, almost, thi, wa, bad...",warn spoiler agre wa almost thi wa bad movi in...
19832,9-10,It's hard to top this movie in several ways. E...,it s hard to top this movie in several ways ev...,"[hard, top, thi, movi, sever, way, everyth, wo...",hard top thi movi sever way everyth work reall...


Для обучения моделей я использовал модуль fasttext. Для него нужно преобразовать данные в формат: "__label__{метка класса}{текст отзыва}"

In [13]:
data_text_fast = df_rating.apply(lambda x: '__label__' + str(x['rating']) + ' ' + ''.join([str(x['string_review'])]), axis = 1)

In [14]:
data_text_fast

23637    __label__9-10 take gut make movi gandhi india ...
12099    __label__3-4 watch awak led conclus director s...
13508    __label__7-8 safe admit imdb geek phantom ladi...
1848     __label__1-2 warn spoiler agre wa almost thi w...
19832    __label__9-10 hard top thi movi sever way ever...
                               ...                        
12266    __label__3-4 thi decent endeavor guy wrote scr...
5446     __label__1-2 previous enjoy wesley snipe sever...
18915    __label__9-10 friend pick paperhous random pil...
13861    __label__7-8 lover bad movi definit hit paydir...
6777     __label__1-2 nb spoiler warn first thi teen sl...
Length: 25000, dtype: object

Сохраним наши данные в текстовый документ.

In [15]:
np.savetxt('data_fast_text.txt', data_text_fast, delimiter = ' ', fmt = '%s')

Разобьем данные на тестовую и обучающую выборки в отношении 1:9

In [16]:
test = data_text_fast[:int(len(data_text_fast)*.1)]
train = data_text_fast[int(len(data_text_fast)*.1):]
np.savetxt('train_fasttext.txt', train, delimiter = ' ', fmt = '%s')
np.savetxt('test_fasttext.txt', test, delimiter = ' ', fmt = '%s')

In [17]:
train

20186    __label__9-10 remark quit praiseworthi writer ...
24133    __label__9-10 rubi paradis beauti come age sto...
5354     __label__1-2 gave 2 instead 1 aw becaus deni m...
11960    __label__3-4 familiar trilog came upon thi fil...
16523    __label__7-8 overshadow braveheart releas year...
                               ...                        
12266    __label__3-4 thi decent endeavor guy wrote scr...
5446     __label__1-2 previous enjoy wesley snipe sever...
18915    __label__9-10 friend pick paperhous random pil...
13861    __label__7-8 lover bad movi definit hit paydir...
6777     __label__1-2 nb spoiler warn first thi teen sl...
Length: 22500, dtype: object

#### Обучение модели

Где epoch - это количество эпох (число раз, когда один и тот же объект встречается модели), wordNgrams - N_граммы, dim - размер эмбеддинга, bucket - количество строк, содержащих в себе эмбеддинги для каждой N-граммы символов.

In [18]:
model = fasttext.train_supervised(input='train_fasttext.txt', epoch=35, wordNgrams=3, bucket=300000, dim=50, loss='ova')    

Сохраним нашу модель.

In [19]:
model.save_model("model_review_fasttext.bin")

#### Оценка точности полученного результата на тестовой выборке

In [20]:
print(model.test("test_fasttext.txt"))

(2500, 0.6556, 0.6556)


Итого точность (precision) нашей модели равна 0,64 .

И еще несколько отзывов для оценки качества модели.
Рейтинг отзыва обозначен цифрой после слова review.

In [21]:
review_5 = "Good production, quite intense, very interesting reenactment of a battle at sea, although I am not sure how realistic and accurate it is (those submarines liked to fight a lot in plain sight over the water, right?).My main issue? It was like a documentary, lots of battle time but no much human drama. No time given to develop any characters or even make us feel something about these people. Ships were sinking in the middle of the Atlantic and we never had a look in the horror of trying to survive this. Even Hank's character felt flat. The few scenes with his love interest were rather awkward and didn't contribute much. At one point we were wondering if the captain was in the spectrum or something..Still, if you are a fan of WWII movies I suggest to watch it for the unique perspective of a battle at sea."
review_5 = preprocessor(review_5)
review_5 = tokenizer_stemmer(review_5)
review_5 = list_to_string(review_5)
print(model.predict(review_5, k=2))

review_1 = "Very bad and disgusting film!"
review_1 = preprocessor(review_1)
review_1 = tokenizer_stemmer(review_1)
review_1 = list_to_string(review_1)
print(model.predict(review_1, k=2))

review_10 = "This movie may not be for everyone, but as a Navy veteran who has stood watch in CIC this movie is very realistic. I loved that there wasn't a lot of back story. It was more about how lonely and arduous the job of a Navy captain is. Must watch!"
review_10 = preprocessor(review_10)
review_10 = tokenizer_stemmer(review_10)
review_10 = list_to_string(review_10)
print(model.predict(review_10, k=2))

review_6 = "It was just OK which is reflected in my 6 star rating, which I think is generous. They made little effort at any sort of character development. Hanks has some sort of love interest that is hinted at, but that's it. I thought they were going to build some sort of bond between the captain and his Black cook, but that never got off the ground. We know almost nothing more about the man at the end than in the beginning.It's a difficult challenge to portray on film the battle between a destroyer and a submarine. This movie never really pulls that off, it's like listening to one side of a telephone conversation. The German U-boats never seem to be part of the narrative. They try to bring the U-boats in with their radio broadcasts which come across more as obscene phone calls than viable dialogue.Almost exactly half of the dialogue is sailors repeating orders. It got very tedious, very quickly. I thought they went too far in the whole \"navy talk\" department.I found the U-boat attack theme music to be mostly bothersome and heavy-handed, like death wail of a fat man, or a runner-up in Dumb and Dumber's most annoying sound in the world.The sea burial onboard was a moving tribute to our military dead. I couldn't imagine a better resting place than the open sea. I was air force and not really sure what we did. Threw bodies out the back of a C130? That'd be cool with me."
review_6 = preprocessor(review_6)
review_6 = tokenizer_stemmer(review_6)
review_6 = list_to_string(review_6)
print(model.predict(review_6, k=2))

review_1 = "A typical US movie boring would a German u-boat contact a us war ship ??? and all the rest again private ( hanks) movie not worth a second to watch ok maybe now with the virus nothing better on"
review_1 = preprocessor(review_1)
review_1 = tokenizer_stemmer(review_1)
review_1 = list_to_string(review_1)
print(model.predict(review_1, k=2))

review_2 = "This movie was very disappointing, I really wanted to like it. But it was monotonous with 99% of the movie on the fighter ship fighting with a fake sounding loud sound effects. Gave me a headache, take a couple of aspirins if planing to watch it."
review_2 = preprocessor(review_2)
review_2 = tokenizer_stemmer(review_2)
review_2 = list_to_string(review_2)
print(model.predict(review_2, k=2))

review_1 = "movie is disgusting"
review_1 = preprocessor(review_1)
review_1 = tokenizer_stemmer(review_1)
review_1 = list_to_string(review_1)
print(model.predict(review_1, k=2))

review_10 = "movie great awesome"
review_10 = preprocessor(review_10)
review_10 = tokenizer_stemmer(review_10)
review_10 = list_to_string(review_10)
print(model.predict(review_10, k=2))

(('__label__7-8', '__label__3-4'), array([0.12253322, 0.10375863]))
(('__label__1-2', '__label__3-4'), array([1.00001001e+00, 1.00000034e-05]))
(('__label__9-10', '__label__7-8'), array([0.8808071 , 0.01799621]))
(('__label__3-4', '__label__7-8'), array([0.69265199, 0.22271016]))
(('__label__1-2', '__label__3-4'), array([0.45327184, 0.05185546]))
(('__label__3-4', '__label__1-2'), array([0.99567848, 0.46880063]))
(('__label__1-2', '__label__3-4'), array([7.49097228e-01, 1.00000034e-05]))
(('__label__9-10', '__label__3-4'), array([1.00001001e+00, 1.00000034e-05]))


Видно, что классификатор работает точно.
В случае ошибок, он также указывает класс, в котором должен быть отзыв(Второе значение в первом кортеже).
Так же в данных отсутствуют классы 5 и 6. И отзывы этих классов отсосятся либо к 3-4 либо к 7-8.

#### Бинарная классификация

Задача состоит в том, чтобы классифицировать отзывы на 2 класса: positive, negative.

#### Загрузка данных

In [22]:
def sorter_sentiment(paths): # функция сортирующая отзывы в папках в словарь reviews_sentiment
    reviews_sentiment = {'__label__0': [], '__label__1': []}
    for path in paths:
        if path == 'neg_train' or path == 'neg':
            for rating in range(1, 5):
                corpus = PlaintextCorpusReader(path, '.*\_{}.txt'.format(rating), encoding='utf-8')
                names = corpus.fileids()
                for name in names:
                    file = open(path + f"/{name}")
                    text = file.read()
                    reviews_sentiment['__label__0'].append(text)
        elif path == "pos_train" or path == 'pos':
            for rating in range(7, 11):
                corpus = PlaintextCorpusReader(path, '.*\_{}.txt'.format(rating), encoding='utf-8')
                names = corpus.fileids()
                for name in names:
                    file = open(path + f"/{name}")
                    text = file.read()
                    reviews_sentiment['__label__1'].append(text)
    return reviews_sentiment

In [23]:
paths = ['train_neg', 'train_pos', 'pos', 'neg']
reviews_sentiment = sorter_sentiment(paths)

#### Знакомство с данными и их предобработка

Преобразуем слова reviews_rating в Dataframe df_sentiment. Так как содержимое Dataframe сортировано по строкам, перемешаем строки.

In [24]:
df_sentiment = shuffle(pd.DataFrame([(key, var) for (key, L) in reviews_sentiment.items() for var in L], 
                 columns=['sentiment', 'review']))

Сохраним полученный Dataframe в csv файл.

In [25]:
df_sentiment.to_csv('df_sentiment.csv')

Обработаем Dataframe.

In [48]:
df_sentiment['preprocessed_review'] = df_sentiment.review.apply(preprocessor)

In [49]:
df_sentiment

Unnamed: 0,sentiment,review,preprocessed_review,tokenized_review,string_review
2832,__label__0,99.999% pure crap. And the other .001% was a b...,99 999 pure crap and the other 001 was a brief...,"[99, 999, pure, crap, 001, wa, brief, moment, ...",99 999 pure crap 001 wa brief moment thought b...
23477,__label__1,"Idiotic hack crooks, a babe, a safe, a plan an...",idiotic hack crooks a babe a safe a plan and a...,"[idiot, hack, crook, babe, safe, plan, babi, a...",idiot hack crook babe safe plan babi add get b...
19864,__label__1,H.G. Wells in 1936 was past his prime and the ...,h g wells in 1936 was past his prime and the b...,"[h, g, well, 1936, wa, past, hi, prime, book, ...",h g well 1936 wa past hi prime book hi surviv ...
15516,__label__1,While I had wanted to se this film since the f...,while i had wanted to se this film since the f...,"[want, se, thi, film, sinc, first, time, watch...",want se thi film sinc first time watch trailer...
7173,__label__0,I had a really hard time making it through thi...,i had a really hard time making it through thi...,"[realli, hard, time, make, thi, move, wa, exte...",realli hard time make thi move wa extermli slo...
...,...,...,...,...,...
14574,__label__1,Clint Eastwood reprises his role as Dirty Harr...,clint eastwood reprises his role as dirty harr...,"[clint, eastwood, repris, hi, role, dirti, har...",clint eastwood repris hi role dirti harri thi ...
13332,__label__1,"I actually like the original, and this film ha...",i actually like the original and this film has...,"[actual, like, origin, thi, film, ha, origin, ...",actual like origin thi film ha origin voic cas...
6777,__label__0,"((NB: Spoiler warning, such as it is!))<br /><...",nb spoiler warning such as it is first off th...,"[nb, spoiler, warn, first, thi, teen, slasher,...",nb spoiler warn first thi teen slasher flick s...
15169,__label__1,SPOILERS Many different comedy series nowadays...,spoilers many different comedy series nowadays...,"[spoiler, mani, differ, comedi, seri, nowaday,...",spoiler mani differ comedi seri nowaday one po...


Приведем данные в формат для fasttext.

In [50]:
data_text_fast_sentiment = df_sentiment.apply(lambda x: x['sentiment'] + ' ' + ''.join([str(x['preprocessed_review'])]), axis = 1)

In [51]:
data_text_fast_sentiment

2832     __label__0 99 999 pure crap and the other 001 ...
23477    __label__1 idiotic hack crooks a babe a safe a...
19864    __label__1 h g wells in 1936 was past his prim...
15516    __label__1 while i had wanted to se this film ...
7173     __label__0 i had a really hard time making it ...
                               ...                        
14574    __label__1 clint eastwood reprises his role as...
13332    __label__1 i actually like the original and th...
15169    __label__1 spoilers many different comedy seri...
16343    __label__1 a fine story about following your d...
Length: 25000, dtype: object

Сохраним наши данные в текстовый документ.

In [52]:
np.savetxt('data_fast_text_sentiment.txt', data_text_fast_sentiment, delimiter = ' ', fmt = '%s')

Разобьем данные на тестовую и обучающую выборки в отношении 1:9.

In [53]:
test_sentiment = data_text_fast_sentiment[:int(len(data_text_fast_sentiment)*.1)]
train_sentiment = data_text_fast_sentiment[int(len(data_text_fast_sentiment)*.1):]
np.savetxt('train_fasttext_sentiment.txt', train_sentiment, delimiter = ' ', fmt = '%s')
np.savetxt('test_fasttext_sentiment.txt', test_sentiment, delimiter = ' ', fmt = '%s')

In [54]:
test_sentiment

2832     __label__0 99 999 pure crap and the other 001 ...
23477    __label__1 idiotic hack crooks a babe a safe a...
19864    __label__1 h g wells in 1936 was past his prim...
15516    __label__1 while i had wanted to se this film ...
7173     __label__0 i had a really hard time making it ...
                               ...                        
23581    __label__1 my wife and i have watched this who...
7775     __label__0 this movie is little more than poor...
22920    __label__1 this is the second baby burlesk sho...
1237     __label__0 thanks to this film i now can answe...
12564    __label__1 in the spirit of the classic the st...
Length: 2500, dtype: object

In [55]:
train_sentiment

4836     __label__0 it is very unfortunate when a movie...
13652    __label__1 didn t know anything about the movi...
16365    __label__1 this was a good movie it wasn t you...
7502     __label__0 you can give jms and the boys a pas...
19205    __label__1 jackie chan is considered by many f...
                               ...                        
14574    __label__1 clint eastwood reprises his role as...
13332    __label__1 i actually like the original and th...
15169    __label__1 spoilers many different comedy seri...
16343    __label__1 a fine story about following your d...
Length: 22500, dtype: object

#### Обучение модели

In [56]:
model_sentiment = fasttext.train_supervised(input='train_fasttext_sentiment.txt', epoch=33, wordNgrams=3, bucket=300000, dim=25, loss='ova') 

Сохраним нашу модель.

In [59]:
model_sentiment.save_model("model_review_fasttext_sentiment.bin")

#### Оценка точности полученного результата на тестовой выборке

In [57]:
print(model_sentiment.test("test_fasttext_sentiment.txt"))

(2500, 0.8964, 0.8964)


Итого точность (precision) нашей модели равна 0,89.

И еще несколько отзывов для оценки качества модели. Статус отзыва обозначен после слова rev.

In [58]:
rev_pos = 'What an excellent film by Rian Johnson; definitely feels like the film he was destined to make. Writing that is slick as hell, sublime performances (most notably Daniel Craig who brings his A-game in a wonderfully charismatic turn), superb editing and wonderfully atmospheric music - all tied together by masterful direction. Will probably be among the most fun you have at a theatre this year and fans of Agatha Christie and old murder mystery stories will have plenty to love here - a nostalgically entertaining time!'
rev_pos = preprocessor(rev_pos)
print(model_sentiment.predict(rev_pos))

rev_neg = "Although this show is very highly rated I couldn't force myself to keep watching after a few episodes simply because of the way the actors talk. I wouldn't even say the acting is bad, the characters just can't keep up with all this dark, cold and mysterious plot. The people are all equally dark, cold and mysterious and that's just not the way people are or at least behave. For Non-German audience it is probably easier to get into but the show is set in Germany in German language. And no one here talks that way, like an emotionless robot (maybe that's the way we are seen by the world but it's not true). So every dialogue of the show reminded me just how different and stylish it wants to be, completely forgetting human depth or even basic human feelings (when you begin to ask yourself if there is any character capable of loving, even if it's just their family members, you know there is something wrong)."
rev_neg = preprocessor(rev_neg)
print(model_sentiment.predict(rev_neg))

rev_pos = "This feedback is currently based on the first 10 episodes of season one (still watching the series as we speak), and to be honest it made me downgrade my rating for 'Stranger Things' quite a bit. It's clear that this series is the child of a genius - everything just works. The acting is great, the cinematography is great, the soundtrack is awesome (reminds me of Klaus Schulze at times), and the story-line is mind-boggling. This series is just miles above anything that leaves Hollywood. Wish that all productions could be like this. Thanks Netflix for bringing this awesome series to me screen. Say bye-bye to public television!"
rev_pos = preprocessor(rev_pos)
print(model_sentiment.predict(rev_pos))

rev_pos = "This movie is good"
rev_pos = preprocessor(rev_pos)
print(model_sentiment.predict(rev_pos))

rev_neg = "This is truly the most garbage movie I've ever seen. As a film student, I have no idea what this director was doing as he put this maseceure together. The shots are random and unmotivated, even for a comedy. I love comedies and Andy Samberg, but the writing is so awful. I don't see how a writer could be happy showing this to anyone, and the producers must of been so eager for a script that they couldn't recognize how terrible it is. The editing and sound mixing look like they've been done by amateurs. Besides the few creative cinematic shots and the three or four decent lines of dialogue, there's absolutely nothing good about this film and anyone who things this is great must have a very low intellect and appreciation of good films."
rev_neg = preprocessor(rev_neg)
print(model_sentiment.predict(rev_neg))

(('__label__1',), array([1.00001001]))
(('__label__0',), array([0.98309511]))
(('__label__1',), array([1.00001001]))
(('__label__1',), array([1.00001001]))
(('__label__0',), array([1.00001001]))


Видно, что бинарный классификатор работает точно.

# -----------------//-----------------

## Веб-сервис

#### Разработка веб-сервиса

С помощью фреймворка Django и облачного сервиса Heroku был реализован веб-сервис для ввода отзыва о фильме с автоматическим присвоением рейтинга (от 1 до 10) и статуса комментария (положительный или отрицательный).

Исходный код проекта можно посмотреть на github.

Приложение было создано с помощью облачного сервиса Heroku

### Ссылка на Github:

### Ссылка на прототип:

https://reviewclassifier-withdjango.herokuapp.com