## Homework:

https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Класифікація на позитивні та негативні відгуки

Використати spaCy (або інші моделі, методи)

## Information about dataset

"IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms. For more dataset information, please go through the following link, http://ai.stanford.edu/~amaas/data/sentiment/ "

## Results from different notebooks:

- GRU Model:  [0.49783108]
- LSTM Model:  [0.49836373]
- GAP Model:  [0.9808547]

### Resource:
- https://www.kaggle.com/code/meryemtetik/moviereviews-rnn-lstm-gru
- https://www.kaggle.com/code/lovinggirls/text-preprocessing-nlp-pipeline-by-piyush-kumar
- https://www.kaggle.com/code/marekm4/movie-reviews-simple-nlp-library
- https://www.kaggle.com/code/martandsay/movie-review-sentiment-analysis-nlp
- https://www.kaggle.com/code/ibrahimo/imdb-reviews

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os    
        
import matplotlib.pyplot as plt
import os
import string
import tensorflow as tf

from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import GRU, LSTM, GlobalAveragePooling1D, Dense, TextVectorization, Input, Embedding, Flatten, Dropout, BatchNormalization
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam



In [2]:
%pip install transformers

import nltk
from string import punctuation
nltk.download('stopwords')
nltk.download('punkt')

stop_words = nltk.corpus.stopwords.words('english') + list(punctuation)

Note: you may need to restart the kernel to use updated packages.
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
df.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

### Dataset pre-processing

In [6]:
# coding target 
df['sentiment'] = df['sentiment'].map({'positive': 1,'negative' :0 })
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [7]:
!pip install ttictoc

Collecting ttictoc
  Downloading ttictoc-0.5.6-py3-none-any.whl (5.7 kB)
Installing collected packages: ttictoc
Successfully installed ttictoc-0.5.6


In [8]:
import re
from nltk import word_tokenize
from ttictoc import tic,toc
from IPython.display import clear_output
import spacy

nlp = spacy.load("en_core_web_lg") # Loading english large corpus
tic()

def preprocesing_text_v1(text : str):
    try:
        text = text.lower()
        text = re.sub(r"@[A-Za-z0-9]+", ' ', text)
        text = re.sub(r"<.*?>", ' ', text) # Removing HTML tags
        text = re.sub(r"https?://\S+", ' ', text) # Removing urls
        text = re.sub(r"https?://[A-Za-z0-9./]+", ' ', text)
        text = re.sub(r"[^a-zA-z.!?'0-9]", ' ', text)
        text = re.sub('\t', ' ',  text)
        text = re.sub(r" +", ' ', text)
        text = text.strip(' ')
        text = ' '.join([x for x in word_tokenize(text) if x not in stop_words])
    except Exception as e :
        print(e)
    finally:
        return text
    
# Spacy
def preprocesing_text_v2(text):
    doc = nlp(text)
    preprocessed_text = [token.lemma_ for token in doc if not token.is_punct and 
                         token.text != ' ' and token.text != '\n' and token.text != '<br />' 
                         and token.text != '<br/>']
    return " ".join(preprocessed_text)


df['review'] = df['review'].apply(preprocesing_text_v2)

clear_output(wait=True)
tc=toc()
print('Pre-processing time: ',tc)
print(df.head)

Pre-processing time:  2285.762749088
<bound method NDFrame.head of                                                   review  sentiment
0      one of the other reviewer have mention that af...          1
1      a wonderful little production < br /><br />the...          1
2      I think this be a wonderful way to spend time ...          1
3      basically there be a family where a little boy...          0
4      petter Mattei 's love in the Time of money be ...          1
...                                                  ...        ...
49995  I think this movie do a down right good job it...          1
49996  bad plot bad dialogue bad act idiotic directin...          0
49997  I be a Catholic teach in parochial elementary ...          0
49998  I be go to have to disagree with the previous ...          0
49999  no one expect the Star Trek movie to be high a...          0

[50000 rows x 2 columns]>


In [9]:
# Here presented examples of the tools for working with text 
######################################################
# Tokenization
# NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

sent1 = "I am going to visit you this night!"
sent2 = "I have Ph.D in A.I."
sent3 = "I am here to help. Mail me at makelove@gmail.com"
sent4 = "A 2hr ride cost $10"

print(word_tokenize(sent1))
print(word_tokenize(sent2))
print(word_tokenize(sent3))
print(word_tokenize(sent4))

######################################################
for token in doc4:
    print(token)

######################################################
# spacy
import spacy
nlp = spacy.load('en_core_web_sm')

doc1 = nlp(sent1)
doc2 = nlp(sent2)
doc3 = nlp(sent3)
doc4 = nlp(sent4)

######################################################
for token in doc1:
    print(token)

# Stemming    
######################################################
#test = 'walk walks walking walked'
#stem_words(test)



['I', 'am', 'going', 'to', 'visit', 'you', 'this', 'night', '!']
['I', 'have', 'Ph.D', 'in', 'A.I', '.']
['I', 'am', 'here', 'to', 'help', '.', 'Mail', 'me', 'at', 'makelove', '@', 'gmail.com']
['A', '2hr', 'ride', 'cost', '$', '10']


NameError: name 'doc4' is not defined

### Train-Test datasets

In [None]:
!pip install simple_nlp_library

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df,test_size=0.2,shuffle=True)
# val, test = train_test_split(test,test_size=0.5,shuffle=True)

train = train.reset_index(drop=True)
# val = val.reset_index(drop=True)
test = test.reset_index(drop=True)

train.shape,test.shape

In [None]:
from simple_nlp_library import embeddings, preprocessing
from sklearn.model_selection import train_test_split

stop_words = preprocessing.stop_words()
vectors = embeddings.vectors()

X_train = train['review'].tolist()
X_train = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_train]
y_train = train['sentiment'].tolist()

X_test = test['review'].tolist()
X_test = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_test]
y_test = test['sentiment'].tolist()

### Model-0: MLP

In [None]:
clf = MLPClassifier(hidden_layer_sizes=(25), early_stopping=True, n_iter_no_change=20).fit(X_train, y_train)


clf.best_validation_score_

### Model-1: Naive Bayes 

In [None]:
X_train = train['review'].tolist()
# X_train = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_train]
y_train = train['sentiment'].tolist()

X_test = test['review'].tolist()
# X_test = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_test]
y_test = test['sentiment'].tolist()

In [None]:
nbpl = Pipeline([
    ('count_vectorizer', CountVectorizer()),
    ('naive_bayes', MultinomialNB())
])

nbpl.fit(X_train, y_train)
y_pred = nbpl.predict(X_test)

print(classification_report(y_test, y_pred))

### Model-1: Random Forest Classifier

In [None]:
X_train = train['review'].tolist()
# X_train = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_train]
y_train = train['sentiment'].tolist()

X_test = test['review'].tolist()
# X_test = [embeddings.tokens_vector(vectors, preprocessing.semantic_tokens(stop_words, x)) for x in X_test]
y_test = test['sentiment'].tolist()

In [None]:
rfpl = Pipeline([
    ('count_vectorizer', CountVectorizer()),
    ('random_classifier', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

rfpl.fit(X_train, y_train)

y_pred = rfpl.predict(X_test)

print(classification_report(y_test, y_pred))

# Conclusion:
- MLP showed worde accuracy: 78% (preprocessing-v1), 76% (preprocessing-v2)
- Random Forest and Naive Bayes showed the same accuracy: 86% (pre-processing-v1), 84%-85% (pre-processing-V2)
- text preprocessing using v1: 67 sec
- text pre-processing using v2 [spacy]: 1237.54 sec
- spacy preprocessing resulted in less accurate classification then alernative pre-processing-v1