# Analysis of Comic Book Reviews with usage of Deep NLP.

Dataset originates from the site https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

The datasets were collected in late 2017 from goodreads.com, where only users' public shelves were scraped, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized. 

Following papers needs to be mentioned for fair use:
1. Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18.
2. Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19.

### Importing the data 

File is available with JSON extension, which is a standard data interchangeable format. To load it in Python we need to import json library.

In [2]:
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def load_data(file):
    reviews = []
    for line in  open(file, 'r'):
          reviews.append(json.loads(line))
    return reviews
filename = '/Users/user/Desktop/Python/ANN/NLP/goodreads_reviews_comics_graphic.json'
dataset = load_data(filename)
dataset_min = []
for line in dataset:
    dataset_min.append((line['review_text'],line['rating']))

#We are taking 1000 first rows to speed up the process
dataset = pd.DataFrame(dataset_min[:1000])
dataset.columns = ['Review', 'Rating']

### Cleaning the text 

First step for Natural Language Processing is of course cleaning the text. This is highly significat for model predictions. Following stages will be covered:
1. Deleting numbers, punctuation and trailing newlines,
2. Changing all uppercase letters to lowercase,
3. Removing as much meaningless words as possible based on 'stopwords' dataset,
4. Stemming - reducing inflected (or sometimes derived) words to their root form

Unfortunately English stopwords dataset contains word 'not' which is important for determining how good the review is, so we need to remove it from the stopwords dataset.

In [4]:
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

In [5]:
reviews_all = []
for i in range(0,len(dataset)):
    #1. Removing redundant signs
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.rstrip()
    #2. Changing uppercase letters
    review = review.lower()
    #3 & 4. Stopwords and stemming
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if word not in all_stopwords]
    review = ' '.join(review)
    reviews_all.append(review)

### Bag-of-Words 

Feature selection with BOW model.

In [6]:
cv = CountVectorizer()
X = cv.fit_transform(reviews_all).toarray()
y = dataset.iloc[:,-1].values
len(X[0])

6373

CountVectorizer class has an attribute of 'max_fatures' which indicates maximum number of columns a sparse matrix can have. Worth knowing is that this columns also point at meaningless words that won't influence the model itself. To reduce the number of columns we can manually set max_fatures to a certain number. 

In [7]:
max_f = 6000
cv = CountVectorizer(max_features = max_f)
X = cv.fit_transform(reviews_all).toarray()
y = dataset.iloc[:,-1].values

Encoding categorical data.

Usually for non-binary classification OneHotEncoder would be used, but here I want to transform multicategorical data to binary data. My goal is to predict not which rating the review will get but weather it will be negative or positive. That's why ratings from 0 to 2 are associated with 0, which means bad, and from 3 to 5 with 1, which is good.

In [8]:
y = np.where(y == 3, 1, np.where(y == 4, 1, np.where(y == 5, 1, 0)))

### Creating Artificial Neural Network

Splitting the dataset into training set and test set.

Since we have only categorical data in the dataset, we do not have to scale the features.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

For multicategorical data we would have several nodes as output, activation function would be softmax and loss would be categorical crossentropy. For binary output there is only one node needed, sigmoid function and binary crossentropy loss is also correct.

In [10]:
classifier= Sequential()
classifier.add(Dense(max_f/2, activation = 'relu'))
classifier.add(Dense(max_f/4, activation = 'relu'))
classifier.add(Dense(1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 32, epochs = 100)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch

Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1a33107210>

In [11]:
y_pred = classifier.predict(X_test)
y_pred = (y_pred>0.5)

In [12]:
acc = accuracy_score(y_test,y_pred)
acc

0.83

### Accuracy score is 83% which is indicating that the model is a good fit. Not too high value shows that model is in general not overfitted. 