# Detect polarity in Google app reviews with TensorFlow
This work is part of a collection of practice sets called [NLP Starter](https://github.com/jamiemorales/project-nlp-starter).
It aims to help someone get started fast and gain a high-level understanding of the fundamental steps in the NLP lifecycle early on.
After completion, someone will have built intuition over the NLP lifecycle. 

## Step 0: Understand the problem
What we're trying to do here is to classify whether a Google app review is positive or negative.

## Step 1: Set-up and understand data
In this step, we layout the tools we will need to solve the problem identified in the previous step. We want to inspect our data sources and explore the data itself to gain an understanding of the data for preprocessing and modeling.

In [160]:
# Set-up libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow import keras

In [162]:
# Load data
df = pd.read_csv('../00-Datasets/google-play-store-app-reviews/googleplaystore_user_reviews.csv')
df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [163]:
# Look at some details
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
App                       64295 non-null object
Translated_Review         37427 non-null object
Sentiment                 37432 non-null object
Sentiment_Polarity        37432 non-null float64
Sentiment_Subjectivity    37432 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [164]:
# Check for missing values
df.isna().sum()

App                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64

In [165]:
# Look at some missing records
df[df.isna()].head(10)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,
5,,,,,
6,,,,,
7,,,,,
8,,,,,
9,,,,,


In [166]:
# Look at breakdown of label
df['Sentiment'].value_counts()

Positive    23998
Negative     8271
Neutral      5163
Name: Sentiment, dtype: int64

## Step 2: Prepare data and understand some more
In this step, we perform the necessary transformations on the data so that the neural network would be able to understand it. Real-world datasets are complex and messy. For our purposes, most of the datasets we work on in this series require minimal preparation.

Recall that we have missing values. It turns out most of these records have NaN across the board. Let's remove them.

In [167]:
# Remove missing records
df.dropna(inplace=True)

print(df.isna().sum())
print(df.info())

App                       0
Translated_Review         0
Sentiment                 0
Sentiment_Polarity        0
Sentiment_Subjectivity    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
App                       37427 non-null object
Translated_Review         37427 non-null object
Sentiment                 37427 non-null object
Sentiment_Polarity        37427 non-null float64
Sentiment_Subjectivity    37427 non-null float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB
None


In [168]:
# Remove netural sentiments
df.drop(df[df['Sentiment']=='Neutral'].index, inplace=True)

print(df.info())
print(df['Sentiment'].value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32269 entries, 0 to 64230
Data columns (total 5 columns):
App                       32269 non-null object
Translated_Review         32269 non-null object
Sentiment                 32269 non-null object
Sentiment_Polarity        32269 non-null float64
Sentiment_Subjectivity    32269 non-null float64
dtypes: float64(2), object(3)
memory usage: 1.5+ MB
None
Positive    23998
Negative     8271
Name: Sentiment, dtype: int64


In [169]:
# Split data into 80% train 20% validation
sentences = df['Translated_Review']
labels = np.where(df['Sentiment'] == 'Positive', 1, 0)

train_sentences, val_sentences, train_labels, val_labels = train_test_split(sentences, labels, test_size=0.2, random_state=0)

print(train_sentences.shape)
print(train_labels.shape)
print(val_sentences.shape)
print(val_labels.shape)

(25815,)
(25815,)
(6454,)
(6454,)


In [170]:
# Tokenize and pad
vocab_size = 10000
oov_token = '<00V>'
max_length = 100
padding_type = 'post'
trunc_type = 'post'

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

val_sequences = tokenizer.texts_to_sequences(val_sentences)
val_padded = pad_sequences(val_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

## Step 3: Build, train, and evaluate neural network
First, we design the neural network, e.g., sequence of layers and activation functions. 

Second, we train the neural network, we iteratively make a guess, calculate how accurate that guess is, and enhance our guess. The first guess is initialised with random values. The goodness or badness of the guess is measured with the loss function. The next guess is generated and enhanced by the optimizer function.

Lastly, we apply use the neural network on previously unseen data and evaluate the results.

In [171]:
# Build and train neural network
embedding_dim = 16
num_epochs = 5
batch_size = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
             loss='binary_crossentropy',
              metrics=['accuracy']
             )

history = model.fit(train_padded, train_labels, batch_size=batch_size, epochs=num_epochs, verbose=2)

Train on 25815 samples
Epoch 1/5
25815/25815 - 2s - loss: 0.5791 - accuracy: 0.7388
Epoch 2/5
25815/25815 - 1s - loss: 0.5065 - accuracy: 0.7522
Epoch 3/5
25815/25815 - 1s - loss: 0.3447 - accuracy: 0.8457
Epoch 4/5
25815/25815 - 1s - loss: 0.2218 - accuracy: 0.9215
Epoch 5/5
25815/25815 - 1s - loss: 0.1615 - accuracy: 0.9468


In [184]:
# Apply neural network
val_loss, val_accuracy = model.evaluate(val_padded, val_labels)
print('Val loss: {}, Val accuracy: {}'.format(val_loss, val_accuracy*100), '\n')

quick_test_sentence = [
    'bombarded with ads makes the app very slow i wish there were less ads',
    'super fun and addictive game get it if you like farming games',
    'very useful app on the go dont need to bring laptop around'
]

quick_test_sequences = tokenizer.texts_to_sequences(quick_test_sentence)
quick_test_padded = pad_sequences(quick_test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
quick_test_sentiments = model.predict(quick_test_padded)


for i in range(len(quick_test_sentiments)):
    print('"' + quick_test_sentence[i] + '"')
    if quick_test_sentiments[i] > .75:
        print('--> Liked app.')
    else:
        print('--> Disliked app.')

Val loss: 0.19054529411636864, Val accuracy: 92.57824420928955 

"bombarded with ads makes the app very slow i wish there were less ads"
--> Disliked app.
"super fun and addictive game get it if you like farming games"
--> Liked app.
"very useful app on the go dont need to bring laptop around"
--> Liked app.


## More

If you found this work interesting, you might like:

* Machine Learning Starter

* Deep Learning Starter

* Natural Language Processing Starter

You can find more at [github.com/jamiemorales](https://github.com/jamiemorales).

Datasets are not mine. List of sources: [datasets and sources]()

For sharing this work, here's how / the license: https://creativecommons.org/licenses/by-sa/4.0/
