# Customer Review Analysis

Recently, many of my friends have become interested in developing their own apps in the hope that it could generate some cash on the side. However, many of them became worried on how to properly evaluate their app reviews, and whether "ratings" on the app store truly captured the overall sentiment expressed by the reviews that users gave. I decided to develop a machine learning model trained on 10k positive reviews and 10k negative reviews in order to provide a basis for sentiment review analysis.

Dataset URL: https://github.com/amitt001/Android-App-Reviews-Dataset

## Loading in Dataset

As mentioned above, I am using two datasets and merging them together: one with 10,000 positive reviews, and one with 10,000 negative reviews.

In [23]:
import pandas as pd

positive = pd.read_csv("positive10k.csv")
positive["Type"] = "Positive"
display(positive)

Unnamed: 0,Review,Type
0,Very simple and effective way for new words fo...,Positive
1,Fh d Fcfatgv,Positive
2,My son loved it. It is easy even though my son...,Positive
3,Brilliant A brilliant app that is challenging ...,Positive
4,Good I have gotten several updates and new gam...,Positive
...,...,...
9926,LOL well does not work K so you can get a lase...,Positive
9927,Mrs. Tartaglia Yes i like playing this game i ...,Positive
9928,Awesome Very helpful and educational.,Positive
9929,Drives me bats Love it. Makes me wonky but kee...,Positive


In [24]:
negative = pd.read_csv("negative10k.csv")
negative["Type"] = "Negative"
display(negative)

Unnamed: 0,Review,Type
0,Ya no I'd ur going make an app make sure it works,Negative
1,STUPID ADDS THE STUPID ADDS POP UP EVERY 5 SEC...,Negative
2,WTF Needs way more detail. 1. Sun should go do...,Negative
3,I love this game. But all of a sudden it when ...,Negative
4,Mark Now I can't send photo n images pls fix i...,Negative
...,...,...
9712,Freezes too often It freezes too often. There ...,Negative
9713,Won't let me play It won't let me play as gues...,Negative
9714,Garbage App is broken Used to work well. Now i...,Negative
9715,HATE IT HOW ARE WE SUPPOSE TO HAVE FUN WHEN IT...,Negative


In [67]:
reviews = pd.concat([positive, negative])
display(reviews)

Unnamed: 0,Review,Type
0,Very simple and effective way for new words fo...,Positive
1,Fh d Fcfatgv,Positive
2,My son loved it. It is easy even though my son...,Positive
3,Brilliant A brilliant app that is challenging ...,Positive
4,Good I have gotten several updates and new gam...,Positive
...,...,...
9712,Freezes too often It freezes too often. There ...,Negative
9713,Won't let me play It won't let me play as gues...,Negative
9714,Garbage App is broken Used to work well. Now i...,Negative
9715,HATE IT HOW ARE WE SUPPOSE TO HAVE FUN WHEN IT...,Negative


## Data Preprocessing / Cleaning

Text Analysis in particular is one of the parts of machine learning that needs to be carefully handled, since the data is almost entirely string-based (which most models don't work as well with). We will be cleaning the data by standardizing all rows: 

- Remove all punctuation
    - We don't have a need for periods, semicolons, colons, commas, etc because they don't provide much to the overall meaning of each sentence.
- Lowercase everything
    - Some people use all caps, and although that does generally indicate that a review is more negative, we are more focused on the overall meaning of the sentence, not its casing.
- Remove stop words
    - Stop words are words that provide no value to the meaning of the sentence (aka filler words). For example, "a, and, but" etc.
- Word Tokenization
    - Helps the model better understand the data by separating words into separate entities rather than as a whole
- Lemmetization 
    - People's usage of tenses varies: sometimes people write a review in present tense, sometimes they write a review in past tense, etc. Reducing each word to its base should standardize the data to allow for more accuracy by making each word more impactful.
    

In [68]:
# Punctuation removal and lowercase everything

import string

def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

reviews['Review'] = reviews['Review'].apply(remove_punctuation).str.lower()
display(reviews)

Unnamed: 0,Review,Type
0,very simple and effective way for new words fo...,Positive
1,fh d fcfatgv,Positive
2,my son loved it it is easy even though my son ...,Positive
3,brilliant a brilliant app that is challenging ...,Positive
4,good i have gotten several updates and new gam...,Positive
...,...,...
9712,freezes too often it freezes too often there i...,Negative
9713,wont let me play it wont let me play as guest ...,Negative
9714,garbage app is broken used to work well now it...,Negative
9715,hate it how are we suppose to have fun when it...,Negative


In [69]:
# Remove stopwords

import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

reviews['Review'] = reviews['Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
display(reviews)

Unnamed: 0,Review,Type
0,simple effective way new words kids,Positive
1,fh fcfatgv,Positive
2,son loved easy even though son first grade hig...,Positive
3,brilliant brilliant app challenging great fun ...,Positive
4,good gotten several updates new games help alot,Positive
...,...,...
9712,freezes often freezes often way exit game hitt...,Negative
9713,wont let play wont let play guest even created...,Negative
9714,garbage app broken used work well recognizes h...,Negative
9715,hate suppose fun tells us,Negative


In [70]:
# Tokenization

def tokenize(text):
    return nltk.word_tokenize(text)

reviews['Review'] = reviews['Review'].apply(tokenize)
display(reviews)

Unnamed: 0,Review,Type
0,"[simple, effective, way, new, words, kids]",Positive
1,"[fh, fcfatgv]",Positive
2,"[son, loved, easy, even, though, son, first, g...",Positive
3,"[brilliant, brilliant, app, challenging, great...",Positive
4,"[good, gotten, several, updates, new, games, h...",Positive
...,...,...
9712,"[freezes, often, freezes, often, way, exit, ga...",Negative
9713,"[wont, let, play, wont, let, play, guest, even...",Negative
9714,"[garbage, app, broken, used, work, well, recog...",Negative
9715,"[hate, suppose, fun, tells, us]",Negative


In [71]:
# Lemmatization

from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(w) for w in text])

reviews['Review'] = reviews['Review'].apply(lemmatize)
display(reviews)

Unnamed: 0,Review,Type
0,simple effective way new word kid,Positive
1,fh fcfatgv,Positive
2,son loved easy even though son first grade hig...,Positive
3,brilliant brilliant app challenging great fun ...,Positive
4,good gotten several update new game help alot,Positive
...,...,...
9712,freeze often freeze often way exit game hittin...,Negative
9713,wont let play wont let play guest even created...,Negative
9714,garbage app broken used work well recognizes h...,Negative
9715,hate suppose fun tell u,Negative
