# Part 1.2 - Rule Based Sentiment Analysis

Using 3 types of lexicon based approach to conduct sentiment analysis on app reviews
- TextBlob
- VADER
- SentiWordNet

In [30]:
import numpy as np
import pandas as pd
import regex as re


In [46]:
# import file
app_reviews = pd.read_csv('app_reviews.csv')
app_reviews.head()

Unnamed: 0,app_name,content
0,Syfe,1. The portfolio “card user interface” can be ...
1,Syfe,This hybrid app is quite buggy compared Stasha...
2,Syfe,The app and website is just a bunch of fake li...
3,Syfe,The app looks fantastic and it’s so fresh with...
4,Syfe,"Hi there,\n\nThe app checks for latest version..."


# 1. Data Preprocessing
Data preprocessing steps:

1. Cleaning the text
2. Tokenization
3. Enrichment – POS tagging
4. Stopwords removal
5. Obtaining the stem words

## 1. Cleaning the Text

Remove the special characters, numbers from the review text using regex

In [47]:
# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', text)
    return text

# Cleaning the text in the review column
app_reviews['cleaned_reviews'] = app_reviews['content'].apply(clean)
app_reviews.head()

Unnamed: 0,app_name,content,cleaned_reviews
0,Syfe,1. The portfolio “card user interface” can be ...,The portfolio card user interface can be inco...
1,Syfe,This hybrid app is quite buggy compared Stasha...,This hybrid app is quite buggy compared Stasha...
2,Syfe,The app and website is just a bunch of fake li...,The app and website is just a bunch of fake li...
3,Syfe,The app looks fantastic and it’s so fresh with...,The app looks fantastic and it s so fresh with...
4,Syfe,"Hi there,\n\nThe app checks for latest version...",Hi there The app checks for latest version dur...


## 2. Tokenisation

Using nltk tokenize function word_tokenize() to perform word-level tokenization

## 3. Enrichment – POS tagging

Using the nltk pos_tag function to perform Parts of Speech (POS) tagging - converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization

## 4. Stopwords removal
Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language. 

In [49]:
import nltk
from nltk.tokenize import word_tokenize
# Download punkt resource if unavailable
# nltk.download('punkt') 

from nltk.tag import pos_tag
# Download averaged_perceptron_tagger resource if unavailable
# nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.corpus import wordnet
# Download wordnet resource if unavailable
# nltk.download('wordnet')

In [48]:
## POS tagger dictionary
# To obtain the accurate Lemma, the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. 
# But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.
# To map pos_tag to wordnet tags, we created a dictionary pos_dict. 
# Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.
# Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.
pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}

def token_stop_pos(text):
    tags = pos_tag(word_tokenize(text)) # tokenise the reviews, and pos tag the tokens
    newlist = [] # create empty list to append tags to the words
    for word, tag in tags: # interate through the tuples (word:pos tag) in tags
        if word.lower() not in set(stopwords.words('english')): # remove stop words
            newlist.append(tuple([word, pos_dict.get(tag[0])])) # append new pos tags in the correct form by mapping to pos_dict
    return newlist

app_reviews['pos_tagged'] = app_reviews['cleaned_reviews'].apply(token_stop_pos) # apply token_stop_pos function to the reviews
app_reviews.head()

Unnamed: 0,app_name,content,cleaned_reviews,pos_tagged
0,Syfe,1. The portfolio “card user interface” can be ...,The portfolio card user interface can be inco...,"[(portfolio, n), (card, n), (user, None), (int..."
1,Syfe,This hybrid app is quite buggy compared Stasha...,This hybrid app is quite buggy compared Stasha...,"[(hybrid, a), (app, n), (quite, r), (buggy, a)..."
2,Syfe,The app and website is just a bunch of fake li...,The app and website is just a bunch of fake li...,"[(app, n), (website, n), (bunch, n), (fake, a)..."
3,Syfe,The app looks fantastic and it’s so fresh with...,The app looks fantastic and it s so fresh with...,"[(app, n), (looks, v), (fantastic, a), (fresh,..."
4,Syfe,"Hi there,\n\nThe app checks for latest version...",Hi there The app checks for latest version dur...,"[(Hi, n), (app, n), (checks, n), (latest, a), ..."


## 5. Obtaining the stem words
A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.

The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.

In [52]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(pos_data):
    lemma_rew = " " # create empoty string
    for word, pos in pos_data: # iterate through tuples (word,POS tag)
        if not pos: 
            lemma = word
            lemma_rew = lemma_rew + " " + lemma
        else:
            lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
            lemma_rew = lemma_rew + " " + lemma
    return lemma_rew

app_reviews['Lemma'] = app_reviews['pos_tagged'].apply(lemmatize)
app_reviews.head()

Unnamed: 0,app_name,content,cleaned_reviews,pos_tagged,Lemma
0,Syfe,1. The portfolio “card user interface” can be ...,The portfolio card user interface can be inco...,"[(portfolio, n), (card, n), (user, None), (int...",portfolio card user interface inconvenient m...
1,Syfe,This hybrid app is quite buggy compared Stasha...,This hybrid app is quite buggy compared Stasha...,"[(hybrid, a), (app, n), (quite, r), (buggy, a)...",hybrid app quite buggy compare Stashaway How...
2,Syfe,The app and website is just a bunch of fake li...,The app and website is just a bunch of fake li...,"[(app, n), (website, n), (bunch, n), (fake, a)...",app website bunch fake lie Starting onboardi...
3,Syfe,The app looks fantastic and it’s so fresh with...,The app looks fantastic and it s so fresh with...,"[(app, n), (looks, v), (fantastic, a), (fresh,...",app look fantastic fresh different color muc...
4,Syfe,"Hi there,\n\nThe app checks for latest version...",Hi there The app checks for latest version dur...,"[(Hi, n), (app, n), (checks, n), (latest, a), ...",Hi app check late version launch alert user ...
