<h1><center>SENTIMENT ANALYSIS CODING SECTION</h1></center>

<h2><center>José Jaén Delgado</h2></center>

The present Python Jupyter notebook provides an in-depth look into the relevant code used to carry out the opinion mining project. From data engineering/cleaning processes to actual Machine Learning modelling, the reader may check the underlying operations that give substance to our Data Science Project

<h1><center>1. DATA ENGINEERING</h1></center>


## 1.1) Creating an SFrame 

In this section we manipulate and clean data with the view of procuring an SFrame that will eventually become our main working data environment

Firstly, we turn the JSON file containing the customer reviews and ratings into a pandas dataframe

In [None]:
import pandas as pd
import gzip

## Algorithm that allows to open the gzip file containing the data

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

## Creating the pandas dataframe that retrieves data from the JSON file via recursion from the previous algorithm

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient = 'index')

## Specifying the name of the resulting pandas dataframe

df = getDF('reviews_Electronics_5.json.gz')

Let us take a look at our Amazon User data

In [3]:
df

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AO94DHGC771SJ,0528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,"06 2, 2013"
1,AMO214LNFCEI4,0528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,1290643200,"11 25, 2010"
2,A3N7T0DY83Y4IG,0528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3.0,1st impression,1283990400,"09 9, 2010"
3,A1H8PY3QHMQQA0,0528881469,"Dave M. Shaw ""mack dave""","[9, 10]","Not going to write a long review, even thought...",2.0,"Great grafics, POOR GPS",1290556800,"11 24, 2010"
4,A24EV6RXELQZ63,0528881469,Wayne Smith,"[0, 0]",I've had mine for a year and here's what we go...,1.0,"Major issues, only excuses for support",1317254400,"09 29, 2011"
...,...,...,...,...,...,...,...,...,...
1689183,A34BZM6S9L7QI4,B00LGQ6HL8,"Candy Cane ""Is it just me?""","[1, 1]",Burned these in before listening to them for a...,5.0,Boom -- Pop -- Pow. These deliver.,1405555200,"07 17, 2014"
1689184,A1G650TTTHEAL5,B00LGQ6HL8,"Charles Spanky ""Zumina Reviews""","[0, 0]",Some people like DJ style headphones or earbud...,5.0,"Thin and light, without compromising on sound ...",1405382400,"07 15, 2014"
1689185,A25C2M3QF9G7OQ,B00LGQ6HL8,Comdet,"[0, 0]",I&#8217;m a big fan of the Brainwavz S1 (actua...,5.0,Same form factor and durability as the S1 with...,1405555200,"07 17, 2014"
1689186,A1E1LEVQ9VQNK,B00LGQ6HL8,J. Chambers,"[0, 0]","I've used theBrainwavz S1 In Ear Headphones, a...",5.0,Superb audio quality in a very comfortable set...,1405641600,"07 18, 2014"


Once the dataframe is set up and the JSON file parsed, we can read it and transform it into an SFrame

We can now comfortably read the data as an SFrame

In [1]:
import turicreate as tc

reviews = tc.SFrame.read_json('Electronics_5.json', orient = 'lines')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


## 1.2) Cleaning the SFrame

We will now drop the data columns that play no relevant role in our opinion mining project

In [2]:
reviews = reviews.remove_columns(['helpful', 'unixReviewTime', 'reviewTime', 'summary'])

As there are thousands of reviews, it is imperative to make sure all of them are written in the English language. For such purpose, we will use the **langrid** package, obtaining probabilistic outputs relating to the likelihood of the user's language

In [3]:
from langid.langid import LanguageIdentifier, model

identifier = LanguageIdentifier.from_modelstring(model, norm_probs = True)

We will now write an algorithm to coherce all of the review text into a string data type as it seems langid interpret some reviews as bytes

In [4]:
def comments_string(comments):
    i = 0
    comments_list = []
    while i < len(comments):
        i += 1
        individual_comment = str(comments[i - 1])
        comments_list.append(individual_comment)
    return comments_list

In [5]:
comments_transformed = comments_string(reviews['reviewText'])
reviews.remove_column('reviewText')
reviews['reviewText'] = comments_transformed

Continuing with the data engineering process, we will further polish the data by removing punctuation in the reviews with another algorithm that uses Python's built-in string functionality

In [6]:
import string 

def remove_punctuation(text):
    try:
        text = text.translate(None, string.punctuation) 
    except: 
        translator = text.maketrans('', '', string.punctuation)
        text = text.translate(translator)
    return text

reviews['reviewText'] = reviews['reviewText'].apply(remove_punctuation)

Now it is possible to safely proceed to identify the language used by each user

In [7]:
def languages(comments):
    languages_list = []
    for i in range(len(comments)):
        language = identifier.classify(comments[i])[0]
        languages_list.append(language)
    return languages_list

In [8]:
language = languages(reviews['reviewText'])
language_sarray = tc.SArray(language)
reviews['language'] = language_sarray

## May take a few minutes to classify each reviewer's language as there are thousands of it

## Consequently, we will save a csv file once it has processed all of the information and read it from here on

## Present workaround permits skipping this computationally intensive step every time I open the Jupyter notebook

In [11]:
reviews.save('amazon_reviews.csv', format = 'csv')

In [13]:
## Reading the recently created csv file as an SFrame

reviews = tc.SFrame.read_csv('amazon_reviews.csv',
                             column_type_hints = {
                             'reviewerID': str, 'asin': str,
                             'reviewerName': str,'reviewText': str, 
                             'overall': float, 'language': str})

The SFrame below constitutes the definitive working environment for our sentiment analysis

In [14]:
reviews

asin,overall,reviewText,reviewerID,reviewerName,language
528881469,5.0,We got this GPS for my husband who is an OTR ...,AO94DHGC771SJ,amazdnu,en
528881469,1.0,Im a professional OTR truck driver and I bo ...,AMO214LNFCEI4,Amazon Customer,en
528881469,3.0,Well what can I say Ive had this unit in my t ...,A3N7T0DY83Y4IG,C. A. Freeman,en
528881469,2.0,Not going to write a long review even thought this ...,A1H8PY3QHMQQA0,"Dave M. Shaw """"mack dave"""" ...",en
528881469,1.0,Ive had mine for a year and heres what we got It ...,A24EV6RXELQZ63,Wayne Smith,en
594451647,5.0,I am using this with a Nook HD It works as ...,A2JXAZZI9PHK9Z,"Billy G. Noland """"Bill Noland"""" ...",en
594451647,2.0,The cable is very wobbly and sometimes disconn ...,A2P5U7BDKKT7FW,Christian,en
594451647,5.0,This adaptor is real easy to setup and use right ...,AAZ084UMH8VZ2,"D. L. Brown """"A Knower Of Good Things"""" ...",en
594451647,4.0,This adapter easily connects my Nook HD 734 ...,AEZ3CR6BKIROJ,Mark Dietter,en
594451647,5.0,This product really works great but I found the ...,A3BY5KCNQZXV5U,Matenai,en


<h1><center>2. DATA EXPLORATION</h1></center>

Let us see how many reviews are there in our dataset

In [15]:
print('Number of reviews: %d' % len(reviews['reviewText']))

Number of reviews: 1689188


It is highly unlikely that 201754 users reviewed thousands of products. Thus, a more specific analysis is crucial

In [16]:
print('Number of distinct reviewers: %d' % len(reviews['reviewerID'].unique()))
print('Number of distinct reviewed products: %d' % len(reviews['asin'].unique()))
print('Minimum rating score: %d' % reviews['overall'].min())
print('Maximum rating score: %d' % reviews['overall'].max())

Number of distinct reviewers: 192403
Number of distinct reviewed products: 63001
Minimum rating score: 1
Maximum rating score: 5


As suspected, there are less distinct users than reviews, owing to the fact that Amazon customers buy multiple items and consequently write different product comments.

Let's check if English is the only language used by reviewers by writing a simple laguange identifier algorithm

In [17]:
def english_identifier(languages):
    non_english = []
    for i in languages:
        if i != 'en':
            lang = i
            non_english.append(lang)
            return non_english
        else:
            return 'Every comment was made in English'

english_identifier(reviews['language'])

'Every comment was made in English'

Luckily, we are dealing with a monolingual dataset. Otherwise our NLP task would get more complex

('en', 1.0)