# Data Science Project  - Text Sentiment Analysis for Customer Reviews

To begin the analysis, we need to understand what data we want to have and the purpose of using the data. 
We begin by downloading certain datasets, which are the metadata and reviews for Amazon products in the 'All Beauty Category'.

## Our Goal

Our goal is to analyze our *Amazon Reviews Dataset* in the 'All Beauty Category' by converting reviews into calculable metrics, through which we apply few classification algorithms to get the appropriate solution to classifiy future reviews.

While this definitely does not represent the whole behavior of reviews within Amazon, it can be a close predictor/ classifier for other categories. 


## Steps

*Step 1* -> we begin with the imports for the analysis.

Starting with the Standard Library -> *Pandas*, Then using *json* library for manipulation of JSON to our main Pandas DataFrame. 
We are also importing other libraries like *math* and *time* to analyze our speed of algorithms. 

In [1]:
import pandas as pd
import json
import math
import time

*Step 2 ->* Now we load our dataset from the JSON File and convert it to a Pandas Dataframe **df**.

In [2]:
df = pd.read_json('Data/All_Beauty.json',lines='true')
df.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,143026860,theodore j bigham,great,One Star,1424304000,,,
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,143026860,Mary K. Byke,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...,1418860800,,,
2,4,True,"08 10, 2014",A1572GUYS7DGSR,143026860,David G,"This book was very informative, covering all a...",Worth the Read,1407628800,,,
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,143026860,TamB,I am already a baseball fan and knew a bit abo...,Good Read,1362960000,,,
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,143026860,shoecanary,This was a good story of the Black leagues. I ...,"More than facts, a good story read!",1324771200,5.0,,


*Step 3 ->* Understanding the data content, size and shape

In [3]:
print(df.columns)
print(df.shape)

Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'style', 'image'],
      dtype='object')
(371345, 12)


*Step 3.1 ->* loading and manipulation of the meta data from our category 'All Beauty' in Amazon Reviews. These two datasets are primarily connected by the **ASIN** number in the reviews and the metadata. 

In [4]:

with open('Data/meta_All_Beauty.json', 'r') as json_file:
    #json_object = json.load(json_file)
    response = json_file.read()
    response = response.replace('\n', '')
    response = response.replace('}{', '},{')
    response = "[" + response + "]"
    json_object = json.loads(response)


In [5]:
print(json.dumps(json_object[0],indent=4))

{
    "category": [],
    "tech1": "",
    "description": [
        "Loud 'N Clear Personal Sound Amplifier allows you to turn up the volume on what people around you are saying, listen at the level you want without disturbing others, hear a pin drop from across the room."
    ],
    "fit": "",
    "title": "Loud 'N Clear&trade; Personal Sound Amplifier",
    "also_buy": [],
    "tech2": "",
    "brand": "idea village",
    "feature": [],
    "rank": "2,938,573 in Beauty & Personal Care (",
    "also_view": [],
    "details": {
        "ASIN: ": "6546546450"
    },
    "main_cat": "All Beauty",
    "similar_item": "",
    "date": "",
    "price": "",
    "asin": "6546546450",
    "imageURL": [],
    "imageURLHighRes": []
}


*Step 3.2->* Now we iterate through rows, to work on data transformation/ filtering. Look for NaN or null values, remove unrelated or non-productive columns, which will later greatly help in speeding up our analysis (even though most of it is dependent of the number of data points inside the dataset 😅.

### Minimising the dataset

We have almost 370,000 points for analysis. If we have to mainpulate every cell, it would take quite a lot of time to get a cleaner dataset, the way we want. We have two ways to go with it. Minimise the dataset to lower number of datarows or use better algorithms to make these manipulations. 

For Example, *NaN* in the 'votes' column basically meant that there were no votes on the review in question. Which we can convert to zero. But doing that with 370,000 rows will take quite a long time.  

Getting Random Sample with approx. 60,000 data rows(1/6th fraction).

In [7]:
#df_new = df.sample(frac=0.1667, replace=False, random_state=1)
#df_new = df_new.copy()

Importing *numpy* for further manipulation. 

In [8]:
import numpy as np

In [9]:
start = time.time()

df["vote"] = np.where(df['vote'].isna(), "0", df["vote"])
'''
for (index,rows) in df_new.iterrows():
    if(math.isnan(float(str(df_new.iloc[index]['vote']).replace(',','')))):
        df_new.loc[index,'vote']=0
'''
end = time.time()
print("Elapsed Time: ",end - start, "sec")


Elapsed Time:  0.0473484992980957 sec


Dropping in useless columns --> **style** and **image**

In [10]:
df.drop(columns=['style','image'],inplace=True)

In [11]:
df.iloc[2]

overall                                                           4
verified                                                       True
reviewTime                                              08 10, 2014
reviewerID                                           A1572GUYS7DGSR
asin                                                     0143026860
reviewerName                                                David G
reviewText        This book was very informative, covering all a...
summary                                              Worth the Read
unixReviewTime                                           1407628800
vote                                                              0
Name: 2, dtype: object

In [12]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,One Star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,"This book was very informative, covering all a...",Worth the Read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,I am already a baseball fan and knew a bit abo...,Good Read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,This was a good story of the Black leagues. I ...,"More than facts, a good story read!",1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,It was awful. It was super frizzy and I tried ...,It was super frizzy and I tried to comb it and...,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,I was skeptical about buying this. Worried it...,Awesome,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,Makes me look good fast.,Five Stars,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,Way lighter than photo\nNot mix blend of color...,Ok but color way off and volume as well,1487635200,0


## Description of Data

- overall -> rating of the product
- reviewerID -> ID of the reviewer
- reviewerName - name of the reviewer
- asin -> ID of the product

- reviewText -> review of the product, in text
- summary - summary of the review

## Next Steps

Using the text given and the rating given for the product, we can manipulate the *reviewText* values in our dataframe to get TF-IDF values of use word embedding techniques to get a feature(s) through which we operate a *classification* algorithm for sentimental analysis.

But before going for the classification, we have to **clean the dataset**. 

*Step 4 ->* target column --> reviewText and summary. First removing all rows with NaN values in reviewText and summary columns.

*Step 4.1 ->*  Dropping NaN and lowercasing values in our target columns

In [13]:
def lowerCase(x):
    if isinstance(x, str):
        return x.lower()
    else:
        return(str(x).lower())

#df['reviewText'].apply(lambda x: lowerCase(x))
#df['summary'].apply(lambda x: lower)
df.dropna(subset=['reviewText', 'summary'])

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,One Star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,"This book was very informative, covering all a...",Worth the Read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,I am already a baseball fan and knew a bit abo...,Good Read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,This was a good story of the Black leagues. I ...,"More than facts, a good story read!",1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,It was awful. It was super frizzy and I tried ...,It was super frizzy and I tried to comb it and...,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,I was skeptical about buying this. Worried it...,Awesome,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,Makes me look good fast.,Five Stars,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,Way lighter than photo\nNot mix blend of color...,Ok but color way off and volume as well,1487635200,0


In [14]:
df['reviewText']=df['reviewText'].apply(lambda x: lowerCase(x))
df['summary']=df['summary'].apply(lambda x: lowerCase(x))

In [15]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,my husband wanted to reading about the negro ...,... to reading about the negro baseball and th...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,"this book was very informative, covering all a...",worth the read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,i am already a baseball fan and knew a bit abo...,good read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,this was a good story of the black leagues. i ...,"more than facts, a good story read!",1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,it was awful. it was super frizzy and i tried ...,it was super frizzy and i tried to comb it and...,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,i was skeptical about buying this. worried it...,awesome,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,makes me look good fast.,five stars,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter than photo\nnot mix blend of color...,ok but color way off and volume as well,1487635200,0


In [16]:
df.iloc[1]['reviewText']

"my  husband wanted to reading about the negro baseball and this a great addition to his library\n our library doesn't haveinformation so this book is his start. tthank you"

Now all our reviewTexts and summaries are in lower case 🙌
Next, we remove our punctuations.

To make it, we will start with the string library import
Before that, we will remove newlines and replace them with spaces or empty strings.

*Step 4.2 ->* remove newlines, remove punctuations

In [17]:
def remove_newline(x):
    if isinstance(x, str):
        return x.replace('\n','')
    else:
        return(str(x).replace('\n',''))

df['reviewText']=df['reviewText'].apply(lambda x: remove_newline(x))
df['summary']=df['summary'].apply(lambda x: remove_newline(x))

In [19]:
df.iloc[1]['reviewText']

"my  husband wanted to reading about the negro baseball and this a great addition to his library our library doesn't haveinformation so this book is his start. tthank you"

In [20]:
import string
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

In [21]:
df['reviewText']=df['reviewText'].apply(lambda x: remove_punctuations(x))
df['summary']=df['summary'].apply(lambda x: remove_punctuations(x))

In [22]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,my husband wanted to reading about the negro ...,to reading about the negro baseball and this ...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,this book was very informative covering all as...,worth the read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,i am already a baseball fan and knew a bit abo...,good read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,this was a good story of the black leagues i b...,more than facts a good story read,1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,it was awful it was super frizzy and i tried t...,it was super frizzy and i tried to comb it and...,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,i was skeptical about buying this worried it ...,awesome,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,makes me look good fast,five stars,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter than photonot mix blend of colorsn...,ok but color way off and volume as well,1487635200,0


In [23]:
df.iloc[1]['reviewText']

'my  husband wanted to reading about the negro baseball and this a great addition to his library our library doesnt haveinformation so this book is his start tthank you'

Now we have also implemented punctuation removal. We are now going to do stop word removal for both the columns. 

We are implementing Stopword removal by using a libraby called nltk

*Step 4.3 -> Stopword removal*

In [25]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/matcha.s/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
stop = [remove_punctuations(word) for word in stop] # removing punctuations as we have already removed it from the actual dataset
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'youre',
 'youve',
 'youll',
 'youd',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'shes',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'thatll',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'fe

Now applying *stop* to both our text arrays, using *list comprehensions*

In [27]:
df['reviewText']=df['reviewText'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df['summary']=df['summary'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [29]:
df.iloc[1]['reviewText']

'husband wanted reading negro baseball great addition library library haveinformation book start tthank'

In [30]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,husband wanted reading negro baseball great ad...,reading negro baseball great addition library ...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,book informative covering aspects game,worth read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,already baseball fan knew bit negro leagues le...,good read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,good story black leagues bought book teach hig...,facts good story read,1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,awful super frizzy tried comb fell completely ...,super frizzy tried comb,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,skeptical buying worried would look obviously ...,awesome,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,makes look good fast,five stars,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter photonot mix blend colorsnot fulln...,ok color way volume well,1487635200,0


Stemming and Lemmatization

Now we do stemming and lemmatization on both our columns, by using the NLTK library (again) 

In [31]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
ps = PorterStemmer()
sns= SnowballStemmer("english")

df['reviewText']=df['reviewText'].apply(lambda x: " ".join([sns.stem(w) for w in x.split()]))
df['summary']=df['summary'].apply(lambda x: " ".join([sns.stem(w) for w in x.split()]))

In [32]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,husband want read negro basebal great addit li...,read negro basebal great addit librari librari...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,book inform cover aspect game,worth read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,alreadi basebal fan knew bit negro leagu learn...,good read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,good stori black leagu bought book teach high ...,fact good stori read,1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,aw super frizzi tri comb fell complet apart th...,super frizzi tri comb,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,skeptic buy worri would look obvious fake shin...,awesom,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,make look good fast,five star,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter photonot mix blend colorsnot fulln...,ok color way volum well,1487635200,0


In [33]:
print(df.iloc[1]['reviewText'])
print(df.iloc[1]['summary'])

husband want read negro basebal great addit librari librari haveinform book start tthank
read negro basebal great addit librari librari haveinform


Now doing lemmatization

In [29]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

df['reviewText']=df['reviewText'].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))
df['summary']=df['summary'].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))

[nltk_data] Downloading package wordnet to /home/matcha.s/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [34]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,1,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,4,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,husband want read negro basebal great addit li...,read negro basebal great addit librari librari...,1418860800,0
2,4,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,book inform cover aspect game,worth read,1407628800,0
3,5,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,alreadi basebal fan knew bit negro leagu learn...,good read,1362960000,0
4,5,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,good stori black leagu bought book teach high ...,fact good stori read,1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,1,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,aw super frizzi tri comb fell complet apart th...,super frizzi tri comb,1500508800,0
371341,5,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,skeptic buy worri would look obvious fake shin...,awesom,1489622400,34
371342,5,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,make look good fast,five star,1488326400,46
371343,2,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter photonot mix blend colorsnot fulln...,ok color way volum well,1487635200,0


In [31]:
df.iloc[371343]["summary"]

'ok color way volum well'

In [35]:
def applyCategory(x):
    x=int(x)
    if(x==5 or x==4):
        return "Positive"
    elif(x==3):
        return "Neutral"
    else:
        return "Negative"

df["overall"] = df["overall"].apply(lambda x: applyCategory(x))

In [36]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote
0,Negative,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0
1,Positive,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,husband want read negro basebal great addit li...,read negro basebal great addit librari librari...,1418860800,0
2,Positive,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,book inform cover aspect game,worth read,1407628800,0
3,Positive,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,alreadi basebal fan knew bit negro leagu learn...,good read,1362960000,0
4,Positive,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,good stori black leagu bought book teach high ...,fact good stori read,1324771200,5
...,...,...,...,...,...,...,...,...,...,...
371340,Negative,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,aw super frizzi tri comb fell complet apart th...,super frizzi tri comb,1500508800,0
371341,Positive,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,skeptic buy worri would look obvious fake shin...,awesom,1489622400,34
371342,Positive,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,make look good fast,five star,1488326400,46
371343,Negative,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter photonot mix blend colorsnot fulln...,ok color way volum well,1487635200,0


In [37]:
from sklearn.model_selection import train_test_split
X1 = df['reviewText']
X2 = df['summary']
y = df['overall']
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.33,random_state=12)
print('Training Data :', X_train.shape)

print('Testing Data : ', X_test.shape)

Training Data : (248801,)
Testing Data :  (122544,)


Ideally when working on TF-IDF Scoring, we should have our data non-lemmatized, since we don't get too unique data text in our corpus.

## Logistic Regression

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
lr = LogisticRegression(max_iter=1000)

cv = CountVectorizer()

X_train_converted = cv.fit_transform(X_train)

print(X_train_converted.shape)

lr.fit(X_train_converted, y_train)

(248801, 97642)


In [39]:
X_test_converted = cv.transform(X_test)
predictions = lr.predict(X_test_converted)

predictions

array(['Positive', 'Positive', 'Neutral', ..., 'Positive', 'Negative',
       'Positive'], dtype=object)

In [40]:
from sklearn.metrics import classification_report

report = classification_report(y_test, predictions)
print(report)

              precision    recall  f1-score   support

    Negative       0.72      0.61      0.66     19637
     Neutral       0.40      0.17      0.24      9687
    Positive       0.87      0.96      0.91     93220

    accuracy                           0.84    122544
   macro avg       0.67      0.58      0.60    122544
weighted avg       0.81      0.84      0.82    122544



## TF-IDF Based Classification

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(max_features=1000,smooth_idf=True)
x_rt = v.fit_transform(df['reviewText'])
x_summary=v.fit_transform(df['summary'])
x_rt_array = x_rt.todense()

In [42]:
x_rt_array.shape

(371345, 1000)

In [43]:
x_rt_array=np.sum(x_rt_array, axis=1)
x_rt_array= np.squeeze(np.asarray(x_rt_array))

In [44]:
x_summary_array = np.sum(x_summary,axis=1)
x_summary_array = np.squeeze(np.asarray(x_rt_array))

In [45]:
x_summary_array

array([1.        , 2.3858777 , 1.        , ..., 1.96187927, 2.76833213,
       2.97749194])

In [46]:
df["tfidf_revText"]=x_rt_array
df["tfidf_summary"]=x_summary_array

In [47]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,tfidf_revText,tfidf_summary
0,Negative,True,"02 19, 2015",A1V6B6TNIC10QE,0143026860,theodore j bigham,great,one star,1424304000,0,1.000000,1.000000
1,Positive,True,"12 18, 2014",A2F5GHSXFQ0W6J,0143026860,Mary K. Byke,husband want read negro basebal great addit li...,read negro basebal great addit librari librari...,1418860800,0,2.385878,2.385878
2,Positive,True,"08 10, 2014",A1572GUYS7DGSR,0143026860,David G,book inform cover aspect game,worth read,1407628800,0,1.000000,1.000000
3,Positive,True,"03 11, 2013",A1PSGLFK1NSVO,0143026860,TamB,alreadi basebal fan knew bit negro leagu learn...,good read,1362960000,0,2.611550,2.611550
4,Positive,True,"12 25, 2011",A6IKXKZMTKGSC,0143026860,shoecanary,good stori black leagu bought book teach high ...,fact good stori read,1324771200,5,4.016665,4.016665
...,...,...,...,...,...,...,...,...,...,...,...,...
371340,Negative,True,"07 20, 2017",A202DCI7TV1022,B01HJEGTYK,Sam,aw super frizzi tri comb fell complet apart th...,super frizzi tri comb,1500508800,0,3.421931,3.421931
371341,Positive,True,"03 16, 2017",A3FSOR5IJOFIBE,B01HJEGTYK,TYW,skeptic buy worri would look obvious fake shin...,awesom,1489622400,34,4.377286,4.377286
371342,Positive,True,"03 1, 2017",A1B5DK6CTP2P24,B01HJEGTYK,Norma Jennings,make look good fast,five star,1488326400,46,1.961879,1.961879
371343,Negative,True,"02 21, 2017",A23OUYS5IRMJS9,B01HJEGTYK,Lee,way lighter photonot mix blend colorsnot fulln...,ok color way volum well,1487635200,0,2.768332,2.768332


In [48]:
a=df["tfidf_revText"]
b=df["tfidf_summary"]
X_new=a
y_new = df['overall']
print(y_new.shape)
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.33,random_state=12)

(371345,)


In [49]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
lr = LogisticRegression(max_iter=1000)
X_train_new=X_train_new.values.reshape(-1,1)
lr.fit(X_train_new, y_train_new)


In [50]:
predictions_new = lr.predict(X_test_new.values.reshape(-1,1))

print(predictions_new)
print(y_test_new)

['Positive' 'Positive' 'Positive' ... 'Positive' 'Positive' 'Positive']
350215    Positive
11857     Positive
333469     Neutral
206929     Neutral
22536     Positive
            ...   
214501    Positive
148404    Positive
267593    Positive
336105    Negative
57504     Negative
Name: overall, Length: 122544, dtype: object


In [51]:
report = classification_report(y_test_new, predictions_new)
print(report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00     19637
     Neutral       0.00      0.00      0.00      9687
    Positive       0.76      1.00      0.86     93220

    accuracy                           0.76    122544
   macro avg       0.25      0.33      0.29    122544
weighted avg       0.58      0.76      0.66    122544



  _warn_prf(average, modifier, msg_start, len(result))
