# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful (number of Yes clicked)
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not (Yes + No)
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive or negative.

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

But we will use this changed column as a reference. Our real task is to determine +ve or -ve review by analysing text data.

In [1]:
#import warnings
#warnings.filterwarnings("ignore")



import sqlite3
import pandas as pd
import nltk
import re

# from tqdm import tqdm
# from bs4 import BeautifulSoup
# from nltk.corpus import stopwords

## Loading the Data and Changing Score Column.

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [2]:
# Creating connection object to run sql queries. 

con = sqlite3.connect('database.sqlite')

In [3]:
# query to select 10k rows as sample data while ignoring those rows where score=3
# 10k rows to ease the computation.

filtered_data = pd.read_sql_query("""SELECT * FROM Reviews WHERE Score!=3 LIMIT 10000""", con)
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
# Creating a function to change Score column values from 1,2,4,5 to 1 and 0.
# Here 1 means Positive and 0 means negative.

def changeScore(x):
    if x > 3:
        return 1
    else:
        return 0
    
# Now we have created the function to update score values.
# We can apply it on each score value. That's how apply() works.
filtered_data['Score'] = filtered_data.Score.apply(changeScore)

# Just seeing our results.
print("Number of data points in our data", filtered_data.shape)
filtered_data.head()

Number of data points in our data (10000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#  Exploratory Data Analysis

## [2] Data Cleaning: Deduplication

There is no fix way to clean any dataset. But there are some basic guiedlines to do it.<br>
Which means we have to find problems in our dataset by our own knowledge and experience.<br>
There are some problems found by AI team. So we will try to resolve these problems.<br>

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [5]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini "Vanilla" Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini "Lemon" Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delete the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [6]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
sorted_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
8695,9526,B00005V3DC,APASCXWTM041,Ed Raton,0,0,1,1350604800,"Good, effective product","Good flavor, unique in all the teas that I've ..."


In [7]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
print('Shape of Final Dataset', final.shape)
final.head()

Shape of Final Dataset (9564, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
8695,9526,B00005V3DC,APASCXWTM041,Ed Raton,0,0,1,1350604800,"Good, effective product","Good flavor, unique in all the teas that I've ..."


In [8]:
# So there was 436 duplicate reviews for same product but slightly different variants.

In [9]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(9564, 10)


1    7976
0    1588
Name: Score, dtype: int64

# [3].  Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [10]:
# Printing some random variables to see how our text data is. 
# remember we have only 9564 reviews.

rev1 = final['Text'].values[1]      # save 0th review in variable rev0
print(rev1)
print("="*50)

rev1200 = final['Text'].values[1200]
print(rev1200)
print("="*50)

rev8500 = final['Text'].values[8500]
print(rev8500)
print("="*50)

rev9500 = final['Text'].values[9500]
print(rev9500)
print("="*50)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
My wife like these better that the Swedish version, Wasa, as she finds the taste better and likes the thinness of the product.  We will buy again.
Really very good. Taste is great, no weird artificial taste, no little bitty grounds floating about. This is not crunchy brown water, but an actual cup of coffee. I make it in the microwave and it's great. Get the 10 pack for 9.95 and it's less than $1/cup.
This is probably my favorite k-cup so far. I'm no professional taster so talk of flavor notes in coffee or elsewhere usually leaves me scratching my head and thinking, "if you say so...". But at my first sip of this dark, smooth coffee, I though chocolate. It's not flavored or sweet but, for me, it has a "note" of good, dark c

#### First We will apply all text preprocessing on these 4 random examples to understand how each technique is working. Then we will apply it on whole Text column.

#### Removing URLs from reviews.

In [11]:
# Because only rev1 contains a URL so we will apply our code to rev1 only.
print("click here https://www.google.co.in/ for more details.")
print(re.sub(r"http\S+", "", "click here https://www.google.co.in/ for more details."))
print()

# Substitute anything starting with "html\........." with "" in given string.

rev1 = re.sub(r"http\S+", "", rev1)
print(rev1)

click here https://www.google.co.in/ for more details.
click here  for more details.

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


#### Removing HTML tags from strings.

In [12]:
# again only rev1 has HTMl tag

from bs4 import BeautifulSoup

# We will use BeautifulSoup to remove html tags from the string.
# BeautifulSoup first puts all possible html tags in the string.
# Then we can extract only text from modified string. Using get_text() method of BeautifulSoup.

Ex = "click here https://www.google.co.in/ <h2> DEVENDRA </h2> for more <br />TESTING<br /> details."

bs_obj = BeautifulSoup(Ex, 'lxml')
print("bs_obj:", bs_obj)

text = bs_obj.get_text()
print("text:", text)

bs_obj: <html><body><p>click here https://www.google.co.in/ </p><h2> DEVENDRA </h2> for more <br/>TESTING<br/> details.</body></html>
text: click here https://www.google.co.in/  DEVENDRA  for more TESTING details.


In [13]:
# For rev1

bs = BeautifulSoup(rev1, 'lxml')   # Putting HTML tags in strings where nessesary
rev1_text = bs.get_text()          # Extracting only text.
print(rev1_text)

Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


#### Removing Contractions.

In [14]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [15]:
print(decontracted(rev9500))

This is probably my favorite k-cup so far. I am no professional taster so talk of flavor notes in coffee or elsewhere usually leaves me scratching my head and thinking, "if you say so...". But at my first sip of this dark, smooth coffee, I though chocolate. It is not flavored or sweet but, for me, it has a "note" of good, dark chocolate. Highly recommended if you like your coffee strong and full of flavor.


In [16]:
decontracted("Hello | didn't | don't | he'll be | they've")

'Hello | did not | do not | he will be | they have'

#### Removing words which contains numbers.

In [17]:
Ex2 = "ABCD abcd AB55 55CD A55D 5555"

print(re.sub("\S*\d\S*", "", Ex2).strip())  # This regex will remove all words which contains any digit.

ABCD abcd


In [18]:
# For rev1
#M380 and M502 are removed.
print(rev1)
print()

print(re.sub("\S*\d\S*", "", rev1).strip())

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor  and  traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


#### removing special characters. All symbols.

In [19]:
# Notice that we cannot remove all symbols at the starting of text preprocessing.
# To remove URLs, HTML tags, Contractions we need symbols in our string.
# So always make sure when you should remove the symbols from the string.

re.sub('[^A-Za-z0-9]+', ' ', rev1) # this regex will remove all the special characters/symbols

'Why is this when the same product is available for here br br The Victor M380 and M502 traps are unreal of course total fly genocide Pretty stinky but only right nearby '

In [20]:
re.sub('[^A-Za-z0-9]+', ' ', rev9500)

'This is probably my favorite k cup so far I m no professional taster so talk of flavor notes in coffee or elsewhere usually leaves me scratching my head and thinking if you say so But at my first sip of this dark smooth coffee I though chocolate It s not flavored or sweet but for me it has a note of good dark chocolate Highly recommended if you like your coffee strong and full of flavor '

#### Removing Stopwords.

In [21]:
# Importing stopwords from nltk lib.
from nltk.corpus import stopwords

In [22]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
# We will create a set of our own stopwords where we will remove 'no', 'nor', 'not'

my_stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [24]:
rev9500   # original rev9500

'This is probably my favorite k-cup so far. I\'m no professional taster so talk of flavor notes in coffee or elsewhere usually leaves me scratching my head and thinking, "if you say so...". But at my first sip of this dark, smooth coffee, I though chocolate. It\'s not flavored or sweet but, for me, it has a "note" of good, dark chocolate. Highly recommended if you like your coffee strong and full of flavor.'

In [25]:
Ex3 =  re.sub('[^A-Za-z0-9]+', ' ', rev9500)   # after removing all symbols from it
Ex3

'This is probably my favorite k cup so far I m no professional taster so talk of flavor notes in coffee or elsewhere usually leaves me scratching my head and thinking if you say so But at my first sip of this dark smooth coffee I though chocolate It s not flavored or sweet but for me it has a note of good dark chocolate Highly recommended if you like your coffee strong and full of flavor '

In [26]:
new_text = ' '.join(e.lower() for e in Ex3.split() if e.lower() not in my_stopwords)
new_text

'probably favorite k cup far no professional taster talk flavor notes coffee elsewhere usually leaves scratching head thinking say first sip dark smooth coffee though chocolate not flavored sweet note good dark chocolate highly recommended like coffee strong full flavor'

In [27]:
k = list(e.lower() for e in Ex3.split() if e.lower() not in my_stopwords)
print(k)

# Because we have done Ex3.split() that's why we have to use join() to create the full review again.
# Otherwise we will get breaked review. Which we didn't want.

['probably', 'favorite', 'k', 'cup', 'far', 'no', 'professional', 'taster', 'talk', 'flavor', 'notes', 'coffee', 'elsewhere', 'usually', 'leaves', 'scratching', 'head', 'thinking', 'say', 'first', 'sip', 'dark', 'smooth', 'coffee', 'though', 'chocolate', 'not', 'flavored', 'sweet', 'note', 'good', 'dark', 'chocolate', 'highly', 'recommended', 'like', 'coffee', 'strong', 'full', 'flavor']


## Combining all tecniques and creating final preprocessed dataset.

In [28]:
# Combining all the above techniques.
from tqdm import tqdm       # It is used to show progress bar in output.
preprocessed_reviews = []


for review in tqdm(final['Text'].values):
    review = re.sub(r"http\S+", "", review)            # removing URLs
    review = BeautifulSoup(review, 'lxml').get_text()  # removing HTML tags
    review = decontracted(review)                      # removing contracted words (won't | didn't | he've)
    review = re.sub("\S*\d\S*", "", review).strip()    # removing numerical text and words
    review = re.sub('[^A-Za-z]+', ' ', review)         # removing all symbols
    review = ' '.join(e.lower() for e in review.split() if e.lower() not in my_stopwords)  # lowercasing and stopword removal
    
    preprocessed_reviews.append(review.strip())        # appending each processed review text in a new list.

100%|█████████████████████████████████████████████████████████████████████████████| 9564/9564 [00:32<00:00, 293.60it/s]


In [29]:
preprocessed_reviews

['used victor fly bait seasons ca not beat great product',
 'product available victor traps unreal course total fly genocide pretty stinky right nearby',
 'received shipment could hardly wait try product love slickers call instead stickers removed easily daughter designed signs printed reverse use car windows printed beautifully print shop program going lot fun product windows everywhere surfaces like tv screens computer monitors',
 'really good idea final product outstanding use decals car window everybody asks bought decals made two thumbs',
 'good flavor unique teas tried tea effective cleansing one system not harsh regular laxative consumed daily needed',
 'used brand years feeling clogged ate massive meal sips tea new make sure home work little well know mean careful first couple times using try little sips see result morning earlier follow lots water',
 'new product need careful dosage strong batches stronger others',
 'using food months find excellent fact two dogs coton de tule

In [30]:
processed_df = pd.DataFrame(preprocessed_reviews, columns =['clean_text'])
processed_df

Unnamed: 0,clean_text
0,used victor fly bait seasons ca not beat great...
1,product available victor traps unreal course t...
2,received shipment could hardly wait try produc...
3,really good idea final product outstanding use...
4,good flavor unique teas tried tea effective cl...
...,...
9559,tried many tassimo flavors far favorite normal...
9560,bold blend great taste flavor comes bursting u...
9561,coffee available tassimo kona richest flavor f...
9562,coffee supposedly premium tastes watery thin n...


In [31]:
processed_df['clean_text'][0]

'used victor fly bait seasons ca not beat great product'

In [32]:
processed_df['clean_text'][57]

'dual review first part covers delicious needs nothing else china mist iced tea varieties included box packets contains four ounces world choicest tea leaves stellar flavorings mixed dried crushed mango dried crushed passion fruit dried crushed raspberry pure strong tea first experience restaurant daughter works said order iced tea dad not need put sugar lemon anything always looking diabetes fantastic nearly highlight meal outstanding asked manager could share tea secret well use industry best china mist everybody says best indeed flavor pardon complex could rated would nice wine action hints new rush end taste cycle made think would missing tasted like british must drink full bodied teas enjoy apparently americans add much extra water tea coffee french hear order java coffee cup water top second part review concerns hamilton beach iced tea machine add recommend water ice cubes brewing cycle indeed take minutes tea already ice cold may difference home level heat limit maybe fact water

In [33]:
processed_df['clean_text'][224]

'review make sound really stupid whatever not really care long people find real avoid mistakes got wonderful little sweet bella bean days shy three years old bounced around house house eating whatever cheap cats around entire life twenty five years mother always fed whatever kinds food buy supermarket friskies nine lives kit kaboodle stuff like cats always fine least terms eating habits would eat morning stop done come back eat got hungry housemate time working hill assured best food ever made great forth know utter buffoon initially trusted judgment unfortunate not think also plenty coupons free deeply discounted bags made much attractive choice first tried feeding little bean unmeasured amount science diet bowl not work would devour one sitting took measuring thing started parsing twice day not work either would start going crazy middle day running around intentionally destroying things deliberately spilling water crying etc got food split three servings thing got four servings littl

In [35]:
# We will save our processed data into csv file so we can directly use clean data in other notebook.

processed_df.to_csv("processed_data", index=False) 

# Now our Text data is Fully ready for vectorization.

# [4] Featurization

## [4.1] BAG OF WORDS

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

In [50]:
#BoW
# We have to do just 2 things. fit and then transform.

count_vect = CountVectorizer()         # creating countvectorizer object.
count_vect.fit(preprocessed_reviews)   # fit data
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

final_counts = count_vect.transform(preprocessed_reviews)    # transform data
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

some feature names  ['aa', 'aaaa', 'aahhhs', 'ab', 'aback', 'abandon', 'abates', 'abberline', 'abbott', 'abby']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (9564, 18244)
the number of unique words  18244


In [52]:
final_counts[0]

<1x18244 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [53]:
print(final_counts[0])

  (0, 1104)	1
  (0, 1266)	1
  (0, 2094)	1
  (0, 6183)	1
  (0, 7003)	1
  (0, 10661)	1
  (0, 12388)	1
  (0, 14095)	1
  (0, 17202)	1
  (0, 17396)	1


In [57]:
# Our vectors are saved in sparse matrix form.
# Because mostly 95% data in vector is 0. So we will save only those index values of vector where value is not 0.

# as we can see above we have 18244 unique words. Means length of our vector is 18244.
# And in first vector (final_counts[0]) it has only 10 non - zero values and remaining 18233 values are 0.

# Here each vector is of length 18244 and we have 9564 such vectors.
# In these vectors most(90 - 98%) values are 0. Also we are not interested in 0s we are interested in non-zero values.
# So our BoW results are saved in dictionary kind of format which shows which index has non-zero value.