__Amazon Fine Food Review Analysis__

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

- Number of reviews: 568,454
- Number of users: 256,059
- Number of products: 74,258
- Timespan: Oct 1999 - Oct 2012
- Number of Attributes/Columns in data: 10

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

__Objective__: Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

Q) How to determine if a review is positive or negative?

A) We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from matplotlib import pyplot as plt
from matplotlib import style

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import sqlite3

In [4]:
style.use(style='seaborn-whitegrid')

In [5]:
con = sqlite3.connect('database.sqlite')

In [6]:
query = """
SELECT
    NAME
FROM
    sqlite_master
WHERE
    type = "table"
"""

In [7]:
tables = pd.read_sql_query(sql=query, con=con)
display(tables)

Unnamed: 0,name
0,Reviews


In [8]:
query = """
SELECT
    *
FROM
    Reviews
WHERE
    Score <> 3
"""

In [9]:
f_data = pd.read_sql_query(sql=query, con=con)
display(f_data.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [10]:
partition = lambda x: 0 if x < 3 else 1

In [11]:
print(partition(2))
print(partition(4))

0
1


In [12]:
f_data['Score'] = f_data['Score'].map(partition)
display(f_data.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [13]:
display(f_data.shape)

(525814, 10)

__Exploratory Data Analysis__

Data Cleaning: Deduplication

It is observed that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. 

In [14]:
query = """
SELECT
    *
FROM
    Reviews
WHERE
    Score <> 3
AND
    UserId = "AR5J8UI46CURR"
ORDER BY
    ProductId
"""

In [15]:
d = pd.read_sql_query(sql=query, con=con)
display(d.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In [16]:
s_data = f_data.sort_values(by='ProductId')
display(s_data.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


In [17]:
final = s_data.drop_duplicates(subset={"UserId", "ProfileName", "Time", "Text"}, keep='first')
display(final.head())

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


In [18]:
display(final.shape)

(364173, 10)

In [19]:
percent_data_retained = (len(final) / len(f_data)) * 100
display(percent_data_retained)

69.25890143662969

It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calculations.

In [20]:
query = """
SELECT
    *
FROM
    Reviews
WHERE
    Score <> 3
AND
    Id = 44737
OR
    Id = 64422
ORDER BY
    ProductId
"""

In [21]:
d = pd.read_sql_query(sql=query, con=con)
display(d)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [22]:
final = final[final['HelpfulnessNumerator'] <= final['HelpfulnessDenominator']]

In [23]:
display(final.shape)

(364171, 10)

In [24]:
display(final['Score'].value_counts())

1    307061
0     57110
Name: Score, dtype: int64

__Why convert text to a vector?__

![](https://user-images.githubusercontent.com/63338657/159435439-74945e55-2c46-4885-a489-72b9bada63f2.png)

![](https://user-images.githubusercontent.com/63338657/159435951-98adfd3d-c868-4a14-8084-d0621e37d9be.png)

![](https://user-images.githubusercontent.com/63338657/159436491-55c78522-bde0-44bb-91b0-84f7dadb6792.png)

![](https://user-images.githubusercontent.com/63338657/159436794-7ab634cd-f87c-4705-b142-c412c779724a.png)

![](https://user-images.githubusercontent.com/63338657/159437260-bf9ddd9e-4d9c-46f1-990b-576506ff5557.png)

__Bag of Words (BoW)__

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

![](https://user-images.githubusercontent.com/63338657/159440518-ee6131ec-cc7b-43d1-bffb-265a27f6d271.png)

![](https://user-images.githubusercontent.com/63338657/159440784-d2a73590-357d-41b1-9aa6-021fb9b62808.png)

![](https://user-images.githubusercontent.com/63338657/159441109-4ce9ae82-4809-43e8-8042-77e92016232d.png)

![](https://user-images.githubusercontent.com/63338657/159442485-2d61a544-3404-4672-a559-f7d59c009c83.png)

In the above screenshot, a small correction - During the construction of BOW, 'is' was mentioned twice which is a typo. We consider only unique words.

![](https://user-images.githubusercontent.com/63338657/159443348-d6bff985-f28c-451e-843c-7058069baad5.png)

![](https://user-images.githubusercontent.com/63338657/159443810-5030e7e3-9d46-45e6-8586-9778f6f2498c.png)

__Text Preprocessing: Stemming, Stop Word Removal, Tokenization, Lemmatization__

![](https://user-images.githubusercontent.com/63338657/159462702-77d51465-87b8-4363-a64a-5a9f79f62f75.png)

![](https://user-images.githubusercontent.com/63338657/159463664-66a73912-afaa-41b8-94f0-fb72431a104a.png)

![](https://user-images.githubusercontent.com/63338657/159464223-ff94f614-1e04-472b-b443-137fed5fdd53.png)

Here, teacher misdefined lemmatization. Teacher actually defined tokenization.

Lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

- The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
- The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
- The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

![](https://miro.medium.com/max/1400/1*ES5bt7IoInIq2YioQp2zcQ.png)

![](https://user-images.githubusercontent.com/63338657/159464930-ceabb311-0e2f-455a-801c-271fac9c9adc.png)

Bag of Words __does not take semantic meaning of words__ into consideration --> major drawback.

__Uni-gram, Bi-gram, N-grams__

![](https://user-images.githubusercontent.com/63338657/159475295-d9f2550b-66a2-4916-82b7-61d5c01dc195.png)

![](https://user-images.githubusercontent.com/63338657/159472146-18548934-bb93-4fbf-b2f5-25f92790d2f7.png)

![](https://user-images.githubusercontent.com/63338657/159472721-01a3c581-056d-4b54-b4bf-fb5a7d9e229d.png)

![](https://user-images.githubusercontent.com/63338657/159473287-2e7aae3e-c558-411a-9256-de4364b0e125.png)

![](https://user-images.githubusercontent.com/63338657/159473773-5d2ec65b-a841-4a6a-b7d6-a64363a6f2f5.png)

Example 1:

Rome is not built in a day

Here there are no repeated words. So

- unigrams --> not, in, day, is, a, built, Rome
- bigrams --> Rome is, is not, not built, built in, in a, a day
- trigrams --> Rome is not, is not built, not built in, built in a, in a day

Here there is no repetetion of words in the given sentence. Hence, number of unigrams > number of bigrams > number of trigrams.

Example 2:

horse is a horse of course of course accept it

Here there are repeated words. So
- unigrams --> horse, of, course, a, is, it, accept
- bigrams --> horse is, is a, a horse, horse of, of course, course of, course accept, accept it
- trigrams --> horse is a, is a horse, a horse of, horse of course, of course of, course of course, of course accept, course accept it

Here there is repetetion of words in the given sentence. Hence, number of unigrams < number of bigrams < number of trigrams.

__TF-IDF (Term Frequency - Inverse Document Frequency)__

Link: https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.

![](https://user-images.githubusercontent.com/63338657/159481310-f2d176a1-8241-4538-aaba-139cc5a3f173.png)

![](https://user-images.githubusercontent.com/63338657/159482073-d4873997-0a87-4ca2-931c-bc152a63f8f7.png)

![](https://user-images.githubusercontent.com/63338657/159482859-bb799073-6bfa-47b8-ac5f-70dec49b68f1.png)

![](https://user-images.githubusercontent.com/63338657/159483467-34770bf7-c4ac-4a26-ab77-70702f434cef.png)

![](https://user-images.githubusercontent.com/63338657/159484864-2037a88d-5314-4822-b5d9-26fc700a8abf.png)

![](https://user-images.githubusercontent.com/63338657/159484174-ee59058d-2b17-4b90-9d8f-96222f32f225.png)

![](https://user-images.githubusercontent.com/63338657/159485169-cd78b1fa-c2c1-41b9-afde-e6ea6842ccfb.png)

![](https://user-images.githubusercontent.com/63338657/159485536-9c99f90d-7097-4487-9122-59a46b120310.png)

![](https://user-images.githubusercontent.com/63338657/159486081-90d557fe-7fac-4c4e-a8ad-422e1799ecaa.png)

![](https://user-images.githubusercontent.com/63338657/159486475-007d311d-e8eb-4f9d-81f6-87bdd52c902f.png)

TF-IDF still does not take semantic meaning of words. For example: tasty - delicious; cheap - affordable etc.

![](https://user-images.githubusercontent.com/63338657/159489489-bc5af23d-09e8-4ee1-b1e0-4cacb3da46a8.png)

![](https://user-images.githubusercontent.com/63338657/159490430-bfce1004-3f2d-4d84-9c1f-2d1c1c6a9adf.png)

![](https://user-images.githubusercontent.com/63338657/159495275-816140f9-0402-4f5c-b1ed-d125243484a6.png)

__Word2Vec__

Link: https://www.tensorflow.org/tutorials/text/word2vec

Unlike Bag of Words and TF-IDF, Word2Vec considers semantic meaning of words.

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

![](https://user-images.githubusercontent.com/63338657/159693183-e25caf01-c10f-459c-a023-151f95e24ff2.png)

![](https://user-images.githubusercontent.com/63338657/159693835-2cbca91a-ea89-4e11-be96-1692346e2f49.png)

![](https://user-images.githubusercontent.com/63338657/159694812-ea8b1117-06b3-41ec-96f0-0bdb757ff486.png)

![](https://user-images.githubusercontent.com/63338657/159695216-b2654439-53a3-4187-b23f-71b58cb4aad4.png)

![](https://user-images.githubusercontent.com/63338657/159695590-d31b59fb-87d5-4a11-8327-00861e4827d0.png)

![](https://user-images.githubusercontent.com/63338657/159695912-fd92eaa4-d9b5-4ff7-b519-d154423c3351.png)

In Summery, The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are".

If you have two words that have very similar neighbors (meaning: the context in which its used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled, and astonished are usually used in a similar context.

__Avg-Word2Vec, tf-idf weighted Word2Vec__

![](https://user-images.githubusercontent.com/63338657/159698672-a8bf4ca7-ff7c-40d7-8d62-e4ca695b2e9f.png)

![](https://user-images.githubusercontent.com/63338657/159699203-ae3e88c9-74cc-4202-afb1-832ebb1c8840.png)

For each word, a vector is created. In Avg W2V, as each word has a vector associated with it, average of all the vectors (component-wise) of all the words in a given review is computed.

![](https://user-images.githubusercontent.com/63338657/159699697-2213f30c-8534-4d25-a4db-aee2b402acca.png)

![](https://user-images.githubusercontent.com/63338657/159700087-50e95ea9-57b9-40d1-b27f-86669c15d415.png)

![](https://user-images.githubusercontent.com/63338657/159700450-7b22f21f-6f2e-4264-899d-5def27690c7f.png)

Sparse Matrix

![](https://user-images.githubusercontent.com/63338657/160226121-15b02e04-d186-4dd4-aadb-a5978f45dd01.png)

![](https://user-images.githubusercontent.com/63338657/160226175-cc3e90f9-6258-403a-bed0-0d532ed5471b.png)

![](https://user-images.githubusercontent.com/63338657/160226220-2c0f7ac3-f910-4e34-9aaf-0892d043ce40.png)

![](https://user-images.githubusercontent.com/63338657/160226257-8a9efd10-077c-46c6-b6f1-b355cd601326.png)

![](https://user-images.githubusercontent.com/63338657/160226388-b9f0b1df-2c49-4e74-b481-a0aee09bebe1.png)

![](https://user-images.githubusercontent.com/63338657/160226414-75743709-8705-4f04-b845-a8fea4803252.png)

![](https://user-images.githubusercontent.com/63338657/160226456-47042a88-a60b-4e3e-a84e-af0a17b99975.png)

Correction: sparsity of the matrix = no.of zero elements / total elements

A matrix is sparse if many of its coefficients are zero. The interest in sparsity arises because its exploitation can lead to enormous computational savings and because many large matrix problems that occur in practice are sparse.
By contrast, if most of the elements are nonzero, then the matrix is considered dense.

sparsity = count zero elements / total elements

Density = 1 - sparsity

Dense matrices store every entry in the matrix. Sparse matrices only store the nonzero entries. Sparse matrices don't have a lot of extra features, and some algorithms may not work for them.
You use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily.

![](https://user-images.githubusercontent.com/63338657/160226504-870e51f5-131e-4583-9228-5212157772e7.png)

Text Preprocessing

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

- Begin by removing the html tags.
- Remove any punctuations or limited set of special characters like , or . or # etc.
- Check if the word is made up of english letters and is not alpha-numeric.
- Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters).
- Convert the word to lowercase.
- Remove Stopwords.
- Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming).

After which we collect the words used to describe positive and negative reviews

In [25]:
from bs4 import BeautifulSoup
from tqdm import tqdm
import re

In [26]:
# https://stackoverflow.com/a/47091490/4084039

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [27]:
# https://gist.github.com/sebleier/554280

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [28]:
preprocessed_reviews = []

for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z0-9]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|█████████████████████████████████████████████████████████| 364171/364171 [01:06<00:00, 5478.24it/s]


In [34]:
print(preprocessed_reviews[5000])

late delivered paid extra charge faster delivery not delivered time expected


Bag of Words

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

Convert a collection of text documents to a matrix of token counts.

This implementation produces a sparse representation of the counts using `scipy.sparse.csr_matrix`.

In [36]:
count_vec = CountVectorizer()
final_counts = count_vec.fit_transform(raw_documents=preprocessed_reviews)

In [37]:
display(type(final_counts))
display(final_counts.get_shape())

scipy.sparse._csr.csr_matrix

(364171, 116756)

There are $116756$ unique words.

In [44]:
display(final_counts)

<364171x116756 sparse matrix of type '<class 'numpy.int64'>'
	with 11971404 stored elements in Compressed Sparse Row format>

In [53]:
print(final_counts)

  (0, 114149)	1
  (0, 59223)	1
  (0, 11737)	3
  (0, 61190)	1
  (0, 94893)	2
  (0, 57622)	1
  (0, 60044)	1
  (0, 83783)	2
  (0, 15503)	1
  (0, 30987)	1
  (0, 2903)	1
  (0, 3161)	1
  (0, 92822)	1
  (0, 84409)	1
  (0, 57892)	1
  (0, 112927)	1
  (0, 50921)	1
  (0, 31032)	1
  (0, 87454)	1
  (0, 60092)	1
  (0, 68406)	1
  (0, 114517)	1
  (0, 52446)	1
  (0, 92606)	1
  (0, 19480)	1
  :	:
  (364170, 42138)	1
  (364170, 55927)	1
  (364170, 60097)	1
  (364170, 81602)	1
  (364170, 48258)	1
  (364170, 34742)	1
  (364170, 86701)	1
  (364170, 102632)	1
  (364170, 66612)	1
  (364170, 111837)	1
  (364170, 7241)	1
  (364170, 4777)	1
  (364170, 94047)	1
  (364170, 87359)	1
  (364170, 14730)	1
  (364170, 91809)	1
  (364170, 36623)	1
  (364170, 46244)	1
  (364170, 14774)	1
  (364170, 14742)	1
  (364170, 90693)	2
  (364170, 27063)	1
  (364170, 88140)	1
  (364170, 30417)	1
  (364170, 64175)	1


Bi-Grams and N-Grams

In [39]:
count_vec = CountVectorizer(ngram_range=(1, 2))
final_bigram_counts = count_vec.fit_transform(raw_documents=preprocessed_reviews)

In [40]:
display(type(final_bigram_counts))
display(final_bigram_counts.get_shape())

scipy.sparse._csr.csr_matrix

(364171, 3923364)

In [45]:
display(final_bigram_counts)

<364171x3923364 sparse matrix of type '<class 'numpy.int64'>'
	with 25563547 stored elements in Compressed Sparse Row format>

In [52]:
print(final_bigram_counts)

  (0, 3851370)	1
  (0, 1950745)	1
  (0, 359218)	3
  (0, 2040306)	1
  (0, 3188176)	2
  (0, 1869528)	1
  (0, 1991302)	1
  (0, 2797693)	2
  (0, 502657)	1
  (0, 1025765)	1
  (0, 82455)	1
  (0, 99161)	1
  (0, 3109089)	1
  (0, 2813835)	1
  (0, 1879762)	1
  (0, 3818823)	1
  (0, 1711593)	1
  (0, 1026631)	1
  (0, 2912578)	1
  (0, 1992116)	1
  (0, 2252623)	1
  (0, 3860812)	1
  (0, 1753801)	1
  (0, 3097651)	1
  (0, 630236)	1
  :	:
  (364170, 904231)	1
  (364170, 477276)	1
  (364170, 1000292)	1
  (364170, 1000353)	1
  (364170, 2715235)	1
  (364170, 1819265)	1
  (364170, 3155585)	1
  (364170, 2202584)	1
  (364170, 3029214)	1
  (364170, 200451)	1
  (364170, 3188276)	1
  (364170, 132576)	1
  (364170, 3765893)	1
  (364170, 2890217)	1
  (364170, 474355)	1
  (364170, 3073885)	1
  (364170, 2125026)	1
  (364170, 670325)	1
  (364170, 2909507)	1
  (364170, 3481904)	1
  (364170, 2928746)	1
  (364170, 1142206)	1
  (364170, 477980)	1
  (364170, 1424915)	1
  (364170, 2125028)	1


TF-IDF

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

Convert a collection of raw documents to a matrix of TF-IDF features.

In [59]:
tf_idf_vec = TfidfVectorizer(ngram_range=(1, 2))
final_tf_idf = tf_idf_vec.fit_transform(raw_documents=preprocessed_reviews)

In [60]:
tf_features = tf_idf_vec.get_feature_names()

In [62]:
display(type(final_tf_idf))
display(final_tf_idf.get_shape())
display(len(tf_features))

scipy.sparse._csr.csr_matrix

(364171, 3923364)

3923364

In [65]:
display(tf_features[100000:100010])

['always enjoyable',
 'always enjoyed',
 'always enjoyedthe',
 'always enjoying',
 'always enjoys',
 'always enough',
 'always enoughk',
 'always ensues',
 'always ensure',
 'always ensures']

In [72]:
print(final_tf_idf[3, :].toarray()[0])
print(len(final_tf_idf[3, :].toarray()[0]))

[0. 0. 0. ... 0. 0. 0.]
3923364


In [73]:
def get_top_tfidf_features(row, features, top_n=25):
    """
    This funtion gets the top 'n' tfidf features.
    """
    topn_ids = np.argsort(a=row)[::-1][:top_n]
    top_feats = [[features[i], row[i]] for i in topn_ids]
    df = pd.DataFrame(data=top_feats, columns=['feature', 'tfidf'])
    return df

In [74]:
top_tfidf_df = get_top_tfidf_features(row=final_tf_idf[0, :].toarray()[0], features=tf_features)
display(top_tfidf_df)

Unnamed: 0,feature,tfidf
0,recite,0.264566
1,book,0.21989
2,introduces silliness,0.146257
3,loud recite,0.146257
4,memory college,0.146257
5,india drooping,0.146257
6,roses love,0.146257
7,whales india,0.146257
8,classic book,0.146257
9,always sing,0.146257


![](https://user-images.githubusercontent.com/63338657/160231764-d605bda1-2eba-48df-84e8-9123d0de34b0.png)