https://s3.amazonaws.com/amazon-reviews-pds/readme.html

```

DATA COLUMNS:
marketplace       - 2 letter country code of the marketplace where the review was written.
customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
review_id         - The unique ID of the review.
product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
                    for the same product in different countries can be grouped by the same product_id.
product_parent    - Random identifier that can be used to aggregate reviews for the same product.
product_title     - Title of the product.
product_category  - Broad product category that can be used to group reviews 
                    (also used to group the dataset into coherent parts).
star_rating       - The 1-5 star rating of the review.
helpful_votes     - Number of helpful votes.
total_votes       - Number of total votes the review received.
vine              - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline   - The title of the review.
review_body       - The review text.
review_date       - The date the review was written.
```

In [1]:
import pandas as pd

df = pd.read_csv('amazon_reviews_us_Grocery_v1_00.tsv.gz', 
                 nrows=20000, sep='\t', error_bad_lines=False)

b'Skipping line 1925: expected 15 fields, saw 22\nSkipping line 1977: expected 15 fields, saw 22\n'


In [2]:
df.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,42521656,R26MV8D0KG6QI6,B000SAQCWC,159713740,"The Cravings Place Chocolate Chunk Cookie Mix,...",Grocery,5,0,0,N,Y,Using these for years - love them.,"As a family allergic to wheat, dairy, eggs, nu...",2015-08-31
1,US,12049833,R1OF8GP57AQ1A0,B00509LVIQ,138680402,"Mauna Loa Macadamias, 11 Ounce Packages",Grocery,5,0,0,N,Y,Wonderful,"My favorite nut. Creamy, crunchy, salty, and ...",2015-08-31
2,US,107642,R3VDC1QB6MC4ZZ,B00KHXESLC,252021703,Organic Matcha Green Tea Powder - 100% Pure Ma...,Grocery,5,0,0,N,N,Five Stars,This green tea tastes so good! My girlfriend l...,2015-08-31
3,US,6042304,R12FA3DCF8F9ER,B000F8JIIC,752728342,15oz Raspberry Lyons Designer Dessert Syrup Sauce,Grocery,5,0,0,N,Y,Five Stars,I love Melissa's brand but this is a great sec...,2015-08-31
4,US,18123821,RTWHVNV6X4CNJ,B004ZWR9RQ,552138758,"Stride Spark Kinetic Fruit Sugar Free Gum, 14-...",Grocery,5,0,0,N,Y,Five Stars,good,2015-08-31


In [3]:
df = df.drop(['marketplace', 'product_category'], axis=1)

In [82]:
#df = df[~df['review_body'].isnull()]

In [5]:
# considering people who only write headlines
df['review_body'] = df['review_headline'] + '. ' + df['review_body']

In [6]:
df['customer_id'].nunique()

15659

In [7]:
df['review_id'].nunique()

20000

In [8]:
df['customer_id']

0        42521656
1        12049833
2          107642
3         6042304
4        18123821
           ...   
19995    14359560
19996    23158903
19997    16476066
19998    52757565
19999    17031304
Name: customer_id, Length: 20000, dtype: int64

In [9]:
df['customer_id'].value_counts()

34247947    25
20674418    18
1535682     18
13403431    17
36290808    15
            ..
43931969     1
49041728     1
33367103     1
52474172     1
2686976      1
Name: customer_id, Length: 15659, dtype: int64

In [11]:
df[df['customer_id'] == 52646512]

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
60,52646512,R3T1JWKO1X6EDL,B000F4GPC8,89439274,"Miso-Cup Soup with Seaweed, 2-Serving Envelope...",5,0,0,N,Y,Old stand by that is delicious,Old stand by that is delicious. Old stand by t...,2015-08-31
142,52646512,R1HC2YQHOSLX3Z,B000ILILLQ,539104877,Pamela's Simplebites Mini Cookies,5,0,0,N,Y,Too Good if you know what i mean.,Too Good if you know what i mean.. Too Good if...,2015-08-31
171,52646512,R12FY3F7R3LT3,B008X60YEA,718531235,"Twinings Earl Grey Tea, Keurig K-Cups, 24 Count",5,0,0,N,Y,The best of all the non milky chai's,The best of all the non milky chai's. The best...,2015-08-31
1403,52646512,R4NBEEFMHKPQ9,B00503DQIA,814285931,Silk Almond Milk,5,0,0,N,Y,Dark chocolate without any guilt.,Dark chocolate without any guilt.. Dark chocol...,2015-08-31
1637,52646512,R1TGM0S9T9GN3G,B00509LVIQ,138680402,"Mauna Loa Macadamias, 11 Ounce Packages",5,0,0,N,Y,"In small amounts these are great, too many and...","In small amounts these are great, too many and...",2015-08-31
1724,52646512,R1WU295CBZ8DC2,B008YDVYGO,267956568,San Francisco Bay One Cup,5,0,0,N,Y,One of if not the best expresso pods for the K...,One of if not the best expresso pods for the K...,2015-08-31
2738,52646512,RLM2HHPTV92P,B00IZJ8FWI,902799879,Grove Square Pumpkin Spice Cappuccino K-Cups (...,4,0,1,N,Y,5 stars if these were healthy.,5 stars if these were healthy.. 5 stars if the...,2015-08-31


In [12]:
df['customer_product'] = df['customer_id'].astype(str) + '_' + df['product_id'].astype(str)

In [13]:
df['customer_product'].value_counts()

18890917_B00FMF4YVO    1
15773076_B000H69AI0    1
12027472_B00679PULM    1
42141740_B00787FUEE    1
22755633_B006F6I9OM    1
                      ..
43339811_B00PV0KIBK    1
43840247_B00CXXK6Z6    1
38552705_B00TZFVK2I    1
28064875_B000LQNK6E    1
8031503_B004W8LT10     1
Name: customer_product, Length: 20000, dtype: int64

In [133]:
df[df['customer_product'] == '13879735_B003WO0I6C']

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,customer_product
26497,US,13879735,R37A5SE85MPPDQ,B003WO0I6C,74657246,"SweetLeaf Sweetener (70-Count Packets), 2.5-Ou...",Grocery,5,3,3,N,N,"Great pack of three, all natural stevia product",I bought this because I wanted TRULY ALL NATUR...,2015-08-22,13879735_B003WO0I6C
27990,US,13879735,R1LUOS1KTVQ7P8,B003WO0I6C,74657246,"SweetLeaf Sweetener (70-Count Packets), 2.5-Ou...",Grocery,5,0,0,N,N,Good value for three pack,I bought this because I wanted TRULY ALL NATUR...,2015-08-22,13879735_B003WO0I6C


In [15]:
customer_rate_mean = df.groupby(['customer_id'])['star_rating'].mean()
customer_rate_count = df['customer_id'].value_counts()

In [16]:
# star rating average of each customer
customer_rate_mean = customer_rate_mean.reset_index()
customer_rate_mean.columns = ['customer_id', 'rating']

# how many reviews does each customer write
customer_rate_count = customer_rate_count.reset_index()
customer_rate_count.columns = ['customer_id', 'count']

In [17]:
customer_rate_mean.columns, customer_rate_count.columns

(Index(['customer_id', 'rating'], dtype='object'),
 Index(['customer_id', 'count'], dtype='object'))

In [18]:
customer_df = pd.merge(customer_rate_mean, customer_rate_count, on='customer_id')

In [26]:
customer_df.sort_values(by='count', ascending=False)

Unnamed: 0,customer_id,rating,count
9777,34247947,4.960000,25
6374,20674418,5.000000,18
747,1535682,4.888889,18
3691,13403431,4.882353,17
9648,33718153,5.000000,15
...,...,...,...
5651,18135902,4.000000,1
5652,18144169,5.000000,1
5653,18147763,5.000000,1
5654,18147977,5.000000,1


In [21]:
customer_df[customer_df['count'] >= 5].sort_values(by='rating', ascending=False)

Unnamed: 0,customer_id,rating,count
15614,53037408,5.0,6
7098,23208852,5.0,8
9610,33589781,5.0,5
9601,33524015,5.0,6
14747,51256870,5.0,5
...,...,...,...
1723,6579386,3.0,6
11253,40165579,3.0,5
14479,50616505,3.0,8
3574,13181631,2.6,5


In [22]:
df[df['star_rating'] == 1]['review_body']

9        Disgusting now and difficult on digestion. Dis...
17       1 Out Of 5 Of My Co-Workers Thought It Was "Ok...
23       pita crackers. pita crackers. not craze about ...
40       Does not recommend!. Does not recommend!. This...
99       It's actually TOO salty.. It's actually TOO sa...
                               ...                        
19923    Made the Pinot Grigio and I would say don't wa...
19924    Changed product to something unacceptably infe...
19954               One Star. One Star. Way too expensive.
19971           One Star. One Star. I order 6 i received 4
19978    yuck!. yuck!. I did not like at all. I know it...
Name: review_body, Length: 1451, dtype: object

In [23]:
import nltk

```
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense, took
VBG verb, gerund/present participle taking
VBN verb, past participle is taken
VBP verb, sing. present, known-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when

```

In [157]:
# https://www.nltk.org/book/ch05.html
# part of speech
tokens = nltk.word_tokenize('Mushrooms are moldy!')
nltk.pos_tag(tokens)

[('Mushrooms', 'NNS'), ('are', 'VBP'), ('moldy', 'JJ'), ('!', '.')]

In [158]:
tokens

['Mushrooms', 'are', 'moldy', '!']

In [85]:
# take the adjs from the comments with 1 star
for line in df[df['star_rating'] == 1]['review_body']:
    line = line.replace('<br />', '')
    
    tokens = nltk.word_tokenize(line)
    tags = nltk.pos_tag(tokens)
    rags_select = []
    for tag in tags:
        if tag[1] in ['JJ', 'JJR', 'JJS', 'VBG']:
            rags_select.append(tag[0])
    
    print(line)
    print(rags_select)
    
    print('')

Used to be a decent product.  Disgusting now and difficult on digestion.  All 3 purchased from Costco over past couple months end in same result -- open the container and it smells like rancid oil.  Something not right about how they are making/processing this powder now.  Will not buy again.
['decent', 'Disgusting', 'difficult', 'past', 'couple', 'same', 'Something', 'making/processing']

I bought this from a local super market on a whim and decided to let people know how it tastes. I'm a huge fan of peanut butter and salted caramel.For instance, I had a Salted Caramel and Almond Kind Bar today and it was amazing. It tasted like you would expect it to taste.However, this particular product tastes like a chemical spill. It starts off with a peanut butter taste, but then the (caramel I'm guessing) tastes kind of burnt and chemical-like, and then it finishes with a very salty burnt taste.I had some on a spoon and disliked it, then I put some on pretzel bread and it was slightly palatable

In [159]:
from textblob import TextBlob

# https://textblob.readthedocs.io/en/dev/
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)

for sentence in blob.sentences:
    # sentence.sentiment.polarity
    # sentence.sentiment.subjectivity 
    print(sentence, sentence.sentiment.polarity, sentence.sentiment.subjectivity)
    print('')


The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact. 0.06000000000000001 0.605

Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant. -0.34166666666666673 0.7666666666666666



In [108]:
for line in df[df['star_rating'] == 1]['review_body']:
    line = line.replace('<br />', '')
    
    blob = TextBlob(line)
    for sentence in blob.sentences:
        print(sentence, sentence.sentiment.polarity, sentence.sentiment.subjectivity)
        print('')
    
    print('')

Used to be a decent product. 0.16666666666666666 0.6666666666666666

Disgusting now and difficult on digestion. -0.75 1.0

All 3 purchased from Costco over past couple months end in same result -- open the container and it smells like rancid oil. -0.08333333333333333 0.2916666666666667

Something not right about how they are making/processing this powder now. -0.14285714285714285 0.5357142857142857

Will not buy again. 0.0 0.0


I bought this from a local super market on a whim and decided to let people know how it tastes. 0.16666666666666666 0.3333333333333333

I'm a huge fan of peanut butter and salted caramel.For instance, I had a Salted Caramel and Almond Kind Bar today and it was amazing. 0.5333333333333333 0.9

It tasted like you would expect it to taste.However, this particular product tastes like a chemical spill. 0.16666666666666666 0.3333333333333333

It starts off with a peanut butter taste, but then the (caramel I'm guessing) tastes kind of burnt and chemical-like, and then

In [109]:
for line in df[df['star_rating'] == 1]['review_body']:
    line = line.replace('<br />', '')
    
    blob = TextBlob(line)
    for sentence in blob.sentences:
        if sentence.sentiment.polarity >= 0:
            continue
            
        print(sentence, sentence.sentiment.polarity, sentence.sentiment.subjectivity)
        print('')
    
    print('')

Disgusting now and difficult on digestion. -0.75 1.0

All 3 purchased from Costco over past couple months end in same result -- open the container and it smells like rancid oil. -0.08333333333333333 0.2916666666666667

Something not right about how they are making/processing this powder now. -0.14285714285714285 0.5357142857142857



nothing really wrong with them just no into them. -0.5 0.9

These crackers are small so not much room for cheese. -0.175 0.30000000000000004


The actual taste is not consistent with the flavor the label claims to represent. -0.0625 0.175


Not what I expected. -0.1 0.4


But Powerbar changed the formula or something and now they are just awful. -1.0 1.0


These dates were very old upon arrival, and for the most part dry and tasteless. -0.009166666666666656 0.565


I noticed the words 1g Sugar and bought this on the spot at a health food store, believing it was just low sugar but after reading the ingredients at home, I had to drive a long way to return it