### Review data preprocessing steps:

Run this notebook after preprocess_bsr.ipynb

1. Generate `reviewvotes_num` which is a numeric version of #votes.
2. Group data by product-month. Texts/values in that product-month is grouped in a list. 
3. merge with month_level_rank.csv to get the products in intersection. 

Resulting dataset has each row on product-month level. Output to `clean/month_level_review.csv`.

In [1]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# input folders
data = "/content/drive/My Drive/297R-Caps-Pattern/Data/clean"

In [4]:
# load review datasets
rev = pd.read_csv(f'{data}/rev.csv')

In [5]:
rev['reviewvotes_num'] = rev["reviewvotes"].fillna('0').str.split().str[0].replace('One','1').str.replace(',','').astype('int')
rev = rev.drop(['reviewvotes', 'temp'],axis=1)

# sort by date
rev['date'] = pd.to_datetime(rev['date'])
rev = rev.sort_values(['asin','date']).copy()

# add column year-moth 
rev['year_month'] = rev['date'].dt.strftime('%m-%Y')

# reformat date column 
rev['date'] = rev['date'].dt.strftime('%m-%d-%Y')


In [6]:
rev

Unnamed: 0,dat_prod_key,asin,date,product_name,review_title,review_text,reviewrating,reviewverifiedpurchase,country_name,reviewvotes_num,year_month
6731,3130,B000052XB5,01-02-2017,Lactaid Original Strength Lactose Intolerance ...,"In case you didn't notice (I didn't), these ar...","In case you didn't notice (I didn't), these ar...",5.0,True,United States,0,01-2017
6912,3277,B000052XB5,01-03-2017,Lactaid Original Strength Lactose Intolerance ...,description vague,Didn't realize that I would need to take 3 to ...,3.0,True,United States,14,01-2017
6910,3275,B000052XB5,01-08-2017,Lactaid Original Strength Lactose Intolerance ...,Five Stars,Perfect.,5.0,True,United States,0,01-2017
6893,3260,B000052XB5,01-11-2017,Lactaid Original Strength Lactose Intolerance ...,BEFORE DURING &AFTER,YOU HAVE TO TAKE IT BEFORE DURING AND AFTER TH...,1.0,True,United States,0,01-2017
6990,3332,B000052XB5,01-15-2017,Lactaid Original Strength Lactose Intolerance ...,Not the same as Lactaid Fast Act!,I bought this item thinking it was 120 caplets...,2.0,True,United States,5,01-2017
...,...,...,...,...,...,...,...,...,...,...,...
727911,260244,B08QBXMHRT,07-05-2021,"Resveratrol 500mg Per Serving, 120 Capsules (N...",Great product for the price,Consistent quality product paired with a great...,5.0,True,United States,0,07-2021
727922,260250,B08QBXMHRT,07-08-2021,"Resveratrol 500mg Per Serving, 120 Capsules (N...",Great product!,"Great product, great customer service!",5.0,True,United States,0,07-2021
727826,260196,B08QBXMHRT,07-09-2021,"Resveratrol 500mg Per Serving, 120 Capsules (N...",Best supplement out there! Hands down.,"Great product, huge amount per bottle and low ...",5.0,True,United States,0,07-2021
727835,260196,B08QBXMHRT,07-09-2021,"Resveratrol 500mg Per Serving, 120 Capsules (N...",better to have than to not have?,I have been using the product for about 2 mont...,3.0,True,United States,0,07-2021


In [8]:
def make_list(group):
    cols = ['date', 'product_name', 'review_title', 'review_text', 'reviewvotes_num', 'reviewrating', 'reviewverifiedpurchase', 'country_name']
    listed = {col : group[col].to_list() for col in cols}
    return pd.Series(listed)

listed = rev.groupby(["asin", "year_month"]).apply(make_list)
listed = listed.reset_index()

In [9]:
listed.head()

Unnamed: 0,asin,year_month,date,product_name,review_title,review_text,reviewvotes_num,reviewrating,reviewverifiedpurchase,country_name
0,B000052XB5,01-2017,"[01-02-2017, 01-03-2017, 01-08-2017, 01-11-201...",[Lactaid Original Strength Lactose Intolerance...,"[In case you didn't notice (I didn't), these a...","[In case you didn't notice (I didn't), these a...","[0, 14, 0, 0, 5, 2]","[5.0, 3.0, 5.0, 1.0, 2.0, 1.0]","[True, True, True, True, True, True]","[ United States, United States, United State..."
1,B000052XB5,01-2018,"[01-01-2018, 01-03-2018, 01-06-2018, 01-11-201...",[Lactaid Original Strength Lactose Intolerance...,"[Five Stars, Beware: not the same dosage, For ...",[Honestly love this because I can eat ice crea...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]","[5.0, 4.0, 5.0, 1.0, 2.0, 5.0, 5.0, 5.0, 4.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
2,B000052XB5,01-2019,"[01-01-2019, 01-02-2019, 01-02-2019, 01-10-201...",[Lactaid Original Strength Lactose Intolerance...,"[Great Price for the original product, Left my...",[You might still have to take 3 tabs together ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[5.0, 5.0, 5.0, 1.0, 5.0, 1.0, 5.0, 5.0, 5.0, ...","[True, True, True, True, True, False, True, Tr...","[ United States, United States, United State..."
3,B000052XB5,01-2020,"[01-04-2020, 01-05-2020, 01-05-2020, 01-07-202...",[Lactaid Original Strength Lactose Intolerance...,"[Excelente, As advertised, Works., Good Choice...",[Me encanto!! Super bien empaquetado y el prod...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[5.0, 5.0, 5.0, 4.0, 5.0, 5.0, 1.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
4,B000052XB5,01-2021,"[01-01-2021, 01-02-2021, 01-03-2021, 01-05-202...",[Lactaid Original Strength Lactose Intolerance...,"[fast acting, Perfect, easy to swallow size, D...",[easy to swallow and acts fast to counteract l...,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[5.0, 5.0, 1.0, 5.0, 4.0, 5.0, 5.0, 1.0, 4.0, ...","[True, True, True, True, True, True, True, Fal...","[ United States, United States, United State..."


In [10]:
# merge with bsr by rank
bsr = pd.read_pickle(f'{data}/month_level_rank.pickle')[['asin']]
print(bsr['asin'].nunique(), 'product in month level rank')
bsr = bsr.drop_duplicates('asin').reset_index(drop=True)
merged = listed.merge(bsr, on='asin', how='inner')
print(merged['asin'].nunique(), 'product in merged dataset')
print('review dataset changed from', listed.shape, 'to', merged.shape)

8510 product in month level rank
8510 product in merged dataset
review dataset changed from (306071, 10) to (285588, 10)


In [11]:
merged.to_pickle(f'{data}/month_level_review.pickle')  

### read in for final checks

In [12]:
listed_read = pd.read_pickle(f'{data}/month_level_review.pickle')

In [13]:
listed_read

Unnamed: 0,asin,year_month,date,product_name,review_title,review_text,reviewvotes_num,reviewrating,reviewverifiedpurchase,country_name
0,B000052XB5,01-2017,"[01-02-2017, 01-03-2017, 01-08-2017, 01-11-201...",[Lactaid Original Strength Lactose Intolerance...,"[In case you didn't notice (I didn't), these a...","[In case you didn't notice (I didn't), these a...","[0, 14, 0, 0, 5, 2]","[5.0, 3.0, 5.0, 1.0, 2.0, 1.0]","[True, True, True, True, True, True]","[ United States, United States, United State..."
1,B000052XB5,01-2018,"[01-01-2018, 01-03-2018, 01-06-2018, 01-11-201...",[Lactaid Original Strength Lactose Intolerance...,"[Five Stars, Beware: not the same dosage, For ...",[Honestly love this because I can eat ice crea...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]","[5.0, 4.0, 5.0, 1.0, 2.0, 5.0, 5.0, 5.0, 4.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
2,B000052XB5,01-2019,"[01-01-2019, 01-02-2019, 01-02-2019, 01-10-201...",[Lactaid Original Strength Lactose Intolerance...,"[Great Price for the original product, Left my...",[You might still have to take 3 tabs together ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[5.0, 5.0, 5.0, 1.0, 5.0, 1.0, 5.0, 5.0, 5.0, ...","[True, True, True, True, True, False, True, Tr...","[ United States, United States, United State..."
3,B000052XB5,01-2020,"[01-04-2020, 01-05-2020, 01-05-2020, 01-07-202...",[Lactaid Original Strength Lactose Intolerance...,"[Excelente, As advertised, Works., Good Choice...",[Me encanto!! Super bien empaquetado y el prod...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[5.0, 5.0, 5.0, 4.0, 5.0, 5.0, 1.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
4,B000052XB5,01-2021,"[01-01-2021, 01-02-2021, 01-03-2021, 01-05-202...",[Lactaid Original Strength Lactose Intolerance...,"[fast acting, Perfect, easy to swallow size, D...",[easy to swallow and acts fast to counteract l...,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[5.0, 5.0, 1.0, 5.0, 4.0, 5.0, 5.0, 1.0, 4.0, ...","[True, True, True, True, True, True, True, Fal...","[ United States, United States, United State..."
...,...,...,...,...,...,...,...,...,...,...
285583,B08QBXMHRT,03-2021,"[03-01-2021, 03-03-2021, 03-05-2021, 03-06-202...","[Resveratrol 500mg Per Serving, 120 Capsules (...","[Great product and company., great value, Good...",[What I was looking for in senior immune syste...,"[0, 0, 2, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
285584,B08QBXMHRT,04-2021,"[04-01-2021, 04-01-2021, 04-01-2021, 04-02-202...","[Resveratrol 500mg Per Serving, 120 Capsules (...","[High mg count, Great product and service, Gre...",[The product is very good. I had used a differ...,"[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, ...","[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
285585,B08QBXMHRT,05-2021,"[05-04-2021, 05-05-2021, 05-06-2021, 05-11-202...","[Resveratrol 500mg Per Serving, 120 Capsules (...",[Resveratrol - Double Wood sells a great produ...,[I'm using Resveratrol to bolster cardiovascul...,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."
285586,B08QBXMHRT,06-2021,"[06-01-2021, 06-04-2021, 06-05-2021, 06-06-202...","[Resveratrol 500mg Per Serving, 120 Capsules (...","[Effective!, It’s good for the heart., Did not...",[I love Double Wood's Resveratrol. I have been...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[5.0, 5.0, 1.0, 5.0, 5.0, 5.0, 4.0, 5.0, 5.0, ...","[True, True, True, True, True, True, True, Tru...","[ United States, United States, United State..."


In [14]:
for i in [0,2345,13324,43544,200000,204354,280936]:
    print(len(listed_read.loc[i]['review_title']), len(listed_read.loc[i]['review_text']), len(listed_read.loc[i]['date']))

6 6 6
1 1 1
5 5 5
6 6 6
51 51 51
3 3 3
4 4 4


In [15]:
listed_read.loc[12345]['review_text']

['Great results',
 "I think these are the best value vitamins on the market. I'm a primary care physician assistant so I've spent a lot of time researching vitamin recommendations for patients, and these are what I buy for my husband and myself. For less than $20/month, these have great ratios of all your most important vitamins. Only downside is I can't take more than one at a time without getting nauseated (a serving size is 3.)",
 'Been taken these for a while, tried other brands ($$$$) but came back to these I can feel the difference!!',
 'worked as advertised',
 'Me encantan esos vitaminas. Tomo unos dos al día como mucho y me va estupendamente bien. Uno por la mañana al levantarse con el desayuno y otro a mediodía. Sobre todo notas la mejoría en el procesamiento de información en tu cerebro, en una manera mucho más suelta a la hora de expresarse, y en el aguante en general del cansancio y la fatiga en tu día a día. Muy buenos, a mi me funcionan perfectamente aunque reconozco que 