# Preprocess bsr data

The purpose of this notebook is to get the best bsr over a certain time period, as well as the first launch date of a product. We would like to estimate the effect of inital reviews on the long-term bsr.

Steps:
1. Remove the products whose first review date is earlier than the first bsr date. 
2. Calculate min, 10 quantile, 50 quantile bsr over certain time priod.

- For example,for a product whose min date is 1/1/2018. Compute:
  1. min bsr between 1/1/2019 and 12/31/2019 (i.e. 1 full year later, over the following 1 full year period)
  2. min bsr between 1/1/2020 and 12/31/2020 (i.e. 2 full year later, over the following 1 full year period)
  3. min bsr between 1/1/2019 and 3/31/2019 (i.e. 1 full year later, over the follwoing 3 months period)
  4. min bsr between 7/1/2019 and 9/30/2019 (i.e. 1.5 year later, over the following 3 months period)
  5. min bsr between 1/1/2020 and 3/31/2020 (i.e. 2 year later, over the following 3 months period)

3. The first launch date of a product is calculated as 
  - launch_date = min(first_bsr_date, first_review_date)

4. Generate labels based on the selected threshold brs(3000).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# input folders
in_data = "/content/drive/My Drive/297R-Caps-Pattern/Data/raw"

# intermediate folders
int_data = "/content/drive/My Drive/297R-Caps-Pattern/Data/intermediate"

# output folders
out_data = "/content/drive/My Drive/297R-Caps-Pattern/Data/clean"

In [3]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

# Load dataset

In [8]:
# load filtered bsr datasets
bsr_full = pd.read_csv(f'{int_data}/bsr_filtered.csv')

In [10]:
bsr_full.head()

Unnamed: 0,date,asin,rank,avg180_price
0,2017-07-03,B000052XB5,1254.166667,11.98
1,2017-07-03,B00005313T,3805.75,33.73
2,2017-07-03,B0000533I2,8918.0,23.59
3,2017-07-03,B00005K9DO,199998.5,11.97
4,2017-07-03,B0000645VY,4093.75,17.37


In [11]:
# load original review datasets
rev_full = pd.read_csv(f'{in_data}/asin_review_history.csv')
rev_full = rev_full.drop('Unnamed: 0', axis=1)

In [12]:
# drop products without a Bxxxx asin
rev_full = rev_full[rev_full['asin'].str[0] == 'B'].copy()
# drop duplicates based on all columns
rev_full = rev_full.drop_duplicates()

In [13]:
# before drop asin
print(bsr_full.shape)
print(rev_full.shape)

print(bsr_full['asin'].nunique())
print(rev_full['asin'].nunique())

(10418058, 4)
(3818253, 11)
9146
9976


In [14]:
# find product asin with a review before 2017 
rev_asin_before_2017 = rev_full.query('review_date < "2017-01-01"').copy()['asin'].unique()

In [15]:
# num of product that has a review before 2017
rev_asin_before_2017.shape

(5079,)

In [16]:
# remove products from bsr_full with asin in rev_asin_before_2017
bsr = bsr_full.query('asin not in @rev_asin_before_2017').copy()
# remove products from rev_full with asin in rev_asin_before_2017
rev = rev_full.query('asin not in @rev_asin_before_2017').copy()

In [17]:
# after drop asin
print(bsr.shape)
print(rev.shape)

(3530998, 4)
(1458040, 11)


## Process review data


In [18]:
# drop reviewcommentcount since it's all 0
rev = rev.drop('reviewcommentcount', axis=1).copy()
# rename review date to date
rev = rev.rename(columns={'review_date':'date'})

In [None]:
rev.head()

Unnamed: 0,asin,product_name,review_title,review_text,reviewrating,date,reviewvotes,reviewverifiedpurchase,temp,country_name
6083,B079PWNBZW,"Align DualBiotic, Prebiotic + Probiotic for Me...",Didn't work for me...,The label clearly states that gas or bloating ...,1.0,2018-12-26,224 people found this helpful,True,"Reviewed in the United States on December 26, ...",United States
6084,B079PWNBZW,"Align DualBiotic, Prebiotic + Probiotic for Me...",BLOATED,The flavor is great! I saw another post that m...,5.0,2020-08-03,70 people found this helpful,True,"Reviewed in the United States on August 3, 2020",United States
6085,B079PWNBZW,"Align DualBiotic, Prebiotic + Probiotic for Me...",Actually helps,I always hesitate with probiotics because they...,5.0,2018-09-06,126 people found this helpful,True,"Reviewed in the United States on September 6, ...",United States
6086,B079PWNBZW,"Align DualBiotic, Prebiotic + Probiotic for Me...","Great to lose belly fat, reduce gas, constipat...",Been taking these for sometime now. Have ventu...,5.0,2020-02-18,56 people found this helpful,True,"Reviewed in the United States on February 18, ...",United States
6087,B079PWNBZW,"Align DualBiotic, Prebiotic + Probiotic for Me...",Don’t get the women’s probiotic. Get regular,I usually take the Align probiotic and love it...,1.0,2018-09-15,89 people found this helpful,True,"Reviewed in the United States on September 15,...",United States


In [None]:
# check the min date of rev
rev['date'].min()

'2017-01-01'

# Merge

In [19]:
bsr['date'] = pd.to_datetime(bsr['date'])
rev['date'] = pd.to_datetime(rev['date'])

In [20]:
# product sample is the intersect of reviews and bsrs
rev_prod = rev[['asin']].drop_duplicates().copy()
bsr_prod = bsr[['asin']].drop_duplicates().copy()

prod_sample = rev_prod.merge(bsr_prod, on='asin', how='inner')
print(prod_sample.shape[0], 'product remains')


4146 product remains


In [21]:
rev = rev.merge(prod_sample, on='asin', how='right').copy()
bsr = bsr.merge(prod_sample, on='asin', how='right').copy()

In [23]:
bsr.to_pickle(f'{out_data}/bsr_after_2017.pickle')
rev.to_pickle(f'{out_data}/rev_after_2017.pickle')

# Process BSR

In [24]:
def fill_na_rank(rank_list):
    rank_df = pd.DataFrame(rank_list,columns=['Rank'])
    rank_df = pd.concat([rank_df.ffill(), rank_df.bfill()]).groupby(level=0).mean()['Rank'].values.tolist()
    return rank_df

def get_value(rank):
  return [rank.min(), rank.quantile(0.1), rank.quantile(0.5)]
  
def compute_min_bsr(date, rank):
  date_df = pd.DataFrame(zip(date, rank), columns=['date','rank'])
  date_df['YearMonth'] = pd.to_datetime(date_df['date'].apply(pd.to_datetime).apply(lambda x: x.strftime('%m-%Y')))

  min_month = date_df['YearMonth'].min()
  one_yr_later = min_month + pd.DateOffset(months=12)
  one_half_yr_later = min_month + pd.DateOffset(months=18)
  two_yr_later = min_month + pd.DateOffset(months=24)

  # 1 full year later, for 1 full year
  range_1 = date_df[(date_df['YearMonth'] >= one_yr_later) & (date_df['YearMonth'] < one_yr_later + pd.DateOffset(months=12))]
  # 2 full year later, for 1 full year
  range_2 = date_df[(date_df['YearMonth'] >= two_yr_later) & (date_df['YearMonth'] < two_yr_later + pd.DateOffset(months=12))]
  # 1 full year later, for 3 months
  range_3 = date_df[(date_df['YearMonth'] >= one_yr_later) & (date_df['YearMonth'] < one_yr_later + pd.DateOffset(months=3))]
  # 1.5 full year later, for 3 months
  range_4 = date_df[(date_df['YearMonth'] >= one_half_yr_later) & (date_df['YearMonth'] < one_half_yr_later + pd.DateOffset(months=3))]
  # 2 full year later, for 3 months
  range_5 = date_df[(date_df['YearMonth'] >= two_yr_later) & (date_df['YearMonth'] < two_yr_later + pd.DateOffset(months=3))]
  return [min_month, get_value(range_1['rank']), get_value(range_2['rank']), get_value(range_3['rank']), 
          get_value(range_4['rank']), get_value(range_5['rank'])]


In [25]:
# Sorting by date
bsr_sorted = bsr.sort_values('date')
bsr_sorted.reset_index(inplace=True)

In [26]:
# Grouping by product and creating timelines
grouped_data = bsr_sorted.groupby('asin')
bsr_timelines_by_product = grouped_data['rank'].apply(list).reset_index(name='rank')
bsr_timelines_by_product['date'] = grouped_data['date'].apply(list).reset_index(name='date')['date']

In [27]:
# fill nan value in the rank
bsr_timelines_by_product['filled_rank'] = bsr_timelines_by_product['rank'].apply(fill_na_rank)

# get bsr
1. median over 3 months (one number)
2. mean over 3 months (one number)
3. min over 3 months (one number)
4. min of each month (3 numbers)
5. median of each month (3 numbers)
6. mean of each month (3 numbers) 

In [28]:
bsr_timelines_by_product.head()

Unnamed: 0,asin,rank,date,filled_rank
0,B00005K9DO,"[199998.5, 232356.0, 269494.0, nan, nan, nan, ...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[199998.5, 232356.0, 269494.0, 240084.5, 24008..."
1,B0009DVYVC,"[83244.0, 73209.0, 70504.0, 142239.0, nan, nan...","[2018-05-04 00:00:00, 2018-05-05 00:00:00, 201...","[83244.0, 73209.0, 70504.0, 142239.0, 125032.5..."
2,B000CL8LAI,"[971148.0, nan, nan, nan, 1029385.0, nan, 1051...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[971148.0, 1000266.5, 1000266.5, 1000266.5, 10..."
3,B000H8A212,"[246351.0, 263819.0, 282282.0, nan, 305429.0, ...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[246351.0, 263819.0, 282282.0, 293855.5, 30542..."
4,B000POZG0U,"[302880.0, 137168.5, 170519.0, 215885.0, 24513...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[302880.0, 137168.5, 170519.0, 215885.0, 24513..."


In [47]:
def get_monthly_data(date, filled_rank):
  date_df = pd.DataFrame(zip(date, filled_rank), columns=['date','rank'])
  date_df['YearMonth'] = pd.to_datetime(date_df['date'].apply(pd.to_datetime).apply(lambda x: x.strftime('%m-%Y')))
  month_df = date_df.groupby('YearMonth').agg({'rank' : [min, np.mean, np.median]}).reset_index().sort_values(by='YearMonth')
  month_df['YearMonth'] = month_df['YearMonth'].dt.strftime('%m-%Y')
  first_3_month = month_df.iloc[:3]
  return {'min over 3 months': min(list(first_3_month['rank']['min'])),
          'mean over 3 months': np.mean(list(first_3_month['rank']['mean'])),
          'median over 3 months': np.median(list(first_3_month['rank']['median'])),
          'year-month': list(first_3_month['YearMonth']), 
          'min': list(first_3_month['rank']['min']),
          'mean': list(first_3_month['rank']['mean']), 
          'median': list(first_3_month['rank']['median'])}


In [49]:
results = bsr_timelines_by_product.apply(lambda x: get_monthly_data(x.date, 
                                                                      x.filled_rank),
                                            axis=1)

In [48]:
test = bsr_timelines_by_product.iloc[0]
get_monthly_data(test['date'], test['filled_rank'])

{'mean': [119882.04310344828, 101592.5623655914, 111303.22722222222],
 'mean over 3 months': 110925.94423042063,
 'median': [103538.0, 93593.0, 110549.5],
 'median over 3 months': 103538.0,
 'min': [15917.0, 35518.0, 57708.0],
 'min over 3 months': 15917.0,
 'year-month': ['07-2017', '08-2017', '09-2017']}

In [51]:
mo_1_mean = []
mo_1_median = []
mo_1_min = []

mo_2_mean = []
mo_2_median = []
mo_2_min = []

mo_3_mean = []
mo_3_median = []
mo_3_min = []

mean_over_3_mo = []
median_over_3_mo = []
min_over_3_mo = []

for i in results:
  mo_1_mean.append(i['mean'][0])
  mo_2_mean.append(i['mean'][1])
  mo_3_mean.append(i['mean'][2])

  mo_1_median.append(i['median'][0])
  mo_2_median.append(i['median'][1])
  mo_3_median.append(i['median'][2])

  mo_1_min.append(i['min'][0])
  mo_2_min.append(i['min'][1])
  mo_3_min.append(i['min'][2])

  mean_over_3_mo.append(i['mean over 3 months'])
  median_over_3_mo.append(i['median over 3 months'])
  min_over_3_mo.append(i['min over 3 months'])

In [52]:
bsr_timelines_by_product['mo_1_mean'] = mo_1_mean
bsr_timelines_by_product['mo_2_mean'] = mo_2_mean
bsr_timelines_by_product['mo_3_mean'] = mo_3_mean

bsr_timelines_by_product['mo_1_median'] = mo_1_median
bsr_timelines_by_product['mo_2_median'] = mo_2_median
bsr_timelines_by_product['mo_3_median'] = mo_3_median

bsr_timelines_by_product['mo_1_min'] = mo_1_min
bsr_timelines_by_product['mo_2_min'] = mo_2_min
bsr_timelines_by_product['mo_3_min'] = mo_3_min

bsr_timelines_by_product['mean_over_3_mo'] = mean_over_3_mo
bsr_timelines_by_product['median_over_3_mo'] = median_over_3_mo
bsr_timelines_by_product['min_over_3_mo'] = min_over_3_mo

In [53]:
bsr_timelines_by_product.head()

Unnamed: 0,asin,rank,date,filled_rank,mo_1_mean,mo_2_mean,mo_3_mean,mo_1_median,mo_2_median,mo_3_median,mo_1_min,mo_2_min,mo_3_min,mean_over_3_mo,median_over_3_mo,min_over_3_mo
0,B00005K9DO,"[199998.5, 232356.0, 269494.0, nan, nan, nan, ...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[199998.5, 232356.0, 269494.0, 240084.5, 24008...",119882.0,101592.562366,111303.227222,103538.0,93593.0,110549.5,15917.0,35518.0,57708.0,110925.94423,103538.0,15917.0
1,B0009DVYVC,"[83244.0, 73209.0, 70504.0, 142239.0, nan, nan...","[2018-05-04 00:00:00, 2018-05-05 00:00:00, 201...","[83244.0, 73209.0, 70504.0, 142239.0, 125032.5...",68710.78,14283.912847,13196.105645,56523.25,13882.25,12691.5,20931.0,7759.666667,8176.0,32063.598426,13882.25,7759.666667
2,B000CL8LAI,"[971148.0, nan, nan, nan, 1029385.0, nan, 1051...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[971148.0, 1000266.5, 1000266.5, 1000266.5, 10...",1105162.0,730441.677419,470000.216667,1114473.875,691677.0,468332.5,971148.0,160881.0,172227.0,768534.479576,691677.0,160881.0
3,B000H8A212,"[246351.0, 263819.0, 282282.0, nan, 305429.0, ...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[246351.0, 263819.0, 282282.0, 293855.5, 30542...",294159.0,316723.58871,149149.611111,299642.25,333154.0,125755.5,117928.0,150299.0,59737.0,253344.078512,299642.25,59737.0
4,B000POZG0U,"[302880.0, 137168.5, 170519.0, 215885.0, 24513...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[302880.0, 137168.5, 170519.0, 215885.0, 24513...",183821.6,154731.102151,163585.706111,164412.0,154003.5,165687.15,90450.0,99845.0,76813.333333,167379.453329,164412.0,76813.333333


for a product whose min date is 1/1/2018.
compute:
1. min bsr between 1/1/2019 and 12/31/2019 (i.e. 1 full year later, for 1 full year)
2. min bsr between 1/1/2020 and 12/31/2020 (i.e. 2 full year later, for 1 full year)
3. min bsr between 1/1/2019 and 3/31/2019 (i.e. 1 full year later, for 3 months)
4. min bsr between 7/1/2019 and 9/30/2019 (i.e. 1.5 year later, for 3 months)
5. min bsr between 1/1/2020 and 3/31/2020 (i.e. 2 year later, for 3 months)

In [54]:
result_all = bsr_timelines_by_product.apply(lambda x: compute_min_bsr(x.date, 
                                                                      x.filled_rank),
                                            axis=1)

In [55]:
min_month = []
after_1_yr_period_12_mo_min_bsr = []
after_1_yr_period_12_mo_10_perc_bsr = []
after_1_yr_period_12_mo_median_bsr = []

after_2_yr_period_12_mo_min_bsr = []
after_2_yr_period_12_mo_10_perc_bsr = []
after_2_yr_period_12_mo_median_bsr = []

after_1_yr_period_3_mo_min_bsr = []
after_1_yr_period_3_mo_10_perc_bsr = []
after_1_yr_period_3_mo_median_bsr = []

after_1_5_yr_period_3_mo_min_bsr = []
after_1_5_yr_period_3_mo_10_perc_bsr = []
after_1_5_yr_period_3_mo_median_bsr = []

after_2_yr_period_3_mo_min_bsr = []
after_2_yr_period_3_mo_10_perc_bsr = []
after_2_yr_period_3_mo_median_bsr = []

for row in result_all:
  min_month.append(row[0])
  after_1_yr_period_12_mo_min_bsr.append(row[1][0])
  after_1_yr_period_12_mo_10_perc_bsr.append(row[1][1])
  after_1_yr_period_12_mo_median_bsr.append(row[1][2])

  after_2_yr_period_12_mo_min_bsr.append(row[2][0])
  after_2_yr_period_12_mo_10_perc_bsr.append(row[2][1])
  after_2_yr_period_12_mo_median_bsr.append(row[2][2])

  after_1_yr_period_3_mo_min_bsr.append(row[3][0])
  after_1_yr_period_3_mo_10_perc_bsr.append(row[3][1])
  after_1_yr_period_3_mo_median_bsr.append(row[3][2])

  after_1_5_yr_period_3_mo_min_bsr.append(row[4][0])
  after_1_5_yr_period_3_mo_10_perc_bsr.append(row[4][1])
  after_1_5_yr_period_3_mo_median_bsr.append(row[4][2])

  after_2_yr_period_3_mo_min_bsr.append(row[5][0])
  after_2_yr_period_3_mo_10_perc_bsr.append(row[5][1])
  after_2_yr_period_3_mo_median_bsr.append(row[5][2])


In [56]:
bsr_timelines_by_product['min_month_bsr'] = min_month

bsr_timelines_by_product['after_1_yr_period_12_mo_min_bsr'] = after_1_yr_period_12_mo_min_bsr
bsr_timelines_by_product['after_1_yr_period_12_mo_10_perc_bsr'] = after_1_yr_period_12_mo_10_perc_bsr
bsr_timelines_by_product['after_1_yr_period_12_mo_median_bsr'] = after_1_yr_period_12_mo_median_bsr

bsr_timelines_by_product['after_2_yr_period_12_mo_min_bsr'] = after_2_yr_period_12_mo_min_bsr
bsr_timelines_by_product['after_2_yr_period_12_mo_10_perc_bsr'] = after_2_yr_period_12_mo_10_perc_bsr
bsr_timelines_by_product['after_2_yr_period_12_mo_median_bsr'] = after_2_yr_period_12_mo_median_bsr

bsr_timelines_by_product['after_1_yr_period_3_mo_min_bsr'] = after_1_yr_period_3_mo_min_bsr
bsr_timelines_by_product['after_1_yr_period_3_mo_10_perc_bsr'] = after_1_yr_period_3_mo_10_perc_bsr
bsr_timelines_by_product['after_1_yr_period_3_mo_median_bsr'] = after_1_yr_period_3_mo_median_bsr

bsr_timelines_by_product['after_1_5_yr_period_3_mo_min_bsr'] = after_1_5_yr_period_3_mo_min_bsr
bsr_timelines_by_product['after_1_5_yr_period_3_mo_10_perc_bsr'] = after_1_5_yr_period_3_mo_10_perc_bsr
bsr_timelines_by_product['after_1_5_yr_period_3_mo_median_bsr'] = after_1_5_yr_period_3_mo_median_bsr

bsr_timelines_by_product['after_2_yr_period_3_mo_min_bsr'] = after_2_yr_period_3_mo_min_bsr
bsr_timelines_by_product['after_2_yr_period_3_mo_10_perc_bsr'] = after_2_yr_period_3_mo_10_perc_bsr
bsr_timelines_by_product['after_2_yr_period_3_mo_median_bsr'] = after_2_yr_period_3_mo_median_bsr

In [57]:
bsr_timelines_by_product.isnull().sum()

asin                                       0
rank                                       0
date                                       0
filled_rank                                0
mo_1_mean                                  0
mo_2_mean                                  0
mo_3_mean                                  0
mo_1_median                                0
mo_2_median                                0
mo_3_median                                0
mo_1_min                                   0
mo_2_min                                   0
mo_3_min                                   0
mean_over_3_mo                             0
median_over_3_mo                           0
min_over_3_mo                              0
min_month_bsr                              0
after_1_yr_period_12_mo_min_bsr          455
after_1_yr_period_12_mo_10_perc_bsr      455
after_1_yr_period_12_mo_median_bsr       455
after_2_yr_period_12_mo_min_bsr         1692
after_2_yr_period_12_mo_10_perc_bsr     1692
after_2_yr

In [58]:
# before removing
bsr_timelines_by_product.shape

(4146, 32)

In [59]:
# remove products whose whole live period is less than 1 yr
bsr_timelines_by_product = bsr_timelines_by_product[bsr_timelines_by_product['after_1_yr_period_12_mo_min_bsr'].notna()]

In [60]:
# after removing
bsr_timelines_by_product.shape

(3691, 32)

In [61]:
bsr_timelines_by_product.head()

Unnamed: 0,asin,rank,date,filled_rank,mo_1_mean,mo_2_mean,mo_3_mean,mo_1_median,mo_2_median,mo_3_median,...,after_2_yr_period_12_mo_median_bsr,after_1_yr_period_3_mo_min_bsr,after_1_yr_period_3_mo_10_perc_bsr,after_1_yr_period_3_mo_median_bsr,after_1_5_yr_period_3_mo_min_bsr,after_1_5_yr_period_3_mo_10_perc_bsr,after_1_5_yr_period_3_mo_median_bsr,after_2_yr_period_3_mo_min_bsr,after_2_yr_period_3_mo_10_perc_bsr,after_2_yr_period_3_mo_median_bsr
0,B00005K9DO,"[199998.5, 232356.0, 269494.0, nan, nan, nan, ...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[199998.5, 232356.0, 269494.0, 240084.5, 24008...",119882.0,101592.562366,111303.227222,103538.0,93593.0,110549.5,...,16430.104167,13085.75,19051.916667,30473.925,4507.615385,5731.595192,9256.633333,10898.0,12168.066667,14548.986111
1,B0009DVYVC,"[83244.0, 73209.0, 70504.0, 142239.0, nan, nan...","[2018-05-04 00:00:00, 2018-05-05 00:00:00, 201...","[83244.0, 73209.0, 70504.0, 142239.0, 125032.5...",68710.78,14283.912847,13196.105645,56523.25,13882.25,12691.5,...,37476.6,2022.461538,4528.459524,6026.083333,3049.285714,3532.7125,4293.8125,30002.4,38548.354286,52853.8125
2,B000CL8LAI,"[971148.0, nan, nan, nan, 1029385.0, nan, 1051...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[971148.0, 1000266.5, 1000266.5, 1000266.5, 10...",1105162.0,730441.677419,470000.216667,1114473.875,691677.0,468332.5,...,144469.0,80436.0,200828.875,516509.0,131564.0,185871.4,384719.25,74567.0,120561.0,175989.833333
3,B000H8A212,"[246351.0, 263819.0, 282282.0, nan, 305429.0, ...","[2017-07-04 00:00:00, 2017-07-05 00:00:00, 201...","[246351.0, 263819.0, 282282.0, 293855.5, 30542...",294159.0,316723.58871,149149.611111,299642.25,333154.0,125755.5,...,35505.083333,59629.4,99415.478571,146134.55,48713.0,62217.726667,87975.636364,34414.857143,40987.725,53558.404762
4,B000POZG0U,"[302880.0, 137168.5, 170519.0, 215885.0, 24513...","[2017-07-03 00:00:00, 2017-07-04 00:00:00, 201...","[302880.0, 137168.5, 170519.0, 215885.0, 24513...",183821.6,154731.102151,163585.706111,164412.0,154003.5,165687.15,...,89162.45,51325.5,70571.7,123733.166667,43471.6,62566.783333,89366.425,48330.0,63249.3,95426.875


In [62]:
min_bsr_over_time= bsr_timelines_by_product.drop(columns = ['rank','date','filled_rank'])

In [63]:
min_bsr_over_time.to_pickle(f'{int_data}/min_bsr_over_time.pickle')

# Process review

In [64]:
rev['reviewvotes_num'] = rev["reviewvotes"].fillna('0').str.split().str[0].replace('One','1').str.replace(',','').astype('int')
rev = rev.drop(['reviewvotes', 'temp'],axis=1)


In [65]:
# sort by date
rev['date'] = pd.to_datetime(rev['date'])
rev = rev.sort_values(['asin','date']).copy()

# add column year-moth 
rev['year_month'] = rev['date'].dt.strftime('%m-%Y')

# reformat date column 
rev['date'] = rev['date'].dt.strftime('%m-%d-%Y')

# fill nan reviews with empty string
rev['review_text'] = rev['review_text'].fillna('')

assert (pd.isnull(rev['review_text'])).sum() == 0


In [66]:
def make_list(group):
    cols = ['year_month','date', 'product_name', 'review_title', 'review_text', 'reviewvotes_num', 'reviewrating', 'reviewverifiedpurchase', 'country_name']
    listed = {col : group[col].to_list() for col in cols}
    return pd.Series(listed)

listed = rev.groupby(["asin"]).apply(make_list)
listed = listed.reset_index()
listed['product_name'] = [i[0] for i in listed['product_name']]

In [67]:
def get_concat_review(year_month, date, 
                      review_title,review_text,
                           reviewvotes_num,reviewrating,
                           reviewverifiedpurchase,country_name):
  date_df = pd.DataFrame(zip(year_month, date,
                           review_title,review_text,
                           reviewvotes_num,reviewrating,
                           reviewverifiedpurchase,country_name ), columns=cols)
  min_month =  pd.to_datetime(date_df['year_month']).min()
  after_3_mo = min_month + pd.DateOffset(months=3)
  after_6_mo = min_month + pd.DateOffset(months=6)
  # after_12_mo = min_month + pd.DateOffset(months=12)
  date_df['year_month']  =  pd.to_datetime(date_df['year_month'])
  # 0-3 months
  range_3 = date_df[(date_df['year_month'] >= min_month) & (date_df['year_month'] < after_3_mo)]
  # 0-6 months
  range_6 = date_df[(date_df['year_month'] >= min_month) & (date_df['year_month'] < after_6_mo)]
  # # 0-12 months
  # range_12 = date_df[(date_df['year_month'] >= min_month) & (date_df['year_month'] < after_12_mo)]
  
  return {'min_month_rev': min_month,
          '3_mo': make_list(range_3),
          '6_mo': make_list(range_6),
  }
def make_list(range):
  cols = ['review_title', 'review_text', 'reviewvotes_num', 'reviewrating', 'reviewverifiedpurchase', 'country_name']
  listed = {col : range[col].to_list() for col in cols}
  return pd.Series(listed)
cols = ['year_month','date', 'review_title', 
        'review_text', 'reviewvotes_num', 'reviewrating', 'reviewverifiedpurchase', 'country_name']


In [68]:
result_all = listed.apply(lambda x: get_concat_review(x.year_month,
                                                      x.date,
                                                      x.review_title,
                                                      x.review_text,
                                                      x.reviewvotes_num,
                                                      x.reviewrating,
                                                      x.reviewverifiedpurchase,
                                                      x.country_name),axis=1)

In [69]:
results = pd.DataFrame()
for row in result_all:
  df_3_mo = pd.DataFrame([row['3_mo']])
  df_3_mo.columns += '_3_mo'
  df_6_mo = pd.DataFrame([row['6_mo']])
  df_6_mo.columns += '_6_mo'
  df_full = pd.concat([df_3_mo, df_6_mo,],axis=1)
  df_full['min_month_rev'] = row['min_month_rev']
  results = pd.concat([results, df_full], axis=0).reset_index(drop=True)

In [70]:
df_full.columns

Index(['review_title_3_mo', 'review_text_3_mo', 'reviewvotes_num_3_mo',
       'reviewrating_3_mo', 'reviewverifiedpurchase_3_mo', 'country_name_3_mo',
       'review_title_6_mo', 'review_text_6_mo', 'reviewvotes_num_6_mo',
       'reviewrating_6_mo', 'reviewverifiedpurchase_6_mo', 'country_name_6_mo',
       'min_month_rev'],
      dtype='object')

In [71]:
rev_over_time = pd.concat([listed, results], axis=1)

In [72]:
rev_over_time.columns

Index(['asin', 'year_month', 'date', 'product_name', 'review_title',
       'review_text', 'reviewvotes_num', 'reviewrating',
       'reviewverifiedpurchase', 'country_name', 'review_title_3_mo',
       'review_text_3_mo', 'reviewvotes_num_3_mo', 'reviewrating_3_mo',
       'reviewverifiedpurchase_3_mo', 'country_name_3_mo', 'review_title_6_mo',
       'review_text_6_mo', 'reviewvotes_num_6_mo', 'reviewrating_6_mo',
       'reviewverifiedpurchase_6_mo', 'country_name_6_mo', 'min_month_rev'],
      dtype='object')

In [73]:
rev_over_time_short = rev_over_time[['asin','min_month_rev', 'product_name', 'review_title_3_mo', 'review_text_3_mo', 'reviewvotes_num_3_mo',
       'reviewrating_3_mo', 'reviewverifiedpurchase_3_mo', 'country_name_3_mo',
       'review_title_6_mo', 'review_text_6_mo', 'reviewvotes_num_6_mo',
       'reviewrating_6_mo', 'reviewverifiedpurchase_6_mo', 'country_name_6_mo',
       ]]

In [74]:
rev_over_time_short.to_pickle(f'{int_data}/rev_over_time_short.pickle')

In [75]:
merged_data = min_bsr_over_time.merge(rev_over_time_short, on='asin',how='inner')

In [76]:
merged_data['start_month'] = merged_data[['min_month_bsr','min_month_rev']].min(axis=1)

In [77]:
merged_data.to_pickle(f'{int_data}/bsr_rev_classification.pickle')

# generate labels

In [78]:
data = pd.read_pickle(f'{int_data}/bsr_rev_classification.pickle')

In [79]:
data.head()

Unnamed: 0,asin,mo_1_mean,mo_2_mean,mo_3_mean,mo_1_median,mo_2_median,mo_3_median,mo_1_min,mo_2_min,mo_3_min,...,reviewrating_3_mo,reviewverifiedpurchase_3_mo,country_name_3_mo,review_title_6_mo,review_text_6_mo,reviewvotes_num_6_mo,reviewrating_6_mo,reviewverifiedpurchase_6_mo,country_name_6_mo,start_month
0,B00005K9DO,119882.0,101592.562366,111303.227222,103538.0,93593.0,110549.5,15917.0,35518.0,57708.0,...,[5.0],[True],[ United States],[Great for pre menopausal women!],"[If used with Evening Primrose oil, DHEA, and ...",[15],[5.0],[True],[ United States],2017-02-01
1,B0009DVYVC,68710.78,14283.912847,13196.105645,56523.25,13882.25,12691.5,20931.0,7759.666667,8176.0,...,"[5.0, 4.0, 5.0, 5.0, 5.0, 3.0, 5.0]","[False, False, False, False, True, True, True]","[ United States, United States, United State...","[Your kids will love these!, My kids like thes...",[These gummies are great for kids. They are cu...,"[0, 3, 2, 1, 6, 0, 2, 1, 0, 1, 1, 0, 0, 3, 2, ...","[5.0, 4.0, 5.0, 5.0, 5.0, 3.0, 5.0, 5.0, 5.0, ...","[False, False, False, False, True, True, True,...","[ United States, United States, United State...",2018-05-01
2,B000CL8LAI,1105162.0,730441.677419,470000.216667,1114473.875,691677.0,468332.5,971148.0,160881.0,172227.0,...,[5.0],[True],[ United States],"[Five Stars, Five Stars]","[Great product for speedy recovery., This prod...","[5, 21]","[5.0, 5.0]","[True, True]","[ United States, United States]",2017-07-01
3,B000H8A212,294159.0,316723.58871,149149.611111,299642.25,333154.0,125755.5,117928.0,150299.0,59737.0,...,[5.0],[True],[ United States],"[Works within a day!, Don't Buy!, Sundown Echi...",[I have been using Echinacea for many years bu...,"[1, 5, 4]","[5.0, 1.0, 5.0]","[True, True, True]","[ United States, United States, United States]",2017-06-01
4,B000POZG0U,183821.6,154731.102151,163585.706111,164412.0,154003.5,165687.15,90450.0,99845.0,76813.333333,...,[5.0],[True],[ United States],[so it is nice not to have to buy 100mg tablet...,[This dosage is hard to find. My physician has...,[4],[5.0],[True],[ United States],2017-06-01


In [None]:
print('avg #reviews in 3 months:', np.mean([len(i) for i in data['review_text_3_mo']]))
print('avg #reviews in 6 months:', np.mean([len(i) for i in data['review_text_6_mo']]))

avg #reviews in 3 months: 23.007044161473857
avg #reviews in 6 months: 54.41343809265781


In [80]:
data.columns

Index(['asin', 'mo_1_mean', 'mo_2_mean', 'mo_3_mean', 'mo_1_median',
       'mo_2_median', 'mo_3_median', 'mo_1_min', 'mo_2_min', 'mo_3_min',
       'mean_over_3_mo', 'median_over_3_mo', 'min_over_3_mo', 'min_month_bsr',
       'after_1_yr_period_12_mo_min_bsr',
       'after_1_yr_period_12_mo_10_perc_bsr',
       'after_1_yr_period_12_mo_median_bsr', 'after_2_yr_period_12_mo_min_bsr',
       'after_2_yr_period_12_mo_10_perc_bsr',
       'after_2_yr_period_12_mo_median_bsr', 'after_1_yr_period_3_mo_min_bsr',
       'after_1_yr_period_3_mo_10_perc_bsr',
       'after_1_yr_period_3_mo_median_bsr', 'after_1_5_yr_period_3_mo_min_bsr',
       'after_1_5_yr_period_3_mo_10_perc_bsr',
       'after_1_5_yr_period_3_mo_median_bsr', 'after_2_yr_period_3_mo_min_bsr',
       'after_2_yr_period_3_mo_10_perc_bsr',
       'after_2_yr_period_3_mo_median_bsr', 'min_month_rev', 'product_name',
       'review_title_3_mo', 'review_text_3_mo', 'reviewvotes_num_3_mo',
       'reviewrating_3_mo', 'reviewverif

In [81]:
def get_label(df, threshold):
  columns = ['after_1_yr_period_12_mo_min_bsr',
       'after_1_yr_period_12_mo_10_perc_bsr',
       'after_1_yr_period_12_mo_median_bsr', 'after_2_yr_period_12_mo_min_bsr',
       'after_2_yr_period_12_mo_10_perc_bsr',
       'after_2_yr_period_12_mo_median_bsr', 'after_1_yr_period_3_mo_min_bsr',
       'after_1_yr_period_3_mo_10_perc_bsr',
       'after_1_yr_period_3_mo_median_bsr', 'after_1_5_yr_period_3_mo_min_bsr',
       'after_1_5_yr_period_3_mo_10_perc_bsr',
       'after_1_5_yr_period_3_mo_median_bsr', 'after_2_yr_period_3_mo_min_bsr',
       'after_2_yr_period_3_mo_10_perc_bsr',
       'after_2_yr_period_3_mo_median_bsr']
  for col in columns:
    col_name = 'label_' + col
    df[col_name] = df[col].apply(lambda x: 1 if x < threshold else 0)
    

In [82]:
get_label(data, 3000)

In [83]:
data

Unnamed: 0,asin,mo_1_mean,mo_2_mean,mo_3_mean,mo_1_median,mo_2_median,mo_3_median,mo_1_min,mo_2_min,mo_3_min,...,label_after_2_yr_period_12_mo_median_bsr,label_after_1_yr_period_3_mo_min_bsr,label_after_1_yr_period_3_mo_10_perc_bsr,label_after_1_yr_period_3_mo_median_bsr,label_after_1_5_yr_period_3_mo_min_bsr,label_after_1_5_yr_period_3_mo_10_perc_bsr,label_after_1_5_yr_period_3_mo_median_bsr,label_after_2_yr_period_3_mo_min_bsr,label_after_2_yr_period_3_mo_10_perc_bsr,label_after_2_yr_period_3_mo_median_bsr
0,B00005K9DO,1.198820e+05,101592.562366,111303.227222,1.035380e+05,93593.00,110549.500000,15917.000000,35518.000000,57708.000000,...,0,0,0,0,0,0,0,0,0,0
1,B0009DVYVC,6.871078e+04,14283.912847,13196.105645,5.652325e+04,13882.25,12691.500000,20931.000000,7759.666667,8176.000000,...,0,1,0,0,0,0,0,0,0,0
2,B000CL8LAI,1.105162e+06,730441.677419,470000.216667,1.114474e+06,691677.00,468332.500000,971148.000000,160881.000000,172227.000000,...,0,0,0,0,0,0,0,0,0,0
3,B000H8A212,2.941590e+05,316723.588710,149149.611111,2.996422e+05,333154.00,125755.500000,117928.000000,150299.000000,59737.000000,...,0,0,0,0,0,0,0,0,0,0
4,B000POZG0U,1.838216e+05,154731.102151,163585.706111,1.644120e+05,154003.50,165687.150000,90450.000000,99845.000000,76813.333333,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686,B08CY61T6Q,1.267128e+04,9252.717742,8913.016667,1.347925e+04,8903.00,8842.500000,9380.000000,7482.000000,7293.500000,...,0,0,0,0,0,0,0,0,0,0
3687,B08D6459F6,8.306923e+04,44972.402688,44555.292322,8.936050e+04,43033.40,41755.666667,34039.000000,28082.750000,28764.375000,...,0,0,0,0,0,0,0,0,0,0
3688,B08DJ78YC4,2.111520e+05,58538.623656,45383.168889,2.111520e+05,51409.00,45437.166667,211152.000000,14465.500000,36575.250000,...,0,0,0,0,0,0,0,0,0,0
3689,B08DTG33VT,1.017147e+05,100513.432796,39031.277500,1.017147e+05,84160.50,34093.583333,92494.333333,53557.000000,28323.500000,...,0,0,0,0,0,0,0,0,0,0


In [86]:
data = data.drop(columns=['min_month_bsr','min_month_rev'])

In [87]:
data.columns

Index(['asin', 'mo_1_mean', 'mo_2_mean', 'mo_3_mean', 'mo_1_median',
       'mo_2_median', 'mo_3_median', 'mo_1_min', 'mo_2_min', 'mo_3_min',
       'mean_over_3_mo', 'median_over_3_mo', 'min_over_3_mo',
       'after_1_yr_period_12_mo_min_bsr',
       'after_1_yr_period_12_mo_10_perc_bsr',
       'after_1_yr_period_12_mo_median_bsr', 'after_2_yr_period_12_mo_min_bsr',
       'after_2_yr_period_12_mo_10_perc_bsr',
       'after_2_yr_period_12_mo_median_bsr', 'after_1_yr_period_3_mo_min_bsr',
       'after_1_yr_period_3_mo_10_perc_bsr',
       'after_1_yr_period_3_mo_median_bsr', 'after_1_5_yr_period_3_mo_min_bsr',
       'after_1_5_yr_period_3_mo_10_perc_bsr',
       'after_1_5_yr_period_3_mo_median_bsr', 'after_2_yr_period_3_mo_min_bsr',
       'after_2_yr_period_3_mo_10_perc_bsr',
       'after_2_yr_period_3_mo_median_bsr', 'product_name',
       'review_title_3_mo', 'review_text_3_mo', 'reviewvotes_num_3_mo',
       'reviewrating_3_mo', 'reviewverifiedpurchase_3_mo', 'country_name_3

In [89]:
data.to_pickle(f'{out_data}/prod_level_bsr_rev.pickle')

In [90]:
pd.read_pickle(f'{out_data}/prod_level_bsr_rev.pickle')

Unnamed: 0,asin,mo_1_mean,mo_2_mean,mo_3_mean,mo_1_median,mo_2_median,mo_3_median,mo_1_min,mo_2_min,mo_3_min,...,label_after_2_yr_period_12_mo_median_bsr,label_after_1_yr_period_3_mo_min_bsr,label_after_1_yr_period_3_mo_10_perc_bsr,label_after_1_yr_period_3_mo_median_bsr,label_after_1_5_yr_period_3_mo_min_bsr,label_after_1_5_yr_period_3_mo_10_perc_bsr,label_after_1_5_yr_period_3_mo_median_bsr,label_after_2_yr_period_3_mo_min_bsr,label_after_2_yr_period_3_mo_10_perc_bsr,label_after_2_yr_period_3_mo_median_bsr
0,B00005K9DO,1.198820e+05,101592.562366,111303.227222,1.035380e+05,93593.00,110549.500000,15917.000000,35518.000000,57708.000000,...,0,0,0,0,0,0,0,0,0,0
1,B0009DVYVC,6.871078e+04,14283.912847,13196.105645,5.652325e+04,13882.25,12691.500000,20931.000000,7759.666667,8176.000000,...,0,1,0,0,0,0,0,0,0,0
2,B000CL8LAI,1.105162e+06,730441.677419,470000.216667,1.114474e+06,691677.00,468332.500000,971148.000000,160881.000000,172227.000000,...,0,0,0,0,0,0,0,0,0,0
3,B000H8A212,2.941590e+05,316723.588710,149149.611111,2.996422e+05,333154.00,125755.500000,117928.000000,150299.000000,59737.000000,...,0,0,0,0,0,0,0,0,0,0
4,B000POZG0U,1.838216e+05,154731.102151,163585.706111,1.644120e+05,154003.50,165687.150000,90450.000000,99845.000000,76813.333333,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686,B08CY61T6Q,1.267128e+04,9252.717742,8913.016667,1.347925e+04,8903.00,8842.500000,9380.000000,7482.000000,7293.500000,...,0,0,0,0,0,0,0,0,0,0
3687,B08D6459F6,8.306923e+04,44972.402688,44555.292322,8.936050e+04,43033.40,41755.666667,34039.000000,28082.750000,28764.375000,...,0,0,0,0,0,0,0,0,0,0
3688,B08DJ78YC4,2.111520e+05,58538.623656,45383.168889,2.111520e+05,51409.00,45437.166667,211152.000000,14465.500000,36575.250000,...,0,0,0,0,0,0,0,0,0,0
3689,B08DTG33VT,1.017147e+05,100513.432796,39031.277500,1.017147e+05,84160.50,34093.583333,92494.333333,53557.000000,28323.500000,...,0,0,0,0,0,0,0,0,0,0


In [91]:
np.mean(data['after_1_yr_period_12_mo_min_bsr']<3000)

0.18098076402059063

In [None]:
np.mean(data['label_after_1_yr_period_12_mo_min_bsr']==1)

0.18098076402059063

In [None]:
np.mean(data['after_2_yr_period_12_mo_min_bsr']<3000)

0.12408561365483609

In [None]:
np.mean(data['label_after_2_yr_period_12_mo_min_bsr']==1)

0.12408561365483609