Should run ML_grocery_basket notebook first. It will tell you when to come back here to run this notebook. 

In [1]:
#load watermark extension
%load_ext watermark
#print watermark for notebook
%watermark

2018-04-25T00:53:30

CPython 2.7.14
IPython 5.4.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.3.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


In [2]:
#https://ipyton.org/ipython-docdex/config/extsions/autoreload.html
%reload_ext autoreload
%autoreload 2

#version information
%reload_ext version_information
%version_information Cython, matplotlib, numpy, pandas,  qutip, seaborn, scipy, sklearn, tqdm, version_information,



Software,Version
Python,2.7.14 64bit [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,5.4.1
OS,Darwin 17.3.0 x86_64 i386 64bit
Cython,0.26.1
matplotlib,2.1.0
numpy,1.13.3
pandas,0.20.3
qutip,The 'qutip' distribution was not found and is required by the application
seaborn,0.8.0
scipy,0.19.1


In [3]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import scipy as sp
from sklearn import preprocessing




Lets do some data wraggling. We will need to set up martrices to execute a SVD, R=UΣV^T. Normally this is associated with a recommender based on ratings. Here, we will be using the reorder proportions inplace of ratings to predict future reorders. R will be the user reorder matrix. U is user product feature matrix.  Σ is the singular value matrix. V^T is the product feature matrix. 

In [4]:
#loading user order information
instacart_file=pd.read_csv('Data/orders.csv')
df_orders=pd.DataFrame(instacart_file,)
df_orders.head(15)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


Note that the column 'eval_set' breaks down the data into three sets. Details in readme file. What is important to note is that reorders is not provided for the test set (loading into the next dataframe). We will not be using that set of data. We will build our own train set and test set from the prior orders. The training set will consist of a sample from only the 3rd previous orders. The test set will consit of the same users in the training set next order, the 2nd previous order. 

    Note: The current order is the labeled test set but does not have reordered label information. Unable to train with. The 1st previous order has some customers that fall into the test set, hence not all have labeled reorder infromation. Also unable to train with. Therefore, we use the 2nd previous order as the test set since we can train on the 3rd previous order, which has labels for all customers. 

In [5]:
#build data set 
user_order_max=df_orders['order_number'].groupby(df_orders['user_id']).max()
user_order_max.head()

user_id
1    11
2    15
3    13
4     6
5     5
Name: order_number, dtype: int64

In [6]:
user_order_max.size

206209

In [7]:
df_orders.nunique()

order_id                  3421083
user_id                    206209
eval_set                        3
order_number                  100
order_dow                       7
order_hour_of_day              24
days_since_prior_order         31
dtype: int64

In [8]:
#capture the last order information for each user
g = df_orders.groupby('user_id')
data_p1=g.last()

In [9]:
#capture the second to last order information for each user
data_p2=g.nth(-2)
test_set=data_p2

In [10]:
#capture the third to last order information for each user
data_p3=g.nth(-3)
train_set=data_p3

In [11]:
## pull in trains set from SVM and Random Forest to train SVD

In [12]:
#import user_id for comparision between models in ML-grocery_basket notebook
%store -r data_svd
data_svd.head()

Unnamed: 0,user_id
189032,189033
113006,113007
40368,40369
2152,2153
194849,194850


In [13]:
#verify import
data_svd.nunique()

user_id    1000
dtype: int64

In [14]:
#check for NaN values
train_set.isnull().any()

days_since_prior_order    False
eval_set                  False
order_dow                 False
order_hour_of_day         False
order_id                  False
order_number              False
dtype: bool

In [15]:
train_set=train_set.reset_index()

In [16]:
#merge training notebook and user_ids from basket notebook
#train set now only has the sampled user_id order history from the 3rd previous order 
train_set_svd= pd.merge( data_svd, train_set, how= 'inner',left_on="user_id", right_on='user_id')

In [17]:
#verify the merge done completely
train_set_svd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 7 columns):
user_id                   1000 non-null int64
days_since_prior_order    1000 non-null float64
eval_set                  1000 non-null object
order_dow                 1000 non-null int64
order_hour_of_day         1000 non-null int64
order_id                  1000 non-null int64
order_number              1000 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 62.5+ KB


In [18]:
train_set_svd.head()

Unnamed: 0,user_id,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_id,order_number
0,189033,7.0,prior,3,9,457273,7
1,113007,3.0,prior,2,18,2700094,2
2,40369,30.0,prior,5,18,2746838,15
3,2153,0.0,prior,4,19,1407512,16
4,194850,6.0,prior,2,14,69123,63


In [19]:
train_set_svd.nunique()

user_id                   1000
days_since_prior_order      31
eval_set                     1
order_dow                    7
order_hour_of_day           23
order_id                  1000
order_number                79
dtype: int64

# SVD
Now that the data is set up. Lets set up the mechanics for SVD. In short we will need to wraggle our data into dataframes to feed into scripy model. Recall the basic set up is   R=UΣV^T. R we want to be user_id as the index, product_id as the column and reorder rates for users by product as the value.  

In [20]:
#loading product reorder information
instacart_file2=pd.read_csv('Data/order_products__prior.csv')
df_prod_orders=pd.DataFrame(instacart_file2,)
df_prod_orders.head()


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [21]:
#loading information for product names
instacart_products=pd.read_csv('Data/products.csv')
df_prod=pd.DataFrame(instacart_products,)
df_prod.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [22]:
#merge dataframes to get user_id with product_id and reorder in same dataframe
df_user_order_prod=pd.merge( df_prod_orders, train_set_svd, how= 'inner',left_on="order_id", right_on='order_id')

In [23]:
df_user_order_prod.nunique()

order_id                  1000
product_id                4686
add_to_cart_order           48
reordered                    2
user_id                   1000
days_since_prior_order      31
eval_set                     1
order_dow                    7
order_hour_of_day           23
order_number                79
dtype: int64

We will want to get reorder rates for user by product. Not completely necessary. However, if we want to expand the test set to include more than the 3rd previous order, for example training on the combined customer order history, then the following will use user product reorder rate as the training values. Currently, it uses the order history of only the 3rd previous purchase.

In [24]:
#count number of product purchases by user
user_products_total=df_user_order_prod.groupby(['user_id','product_id']).size().sort_values(ascending=False)
user_products_total.head()

user_id  product_id
206126   48398         1
68566    15618         1
         30495         1
         26172         1
         23760         1
dtype: int64

In [25]:
#count number of reorders for user by product
user_item_reorders=df_user_order_prod['reordered'].groupby([df_user_order_prod['user_id'],df_user_order_prod['product_id']]).sum().sort_values(ascending=False)
user_item_reorders.head()

user_id  product_id
206126   48398         1
84462    35633         1
84067    44661         1
         48230         1
84131    45541         1
Name: reordered, dtype: int64

In [26]:
#calculate reorder rate for user by product
user_item_reorder_rate=user_item_reorders/user_products_total
user_item_reorder_rate.rename(columns={0:'reorder_rate'})
user_item_reorder_rate.head()

user_id  product_id
102      4920          1.0
         9839          1.0
         15290         1.0
         19051         1.0
         23645         0.0
dtype: float64

In [27]:
#move series into dataframe and rename columns
df_upr=pd.DataFrame(user_products_total,columns=['prod_order_count'])
df_ur=pd.DataFrame(user_item_reorder_rate,columns=['prod_reorder_rate'])
print(df_upr.head())
print(df_ur.head())
#pd.merge(df_upr.reset_index(), df_ur.reset_index(), on=['user_id'], how='inner').set_index(['user_id','product_id'])

                    prod_order_count
user_id product_id                  
206126  48398                      1
68566   15618                      1
        30495                      1
        26172                      1
        23760                      1
                    prod_reorder_rate
user_id product_id                   
102     4920                      1.0
        9839                      1.0
        15290                     1.0
        19051                     1.0
        23645                     0.0


In [28]:
#join into singe dataframe
df_rates=pd.concat([df_upr, df_ur], axis=1)


In [29]:
df_rates.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,prod_order_count,prod_reorder_rate
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1
102,4920,1,1.0
102,9839,1,1.0
102,15290,1,1.0
102,19051,1,1.0
102,23645,1,0.0
102,23794,1,0.0
102,24852,1,1.0
102,28985,1,1.0
102,30720,1,1.0
102,31487,1,0.0


We want products to be the columns, user_id the rows, and the values to be the reorder rate. This will be R, user reorder matrix, for SVD. (after we normalize)

In [30]:
#reset dataframe in order to pivot product_id to columns, user_id to index, and reorder rate to values.
df_reorders=df_rates.reset_index().pivot(index='user_id', columns='product_id', values='prod_reorder_rate')
df_reorders.head()

product_id,20,27,34,36,45,49,115,116,130,141,...,49610,49616,49621,49628,49637,49638,49667,49677,49678,49683
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102,,,,,,,,,,,...,,,,,,,,,,
239,,,,,,,,,,,...,,,,,,,,,,
273,,,,,,,,,,,...,,,,,,,,,,
286,,,,,,,,,,,...,,,,,,,,,,
810,,,,,,,,,,,...,,,,,,,,,,


In [31]:
#fill NaN with 0 
df_reorders=df_reorders.fillna(0)

In [32]:
df_reorders.info

<bound method DataFrame.info of product_id  20     27     34     36     45     49     115    116    130    \
user_id                                                                     
102           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
239           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
273           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
286           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
810           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
860           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1037          0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1394          0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1418          0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1827          0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
1902          0.0    0.0    0.0    0.0    0.

In [33]:
df_reorders.head()

product_id,20,27,34,36,45,49,115,116,130,141,...,49610,49616,49621,49628,49637,49638,49667,49677,49678,49683
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
239,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
810,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
df_reorders.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 102 to 206126
Columns: 4686 entries, 20 to 49683
dtypes: float64(4686)
memory usage: 35.8 MB


We will turn that dataframe into a matrix, normalize, optimize the parameters, and make some predictions. 

In [35]:
#normalize reorders in order to feed into scipy
reorders= df_reorders.as_matrix()
reorder_mean = np.mean(reorders, axis = 1)
reordered_normalized = reorders - reorder_mean.reshape(-1, 1)

In [36]:
#break down reorder matrix (R) into unitary matices 
# k picked at random will need to cross validate later
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(reordered_normalized, k = 10)

In [37]:
#make sigma a diagonal matrix for muliplication next
sigma = np.diag(sigma)

In [38]:
#multiple U by sigma then by V^t add means back in to get reconstruction of orginal matrix 
#then convert matrix to dataframe assigning columns and index
reconstructed_reorders = np.dot(np.dot(U, sigma), Vt) + reorder_mean.reshape(-1, 1)
df_predictions = pd.DataFrame(reconstructed_reorders, columns = df_reorders.columns,index=df_reorders.index)

In [39]:
print(df_predictions.index)

Int64Index([   102,    239,    273,    286,    810,    860,   1037,   1394,
              1418,   1827,
            ...
            204598, 204656, 204874, 204880, 204939, 205022, 205429, 205594,
            206054, 206126],
           dtype='int64', name=u'user_id', length=1000)


In [40]:
#save user list 
#used later to get predictions for all users 
users=list(df_predictions.index)

In [41]:
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 102 to 206126
Columns: 4686 entries, 20 to 49683
dtypes: float64(4686)
memory usage: 35.8 MB


In [42]:
#look  to see how predictions look
df_predictions.loc[users[0]].sort_values(ascending=False)

product_id
24852    1.005142
15290    0.206910
45007    0.185237
45066    0.181949
26209    0.177980
24964    0.165452
34126    0.155256
47626    0.151797
24184    0.144253
21137    0.133722
4920     0.129643
34969    0.124439
49683    0.122058
48679    0.113774
30489    0.112571
9839     0.111234
27966    0.103975
16797    0.103423
22825    0.098740
27104    0.098410
41220    0.093589
11777    0.093370
13176    0.093257
31717    0.089567
43295    0.087037
27521    0.085416
39877    0.084270
46667    0.083967
28985    0.083526
28842    0.083400
           ...   
39922   -0.021872
31040   -0.021882
48988   -0.021950
17487   -0.022264
27864   -0.022967
19821   -0.022967
9438    -0.022967
35946   -0.022967
44910   -0.022967
28918   -0.022967
6585    -0.022967
28278   -0.022967
36258   -0.022967
47521   -0.022967
11130   -0.022967
45279   -0.023903
33303   -0.024003
17027   -0.024107
13113   -0.024654
19057   -0.024924
31801   -0.025291
10385   -0.025718
2067    -0.025784
26790   -0.025945

In [43]:
#pick only high predictons
prediction_thresholds=df_predictions.loc[users[0]].apply(lambda x: x if x > 0.05 else None)
prediction_thresholds.dropna().sort_values(ascending=False)

product_id
24852    1.005142
15290    0.206910
45007    0.185237
45066    0.181949
26209    0.177980
24964    0.165452
34126    0.155256
47626    0.151797
24184    0.144253
21137    0.133722
4920     0.129643
34969    0.124439
49683    0.122058
48679    0.113774
30489    0.112571
9839     0.111234
27966    0.103975
16797    0.103423
22825    0.098740
27104    0.098410
41220    0.093589
11777    0.093370
13176    0.093257
31717    0.089567
43295    0.087037
27521    0.085416
39877    0.084270
46667    0.083967
28985    0.083526
28842    0.083400
           ...   
35939    0.075060
45       0.072595
4605     0.072374
43789    0.070028
8424     0.069680
43961    0.068901
15261    0.068631
11182    0.068254
16759    0.065892
21938    0.065271
27336    0.064909
46206    0.064247
7217     0.064084
6443     0.064084
10908    0.064084
35213    0.064084
44955    0.064084
15510    0.064084
48745    0.064084
39275    0.062991
39732    0.062790
46979    0.057614
23909    0.057465
26369    0.057185

In [44]:
def predicted_reorders(predictions_df, user_id, df_prod, df_rates, threshold=0.0):
    """ Function takes 5 parameters: prediction dataframe, user to predict reorders for, product dataframe,
        reorders rate per user dataframe, threshold from prediction dataframe to reach to make
        recommendation. 
        
        Function returns two:  user_purchased: items users has purchased in last three orders,
        prediction: items from user_purchased list that surpass prediction rate 
        
        note: threshold default is zero (returns product that SVD show to have any positive cosine similiarity). 
        Higher threshold settings will return more likely reorders.
    """
    
    # for user get products that surpass prediction theshold
    prediction_thresholds=df_predictions.loc[user_id].apply(lambda x: x if x > threshold else None)
    user_predictions = prediction_thresholds.dropna().sort_values(ascending=False)
    
    # dataframe for items user has purchased previously (last three orders) with name and reorder rate
    df_rates=df_rates.reset_index()
    user_data = df_rates[df_rates.user_id == (user_id)]
    user_purchased = (user_data.merge(df_prod, how = 'left', left_on = 'product_id', right_on = 'product_id').
                     sort_values(['prod_reorder_rate'], ascending=False))

    # Predict reorders by returning items from from previous purchases that surpass prediction rate
    prediction = user_purchased.merge(pd.DataFrame(user_predictions).reset_index(), how = 'inner',left_on = 'product_id',
               right_on = 'product_id')
    
    #format prediction dataframe to see prediction rate and sort by rate
    prediction=prediction.rename(columns = {user_id: 'Prediction'}).sort_values('Prediction', ascending = False)
    
    return user_purchased, prediction



In [45]:
#test the defined function with first user_id
user_purchased, prediction = predicted_reorders(df_predictions,users[0], df_prod, df_rates,.0)

In [46]:
#see items purchased
user_purchased


Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id
0,102,4920,1,1.0,Seedless Red Grapes,123,4
1,102,9839,1,1.0,Organic Broccoli,83,4
2,102,15290,1,1.0,Orange Bell Pepper,83,4
3,102,19051,1,1.0,"Pita Chips, Simply Naked, Party Size",107,19
6,102,24852,1,1.0,Banana,24,4
7,102,28985,1,1.0,Michigan Organic Kale,83,4
8,102,30720,1,1.0,Sugar Snap Peas,83,4
10,102,45066,1,1.0,Honeycrisp Apple,24,4
4,102,23645,1,0.0,Sour Cream & Onion Potato Chips,107,19
5,102,23794,1,0.0,1% Chocolate Milk,84,16


In [47]:
#see prediction for user
prediction

Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction
4,102,24852,1,1.0,Banana,24,4,1.005142
2,102,15290,1,1.0,Orange Bell Pepper,83,4,0.20691
7,102,45066,1,1.0,Honeycrisp Apple,24,4,0.181949
0,102,4920,1,1.0,Seedless Red Grapes,123,4,0.129643
1,102,9839,1,1.0,Organic Broccoli,83,4,0.111234
5,102,28985,1,1.0,Michigan Organic Kale,83,4,0.083526
6,102,30720,1,1.0,Sugar Snap Peas,83,4,0.040865
3,102,19051,1,1.0,"Pita Chips, Simply Naked, Party Size",107,19,0.018637


## Next we build a dataframe for one user's test set with a label prediction column. Afterward we construct a dataframe for all users predictions with a label column

In [48]:
#prepare dataframe for merge -- set index
df_user_order_prod = df_user_order_prod.set_index('user_id','product_id')

In [49]:
#prepare dataframe for merge -- set indexr
prediction=prediction.set_index('user_id','product_id')

In [50]:
#merge dataframe
pred_set=pd.merge(prediction , df_user_order_prod, how='inner', left_index=True, right_index=True)
pred_set

Unnamed: 0_level_0,product_id_x,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,order_id,product_id_y,add_to_cart_order,reordered,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
102,24852,1,1.0,Banana,24,4,1.005142,2196914,24852,1,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,4920,2,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,28985,3,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,23794,4,0,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,45066,5,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,23645,6,0,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,15290,7,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,9839,8,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,30720,9,1,11.0,prior,5,23,5
102,24852,1,1.0,Banana,24,4,1.005142,2196914,19051,10,1,11.0,prior,5,23,5


In [51]:
#train_set_svd=train_set_svd.reset_index()
sample=pd.DataFrame(train_set_svd['user_id'])

#capture the second to last order information for each user as a test set
data_p2=g.nth(-2)
data_p2['order_from_last']=2
data_p2=data_p2.reset_index()

#for test set add in reorder information
df_test=pd.merge(df_prod_orders,data_p2, how= 'inner',left_on="order_id", right_on='order_id')

#since we have train and test data broken apart by user previous order we can ensure we test only the users we trained for
small_test=pd.merge(df_test,sample, how= 'inner',left_on="user_id", right_on='user_id')
small_test.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number,order_from_last
0,5753,6041,1,0,124653,29.0,prior,6,11,3,2
1,5753,37483,2,0,124653,29.0,prior,6,11,3,2
2,5753,46359,3,0,124653,29.0,prior,6,11,3,2
3,5753,11837,4,0,124653,29.0,prior,6,11,3,2
4,5753,44987,5,0,124653,29.0,prior,6,11,3,2


In [52]:
#reset index in order to loop across index
prediction=prediction.reset_index()

In [53]:
#set rate at which to predict positive label one in classification
rate=.05
# create column and set column default value to negative label
prediction['ypred']=0

#loop over rows add change prediction column value to 1 if rate is surpassed
for i in prediction.index:
    if float(prediction['Prediction'][i]) >rate:
        prediction['ypred'][i]=1
    else:
        prediction['ypred'][i]=0
#check dataframe
prediction

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,ypred
0,102,24852,1,1.0,Banana,24,4,1.005142,1
1,102,15290,1,1.0,Orange Bell Pepper,83,4,0.20691,1
2,102,45066,1,1.0,Honeycrisp Apple,24,4,0.181949,1
3,102,4920,1,1.0,Seedless Red Grapes,123,4,0.129643,1
4,102,9839,1,1.0,Organic Broccoli,83,4,0.111234,1
5,102,28985,1,1.0,Michigan Organic Kale,83,4,0.083526,1
6,102,30720,1,1.0,Sugar Snap Peas,83,4,0.040865,0
7,102,19051,1,1.0,"Pita Chips, Simply Naked, Party Size",107,19,0.018637,0


# Now to get the predictions for all the user's next product order

In [54]:
#reset index 
prediction.set_index('user_id','product_id')
#check shape of small_test dataframe
print (small_test.shape)
#drop duplicates (might need to change axis)
small_test=small_test.set_index('user_id','product_id').sort_index().drop_duplicates()
print (small_test.shape)

(9875, 11)
(9875, 10)


In [55]:
prediction.shape

(8, 9)

In [56]:
prediction.head()

Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,ypred
0,102,24852,1,1.0,Banana,24,4,1.005142,1
1,102,15290,1,1.0,Orange Bell Pepper,83,4,0.20691,1
2,102,45066,1,1.0,Honeycrisp Apple,24,4,0.181949,1
3,102,4920,1,1.0,Seedless Red Grapes,123,4,0.129643,1
4,102,9839,1,1.0,Organic Broccoli,83,4,0.111234,1


In [57]:
#reset both dataframes and set to same index
prediction = prediction.reset_index()
small_test = small_test.reset_index()
prediction = prediction.set_index('user_id','product_id')
small_test = small_test.set_index('user_id','product_id')

#check difference in index

In [58]:
small_test.index

Int64Index([   102,    102,    102,    102,    102,    102,    102,    102,
               102,    102,
            ...
            206126, 206126, 206126, 206126, 206126, 206126, 206126, 206126,
            206126, 206126],
           dtype='int64', name=u'user_id', length=9875)

In [59]:
prediction.index

Int64Index([102, 102, 102, 102, 102, 102, 102, 102], dtype='int64', name=u'user_id')

In [60]:
#create dataframe to get predictions for
pred_set=small_test.drop (labels=['add_to_cart_order','days_since_prior_order','reordered' ,'eval_set','order_number', 'order_from_last','order_dow','order_hour_of_day'], axis=1 )

In [61]:
#all users and products in test set need to get prediction for each
pred_set.head(6)

Unnamed: 0_level_0,order_id,product_id
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
102,2531610,8174
102,2531610,21137
102,2531610,36735
102,2531610,36865
102,2531610,28985
102,2531610,24852


# Can adjust positive prediction rate here. 

In [62]:
#create dictionary to track user predictions
svd_user_pred={}

# rate for postive label
rate=.01

#call prediction defined function fro each user and save to dictionary
for i in tqdm (xrange(len(users))):
    user_purchased, prediction = predicted_reorders(df_predictions, users[i], df_prod, df_rates,.0) 
    svd_user_pred[i]=prediction
    

100%|██████████| 1000/1000 [00:16<00:00, 59.87it/s]


In [63]:
#
svd_model=svd_user_pred[0]

#loop over users and build one dataframe of all user predictions
for i in tqdm (xrange(len(users))):
    svd_model = pd.concat([svd_model, svd_user_pred[i]],ignore_index=True)

100%|██████████| 1000/1000 [00:01<00:00, 526.77it/s]


In [64]:
#Create a prediction column for all users 
svd_model['ypred']=0

#loop over all users and  label positive ir rate surpassed
for j in tqdm (svd_model.index):
    if float(svd_model['Prediction'][j]) >rate:
        svd_model['ypred'][j]=1
    else:
        svd_model['ypred'][j]=0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
100%|██████████| 8659/8659 [09:13<00:00, 15.66it/s]


In [65]:
#drop unnecessary columns
svd_model=svd_model.drop(labels=['aisle_id','department_id' ,'prod_order_count','prod_reorder_rate'], axis=1 )

In [66]:
#dataframe of all products predicted by svd (missing new products from test set need to include)
#show example of first user 
svd_model[svd_model['user_id']==users[0]]

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred
0,102,24852,Banana,1.005142,1
1,102,15290,Orange Bell Pepper,0.20691,1
2,102,45066,Honeycrisp Apple,0.181949,1
3,102,4920,Seedless Red Grapes,0.129643,1
4,102,9839,Organic Broccoli,0.111234,1
5,102,28985,Michigan Organic Kale,0.083526,1
6,102,30720,Sugar Snap Peas,0.040865,1
7,102,19051,"Pita Chips, Simply Naked, Party Size",0.018637,1
8,102,24852,Banana,1.005142,1
9,102,15290,Orange Bell Pepper,0.20691,1


Decision point. We did not predict any new product purchases. We will give a negative label to all the label not in the training set. The above example show the products that the model predicted. The ypred column is the 0/1 label based off of the rate compared to the prediction column value. Not all products in the test set show up here, since a reorder by be from the 4th previous order and not the 3rd previous order. We will add zero to the prediction label for those cases.

The prediction above only is trained by the previous oreder (previous oreder number 3 for all customers). We will also need to keep only product prediction that are in the test set (previous order number 2). 

In [67]:
#test_set['ypred']=iset['ypred'].fillna(0)

In [68]:
#get size of prediction set(test set sized down to our sample users) needed
pred_set.shape

(9875, 2)

In [69]:
#get size of model predictions (larger we predicted for all orders in previous order)
svd_model.shape

(8659, 5)

In [70]:
#check only user products from test set in new dataframe
svd_predicted = pd.merge(svd_model , pred_set.reset_index(),how='right', left_on=['user_id','product_id'], right_on = ['user_id','product_id'])

In [71]:
svd_predicted.shape

(9877, 6)

In [72]:
#drop duplicaates
svd_predicted= svd_predicted.drop_duplicates()

In [73]:
#size is correct now (should match pred_set size number of rows)
svd_predicted.shape

(9875, 6)

In [74]:
#sort by user_id and product_id (in order to match ML_grocery_basket test sets)
svd_predicted.sort_values('user_id')

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred,order_id
0,102,24852,Banana,1.005142,1.0,2531610
2,102,28985,Michigan Organic Kale,0.083526,1.0,2531610
2674,102,8174,,,,2531610
2675,102,21137,,,,2531610
2676,102,36735,,,,2531610
2677,102,36865,,,,2531610
2678,102,28842,,,,2531610
2679,102,43768,,,,2531610
2680,102,23645,,,,2531610
2681,102,13629,,,,2531610


In [75]:
#see label predicted total
svd_predicted['ypred'].sum()

1531.0

In [76]:
#fill in NaN values for lables
#products we did not train on or predict for get a negative label
svd_predicted['ypred']=svd_predicted['ypred'].fillna(0)

#should not affect sum above
svd_predicted['ypred'].sum()

1531.0

In [77]:
#Should not affect sahpe
svd_predicted.shape

(9875, 6)

In [78]:
#set index to make ML_grocery_basket test index
svd_predicted=svd_predicted.sort_values(['user_id','product_id'])
svd_predicted

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred,order_id
2674,102,8174,,,0.0,2531610
2681,102,13629,,,0.0,2531610
2675,102,21137,,,0.0,2531610
2680,102,23645,,,0.0,2531610
0,102,24852,Banana,1.005142,1.0,2531610
2678,102,28842,,,0.0,2531610
2,102,28985,Michigan Organic Kale,0.083526,1.0,2531610
2676,102,36735,,,0.0,2531610
2677,102,36865,,,0.0,2531610
2679,102,43768,,,0.0,2531610


In [79]:
#fill in prediction NaN with zero
svd_predicted['Prediction']=svd_predicted[['Prediction']].fillna(0)

svd_predicted['Prediction']=preprocessing.normalize(svd_predicted[['Prediction']], axis=0)

#tng_predicted['Prediction']=(tng_predicted['Prediction']*10)

In [80]:
svd_predicted['Prediction'].max()

0.10858935851702026

In [81]:
#save data to pass to ML_grocery notebook to run predictions against other models
data_svd = svd_predicted
%store svd_predicted
#del data # This has deleted the variable

Stored 'svd_predicted' (DataFrame)


#     The work above was to get the label prediction for the test set. Now lets get the label predictions for the training set and pass that back to ML_grocery_basket for analysis too. 

In [82]:
#capture the third to last order information for each user as a train set
data_p3=data_p3.reset_index()

#for train set add in reorder information
df_train=pd.merge(df_prod_orders,data_p3, how= 'inner',left_on="order_id", right_on='order_id')

#make a dataframe to pass get labels for training prediction based off trained SVD
small_train=pd.merge(df_train,sample, how= 'inner',left_on="user_id", right_on='user_id')
small_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number
0,11188,39928,1,1,46510,30.0,prior,3,12,5
1,11188,46969,2,1,46510,30.0,prior,3,12,5
2,11188,37646,3,1,46510,30.0,prior,3,12,5
3,11188,13176,4,1,46510,30.0,prior,3,12,5
4,11188,44430,5,0,46510,30.0,prior,3,12,5


In [83]:
#reset index 
prediction.set_index('user_id','product_id')
#check shape of small_test dataframe
print (small_train.shape)
#drop duplicates (might need to change axis)
small_train=small_train.set_index('user_id','product_id').sort_index().drop_duplicates()
print (small_train.shape)

(9874, 10)
(9874, 9)


In [84]:
#reset both dataframes and set to same index
prediction = prediction.reset_index()
small_train = small_train.reset_index()
prediction = prediction.set_index('user_id','product_id')
small_train = small_train.set_index('user_id','product_id')

#check difference in index

In [85]:
small_train.index

Int64Index([   102,    102,    102,    102,    102,    102,    102,    102,
               102,    102,
            ...
            206126, 206126, 206126, 206126, 206126, 206126, 206126, 206126,
            206126, 206126],
           dtype='int64', name=u'user_id', length=9874)

In [86]:
prediction.index

Int64Index([206126, 206126, 206126, 206126, 206126, 206126, 206126, 206126,
            206126, 206126, 206126, 206126, 206126, 206126, 206126, 206126,
            206126, 206126, 206126, 206126, 206126],
           dtype='int64', name=u'user_id')

In [87]:
small_train.head()

Unnamed: 0_level_0,order_id,product_id,add_to_cart_order,reordered,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
102,2196914,30720,9,1,11.0,prior,5,23,5
102,2196914,19051,10,1,11.0,prior,5,23,5
102,2196914,9839,8,1,11.0,prior,5,23,5
102,2196914,15290,7,1,11.0,prior,5,23,5
102,2196914,23645,6,0,11.0,prior,5,23,5


In [88]:
#create dataframe to get predictions for
pred_set_tng=small_train.drop (labels=['add_to_cart_order','days_since_prior_order','reordered' ,'eval_set','order_number', 'order_dow','order_hour_of_day'], axis=1 )
pred_set_tng.head()

Unnamed: 0_level_0,order_id,product_id
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
102,2196914,30720
102,2196914,19051
102,2196914,9839
102,2196914,15290
102,2196914,23645


In [89]:
#create dictionary to track user predictions
tng_user_pred={}

# rate for postive label
rate=.05

#call prediction defined function fro each user and save to dictionary
for i in tqdm (xrange(len(users))):
    user_purchased, prediction = predicted_reorders(df_predictions, users[i], df_prod, df_rates,.0) 
    tng_user_pred[i]=prediction
    

100%|██████████| 1000/1000 [00:19<00:00, 52.17it/s]


In [91]:
#
tng_model=tng_user_pred[0]

#loop over users and build one dataframe of all user predictions
for i in tqdm (xrange(len(users))):
    tng_model = pd.concat([tng_model, tng_user_pred[i]],ignore_index=True)

100%|██████████| 1000/1000 [00:02<00:00, 436.14it/s]


In [92]:
#Create a prediction column for all users 
tng_model['ypred']=0

#loop over all users and  label positive ir rate surpassed
for j in tqdm (tng_model.index):
    if float(tng_model['Prediction'][j]) >rate:
        tng_model['ypred'][j]=1
    else:
        tng_model['ypred'][j]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
100%|██████████| 8659/8659 [1:26:51<00:00,  1.66it/s]


In [93]:
#drop unnecessary columns
tng_model=tng_model.drop(labels=['aisle_id','department_id' ,'prod_order_count','prod_reorder_rate'], axis=1 )

In [94]:
#dataframe of all products predicted by svd (missing new products from test set need to include)
#show example of first user 
tng_model[tng_model['user_id']==users[0]]

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred
0,102,24852,Banana,1.005142,1
1,102,15290,Orange Bell Pepper,0.20691,1
2,102,45066,Honeycrisp Apple,0.181949,1
3,102,4920,Seedless Red Grapes,0.129643,1
4,102,9839,Organic Broccoli,0.111234,1
5,102,28985,Michigan Organic Kale,0.083526,1
6,102,30720,Sugar Snap Peas,0.040865,0
7,102,19051,"Pita Chips, Simply Naked, Party Size",0.018637,0
8,102,24852,Banana,1.005142,1
9,102,15290,Orange Bell Pepper,0.20691,1


In [95]:
#check only user products from test set in new dataframe
tng_predicted = pd.merge(tng_model , pred_set_tng.reset_index(),how='right', left_on=['user_id','product_id'], right_on = ['user_id','product_id'])

In [96]:
tng_predicted.shape

(9882, 6)

In [97]:
tng_predicted=tng_predicted.drop_duplicates()

In [98]:
#any change
svd_predicted.shape

(9875, 6)

In [99]:
#sort by user_id and product_id (in order to match ML_grocery_basket test sets)
tng_predicted.sort_values('user_id').head(15)

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred,order_id
0,102,24852,Banana,1.005142,1.0,2196914
8661,102,23794,,,,2196914
8660,102,31487,,,,2196914
14,102,19051,"Pita Chips, Simply Naked, Party Size",0.018637,0.0,2196914
12,102,30720,Sugar Snap Peas,0.040865,0.0,2196914
8659,102,23645,,,,2196914
8,102,9839,Organic Broccoli,0.111234,1.0,2196914
4,102,45066,Honeycrisp Apple,0.181949,1.0,2196914
10,102,28985,Michigan Organic Kale,0.083526,1.0,2196914
2,102,15290,Orange Bell Pepper,0.20691,1.0,2196914


In [100]:
#see label predicted total
tng_predicted['ypred'].sum()

1809.0

In [101]:
#fill in NaN values for lables
#products we did not train on or predict for get a negative label
tng_predicted['ypred']=tng_predicted['ypred'].fillna(0)

#should not affect sum above
tng_predicted['ypred'].sum()

1809.0

In [102]:
#see label predicted total
tng_predicted['ypred'].sum()

1809.0

In [103]:
#Should not affect shape
tng_predicted.shape

(9874, 6)

In [104]:
#set index to make ML_grocery_basket test index
tng_predicted=tng_predicted.sort_values(['user_id','product_id'])
tng_predicted.head(7)

Unnamed: 0,user_id,product_id,product_name,Prediction,ypred,order_id
6,102,4920,Seedless Red Grapes,0.129643,1.0,2196914
8,102,9839,Organic Broccoli,0.111234,1.0,2196914
2,102,15290,Orange Bell Pepper,0.20691,1.0,2196914
14,102,19051,"Pita Chips, Simply Naked, Party Size",0.018637,0.0,2196914
8659,102,23645,,,0.0,2196914
8661,102,23794,,,0.0,2196914
0,102,24852,Banana,1.005142,1.0,2196914


In [105]:
tng_user_order_prod = df_user_order_prod.reset_index().sort_values(['user_id','product_id'])
tng_user_order_prod = tng_user_order_prod.drop ( ['order_id', 'add_to_cart_order' , 'days_since_prior_order' , 'eval_set' , 'order_dow', 'order_hour_of_day', 'order_number' ]   , axis=1)

In [106]:
tng_predicted = pd.merge( tng_user_order_prod, tng_predicted , how='inner', left_on=['user_id','product_id'], right_on=['user_id','product_id']  )

In [107]:
tng_predicted

Unnamed: 0,user_id,product_id,reordered,product_name,Prediction,ypred,order_id
0,102,4920,1,Seedless Red Grapes,0.129643,1.0,2196914
1,102,9839,1,Organic Broccoli,0.111234,1.0,2196914
2,102,15290,1,Orange Bell Pepper,0.206910,1.0,2196914
3,102,19051,1,"Pita Chips, Simply Naked, Party Size",0.018637,0.0,2196914
4,102,23645,0,,,0.0,2196914
5,102,23794,0,,,0.0,2196914
6,102,24852,1,Banana,1.005142,1.0,2196914
7,102,28985,1,Michigan Organic Kale,0.083526,1.0,2196914
8,102,30720,1,Sugar Snap Peas,0.040865,0.0,2196914
9,102,31487,0,,,0.0,2196914


In [108]:
tng_predicted['Prediction']=tng_predicted[['Prediction']].fillna(0)

In [109]:
from sklearn import preprocessing
tng_predicted['Prediction']=preprocessing.normalize(tng_predicted[['Prediction']], axis=0)


In [111]:
tng_predicted

Unnamed: 0,user_id,product_id,reordered,product_name,Prediction,ypred,order_id
0,102,4920,1,Seedless Red Grapes,0.006003,1.0,2196914
1,102,9839,1,Organic Broccoli,0.005151,1.0,2196914
2,102,15290,1,Orange Bell Pepper,0.009581,1.0,2196914
3,102,19051,1,"Pita Chips, Simply Naked, Party Size",0.000863,0.0,2196914
4,102,23645,0,,0.000000,0.0,2196914
5,102,23794,0,,0.000000,0.0,2196914
6,102,24852,1,Banana,0.046542,1.0,2196914
7,102,28985,1,Michigan Organic Kale,0.003868,1.0,2196914
8,102,30720,1,Sugar Snap Peas,0.001892,0.0,2196914
9,102,31487,0,,0.000000,0.0,2196914


In [112]:
#save data to pass to ML_grocery notebook to run predictions against other models
data_svd = tng_predicted
%store tng_predicted
#del data # This has deleted the variable

Stored 'tng_predicted' (DataFrame)


Can now run ML_grocery_basket last few cells to compare all models