Should run ML_grocery_basket notebook first. It will tell you when to come back here to run this notebook. 

In [1]:
#load watermark extension
%load_ext watermark
#print watermark for notebook
%watermark

2018-04-23T12:23:59

CPython 2.7.14
IPython 5.4.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.3.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


In [2]:
#https://ipyton.org/ipython-docdex/config/extsions/autoreload.html
%reload_ext autoreload
%autoreload 2

#version information
%reload_ext version_information
%version_information Cython, matplotlib, numpy, pandas,  qutip, seaborn, scipy, sklearn, tqdm, version_information,



Software,Version
Python,2.7.14 64bit [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,5.4.1
OS,Darwin 17.3.0 x86_64 i386 64bit
Cython,0.26.1
matplotlib,2.1.0
numpy,1.13.3
pandas,0.20.3
qutip,The 'qutip' distribution was not found and is required by the application
seaborn,0.8.0
scipy,0.19.1


In [3]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import scipy as sp

Lets do some data wraggling. We will need to set up martrices to execute a SVD, R=UΣV^T. Normally this is associated with a recommender based on ratings. Here, we will be using the reorder proportions inplace of ratings to predict future reorders. R will be the user reorder matrix. U is user product feature matrix.  Σ is the singular value matrix. V^T is the product feature matrix. 

In [4]:
#loading user order information
instacart_file=pd.read_csv('Data/orders.csv')
df_orders=pd.DataFrame(instacart_file,)
df_orders.head(15)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


Note that the column 'eval_set' breaks down the data into three sets. Details in readme file. What is important to note is that reorders is not provided for the test set (loading into the next dataframe). We there will not be using that set of data. We will add data from the prior set into the train set for are data set. The augmented data set will then be feed into scikit learn model to segregate data into a new train set,cross-validation set, and test set. 

In [5]:
#build data set 
user_order_max=df_orders['order_number'].groupby(df_orders['user_id']).max()
user_order_max.head()

user_id
1    11
2    15
3    13
4     6
5     5
Name: order_number, dtype: int64

In [6]:
user_order_max.size

206209

In [7]:
df_orders.nunique()

order_id                  3421083
user_id                    206209
eval_set                        3
order_number                  100
order_dow                       7
order_hour_of_day              24
days_since_prior_order         31
dtype: int64

In [8]:
#capture the last order information for each user
g = df_orders.groupby('user_id')
data_p1=g.last()

In [9]:
#capture the second to last order information for each user
data_p2=g.nth(-2)
test_set=data_p2

In [10]:
#capture the third to last order information for each user
data_p3=g.nth(-3)
train_set=data_p3

#capture the fourth to last order information for each user
data_p4=g.nth(-4)

#join all information into one data set
data_set=pd.concat([data_p2,data_p3])

data_set.groupby(['user_id','order_number','eval_set','order_id']).count()

#remove test set from data
data_set=data_set[data_set.eval_set != 'test']

#set aside new test set
test_set=data_set[data_set.eval_set == 'train']
test_set.nunique()

#assign training set
train_set=data_set[data_set.eval_set == 'prior']
train_set.reset_index(inplace=True)
train_set.nunique()

In [11]:
## pull in trains set from SVM and Random Forest to train SVD

In [12]:
#import user_id for comparision between models in basket notebook
%store -r data_svd
data_svd.head()

Unnamed: 0,user_id
189032,189033
113006,113007
40368,40369
2152,2153
194849,194850


In [13]:
#verify import
data_svd.nunique()

user_id    50
dtype: int64

In [14]:
train_set.isnull().any()

days_since_prior_order    False
eval_set                  False
order_dow                 False
order_hour_of_day         False
order_id                  False
order_number              False
dtype: bool

In [15]:
train_set=train_set.reset_index()

In [16]:
#merge training notebook and user_ids from basket notebook
train_set_svd= pd.merge( data_svd, train_set, how= 'inner',left_on="user_id", right_on='user_id')

In [17]:
#verify the merge done completely
train_set_svd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 7 columns):
user_id                   50 non-null int64
days_since_prior_order    50 non-null float64
eval_set                  50 non-null object
order_dow                 50 non-null int64
order_hour_of_day         50 non-null int64
order_id                  50 non-null int64
order_number              50 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 3.1+ KB


In [18]:
train_set_svd.head()

Unnamed: 0,user_id,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_id,order_number
0,189033,7.0,prior,3,9,457273,7
1,113007,3.0,prior,2,18,2700094,2
2,40369,30.0,prior,5,18,2746838,15
3,2153,0.0,prior,4,19,1407512,16
4,194850,6.0,prior,2,14,69123,63


In [19]:
train_set_svd.nunique()

user_id                   50
days_since_prior_order    19
eval_set                   1
order_dow                  7
order_hour_of_day         15
order_id                  50
order_number              23
dtype: int64

# SVD
Now that the data is set up. Lets set up the mechanics for SVD. In short we will need to wraggle our data into dataframes to feed into scripy model. Recall the basic set up is   R=UΣV^T. R we want to be user_id as the index, product_id as the column and reorder rates for users by product as the value.  

In [20]:
#loading product reorder information
instacart_file2=pd.read_csv('Data/order_products__prior.csv')
df_prod_orders=pd.DataFrame(instacart_file2,)
df_prod_orders.head()


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [21]:
#loading information for product names
instacart_products=pd.read_csv('Data/products.csv')
df_prod=pd.DataFrame(instacart_products,)
df_prod.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [22]:
#merge dataframes to get user_id with product_id and reorder in same dataframe
#use inner to get the intersection in order to preserve test set
df_user_order_prod=pd.merge( df_prod_orders, train_set_svd, how= 'inner',left_on="order_id", right_on='order_id')

In [23]:
df_user_order_prod.nunique()

order_id                   50
product_id                403
add_to_cart_order          25
reordered                   2
user_id                    50
days_since_prior_order     19
eval_set                    1
order_dow                   7
order_hour_of_day          15
order_number               23
dtype: int64

We will want to get reorder rates for user by product.

In [24]:
#count number of product purchases by user
user_products_total=df_user_order_prod.groupby(['user_id','product_id']).size().sort_values(ascending=False)
user_products_total.head()

user_id  product_id
206054   35504         1
65387    1463          1
62267    9839          1
         14992         1
         16349         1
dtype: int64

In [25]:
#count number of reorders for user by product
user_item_reorders=df_user_order_prod['reordered'].groupby([df_user_order_prod['user_id'],df_user_order_prod['product_id']]).sum().sort_values(ascending=False)
user_item_reorders.head()

user_id  product_id
206054   35504         1
65387    29311         1
74630    11520         1
75527    5750          1
         20842         1
Name: reordered, dtype: int64

In [26]:
#calculate reorder rate for user by product
user_item_reorder_rate=user_item_reorders/user_products_total
user_item_reorder_rate.rename(columns={0:'reorder_rate'})
user_item_reorder_rate.head()

user_id  product_id
2153     4066          1.0
         27796         1.0
         34668         1.0
13131    3321          1.0
         5258          0.0
dtype: float64

In [27]:
#move series into dataframe and rename columns
df_upr=pd.DataFrame(user_products_total,columns=['prod_order_count'])
df_ur=pd.DataFrame(user_item_reorder_rate,columns=['prod_reorder_rate'])
print(df_upr.head())
print(df_ur.head())
#pd.merge(df_upr.reset_index(), df_ur.reset_index(), on=['user_id'], how='inner').set_index(['user_id','product_id'])

                    prod_order_count
user_id product_id                  
206054  35504                      1
65387   1463                       1
62267   9839                       1
        14992                      1
        16349                      1
                    prod_reorder_rate
user_id product_id                   
2153    4066                      1.0
        27796                     1.0
        34668                     1.0
13131   3321                      1.0
        5258                      0.0


In [28]:
#join into singe dataframe
df_rates=pd.concat([df_upr, df_ur], axis=1)


In [29]:
df_rates.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,prod_order_count,prod_reorder_rate
user_id,product_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2153,4066,1,1.0
2153,27796,1,1.0
2153,34668,1,1.0
13131,3321,1,1.0
13131,5258,1,0.0
13131,9766,1,0.0
13131,9839,1,1.0
13131,14141,1,1.0
13131,14947,1,1.0
13131,19660,1,1.0


We want products to be the columns, user_id the rows, and the values to be the reorder rate. This will be R, user reorder matrix, for SVD. (after we normalize)

In [30]:
#reset dataframe in order to pivot product_id to columns, user_id to index, and reorder rate to values.
df_reorders=df_rates.reset_index().pivot(index='user_id', columns='product_id', values='prod_reorder_rate')
df_reorders.head()

product_id,20,264,432,630,940,1227,1360,1463,1722,1858,...,48716,48848,48857,49044,49175,49191,49198,49215,49318,49383
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2153,,,,,,,,,,,...,,,,,,,,,,
13131,,,,,,,,,,,...,,1.0,,,,,,,,
16354,,,1.0,,,,1.0,,,,...,,,,,,1.0,,,,
21414,,,,,,,,,,,...,,,,,,,,,,
28328,,,,,,,,,,,...,,,,,,,,,,


In [31]:
#fill NaN with 0 
df_reorders=df_reorders.fillna(0)

In [32]:
df_reorders.info

<bound method DataFrame.info of product_id  20     264    432    630    940    1227   1360   1463   1722   \
user_id                                                                     
2153          0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
13131         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
16354         0.0    0.0    1.0    0.0    0.0    0.0    1.0    0.0    0.0   
21414         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
28328         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
30124         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
35942         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
36449         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
37301         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
40369         0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
51578         0.0    0.0    0.0    0.0    0.

In [33]:
df_reorders.head()

product_id,20,264,432,630,940,1227,1360,1463,1722,1858,...,48716,48848,48857,49044,49175,49191,49198,49215,49318,49383
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16354,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
21414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
df_reorders.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 2153 to 206054
Columns: 403 entries, 20 to 49383
dtypes: float64(403)
memory usage: 157.8 KB


We will turn that dataframe into a matrix, normalize, optimize the parameters, and make some predictions. 

In [35]:
#normalize reorders in order to feed into scipy
reorders= df_reorders.as_matrix()
reorder_mean = np.mean(reorders, axis = 1)
reordered_normalized = reorders - reorder_mean.reshape(-1, 1)

In [36]:
#break down reorder matrix (R) into unitary matices 
# k picked at random will need to cross validate later
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(reordered_normalized, k = 10)

In [37]:
#make sigma a diagonal matrix for muliplication next
sigma = np.diag(sigma)

In [38]:
#multiple U by sigma then by V^t add means back in to get reconstruction of orginal matrix 
#then convert matrix to dataframe assigning columns and index
reconstructed_reorders = np.dot(np.dot(U, sigma), Vt) + reorder_mean.reshape(-1, 1)
df_predictions = pd.DataFrame(reconstructed_reorders, columns = df_reorders.columns,index=df_reorders.index)

In [39]:
print(df_predictions.index)

Int64Index([  2153,  13131,  16354,  21414,  28328,  30124,  35942,  36449,
             37301,  40369,  51578,  53702,  55419,  58648,  62267,  65387,
             74630,  75527,  75779,  92090,  94506, 105795, 105969, 112460,
            112585, 113007, 118532, 118698, 122019, 126360, 130039, 134009,
            134789, 139707, 148324, 149179, 149720, 151683, 152083, 168326,
            170667, 187455, 187888, 189033, 193107, 194202, 194850, 203190,
            204656, 206054],
           dtype='int64', name=u'user_id')


In [40]:
#save user list
users=list(df_predictions.index)

In [41]:
df_predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 2153 to 206054
Columns: 403 entries, 20 to 49383
dtypes: float64(403)
memory usage: 157.8 KB


In [42]:
#look  to see how predictions look
df_predictions.loc[users[0]].sort_values(ascending=False)

product_id
5991     0.016980
25890    0.016980
28985    0.016980
35535    0.016980
18234    0.016980
23165    0.016980
31869    0.016980
16349    0.016980
8277     0.016146
43713    0.016146
41329    0.016146
28576    0.016146
28535    0.016146
47209    0.016146
24373    0.016146
3520     0.016146
17906    0.011042
42342    0.011042
8385     0.011042
41846    0.011042
15392    0.011042
23452    0.010891
14715    0.010891
16283    0.010891
22900    0.010891
37317    0.010891
20955    0.010891
21131    0.010891
29503    0.010891
34024    0.010891
           ...   
17949    0.001633
33198    0.001166
24838    0.001031
22825    0.000820
24489   -0.000493
32864   -0.000493
30122   -0.000493
41950   -0.000493
20520   -0.000493
14084   -0.000493
38777   -0.000493
29388   -0.000493
48179   -0.000493
3252    -0.000493
44359   -0.000493
46979   -0.000493
22720   -0.000493
14074   -0.000493
26128   -0.000493
30391   -0.000493
47626   -0.000833
43643   -0.001128
13380   -0.001708
42585   -0.002577

In [43]:
#pick only high predictons
prediction_thresholds=df_predictions.loc[users[0]].apply(lambda x: x if x > 0.05 else None)
prediction_thresholds.dropna().sort_values(ascending=False)

Series([], Name: 2153, dtype: object)

In [44]:
def predicted_reorders(predictions_df, user_id, df_prod, df_rates, threshold=0.0):
    """ Function takes 5 parameters: prediction dataframe, user to predict reorders for, product dataframe,
        reorders rate per user dataframe, threshold from prediction dataframe to reach to make
        recommendation. 
        
        Function returns two:  user_purchased: items users has purchased in last three orders,
        prediction: items from user_purchased list that surpass prediction rate 
        
        note: threshold default is zero (returns product that SVD show to have any positive cosine similiarity). 
        Higher threshold settings will return more likely reorders.
    """
    
    # for user get products that surpass prediction theshold
    prediction_thresholds=df_predictions.loc[user_id].apply(lambda x: x if x > threshold else None)
    user_predictions = prediction_thresholds.dropna().sort_values(ascending=False)
    
    # dataframe for items user has purchased previously (last three orders) with name and reorder rate
    df_rates=df_rates.reset_index()
    user_data = df_rates[df_rates.user_id == (user_id)]
    user_purchased = (user_data.merge(df_prod, how = 'left', left_on = 'product_id', right_on = 'product_id').
                     sort_values(['prod_reorder_rate'], ascending=False))

    # Predict reorders by returning items from from previous purchases that surpass prediction rate
    prediction = user_purchased.merge(pd.DataFrame(user_predictions).reset_index(), how = 'inner',left_on = 'product_id',
               right_on = 'product_id')
    
    #format prediction dataframe to see prediction rate and sort by rate
    prediction=prediction.rename(columns = {user_id: 'Prediction'}).sort_values('Prediction', ascending = False)
    
    return user_purchased, prediction



In [45]:
#test the defined function with first user_id
user_purchased, prediction = predicted_reorders(df_predictions,users[0], df_prod, df_rates,.0)

In [46]:
#see items purchased
user_purchased


Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id
0,2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9
1,2153,27796,1,1.0,Real Mayonnaise,72,13
2,2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16


In [47]:
#see prediction for user
prediction

Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction
0,2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068
1,2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068
2,2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068


In [48]:
#prepare dataframe for merge -- set index
df_user_order_prod = df_user_order_prod.set_index('user_id','product_id')

In [49]:
#prepare dataframe for merge -- set indexr
prediction=prediction.set_index('user_id','product_id')

In [50]:
#merge dataframe
pred_set=pd.merge(prediction , df_user_order_prod, how='inner', left_index=True, right_index=True)
pred_set

Unnamed: 0_level_0,product_id_x,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,order_id,product_id_y,add_to_cart_order,reordered,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068,1407512,4066,1,1,0.0,prior,4,19,16
2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068,1407512,34668,2,1,0.0,prior,4,19,16
2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068,1407512,27796,3,1,0.0,prior,4,19,16
2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068,1407512,4066,1,1,0.0,prior,4,19,16
2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068,1407512,34668,2,1,0.0,prior,4,19,16
2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068,1407512,27796,3,1,0.0,prior,4,19,16
2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068,1407512,4066,1,1,0.0,prior,4,19,16
2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068,1407512,34668,2,1,0.0,prior,4,19,16
2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068,1407512,27796,3,1,0.0,prior,4,19,16


In [51]:
#train_set_svd=train_set_svd.reset_index()
sample=pd.DataFrame(train_set_svd['user_id'])

#capture the second to last order information for each user as a test set
data_p2=g.nth(-2)
data_p2['order_from_last']=2
data_p2=data_p2.reset_index()

#for test set add in reorder information
df_test=pd.merge(df_prod_orders,data_p2, how= 'inner',left_on="order_id", right_on='order_id')

#since we have train and test data broken apart by user previous order we can ensure we test only the users we trained for
small_test=pd.merge(df_test,sample, how= 'inner',left_on="user_id", right_on='user_id')
small_test.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,days_since_prior_order,eval_set,order_dow,order_hour_of_day,order_number,order_from_last
0,81267,30776,1,1,53702,7.0,prior,6,13,63,2
1,81267,1463,2,1,53702,7.0,prior,6,13,63,2
2,81267,34943,3,1,53702,7.0,prior,6,13,63,2
3,81267,3952,4,1,53702,7.0,prior,6,13,63,2
4,81267,13779,5,1,53702,7.0,prior,6,13,63,2


In [52]:
#reset index in order to loop across index
prediction=prediction.reset_index()

In [53]:
#set rate at which to predict positive label one in classification
rate=.05
# create column and set column default value to negative label
prediction['ypred']=0

#loop over rows add change prediction column value to 1 if rate is surpassed
for i in prediction.index:
    if float(prediction['Prediction'][i]) >rate:
        prediction['ypred'][i]=1
    else:
        prediction['ypred'][i]=0
#check dataframe
prediction

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,ypred
0,2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068,0
1,2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068,0
2,2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068,0


# Now to get the predictions for all the user's next product order

In [54]:
#reset index 
prediction.set_index('user_id','product_id')
#check shape of small_test dataframe
print (small_test.shape)
#drop duplicates (might need to change axis)
small_test=small_test.set_index('user_id','product_id').sort_index().drop_duplicates()
print (small_test.shape)

(488, 11)
(488, 10)


In [55]:
prediction.shape

(3, 9)

In [56]:
prediction.head()

Unnamed: 0,user_id,product_id,prod_order_count,prod_reorder_rate,product_name,aisle_id,department_id,Prediction,ypred
0,2153,4066,1,1.0,Organic Red Heirloom Pasta Sauce,9,9,0.010068,0
1,2153,27796,1,1.0,Real Mayonnaise,72,13,0.010068,0
2,2153,34668,1,1.0,Organic Large Grade AA Omega-3 Eggs,86,16,0.010068,0


In [57]:
#reset both dataframes and set to same index
prediction = prediction.reset_index()
small_test = small_test.reset_index()
prediction = prediction.set_index('user_id','product_id')
small_test = small_test.set_index('user_id','product_id')

#check difference in index

In [58]:
small_test.index

Int64Index([  2153,   2153,   2153,  13131,  13131,  13131,  13131,  13131,
             13131,  13131,
            ...
            203190, 203190, 203190, 203190, 203190, 204656, 206054, 206054,
            206054, 206054],
           dtype='int64', name=u'user_id', length=488)

In [59]:
prediction.index

Int64Index([2153, 2153, 2153], dtype='int64', name=u'user_id')

In [60]:
#create dataframe to get predictions for
pred_set=small_test.drop (labels=['add_to_cart_order','days_since_prior_order','reordered' ,'eval_set','order_number', 'order_from_last','order_dow','order_hour_of_day'], axis=1 )

In [61]:
#all users and products in test set need to get prediction for each
pred_set.head(6)

Unnamed: 0_level_0,order_id,product_id
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2153,2335713,4066
2153,2335713,10798
2153,2335713,14875
13131,1510877,48679
13131,1510877,6332
13131,1510877,30391


prediction['ypred']=0
    for j in prediction.index:
        if float(prediction['Prediction'][i]) >rate:
            prediction['ypred'][j]=1
        else:
            prediction['ypred'][j]=0

In [62]:
#create dictionary to track user predictions
svd_user_pred={}

# rate for postive label
rate=.05

#call prediction defined function fro each user and save to dictionary
for i in tqdm (xrange(len(users))):
    user_purchased, prediction = predicted_reorders(df_predictions, users[i], df_prod, df_rates,.0) 
    svd_user_pred[i]=prediction
    

100%|██████████| 50/50 [00:00<00:00, 64.60it/s]


In [63]:
#
#svd_model=svd_user_pred[i]

#loop over users and build one dataframe of all user predictions
for i in tqdm (xrange(len(users))):
    svd_model = pd.concat([svd_model, svd_user_pred[i]],ignore_index=True)

  0%|          | 0/50 [00:00<?, ?it/s]


NameError: name 'svd_model' is not defined

In [None]:
#Create a prediction column for all users 
svd_model['ypred']=0

#loop over all users and  label positive ir rate surpassed
for j in tqdm (svd_model.index):
    if float(svd_model['Prediction'][j]) >rate:
        svd_model['ypred'][j]=1
    else:
        svd_model['ypred'][j]=0


In [None]:
#drop unnecessary columns
svd_model=svd_model.drop(labels=['aisle_id','department_id' ,'prod_order_count','prod_reorder_rate'], axis=1 )

In [None]:
#dataframe of all products predicted by svd (missing new products from test set need to include)
#show example of first user 
svd_model[svd_model['user_id']==users[0]]

Decision point. We did not predict any new product purchases. We will give a negative label to all of these below. The above example show the products that are model predicted for with prediction value and label (ypred)

The prediction above only is trained by the previous oreder (previous oreder number 3 for all customers). We will need to keep only product prediction that are in the test set (previous order number 2). 

In [None]:
#test_set['ypred']=iset['ypred'].fillna(0)

In [None]:
#get size of prediction set(test set sized down to our sample users) needed
pred_set.shape

In [None]:
#get size of model predictions (larger we predicted for all orders in previous order)
svd_model.shape

In [None]:
#check only user products from test set in new dataframe
svd_predicted = pd.merge(svd_model , pred_set.reset_index(),how='right', left_on=['user_id','product_id'], right_on = ['user_id','product_id'])

In [None]:
svd_predicted.shape

In [None]:
#drop duplicaates
svd_predicted= svd_predicted.drop_duplicates()

In [None]:
#size is correct now (should match pred_set size number of rows)
svd_predicted.shape

In [None]:
#sort by user_id and product_id (in order to match ML_grocery_basket test sets)
svd_predicted.sort_values('user_id')

In [None]:
#see label predicted total
svd_predicted['ypred'].sum()

In [None]:
#fill in NaN values for lables
#products we did not train on or predict for get a negative label
svd_predicted['ypred']=svd_predicted['ypred'].fillna(0)

#should not affect sum above
svd_predicted['ypred'].sum()

In [None]:
#Should not affect sahpe
svd_predicted.shape

In [None]:
#set index to make ML_grocery_basket test index
svd_predicted=svd_predicted.sort_values(['user_id','product_id'])
svd_predicted

In [None]:
#save data to pass to ML_grocery notebook to run predictions against other models
data_svd = svd_predicted
%store svd_predicted
#del data # This has deleted the variable

Can now run ML_grocery_basket last few cells to compare all models