# Final Project- Instacart Predictions

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on November 18, 2019.

## Group Name:
HAM

## Group Members:
- Andy Cheon
- Meng (Marine) Lin
- Hannah Lyon

In [14]:
import pandas as pd
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn import compose
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
import numpy as np

## Ask:

How can we predict the items that an Instacart user is likely to repurchase?

## Acquire:

In [2]:
df = pd.read_csv('data/instacart.csv')
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2,33120,1,1,Organic Egg Whites,86,16,eggs,dairy eggs,202279,prior,3,5,9,8.0
1,2,28985,2,1,Michigan Organic Kale,83,4,fresh vegetables,produce,202279,prior,3,5,9,8.0
2,2,9327,3,0,Garlic Powder,104,13,spices seasonings,pantry,202279,prior,3,5,9,8.0
3,2,45918,4,1,Coconut Butter,19,13,oils vinegars,pantry,202279,prior,3,5,9,8.0
4,2,30035,5,0,Natural Sweetener,17,13,baking ingredients,pantry,202279,prior,3,5,9,8.0


In [3]:
df = df.dropna() 

In [4]:
df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31741038 entries, 0 to 33819105
Data columns (total 15 columns):
order_id                  31741038 non-null int64
product_id                31741038 non-null int64
add_to_cart_order         31741038 non-null int64
reordered                 31741038 non-null int64
product_name              31741038 non-null object
aisle_id                  31741038 non-null int64
department_id             31741038 non-null int64
aisle                     31741038 non-null object
department                31741038 non-null object
user_id                   31741038 non-null int64
eval_set                  31741038 non-null object
order_number              31741038 non-null int64
order_dow                 31741038 non-null int64
order_hour_of_day         31741038 non-null int64
days_since_prior_order    31741038 non-null float64
dtypes: float64(1), int64(10), object(4)
memory usage: 3.8+ GB


## Process:

In [5]:
def data_clean(df):
    df = pd.get_dummies(df, columns=['department_id'], drop_first=True)
    
    # find total numbers of orders
    temp = df.groupby('user_id').max()[['order_number']].reset_index()
    temp = temp.reset_index()
    df = df.merge(temp, how='left', left_on='user_id', right_on='user_id')
    
    df.drop(['index', 'product_name', 'aisle', 'department'], axis=1, inplace=True)
    return df

In [6]:
df = data_clean(df)
df

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,aisle_id,user_id,eval_set,order_number_x,order_dow,order_hour_of_day,...,department_id_13,department_id_14,department_id_15,department_id_16,department_id_17,department_id_18,department_id_19,department_id_20,department_id_21,order_number_y
0,2,33120,1,1,86,202279,prior,3,5,9,...,0,0,0,1,0,0,0,0,0,9
1,2,28985,2,1,83,202279,prior,3,5,9,...,0,0,0,0,0,0,0,0,0,9
2,2,9327,3,0,104,202279,prior,3,5,9,...,1,0,0,0,0,0,0,0,0,9
3,2,45918,4,1,19,202279,prior,3,5,9,...,1,0,0,0,0,0,0,0,0,9
4,2,30035,5,0,17,202279,prior,3,5,9,...,1,0,0,0,0,0,0,0,0,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31741033,3421063,14233,3,1,115,169679,train,30,0,10,...,0,0,0,0,0,0,0,0,0,30
31741034,3421063,35548,4,1,13,169679,train,30,0,10,...,0,0,0,0,0,0,0,1,0,30
31741035,3421070,35951,1,1,91,139822,train,15,6,10,...,0,0,0,1,0,0,0,0,0,15
31741036,3421070,16953,2,1,88,139822,train,15,6,10,...,1,0,0,0,0,0,0,0,0,15


In [7]:
train = df.loc[(df.eval_set == 'prior')]
test = df.loc[(df.eval_set == 'train')]
y_train = train['reordered']
x_train = train.drop(['reordered', 'eval_set'], axis=1)
y_test = test['reordered']
x_test = test.drop(['reordered', 'eval_set'], axis=1)

## Model:

In [8]:
nb = GaussianNB()

In [9]:
cv = cross_validate(nb, x_train, y_train, scoring='f1', cv=5, verbose=True, n_jobs =-1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 16.2min finished


In [10]:
cv['test_score']

array([0.77668988, 0.7730551 , 0.77305638, 0.77305707, 0.77712535])

In [11]:
nb.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [12]:
preds = nb.predict(x_test)

In [15]:
f1_score(y_test, preds)

0.7489169215950179