# Classification Setup & Feature Engineering With Instacart!

The first goal of this notebook is to first show how we can take a bunch of raw, relational data and manipulate it into the format of a machine learning problem (binary classification). This is the sort of thing we'll often be given in a business setting - information that's been collected with no clear guidance about how we can frame this information as a predictive task in the $X$ features, $y$ target style.

The second goal of this notebook is to expose you to feature engineering ideas and best practices. Knowing how to select models and tune hyperparameters is very important, but models can only be as good as the quality of the features that you provide to them. In practical machine learning, a huge amount of time is spent engineering and selecting features to add signal to the problem, and you can often get a lot more additional value from well-constructed features than from hyperparameter fine tuning. Hence it's very important to be able to apply domain knowledge, think extensively about what features should matter, and derive them to include in your model.    

**The data for this notebook is a subset of the [kaggle instacart dataset](https://www.kaggle.com/c/instacart-market-basket-analysis/data)**: the files in the `instacart_data_subset` folder are all the same as the original except that the all the order data is drawn from a 5000 user subset for the sake of less memory and more speed.  

### Workflow: 

1. **Problem setup and baselining**
2. **Feature engineering to improve our model**
3. **Feature engineering exercises and future ideas**

## 1. Problem setup and baselining

First let's get a quick feel for the data we're working with.

In [1]:
!unzip instacart_data_subset.zip

unzip:  cannot find or open instacart_data_subset.zip, instacart_data_subset.zip.zip or instacart_data_subset.zip.ZIP.


In [2]:
import pandas as pd
import numpy as np

path = 'instacart_data_subset/'
df_orders = pd.read_csv(path + 'orders_subset.csv')
df_orders.head(3)

IOError: File instacart_data_subset/orders_subset.csv does not exist

In [None]:
df_orders.eval_set.unique()

In [None]:
df_order_products_prior = pd.read_csv(path + 'order_products__prior_subset.csv')
df_order_products_prior.head(3)

In [3]:
df_order_products_train = pd.read_csv(path + 'order_products__train_subset.csv')
df_order_products_train.head(3)

IOError: File instacart_data_subset/order_products__train_subset.csv does not exist

We'll want to combine the order/product information with user information, so we'll go ahead and merge the `order_products` tables with the `orders` table.

In [8]:
df_order_products_train = df_order_products_train.merge(df_orders.drop('eval_set', axis=1),\
                                                        on='order_id')
df_order_products_prior = df_order_products_prior.merge(df_orders.drop('eval_set', axis=1),\
                                                        on='order_id')


In [12]:
df_order_products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1077,13176,1,1,173934,11,6,9,10.0
1,1077,39922,2,1,173934,11,6,9,10.0
2,1077,5258,3,1,173934,11,6,9,10.0
3,1077,21137,4,1,173934,11,6,9,10.0
4,1119,6046,1,1,129386,7,1,14,17.0


Next we'll set up the classification problem. For the instacart challenge, the given task is to predict which products will show up again in a user's next order based on their entire product order history. We can worry about aggregating to the cart level later, but understand for now that this problem will require us to make **individual binary predictions for every unique user-product combination** in the order history, where the target is 1 or 0 for if that product shows up in the user's next/most current order.    

With that in mind, we'll create a **`df_X` as our ML-formatted dataframe**, with a user-product aggregated version of the `order_products_prior` data. We'll go ahead and count the total # of times the user has ordered each product as our first feature since we're already doing a user-product aggregation.

In [16]:
df_user_product = (df_order_products_prior.groupby(['product_id','user_id'],as_index=False) 
                                          .agg({'order_id':'count'})
                                          .rename(columns={'order_id':'user_product_total_orders'}))


In [21]:
df_user_product.nunique()

product_id                   28927
user_id                       5000
user_product_total_orders       81
dtype: int64

In [22]:
df_user_product = (df_order_products_prior.groupby(['product_id','user_id'],as_index=False) 
                                          .agg({'order_id':'count'}) 
                                          .rename(columns={'order_id':'user_product_total_orders'}))

train_ids = df_order_products_train['user_id'].unique() 
df_X = df_user_product[df_user_product['user_id'].isin(train_ids)]
df_X.head()

Unnamed: 0,product_id,user_id,user_product_total_orders
0,1,21285,1
1,1,47549,4
2,1,54136,1
3,1,54240,1
4,1,95730,1


In [23]:
df_X.shape

(329806, 3)

Next we need to get our labels. To do this, we'll group our current cart data (`order_products_train`) by user and collect a set of the items in that cart. Then we can merge with `df_X` and iterate through the rows to get labels for whether each product occurs in the latest cart.  

In [24]:
train_carts = (df_order_products_train.groupby('user_id',as_index=False)
                                      .agg({'product_id':(lambda x: set(x))})
                                      .rename(columns={'product_id':'latest_cart'}))


In [25]:
train_carts.head()

Unnamed: 0,user_id,latest_cart
0,50,"{6182, 31720, 47209, 21903, 13176, 16249}"
1,52,"{30720, 46149, 24135, 39275, 14032, 8048, 3045..."
2,65,"{6656, 33768, 5161, 11534, 38164, 4920, 41561,..."
3,80,"{12545, 15842, 43713, 8710, 24010, 33002, 4763..."
4,220,"{7781, 42445, 35887, 31343, 28476}"


In [26]:

df_X = df_X.merge(train_carts, on='user_id')
df_X['in_cart'] = (df_X.apply(lambda row: row['product_id'] in row['latest_cart'], axis=1).\
                   astype(int))



Unnamed: 0,product_id,user_id,user_product_total_orders,latest_cart,in_cart
0,1,21285,1,"{21573, 35561, 37710, 11759, 12341, 13176, 32478}",0
1,3298,21285,1,"{21573, 35561, 37710, 11759, 12341, 13176, 32478}",0
2,4920,21285,3,"{21573, 35561, 37710, 11759, 12341, 13176, 32478}",0
3,6066,21285,2,"{21573, 35561, 37710, 11759, 12341, 13176, 32478}",0
4,6184,21285,6,"{21573, 35561, 37710, 11759, 12341, 13176, 32478}",0


In [27]:
df_X.tail()

Unnamed: 0,product_id,user_id,user_product_total_orders,latest_cart,in_cart
329801,38277,94498,1,{41290},0
329802,41290,94498,3,{41290},1
329803,39461,127372,3,{13966},0
329804,43721,127372,3,{13966},0
329805,49070,31628,5,"{33000, 45608, 49070, 39441, 37496}",1


Nice, now we actually have a dataset that's shaped as a binary classification problem: **`in_cart` is our target**, and **each observation is a unique user-product combination** based on the entire order history for users in our current cart data.

We should immediately check out the distribution of our labels and know that we're working with an **imbalanced classification task** (always check this first!). It's something we'll definitely want to account for later (though not in this notebook) when we want to optimize our F1 score, the chosen metric for scoring our model.

In [28]:
df_X.in_cart.value_counts(normalize=True)

0    0.901945
1    0.098055
Name: in_cart, dtype: float64

Right now this looks like a very meagre machine learning problem! We really only have 1 usable feature, the total orders placed for each user-product combination. Let's use a simple logistic model with this feature as a baseline, and see how much predictive power we can add by **building out our feature set with feature engineering**.

For this problem, we want to be **extra careful about validation/testing**. If we do a simple train/test split, we'll end up with users that occur in both the training and test data and run the risk of overfitting to the tendencies of specific users. Instead, we'll manually sample 20% of the users to put into our test set, and use the remaining 80% of users for the training data.

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
    
np.random.seed(42)
total_users = df_X['user_id'].unique() 
test_users = np.random.choice(total_users, size=int(total_users.shape[0] * .20), replace=False)

df_X_tr, df_X_te = df_X[~df_X['user_id'].isin(test_users)], df_X[df_X['user_id'].isin(test_users)] 

y_tr, y_te = df_X_tr['in_cart'], df_X_te['in_cart']
X_tr, X_te = df_X_tr.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \
             df_X_te.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \



In [31]:
y_te.head()

285    0
286    0
287    0
288    0
289    0
Name: in_cart, dtype: int32

In [32]:
lr = LogisticRegression()
lr.fit(X_tr, y_tr)
f1_score(lr.predict(X_te), y_te)

0.082051282051282051

The bar is set low!

## 2. Feature engineering to improve our model

The first thing we need to do is think critically about the predictive task and the types of features that we need to use. We should draw heavily on domain knowledge and be open to trial and error. 

Since our observations are unique combinations of user-product, we'll have multiple sources of features that are highly relevant to making a prediction about the purchasing behavior for the upcoming/most current order. We can break this problem down in terms of **qualities of human behavior** and figure out how to capture them numerically.

### Feature Types:

* **Product** features: general information about product purchase patterns across ALL users. The category of the product, its general popularity, how high priority the item tends to be, etc.
* **User** features: information about specific user behavior. How many items do they tend to order, how long has it been since they've last ordered, what time of day do they usually order, etc. 
* **User-Product** features: information about product-specific user behavior. How often have they ordered this product, how high-priority does it tend to be for them, how long has it been since they've ordered this product, etc.

When engineering 10s or hundreds of features, it can quickly become tricky to keep track of all our code and feature outputs. Here are a couple of best practices:

  1. Use consistent naming conventions for features of the same type 
  2. Build features at the same level of aggregation at the same time, and track them in a dedicated dataframe. Merge back into the ML-formatted dataframe at the end of the process.
  
With this in mind, we'll start with product level features.

### Product features

Here we'll create 2 simple product-level features, merge them into the ML dataframe, and run a new model as a 2nd baselining step. By **iteratively baselining** like this, we can make sure that each set of features we're adding to the mix is adding real predictive value.

We'll gauge each product's overall popularity by counting its total orders across all users, and also gauge its typical priority level in an order by averaging its `add_to_cart_order`.

In [None]:
from collections import OrderedDict

prod_features = ['product_total_orders','product_avg_add_to_cart_order']

df_prod_features = (df_order_products_prior.groupby(['product_id'],as_index=False)
                                           .agg(OrderedDict(
                                                   [('order_id','nunique'),
                                                    ('add_to_cart_order','mean')])))
df_prod_features.columns = ['product_id'] + prod_features
df_prod_features.head()

In [None]:
df_X = df_X.merge(df_prod_features, on='product_id')

#note that dropping rows with NA product_avg_days_since_prior_order is likely a naive choice 
df_X = df_X.dropna()
df_X.head()

In [None]:
df_X_tr, df_X_te = df_X[~df_X['user_id'].isin(test_users)], df_X[df_X['user_id'].isin(test_users)] 

y_tr, y_te = df_X_tr['in_cart'], df_X_te['in_cart']
X_tr, X_te = df_X_tr.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \
             df_X_te.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \

lr = LogisticRegression()
lr.fit(X_tr, y_tr)
f1_score(lr.predict(X_te), y_te)

So we're able to do slightly better with the addition of just a few product specific features.

### User features

Here we'll create 4 user-level features, then merge into `df_X` and benchmarka model as before.

There are a number of components of user behavior that should be critical to measure. We'd like to know if our users have made many or few orders, the average number of products they buy in an order, how many different products they've bought over time, and how long they typically wait between orders. Thinking about purchasing tendencies, all of these factors could play a role in determining what to expect from the next cart.

In [None]:
user_features = ['user_total_orders','user_avg_cartsize','user_total_products','user_avg_days_since_prior_order']

df_user_features = (df_order_products_prior.groupby(['user_id'],as_index=False)
                                           .agg(OrderedDict(
                                                   [('order_id',['nunique', (lambda x: x.shape[0] / x.nunique())]),
                                                    ('product_id','nunique'),
                                                    ('days_since_prior_order','mean')])))

df_user_features.columns = ['user_id'] + user_features
df_user_features.head()

In [None]:
df_X = df_X.merge(df_user_features, on='user_id')
df_X = df_X.dropna()
df_X.head(1)

In [None]:
df_X_tr, df_X_te = df_X[~df_X['user_id'].isin(test_users)], df_X[df_X['user_id'].isin(test_users)] 

y_tr, y_te = df_X_tr['in_cart'], df_X_te['in_cart']
X_tr, X_te = df_X_tr.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \
             df_X_te.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \

lr = LogisticRegression()
lr.fit(X_tr, y_tr)
f1_score(lr.predict(X_te), y_te)

Once again, our model improvement confirms that these features were a worthwhile addition of predictive value.

### User-Product features

For our 3rd feature engineering step, we'll create 2 more user-product features to add to our benchmark.

Here we want to get a sense of how much priority each user places on each product by looking at the typical `add_to_cart_order` for that user-product combination. We also want to get a feature for % of times a product occurs across all of a user's orders -- we'll do that at the end by taking the original `user_product_total_orders` feature we grabbed and dividing it by the `user_total_orders` feature we derived in the user features section.

In [None]:
user_prod_features = ['user_product_avg_add_to_cart_order']

df_user_prod_features = (df_order_products_prior.groupby(['product_id','user_id'],as_index=False) \
                                                .agg(OrderedDict(
                                                     [('add_to_cart_order','mean')])))

df_user_prod_features.columns = ['product_id','user_id'] + user_prod_features 
df_user_prod_features.head()

In [None]:
df_X = df_X.merge(df_user_prod_features,on=['user_id','product_id'])
df_X['user_product_order_freq'] = df_X['user_product_total_orders'] / df_X['user_total_orders'] 
df_X.head(1)

In [None]:
df_X_tr, df_X_te = df_X[~df_X['user_id'].isin(test_users)], df_X[df_X['user_id'].isin(test_users)] 

y_tr, y_te = df_X_tr['in_cart'], df_X_te['in_cart']
X_tr, X_te = df_X_tr.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \
             df_X_te.drop(['product_id','user_id','latest_cart','in_cart'],axis=1), \

lr = LogisticRegression()
lr.fit(X_tr, y_tr)
f1_score(lr.predict(X_te), y_te)

We've come a long way, but have a ways yet to go. We should be able to improve our F1 with a combination of:

    1. More/better features
    2. More training data (we have lots more available)
    3. Better handling of the class imbalance issue / decision threshold (topic for future lecture)
    4. More sophisticated models

In [None]:
#Saving our ML dataframe with features for future use:
df_X.to_csv('instacart_data_subset/instacart_df_X_features.csv', index=False)

## 3. Feature engineering exercises and future ideas

In [None]:
# Add product category / aisle information as categorical features 

In [None]:
# Add another user-product feature that computes how many orders it's been since the user ordered that product 


In [None]:
# Add another user-product feature that computes the % of times a product shows up consecutively in the user's orders
# (i.e. they reordered it immediately in the next order)


In [None]:
# We haven't used the data on order time / day of week at all yet. We could use this to measure the typical times 
# products tend to be ordered (both generically and at the user-product level), and quantify the difference
# between the time of the latest order and these typical times to pick up new signal around ordering patterns.

# Modify the product and user-product features to compute average hour of day and day of week. Add these to df_X,
# Then add features of the form user_product_avg_hod_delta that take the dif of the current order time and the avg. 


In [None]:
# So far the way we've used the order history treats the entire history on equal terms - for example, user-product 
# order frequency treats orders from months ago the same as recent ones. Come up with features that focus more
# on the most recent orders or give them more weight than older ones.
