# Feature Engineering: Synthetic Features


When you are given a raw dataset, you will often encounter some of the following situations:
* Based on your domain knowledge, you need to create new features to improve your predictions
* The data is spread out into multiple files, and you need to combine them
* One or more features can't be used in their current state and need to be transformed
* You need to create synthetic features that aggregate information across multiple rows and/or columns

Note: Throughout this colab I included many details and explanations about the challenge and/or code syntax that are entirely optional. I left them there to serve as a reference and for those of you who would like to fully understand the code or go more in depth in a particular topic. These are marked with \**dfc*\* (digression for the curious).


### Real life example: The Instacart Challenge

Let's look at one example from this Kaggle Instacart Market Basket Analysis competition: https://www.kaggle.com/c/instacart-market-basket-analysis

"The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders." (https://www.kaggle.com/c/instacart-market-basket-analysis/data)



---

### The data

The goal is very clear and seems reasonable. However, the raw data was provided in 6 different csv files:
* **aisles.csv**: a simple list of aisle names and their corresponding id's.
* **departments.csv**: another simple list of department names and id's.
* **products.csv**: a list of product names with their id's, as well as the aisle id and department id they are associated with.
* **orders.csv**: a list of all of the orders contained in the dataset, with a single row for each order, indicating the user_id and eval_set (prior, train, test), along with details about the order (e.g. day of the week and hour of the order)
* **order_products__prior.csv**: a list of which products were ordered for a given order_id prior to the most recent order (3,214,874 orders, 32,434,489 rows)
* **order_products__train.csv**: training data, which is the same as order_products__prior.csv, but with the most recent orders for 131,209 customers (with a total of 1,384,617 rows).

To succeed in this challenge, you would need to know how to cross-reference the information on the different files to be able to match the user information with the orders and products. 
There were 75,000 rows marked as "test" in the orders data. The submission consisted of uploading  a csv containing the predicted list of products included in each of these 75k orders (or None for empty orders).


---

### What we will learn today

In this Colab, we will show some of the essential steps to prepare the data using a small subset of the raw data. We won't cover the actual predictions, but you can give it a try if you want, or check out the public kernels to see what the participants came up with at https://www.kaggle.com/c/instacart-market-basket-analysis/kernels.

\**dfc*\*  This kernel in particular has a great exploratory analysis: https://www.kaggle.com/philippsp/exploratory-analysis-instacart


---

### Importing the data



In [0]:
# Libraries used:
import pandas as pd
import numpy as np

In [0]:
# Import orders data:
orders_df = pd.read_csv('https://raw.githubusercontent.com/juemura/amli-kaggle-instacart/master/orders_short.csv', index_col=0) 
                                                        # *dfc* Since the shortened data was created from another dataframe, 
                                                        # the first column of each spreadsheet contains the row index 
                                                        # from the full data set. To preserve the same index, use index_col=0.
                                                        # To delete the old index and start fresh, don't include the index_col 
                                                        # and run either "orders_df = orders_df.drop(orders_df.columns[0], axis=1)"
                                                        # or "orders_df.drop(orders_df.columns[0], axis=1, inplace=True)"
print(orders_df.shape) # *dfc* shape contains a tuple with the number of rows and the number of columns. 
                       # Note that, differently from head() and describe(), shape is not a method, so you don't need to use ().
                       # shape is an attribute of pandas.DataFrame. For a complete list of methods and attributes, 
                       # check out the docs at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html 
orders_df.head()

(89, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [0]:
print("Number of unique customers: {}".format(len(orders_df.user_id.unique())))
print("Number of orders: {}".format(len(orders_df.order_id.unique())))

Number of unique customers: 10
Number of orders: 89


There are only 89 orders compared to the total of ~3.2 million provided by Instacart.

Since the original dataset was way too big to load in a colab, we'll work with a smaller subset containing the data for 10 users. You won't need to change anything other than the name files if you want to run this code on the full dataset.

\**dfc*\* : The code that I used to extract the data for the first 10 users and save it in new csv files can be found here: https://github.com/juemura/amli-kaggle-instacart/blob/master/Data.ipynb

In [0]:
# Import order_products__prior data:
order_products__prior = pd.read_csv('https://raw.githubusercontent.com/juemura/amli-kaggle-instacart/master/order_products__prior_short.csv', index_col=0)
print(order_products__prior.shape) # shape contains a tuple with the number of rows and the number of columns
order_products__prior.head()

(885, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
221645,23391,13198,1,1
221646,23391,42803,2,1
221647,23391,8277,3,1
221648,23391,37602,4,1
221649,23391,40852,5,1


In [0]:
# Import order_products__train data:
order_products__train = pd.read_csv('https://raw.githubusercontent.com/juemura/amli-kaggle-instacart/master/order_products__train_short.csv', index_col=0)
print(order_products__train.shape)
order_products__train.head()

(104, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
214306,525192,12053,1,0
214307,525192,47272,2,1
214308,525192,37999,3,1
214309,525192,13198,4,1
214310,525192,43967,5,1


In [0]:
n_of_train_orders = len(orders_df[orders_df['eval_set'] == "train"])

# Sanity check: 
assert(n_of_train_orders == len(order_products__train['order_id'].unique()))
            # *dfc* Above are two ways to count how many orders are in the training set.
            # This assertion helps us check if there is training data that is useless because
            # it either has orders in orders_df that have no corresponding label 
            # in the training data (i.e., we don't know which products were included in 
            # that order), or we have labels but no way of matching them to a user.

print("Number of orders in the training set: {}".format(n_of_train_orders))
print("""This means that, out of the 10 users in this subset, we have the list of 
products included in their last order. The other 3 are the users for whom we need to
make a prediction (test set). In the full data set these two sets correspond, respectively, 
to the training set data, which contains the last order for ~131k customers, 
and the test set, which contains 75k customers for which we need to predict what 
products were included in their last order.""")

Number of orders in the training set: 7
This means that, out of the 10 users in this subset, we have the list of 
products included in their last order. The other 3 are the users for whom we need to
make a prediction (test set). In the full data set these two sets correspond, respectively, 
to the training set data, which contains the last order for ~131k customers, 
and the test set, which contains 75k customers for which we need to predict what 
products were included in their last order.


In [0]:
products_df = pd.read_csv('https://raw.githubusercontent.com/juemura/amli-kaggle-instacart/master/products_short.csv', index_col=0)
print(products_df.shape)
products_df.head()

(431, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
22,23,Organic Turkey Burgers,49,12
78,79,Wild Albacore Tuna No Salt Added,95,15
195,196,Soda,77,7
247,248,Dried Sweetened Cranberries,117,19
259,260,Cantaloupe,24,4


In [0]:
print("Number of unique products: {}".format(len(products_df['product_id'].unique())))


Number of unique products: 431




---

### Making sense of the data set

Let's take a look at an individual user.

In [0]:
orders_df[orders_df['user_id'] == 1] # *dfc* this is the syntax that you would use to return all of the rows that match a given condition

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


Ok, we see all of the data and how it's represented. But it's hard to actually visualize and understand it with all these numerical representations. This is another application of feature engineering. Sometimes you want to put all of your data in a format that is human friendly.

Let's take a moment to transform our data set into a format that will help us understand what's going on.


In [0]:
# Step 1: Create a list of fake aliases to represent each customer
names = ["Anna", "Betty", "Charles", "Duke", "Elisa", "Frank", "Gisela", "Homer", "Ingrid", "Jeff"]

# Step 2: Match each name to a user id and save the mapping in a dictionary
# Short, pythonic version:
aliases = dict(zip(orders_df['user_id'].unique(), names))
print(aliases)

{1: 'Anna', 2: 'Betty', 3: 'Charles', 4: 'Duke', 5: 'Elisa', 6: 'Frank', 7: 'Gisela', 8: 'Homer', 9: 'Ingrid', 10: 'Jeff'}


In [0]:
# *dfc* Longer but more digestible version (this does the exact same thing as line 6 in the previous cell, only with more lines of code):
unique_customers = orders_df['user_id'].unique()
print(unique_customers)

[ 1  2  3  4  5  6  7  8  9 10]


In [0]:
aliases_vb = {} # Create empty dictionary
for index, value in enumerate(unique_customers): # *dfc* we need to use enumerate to get both the index and value of an iterable.
  aliases_vb[value] = names[index]
  
assert(aliases == aliases_vb)
print(aliases_vb)

{1: 'Anna', 2: 'Betty', 3: 'Charles', 4: 'Duke', 5: 'Elisa', 6: 'Frank', 7: 'Gisela', 8: 'Homer', 9: 'Ingrid', 10: 'Jeff'}


In [0]:
# Step 3: Replace user_id with names
def id_to_name(row):
  u_id = row['user_id']
  return aliases[u_id]

orders_df['customer'] = orders_df.apply(lambda row: id_to_name(row), axis=1)

orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,customer
0,2539329,1,prior,1,2,8,,Anna
1,2398795,1,prior,2,3,7,15.0,Anna
2,473747,1,prior,3,3,12,21.0,Anna
3,2254736,1,prior,4,4,7,29.0,Anna
4,431534,1,prior,5,4,15,28.0,Anna


*dfc* The example above shows one way to create a column. "<code>apply</code>" will apply a given function to every row of the dataframe.
In this case you don't need to create a separate function ("<code>id_to_name</code>"), you could have simply done:

In [0]:
*dfc*
orders_df['c2'] = orders_df.apply(lambda row: aliases[row['user_id']], axis=1)
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,customer,c2
0,2539329,1,prior,1,2,8,,Anna,Anna
1,2398795,1,prior,2,3,7,15.0,Anna,Anna
2,473747,1,prior,3,3,12,21.0,Anna,Anna
3,2254736,1,prior,4,4,7,29.0,Anna,Anna
4,431534,1,prior,5,4,15,28.0,Anna,Anna


In [0]:
# *dfc*
orders_df.drop('c2', inplace=True, axis=1)

*dfc* However, you couldn't have assigned <code>aliases[row['user_id']]</code> directly without using apply.

<code>df['new_column'] = a_value  # works!</code> 

<code>df['new_column'] = df['old_column']  # works too!</code>

Can you see why the examples below don't?

In [0]:
# *dfc*
orders_df['customer'] = aliases[orders_df['user_id']]
# TypeError: 'Series' objects are mutable, thus they cannot be hashed

orders_df['customer'] = aliases[int(orders_df['user_id'])]
# TypeError: cannot convert the series to <class 'int'>

TypeError: ignored

### Exercise 1: 
Create a column 'order_dow_time' that contains the day of the week and time period of the order (morning, afternoon, evening)

In [0]:
########################
## YOUR CODE HERE
########################



#### Hint




You can use a function like id_to_name and use if/else statements to check the value contained in two distinct columns.

### [add section] The replace function
order_dow	order_hour_of_day

## Merging dataframes

In order to train a model that will predict which previously purchased products will be in a customer's next order, we need to build a dataframe where each row represents a customer x product pair, so we can have a "in_last_order" column, which will be our label (output).

The first step is to merge the orders_data with the dataframes that contain the information about which products were in each order.

Pandas provides a <code>merge()</code> method. Go to https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html for a summary of the different ways to merge, join, and concatenate two dataframes.

In [0]:
df = pd.merge(order_products__prior,orders_data, on="order_id", how="inner")[['eval_set', 'user_id', 
                                                                              'order_id', 'order_number', 
                                                                              'order_dow', 'order_hour_of_day', 
                                                                              'days_since_prior_order', 'product_id', 
                                                                              'add_to_cart_order', 'reordered']]
print("Prior df merged with orders data: ", df.shape)
df.head()

Prior df merged with orders data:  (885, 10)


Unnamed: 0,eval_set,user_id,order_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,prior,7,23391,17,0,10,28.0,13198,1,1
1,prior,7,23391,17,0,10,28.0,42803,2,1
2,prior,7,23391,17,0,10,28.0,8277,3,1
3,prior,7,23391,17,0,10,28.0,37602,4,1
4,prior,7,23391,17,0,10,28.0,40852,5,1


We now have a table with one line per product per order. Note that we used an inner merge because we want the intersection of the two dataframes.
Let's create some new columns to work with.

In [0]:
df['reorder_total'] = df.groupby(['user_id', 'product_id'])['reordered'].transform('sum')
df['user_product_id'] = df['product_id'] + df['user_id'].astype(np.int64) * 100000

**reorder_total** is the number of times a product was reordered by an individual user. If a user orders bananas 5 times, reorder_total will be 4.

**user_product_id** concatenates user_id and product_id so that we have a unique identifier for each product ever purchased by a user. (this is not actually necessary, but it's an option)




---


## Your turn
TODO: Create the following columns:
* **no_of_orders:** the number of times a user placed an order.
* **order_ratio:** the ratio of how many orders included a specific product out of the total number of orders that a user placed. 

We will be using these new columns to build our features.

In [0]:
########################
## YOUR CODE HERE
########################



In [0]:
# AK 1
df['no_of_orders'] = df.groupby(['user_id'])['order_number'].transform('max')
df['order_ratio'] = (df['reorder_total'] + 1) / df['no_of_orders']

In [0]:
rows = df.shape[0] # this will be used later to make sure we did our merge correctly
print(df.shape)
df.head()

(885, 14)


Unnamed: 0,eval_set,user_id,order_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,reorder_total,user_product_id,no_of_orders,order_ratio
0,prior,7,23391,17,0,10,28.0,13198,1,1,7,713198,20,0.4
1,prior,7,23391,17,0,10,28.0,42803,2,1,6,742803,20,0.35
2,prior,7,23391,17,0,10,28.0,8277,3,1,2,708277,20,0.15
3,prior,7,23391,17,0,10,28.0,37602,4,1,11,737602,20,0.6
4,prior,7,23391,17,0,10,28.0,40852,5,1,12,740852,20,0.65


Now we will begin creating our labels. 

TODO: Merge the orders data with the training data.
Note that we only need the user_id and product_id columns. Why?

In [0]:
########################
## YOUR CODE HERE
########################



In [0]:
# AK 2
labels = pd.merge(order_products__train,orders_data, on="order_id", how="inner")[['user_id', 'product_id']]


In [0]:
print("labels: ", labels.shape)
n_of_customers = len(labels.user_id.unique())
print("Number of customers: ", n_of_customers)
print("This is the last order of {} customers, which will be used to create the Y label and train the model.".format(n_of_customers))
labels.head()

labels:  (104, 2)
Number of customers:  7
This is the last order of 7 customers, which will be used to create the Y label and train the model.


Unnamed: 0,user_id,product_id
0,7,12053
1,7,47272
2,7,37999
3,7,13198
4,7,43967


This is looking good!

We need to use *labels* to create a column in *df* indicating which products were in the last order. 
To do that, we need to merge labels with our initial dataset.

TODO: Create a column **in_last_order** filled with 1's (*label* represents the products that were included in the last order, so in_last_order is True for every row).

In [0]:
########################
## YOUR CODE HERE
########################



In [0]:
# AK 3
labels['in_last_order'] = 1

Finally, we need to put it all together.

TODO: 
1. Merge labels with df. Hint: You will need to do a left merge.
2. Indicate which items were not in the last order by filling the NaN values with 0's.

Hint: You can merge on multiple columns by using a list, or you can use the user_product_id column.

In [0]:
########################
## YOUR CODE HERE
########################



In [0]:
# AK 4
df = pd.merge(df, labels, how='left', on=['user_id', 'product_id'])
df['in_last_order'].fillna(0, inplace=True)
df['in_last_order'] = df['in_last_order'].astype('int32')
df.head()

Unnamed: 0,eval_set,user_id,order_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,reorder_total,user_product_id,no_of_orders,order_ratio,in_last_order
0,prior,7,23391,17,0,10,28.0,13198,1,1,7,713198,20,0.4,1
1,prior,7,23391,17,0,10,28.0,42803,2,1,6,742803,20,0.35,0
2,prior,7,23391,17,0,10,28.0,8277,3,1,2,708277,20,0.15,0
3,prior,7,23391,17,0,10,28.0,37602,4,1,11,737602,20,0.6,0
4,prior,7,23391,17,0,10,28.0,40852,5,1,12,740852,20,0.65,1


In [0]:
# Sanity check to make sure we didn't add any rows.
assert(df.shape[0] == rows)

That's it! Your data is now ready for feature selection. Once you figure out which features you want to use, and possibly create a few more, you will be able to split this dataframe into input/output and use it to train your model. Note: you might want to use <code>df.drop_duplicates(subset=['user_id', 'product_id'], inplace=True)</code>. Why?


## Bonus
Say you needed to create a spreadsheet using product names instead of their id's (so humans could understand it). You could merge df with the products_data, but you could also use a lambda function. See the example below:

In [0]:
print(df.columns.values)
df.head()

['eval_set' 'user_id' 'order_id' 'order_number' 'order_dow'
 'order_hour_of_day' 'days_since_prior_order' 'product_id'
 'add_to_cart_order' 'reordered' 'reorder_total' 'user_product_id'
 'no_of_orders' 'order_ratio' 'in_last_order']


Unnamed: 0,eval_set,user_id,order_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,reorder_total,user_product_id,no_of_orders,order_ratio,in_last_order
0,prior,7,23391,17,0,10,28.0,13198,1,1,7,713198,20,0.4,1
1,prior,7,23391,17,0,10,28.0,42803,2,1,6,742803,20,0.35,0
2,prior,7,23391,17,0,10,28.0,8277,3,1,2,708277,20,0.15,0
3,prior,7,23391,17,0,10,28.0,37602,4,1,11,737602,20,0.6,0
4,prior,7,23391,17,0,10,28.0,40852,5,1,12,740852,20,0.65,1


In [0]:
product_dict = dict(zip(products_data.product_id, products_data.product_name))

def get_name(row):
  return product_dict[row['product_id']]

df['product_name'] = df.apply(lambda row: get_name(row), axis=1)
print(df.columns.values)
df.head()


['eval_set' 'user_id' 'order_id' 'order_number' 'order_dow'
 'order_hour_of_day' 'days_since_prior_order' 'product_id'
 'add_to_cart_order' 'reordered' 'reorder_total' 'user_product_id'
 'no_of_orders' 'order_ratio' 'in_last_order' 'product_name']


Unnamed: 0,eval_set,user_id,order_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,reorder_total,user_product_id,no_of_orders,order_ratio,in_last_order,product_name
0,prior,7,23391,17,0,10,28.0,13198,1,1,7,713198,20,0.4,1,85% Lean Ground Beef
1,prior,7,23391,17,0,10,28.0,42803,2,1,6,742803,20,0.35,0,Organic Apple Slices
2,prior,7,23391,17,0,10,28.0,8277,3,1,2,708277,20,0.15,0,Apple Honeycrisp Organic
3,prior,7,23391,17,0,10,28.0,37602,4,1,11,737602,20,0.6,0,Mexican Coffee
4,prior,7,23391,17,0,10,28.0,40852,5,1,12,740852,20,0.65,1,Lactose Free Fat Free Milk


In [0]:
from google.colab import files

df.drop('product_id', inplace=True, axis=1)
df.to_csv('df.csv')
files.download('df.csv')

## Bonus Challenge
Create a spreadsheet that shows, for each user, the most frequent day of the week (shown as Monday ... Friday instead of the number encodings), the average hour of the day, the average number of days between orders, the average number of items in each order, and a list of the top 3 products that they order, along with their aisle and department.

## Debrief
* How do you feel about feature selection?
* What was hard?
* What was too easy?
* Is there anything you'd like to share?