In [1]:
import sklearn
import pandas as pd

In [3]:
data_folder = '~/instacart_data/'
aisles = pd.read_csv(data_folder + 'aisles.csv')
departments = pd.read_csv(data_folder + 'departments.csv')
orders = pd.read_csv(data_folder + 'orders.csv')
orders_prior = pd.read_csv(data_folder + 'order_products__prior.csv')
orders_train = pd.read_csv(data_folder + 'order_products__train.csv')
products = pd.read_csv(data_folder + 'products.csv')
sample_submission = pd.read_csv(data_folder + 'sample_submission.csv')

orders_full was made in `make_orders_full.ipynb`

Notice that we have product id, product name, aisle id, and deparment id. The Product name could be useful in the future if we wanted to do some sort of word embeddings with these names, for example we could group all the 'cookies' together or all the 'teas'. The aisle id would also be used as a way to group certain food types together since the grocery stores group similar items together. The department id, as we will see in the next cell, is another grouping of foods, such as 'frozen', 'bakery', 'produce', etc. This seems like a larger grouping system then either looking at the individual names or the aisles, since a 'frozen' department might have many aisles, one for ice cream, one for frozen diners, etc. So the hierarchy of food groupings that we might use are:

- Individual items (using the product_name)
- Aisle
- Department

In [4]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [5]:
products.shape[0]

49688

There are about 50,000 products, each with aisle_id and department_id

In [6]:
departments.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [7]:
departments.shape[0]

21

There are 21 departments

In [8]:
aisles.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [9]:
aisles.shape[0]

134

All together we have the following statistics:
    
- products: ~50,000
- departments: 21
- aisles: 134

There are about 134 aisles

## The three 'orders' tables

There are three sets of data which have information about the orders that customers have made.

**Orders**: This table includes information about all the orders placed. The features are:
- *order_id*
- *user_id*
- *eval_set*: this feature tells which of the three sets the order belongs in (prior, train, test)
- *order_number*
- *order_dow*: not sure what this feature is
- *order_hour_of_the_day*: hour of the day the order was places (out of a 24 hour clock)
- *days_since_prior_order*

The `orders` table does not include information about the individual products that go into each order. `orders` records the relationship between a user and the order. 

The tables `orders_prior` and `orders_train` relate the individual products (represented by product ids) to the orders. These two tables also record the order in which the item was added to the order and it this item is a reorder. These tables have the following features:

- *order_id*
- *product_id*
- *add_to_cart_order*: what order the item was added to the cart. Was it the first added? The last?
- *reordered*: Has this person ordered this item in the past?

In order to get a large which includes all the information about the individual orders and the order itself we would have to concatenate the `orders_prior` and the `orders_train` table and then merge the new table with the orders table on the `order_id` column. We want to include all the orders, even if there are not products in the order so we would left join the `orders_prior + orders_train` on the `orders` table which would look something like this:

```python
df = pd.concat((orders_train, orders_prior), axis=0)
df = orders.merge(df, on='order_id', how='left')
```

In [22]:
orders_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [17]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [20]:
orders['eval_set'].unique()

array(['prior', 'train', 'test'], dtype=object)

In [10]:
orders_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [11]:
orders_prior.shape[0]

32434489

In [16]:
len(orders_prior['order_id'].unique())

3214874

There are about 32.5 million rows in the prior orders table which represent items being places in orders. THe number of orders is about 3.2 million.

In [13]:
orders_train.shape

(1384617, 4)

In [14]:
orders_train['order_id'].unique()

array([      1,      36,      38, ..., 3421058, 3421063, 3421070])

In [15]:
sample_submission.head()

Unnamed: 0,order_id,products
0,17,39276 29259
1,34,39276 29259
2,137,39276 29259
3,182,39276 29259
4,257,39276 29259


In [16]:
orders.loc[orders['eval_set'] == 'test'].head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
38,2774568,3,test,13,5,15,11.0
44,329954,4,test,6,3,12,30.0
53,1528013,6,test,4,3,16,22.0
96,1376945,11,test,8,6,11,8.0
102,1356845,12,test,6,1,20,30.0


There are about 1.4 million rows in the training data

In [17]:
orders_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [18]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [19]:
orders_full = pd.read_csv('data/orders_full.csv')

test of Math: $A+\frac{B^2}{C_3}$. $$\int_{-1}^{3}f(x)dx$$