# Grocery prediction using clique percolation method

## 1. Import modules
* `pandas` and `numpy` for data manipulation
* `turicreate` for performing model selection and evaluation
* `sklearn` for splitting the data into train and test set

In [0]:
!pip install turicreate
!pip install sklearn
!pip install scripts

Traceback (most recent call last):
  File "/usr/local/bin/pip3", line 6, in <module>
    from pip._internal import main
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/__init__.py", line 40, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/autocompletion.py", line 8, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/main_parser.py", line 8, in <module>
    from pip._internal.cli import cmdoptions
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/cmdoptions.py", line 22, in <module>
    from pip._internal.utils.hashes import STRONG_HASHES
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/hashes.py", line 10, in <module>
    from pip._internal.utils.misc import read_chunks
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/misc.py", line 21, in <mod

In [0]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

In [0]:
from google.colab import files
uploaded = files.upload()

## 2. Load data
Two datasets are used in this exercise, which can be found in `data` folder: 
* `recommend_1.csv` consisting of a list of 1000 customer IDs to recommend as output
* `trx_data.csv` consisting of user transactions

The format is as follows.

In [0]:
import io

customers = pd.read_csv(io.BytesIO(uploaded['recommend_1.csv'])) 
transactions = pd.read_csv(io.BytesIO(uploaded['trx_data.csv']))

In [0]:
print(customers.shape)
customers.head()

In [0]:
print(transactions.shape)
transactions.head()

## 3. Data preparation
* Our goal here is to break down each list of items in the `products` column into rows and count the number of products bought by a user

In [0]:
# example 1: split product items
transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')])
transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index()

In [0]:
transactions.head()

In [0]:
# example 2: organize a given table into a dataframe with customerId, single productId, and purchase count
pd.melt(transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})

### 3.1. Create data with user, item, and target field
* This table will be an input for our modeling later
    * In this case, our user is `customerId`, `productId`, and `purchase_count`

In [0]:
s=time.time()

data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})
data['productId'] = data['productId'].astype(np.int64)

print("Execution time:", round((time.time()-s)/60,2), "minutes")

In [0]:
print(data.shape)
data.head()

In [0]:
max(data['productId']), min(data['productId'])

###3.2 Creating a NetworkX graph

In [0]:
import networkx as nx

G=nx.Graph()

print(G.nodes())
print(G.edges())

print(type(G.nodes()))
print(type(G.edges()))

In [0]:
# a list of nodes:
G.add_nodes_from(data['productId'])
print("Nodes of graph: ")
print(G.nodes())

In [0]:
sdata = data.sort_values('customerId')
sdata.head(),sdata.tail()
for i in range(len(data)):
  for j in range(i,len(data)):
    if data['customerId'][i] == data['customerId'][j]:
      edge = (data['productId'][i],data['productId'][j])
      G.add_edge(*edge)
    else:
      break

* Nodes of graph for clique percolation method

In [0]:
print("Nodes of graph: ")
print(G.nodes())
print("Edges of graph: ")
print(G.edges())

* Notice how dense the graph is

In [0]:
nx.draw(G)

## 4. Clique Percolation Method
* Applying Clique Percolation method on NetworkX graph

* Generating max cliques

In [0]:
from networkx.algorithms.approximation import clique 

c1 = clique.max_clique(G) 
print(c1)

* Clique percolation method would not work since Graph has too many nodes

# Association Rules Mining/Market Basket Analysis
* Source: https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis

## 1 Setting up environment

In [0]:
import os
!pip install kaggle



In [0]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"jacobjohn2016","key":"0206904f95d5d7e082b1af55b711e3f4"}'}

In [0]:
!ls -lha kaggle.json

-rw-r--r-- 1 root root 69 Apr  3 13:16 kaggle.json


In [0]:
!ls
!pwd
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

kaggle.json  sample_data
/content


In [0]:
!kaggle competitions download -c instacart-market-basket-analysis

Downloading departments.csv.zip to /content
  0% 0.00/804 [00:00<?, ?B/s]
100% 804/804 [00:00<00:00, 776kB/s]
Downloading aisles.csv.zip to /content
  0% 0.00/1.87k [00:00<?, ?B/s]
100% 1.87k/1.87k [00:00<00:00, 1.78MB/s]
Downloading order_products__train.csv.zip to /content
  0% 0.00/6.90M [00:00<?, ?B/s]
100% 6.90M/6.90M [00:00<00:00, 63.4MB/s]
Downloading products.csv.zip to /content
  0% 0.00/795k [00:00<?, ?B/s]
100% 795k/795k [00:00<00:00, 111MB/s]
Downloading orders.csv.zip to /content
 77% 24.0M/31.3M [00:00<00:00, 25.5MB/s]
100% 31.3M/31.3M [00:00<00:00, 70.6MB/s]
Downloading order_products__prior.csv.zip to /content
100% 157M/157M [00:01<00:00, 111MB/s] 
100% 157M/157M [00:01<00:00, 101MB/s]
Downloading sample_submission.csv.zip to /content
  0% 0.00/220k [00:00<?, ?B/s]
100% 220k/220k [00:00<00:00, 193MB/s]


In [0]:
!ls
!unzip order_products__prior.csv
!unzip orders.csv
!unzip products.csv
!unzip aisles.csv
!unzip departments.csv

aisles.csv.zip	     order_products__prior.csv.zip  products.csv.zip
departments.csv.zip  order_products__train.csv.zip  sample_data
kaggle.json	     orders.csv.zip		    sample_submission.csv.zip
Archive:  order_products__prior.csv.zip
  inflating: order_products__prior.csv  
   creating: __MACOSX/
  inflating: __MACOSX/._order_products__prior.csv  
Archive:  orders.csv.zip
  inflating: orders.csv              
  inflating: __MACOSX/._orders.csv   
Archive:  products.csv.zip
  inflating: products.csv            
  inflating: __MACOSX/._products.csv  
Archive:  aisles.csv.zip
  inflating: aisles.csv              
  inflating: __MACOSX/._aisles.csv   
Archive:  departments.csv.zip
  inflating: departments.csv         
  inflating: __MACOSX/._departments.csv  


## 2. Data Preprocessing

In [0]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

orders = pd.read_csv('order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())

orders -- dimensions: (32434489, 4);   size: 1037.90 MB


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


### 2.1 Convert order data into format expected by the association rules function

In [0]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

pandas.core.series.Series

### 2.2  Display summary statistics for order data

In [0]:
print('dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders.shape, size(orders), len(orders.index.unique()), len(orders.value_counts())))

dimensions: (32434489,);   size: 518.95 MB;   unique_orders: 3214874;   unique_items: 49677


## 3 Association Rules Function

### 3.1 Helper functions to the main association rules function

In [0]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]     

### 3.2 Association rules function

In [0]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

### 3.3 Association Rule Mining

In [0]:
%%time
rules = association_rules(orders, 0.01)  

Starting order_item:               32434489
Items with support >= 0.01:           10906
Remaining order_item:              29843570
Remaining orders with 2+ items:     3013325
Remaining order_item:              29662716
Item pairs:                        30622410
Item pairs with support >= 0.01:      48751

CPU times: user 7min 27s, sys: 10.9 s, total: 7min 38s
Wall time: 7min 38s


In [0]:
# Replace item ID with item name and display association rules
item_name   = pd.read_csv('products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
display(rules_final)

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,306,0.010155,1163,0.038595,839,0.027843,0.263113,0.364720,9.449868
1,Grain Free Chicken Formula Cat Food,Grain Free Turkey Formula Cat Food,318,0.010553,1809,0.060033,879,0.029170,0.175788,0.361775,6.026229
3,Organic Fruit Yogurt Smoothie Mixed Berry,Apple Blueberry Fruit Yogurt Smoothie,349,0.011582,1518,0.050376,1249,0.041449,0.229908,0.279424,5.546732
9,Nonfat Strawberry With Fruit On The Bottom Gre...,"0% Greek, Blueberry on the Bottom Yogurt",409,0.013573,1666,0.055288,1391,0.046162,0.245498,0.294033,5.318230
10,Organic Grapefruit Ginger Sparkling Yerba Mate,Cranberry Pomegranate Sparkling Yerba Mate,351,0.011648,1731,0.057445,1149,0.038131,0.202773,0.305483,5.317849
11,Baby Food Pouch - Roasted Carrot Spinach & Beans,"Baby Food Pouch - Butternut Squash, Carrot & C...",332,0.011018,1503,0.049878,1290,0.042810,0.220892,0.257364,5.159830
12,Unsweetened Whole Milk Mixed Berry Greek Yogurt,Unsweetened Whole Milk Blueberry Greek Yogurt,438,0.014535,1622,0.053828,1621,0.053794,0.270037,0.270204,5.019798
23,Uncured Cracked Pepper Beef,Chipotle Beef & Pork Realstick,410,0.013606,1839,0.061029,1370,0.045465,0.222947,0.299270,4.903741
24,Organic Mango Yogurt,Organic Whole Milk Washington Black Cherry Yogurt,334,0.011084,1675,0.055586,1390,0.046128,0.199403,0.240288,4.322777
2,Grain Free Chicken Formula Cat Food,Grain Free Turkey & Salmon Formula Cat Food,391,0.012976,1809,0.060033,1553,0.051538,0.216142,0.251771,4.193848


## 4. Conclusion

From the output above, we see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc). As mentioned, one common application of association rules mining is in the domain of recommender systems. Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales. And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed!

# Developing Word-Based Neural Language Models in Python with Keras

In [0]:
item_name.head()

Unnamed: 0,item_id,item_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [0]:
orders.head()

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
Name: item_id, dtype: int64

In [0]:
item_name.head()
item_name.loc[100]

item_id                                 101
item_name        Bread, Healthy Whole Grain
aisle_id                                112
department_id                             3
Name: 100, dtype: object

In [0]:
rules_final.head() 

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Organic Strawberry Chia Lowfat 2% Cottage Cheese,Organic Cottage Cheese Blueberry Acai Chia,306,0.010155,1163,0.038595,839,0.027843,0.263113,0.36472,9.449868
1,Grain Free Chicken Formula Cat Food,Grain Free Turkey Formula Cat Food,318,0.010553,1809,0.060033,879,0.02917,0.175788,0.361775,6.026229
3,Organic Fruit Yogurt Smoothie Mixed Berry,Apple Blueberry Fruit Yogurt Smoothie,349,0.011582,1518,0.050376,1249,0.041449,0.229908,0.279424,5.546732
9,Nonfat Strawberry With Fruit On The Bottom Gre...,"0% Greek, Blueberry on the Bottom Yogurt",409,0.013573,1666,0.055288,1391,0.046162,0.245498,0.294033,5.31823
10,Organic Grapefruit Ginger Sparkling Yerba Mate,Cranberry Pomegranate Sparkling Yerba Mate,351,0.011648,1731,0.057445,1149,0.038131,0.202773,0.305483,5.317849
