# Use Case Tutorial 3: Customer Interest Clustering

This is a tutorial on how to perform customer clustering based on the interests and purchases of customers. 

Marketing teams frequently are interested in this analysis.

We'll show how graph analytics can be used to gain insights about the interests of customers by finding communities of customers who've bought similar products. 

We'll accomplish this by creating a bipartite graph of customers and products, using a graph projection to create a graph of customers linked to other customers who've bought the same product, and using Louvain community detection to find the communities.

We'll be using ecommerce transaction data from a U.K. retailer provided by the University of California, Irvine. The data can be found [here](https://www.kaggle.com/carrie1/ecommerce-data).

# Data Preprocessing

Let's first look at the data.

First, we'll need to import some libraries.

In [1]:
import metagraph as mg
import pandas as pd
import networkx as nx

Let's see what the data looks like.

In [2]:
RAW_DATA_CSV = './data/ecommerce/data.csv' # https://www.kaggle.com/carrie1/ecommerce-data
data_df = pd.read_csv(RAW_DATA_CSV, encoding="ISO-8859-1")
data_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


Let's clean the data to make sure there aren't any missing values. 

In [3]:
data_df.InvoiceDate = pd.to_datetime(data_df.InvoiceDate, format="%m/%d/%Y %H:%M")
data_df.drop(data_df.index[data_df.CustomerID != data_df.CustomerID], inplace=True)
assert len(data_df[data_df.isnull().any(axis=1)])==0, "Raw data contains NaN"
data_df = data_df.astype({'CustomerID': int}, copy=False)
data_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Note that some of these transactions are for returns (denoted by negative quantity values).

In [4]:
data_df[data_df.Quantity < 1].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom


Though customers may have returned these products, they did initially purchase the products (which reflects an interest in the product), so we’ll keep the initial purchases. However, we’ll remove the return transactions (which will also remove any discount transactions as well).

In [5]:
data_df.drop(data_df.index[data_df.Quantity <= 0], inplace=True)
data_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


# Community Detection

Let's now find the communities of customers with similar purchases / interests. 

First, we'll need to create a bipartite graph of customers and products. 

In [6]:
r = mg.resolver
nx_bipartite_graph = nx.from_pandas_edgelist(data_df, 'CustomerID', 'StockCode')
customer_ids = data_df['CustomerID']
stock_codes = data_df['StockCode']
bipartite_graph = r.wrappers.BipartiteGraph.NetworkXBipartiteGraph(nx_bipartite_graph, [customer_ids, stock_codes])

Next, we'll need to use a graph projection to create a graph of customers linked to other customers who've bought the same product.

In [7]:
customer_similarity_graph = r.algos.bipartite.graph_projection(bipartite_graph, 0)

We now have an unweighted bipartite graph. Louvain community detection requires weights. A more elegant approach might be taken in practice, but we'll simply assign every edge to have a weight of 1 for this tutorial.

In [8]:
customer_similarity_graph = r.algos.util.graph.assign_uniform_weight(customer_similarity_graph, 1.0)

Now, we'll need to use Louvain community detection to find similar communities based on purchased products.

In [9]:
labels, modularity_score = r.algos.clustering.louvain_community(customer_similarity_graph)

Let's see how many / what labels we have.

In [10]:
type(labels)

metagraph.plugins.python.types.PythonNodeMap

In [11]:
type(labels.value)

dict

In [12]:
set(labels.value.values())

{0, 1, 2, 3}

Let's now merge the labels into our dataframe.

In [13]:
data_df['CustomerCommunityLabel'] = data_df.CustomerID.map(lambda customer_id: labels.value[customer_id])
data_df.sample(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,CustomerCommunityLabel
105952,545301,21251,DINOSAUR HEIGHT CHART STICKER SET,6,2011-03-01 12:26:00,2.95,12679,France,3
30090,538849,21484,CHICK GREY HOT WATER BOTTLE,2,2010-12-14 13:31:00,3.45,14415,United Kingdom,2
367490,568895,23321,SMALL WHITE HEART OF WICKER,3,2011-09-29 13:11:00,1.65,15356,United Kingdom,3
441827,574655,22605,WOODEN CROQUET GARDEN SET,1,2011-11-06 11:35:00,14.95,16466,United Kingdom,2
7153,536993,21936,RED RETROSPOT PICNIC BAG,1,2010-12-03 15:19:00,2.95,14396,United Kingdom,2
424295,573248,23318,BOX OF 6 MINI VINTAGE CRACKERS,10,2011-10-28 12:09:00,2.49,14498,United Kingdom,1
3137,536602,22411,JUMBO SHOPPER VINTAGE RED PAISLEY,6,2010-12-02 08:34:00,1.65,17850,United Kingdom,2
30247,538853,21677,HEARTS STICKERS,6,2010-12-14 13:35:00,0.85,16805,United Kingdom,1
354297,567873,23273,HEART T-LIGHT HOLDER WILLIE WINKIE,12,2011-09-22 14:25:00,1.65,13055,United Kingdom,0
352970,567709,21915,RED HARMONICA IN BOX,12,2011-09-22 09:53:00,1.25,15239,United Kingdom,1


We now have clusters of customers who've bought similar products and can market to these interests. 