# Co-Occurrence Recommender

In this notebook, we implement an version of collaborative filtering based recommendation. 

The traditional setting for a collaborative filtering approach consists of a set of user-item interactions, usually in the form of a user rating/score for a given item. The three main components consists of:
1. Postive ratings by users on certain items (what the user likes).
2. Negative ratings by users on certain items (what the user does not like).
3. Missing ratings between users and items (what the user did not see).

Our data set consists **only** of binary interactions: 1 if a user bought an item, 0 if a user did not buy an item. 
- Notice in our situation, we can only infer "user did not buy"; we cannot distinguish between "user did not like" vs "user did not see".

This makes a "traditional" collaborative filtering approach not entirely tractable. If we cannot distinguish between negative interactions vs missing interactions, any collaborative model we build (e.g. using low-rank matrix factorization) will always predict the user "buys everything" since it cannot distinguish what a user likes vs dislikes.

Therefore, in order to use collaborative filtering we must take an alternative perspective. What we do instead is use collaborative filtering along the set of items (rather than the set of users). The idea is group items by **how often they are bought together**. This allows us to make recommendations on binary data: if a user buys item X, we can query for the most common item bought together with item X and make a recommendation based on the results.

More specifically, define the **co-occurrence** between item X and item Y as the number of times a user bought both items X and Y (over some fixed time interval, e.g. a single transaction or a week). We can then organize all the co-occurrences between all items into a matrix which we call the **co-occurrence matrix**.
- Each row and column of the co-occurrence matrix corresponds to an item.
- The entry $a_{ij}$ at the ij-th position is the co-occurrence between Item i and Item j.

Then, given an item, say Item $i$, we simply locate row $i$ in the matrix, sort the values of the row in descending order, then return the list items corresponding to the sorted values. 

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

In [3]:
# load transactions
transactions = pd.read_csv('../data/transactions_train.csv', date_parser='t_dat')

In [4]:
articles = pd.read_csv('../data/articles.csv')

articles= articles[['article_id', 'product_code']]

articles.head()

Unnamed: 0,article_id,product_code
0,108775015,108775
1,108775044,108775
2,108775051,108775
3,110065001,110065
4,110065002,110065


In [5]:
transactions = transactions[['t_dat', 'customer_id', 'article_id']]

In [6]:
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004


In [7]:
# replace article_id with product_code (this groups items which are different colors of the same product)
current_week = transactions[ (transactions['t_dat'] > '2020-05-09') & (transactions['t_dat']<'2020-05-16')]

current_week = current_week.merge(right=articles, how='left', on='article_id')

current_week.drop(columns='article_id', inplace=True)

current_week

Unnamed: 0,t_dat,customer_id,product_code
0,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,803986
1,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,782065
2,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,700758
3,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,731160
4,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,722803
...,...,...,...
230478,2020-05-15,fff887d056fff68679d6e2491588ee4e83e57bf077ca7a...,877279
230479,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,429313
230480,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,831467
230481,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,831467


- So in the current week (May 09 to May 16), we have 230,000+ transactions

In [8]:
current_week.dtypes

t_dat           object
customer_id     object
product_code     int64
dtype: object

In [9]:
current_user_sales = current_week.value_counts(subset='customer_id')

current_user_sales

customer_id
1fd04ce108811ff17fa8c40b4ff8cf8e601b37b2e5fb80f21ae4e2b14b6a529e    76
921f7585bd20e057e2bc6c3beddd3f4e95100a76d355e9bd0bf24c3a4ced35fe    68
c975f806798c6b97fc2e810cb3d08e9428b00b2fbb93e3a055ff468b3d5fecd1    64
e358a03e6cd868b1675121d356de98c158ea6ff7ab5d2ec4d9c71bd3fbef655d    58
bb80fc2300ea1c82b4bceb3120a420a9eb4f1bec70cabb6b522cf0388c115382    58
                                                                    ..
8bef3e3321e71d5f83c709ad07c62de2c96dbc5d89db4f7da6315e675179a266     1
8bebe7c566e7ced533730bf9a58782a9e3f4972b413284dc43424ab028d582c8     1
8be7f16ce8bc821e1de4e3c211751f8a4448927f8ed986362335c081496c8aa5     1
8be1586bff73671e3ea6a51ea6e0d72a2b1079d31f71b4ff20419e1375cfd34c     1
8134588e1b3271668a7497f576289603a54b5ae4d0de91ea1e5fc7fdaf0bdb45     1
Length: 55802, dtype: int64

- In the current week, we have 55,802 distinct customers use the H&M website for purchases.

Since we are interested in grouping items which are purchased together, it makes sense to drop all users who only bought a single item. This is because they contribute no "co-occurrence" counts for the items in our data set.

In other words: users who only bought a single item don't qualify for "user's who bought X also bought Y and Z".

In [11]:
current_user_sales[ current_user_sales > 1 ]

customer_id
1fd04ce108811ff17fa8c40b4ff8cf8e601b37b2e5fb80f21ae4e2b14b6a529e    76
921f7585bd20e057e2bc6c3beddd3f4e95100a76d355e9bd0bf24c3a4ced35fe    68
c975f806798c6b97fc2e810cb3d08e9428b00b2fbb93e3a055ff468b3d5fecd1    64
e358a03e6cd868b1675121d356de98c158ea6ff7ab5d2ec4d9c71bd3fbef655d    58
bb80fc2300ea1c82b4bceb3120a420a9eb4f1bec70cabb6b522cf0388c115382    58
                                                                    ..
1ba963052716c9ebb971fa5ecd214ff3f7bdc5b4c06e3cb9e66e772c72f81bb1     2
801d605623c8d80b07596a90b699a01355c9fbf908149bee336d777d9b53e4ae     2
82ad775b2ea4ce0e29767f5e600f8953d9e57bae5073f7f51c1139beec2662fb     2
82217b148cc335c8eab12a281c92a9e72953ce98122bd6ae48de39eb0873530b     2
1cb4a3160af8058627bd12c51b85527d18c2dd68231e23e03224b379865dcab4     2
Length: 43149, dtype: int64

In [12]:
# save the list of users above
current_user_index = set(current_user_sales[ current_user_sales>1 ].index)

len(current_user_index)

43149

In [13]:
# filter out users who do not have multiple purchases
current_week = current_week[ current_week['customer_id'].isin(current_user_index) ] 

current_week

Unnamed: 0,t_dat,customer_id,product_code
0,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,803986
1,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,782065
2,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,700758
3,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,731160
4,2020-05-10,0016a52d0c6fd75ecbbfedb0a49308d726350d1dc3c73a...,722803
...,...,...,...
230478,2020-05-15,fff887d056fff68679d6e2491588ee4e83e57bf077ca7a...,877279
230479,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,429313
230480,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,831467
230481,2020-05-15,ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed...,831467


To create the co-occurrent matrix, we first build the cross table where:
- The rows are each of the users.
- The columns are the items.
- Each entry is the number of times the user bought that item over the most recent week.

In [14]:
current_crosstab = pd.crosstab(index = current_week['customer_id'], columns=current_week['product_code'])

In [15]:
current_crosstab

product_code,108775,110065,111565,111586,111593,111609,120129,123173,126589,129085,...,910700,911405,911436,912075,915047,915851,916335,916338,916397,921096
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37e011580a479e80aa94,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00058bbdde20e5d34b5f2c60698b2d6ee5736dfca9f370f6c63f8925112a6056,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000ae8a03447710b4de81d85698dfc0559258c93136650efc2429fcca80d699a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000c886e014a122bd9066501103e3f4a3ec157af27399a5f6fa2dc540c123356,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000e2691ade6addc974a2d8767def5b5272d1c62883cc04a011658aaa13b2ecb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffa2bd08d02a5dac1860a24150c9ea0a07316a06815ab0d91e746d00e5448c4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fffae8eb3a282d8c43c77dd2ca0621703b71e90904dfde2189bdd644f59071dd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fffd0248a95c2e49fee876ff93598e2e20839e51b9b7678aab75d9e8f9f3c6c8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ffff4c4e8b57b633c1ddf8fbd53db16b962cf831baf9ed67c6a53d86e167a35b,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
current_crosstab.columns

Int64Index([108775, 110065, 111565, 111586, 111593, 111609, 120129, 123173,
            126589, 129085,
            ...
            910700, 911405, 911436, 912075, 915047, 915851, 916335, 916338,
            916397, 921096],
           dtype='int64', name='product_code', length=9830)

Now, to get the co-occurence matrix from the Cross Table above, we simply do matrix multiplication between the transpose of the cross-table with the cross-table itself.
$$\text{Co-Occurrence} = X^T*X$$

where $X$ denotes the cross table above.

Since the cross table is rather large, matrix mulitplication will be rather inefficient here. However, most of the entries of the cross table are 0's, hence we can store it as a **Sparse Matrix** and perform more efficient multiplication using this sparse storage structure.

In [17]:
# store cross tab as sparse matrix
current_crosstab_sparse = csr_matrix(current_crosstab)

current_crosstab_sparse

<43149x9830 sparse matrix of type '<class 'numpy.int64'>'
	with 173050 stored elements in Compressed Sparse Row format>

In [18]:
# get co-occurrence via matrix multiplication (of sparse matrices)
co_occurrence = current_crosstab_sparse.transpose().dot(current_crosstab_sparse)

co_occurrence = co_occurrence.todense()

co_occurrence = pd.DataFrame(co_occurrence, index = current_crosstab.columns, columns = current_crosstab.columns)

In [19]:
co_occurrence

product_code,108775,110065,111565,111586,111593,111609,120129,123173,126589,129085,...,910700,911405,911436,912075,915047,915851,916335,916338,916397,921096
product_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
108775,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
110065,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
111565,0,0,48,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
111586,0,0,0,109,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
111593,0,0,0,0,44,0,0,4,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915851,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
916335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,9,1,0,0
916338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,9,0,0
916397,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,0


- This gives us the co-occurrence matrix: each entry $a_{ij}$ is precisely the number of times item $i$ was purchased together with item $j$.

From here we can make a recommendation as follows: suppose a user has just purchased item ```111586```. We locat the corresponding row in the table and sort the values to see which other items are most often purchased with item ```111586```.

In [20]:
co_occurrence[111586].sort_values(ascending=False)[1:6]

product_code
158340    8
861848    6
832359    6
860217    6
827968    4
Name: 111586, dtype: int64

- From this information, we can recommend the items with code ```158340```, ```861848```, ```832359```, etc. to the user.

We deploy a working implementation of such a recommender on our web app, where we can query any of the items in the matrix above by simply providing its product code.

In [64]:
co_occurrence.to_csv('../data/current-week-co-occurrence.csv')