**Context**

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

**Content**

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

**Reference**

https://jessesw.com/Rec-System/

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

In [2]:
os.getcwd()
os.chdir("D:\Kaggle\E-Commerce")

In [3]:
# need to take all of the transactions for each customer and put these into ALS format
# need unique CustomerID (rows) and unique ItemID (columns) of the matrix
# values of the matrix should be the total number of purchases for each item by each customer

# Data Pre-Processing

In [4]:
df = pd.read_csv("data.csv", encoding="cp1252")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [5]:
df.info()
# CustomerID has some NA values → don't know who bought the item
# it's okay for now that Description column contains NA (unless planning to use Word2Vector)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [6]:
# dropping NA values → now can match all purchases to specific customers
df = df.loc[pd.isnull(df["CustomerID"])==False]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      406829 non-null object
StockCode      406829 non-null object
Description    406829 non-null object
Quantity       406829 non-null int64
InvoiceDate    406829 non-null object
UnitPrice      406829 non-null float64
CustomerID     406829 non-null float64
Country        406829 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


In [7]:
# unique item list
item_list = df[["StockCode", "Description"]].drop_duplicates()
# encode "StockCode" as string for future use
item_list["StockCode"] = item_list["StockCode"].astype(str)
item_list.head()

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.


In [8]:
# group purcahse quantities together by StockCode and ItemID
# change any sums that equal zero to one (when the product was returned after purchase)
# only include customers with a positive purchase total
# set up sparse matrix (to solve memory issues)

In [9]:
# creating sparse matrix
df["CustomerID"] = df["CustomerID"].astype(int)
df = df[["StockCode", "Quantity", "CustomerID"]]
# 각 유저가 산 상품 갯수 (그래서 CustomerID 가 중복)
group = df.groupby(["CustomerID", "StockCode"]).sum().reset_index()
# replace 0 purchases to 1
group["Quantity"].loc[group["Quantity"]==0] = 1
# extract only customers that have made purchase
purchased = group[group["Quantity"]>0]
purchased.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,CustomerID,StockCode,Quantity
0,12346,23166,1
1,12347,16008,24
2,12347,17021,36
3,12347,20665,6
4,12347,20719,40


In [10]:
# instead of explicit rating, purchase quantity can represent a "confidence" of items how "strong" the interaction was
# items with a larger number of purchase → can carry out more weight in ratings

# Create Sparse Matrix

In [11]:
# getting unique customers, products & total purchases
customers = list(np.sort(purchased["CustomerID"].unique()))
products = list(purchased["StockCode"].unique())
quantity = list(purchased["Quantity"])

In [12]:
# getting associated row and column indices (assigning labels)
rows = purchased["CustomerID"].astype("category", categories=customers).cat.codes
columns = purchased["StockCode"].astype("category", categories=products).cat.codes

  exec(code_obj, self.user_global_ns, self.user_ns)


In [13]:
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve

In [14]:
# create sparse matrix
sparse = sparse.csr_matrix((quantity, (rows, columns)), shape=(len(customers), len(products)))

In [15]:
# 4338 customers & 3664 items
sparse

<4338x3664 sparse matrix of type '<class 'numpy.int32'>'
	with 266723 stored elements in Compressed Sparse Row format>

## Sparsity

In [16]:
# number of possible interactions in the matrix
matrix_size = sparse.shape[0]*sparse.shape[1]
# number of items interacted with
num_purchases = len(sparse.nonzero()[0])
sparsity = 100 * (1 - (num_purchases / matrix_size))
sparsity

98.32190920694744

In [17]:
# for CF, the maximum sparsity limit is about 99.5 percent

# Train Test Split

In [18]:
# test set = exact copy of original data (to compare with train)
# training set = masking a random percentage of user/item interactions (as if user never purchased)
# masking (replacing certain train sets as 0)
# check in test set which items were recommended to the user → actually purchased
# check if the model recommends the most popular item to every user → popularity v. preference

In [19]:
import random

In [20]:
# list(zip) → pairing lists until same length
a = range(5)
b = [1, 4, 5, 6, 7, 10, 12]
list(zip(a, b))

[(0, 1), (1, 4), (2, 5), (3, 6), (4, 7)]

In [21]:
# using user-defined train function
# ratings = created sparse matrix
# pct_test = percentage of user-item interactions that will be masked
# user_inds = user rows that were altered in training data (used to compare with test) → check performance with AUC
# will return train set, test set (binarized to purchased/not → 0/1), list of users with at least one item masked (will test performance on these users)


In [27]:
def make_train(ratings, pct_test = .2):
    test = ratings.copy()
    test[test != 0] = 1 # purchased 1, not 0 (binary)
    train = ratings.copy()
    nonzero_inds = train.nonzero() # find indices that has interaction
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1])) # pairing user and item index to list
    random.seed(34) # reproducibility
    num_samples = int(np.ceil(pct_test * len(nonzero_pairs))) # 올림
    samples = random.sample(nonzero_pairs, num_samples) # nonzero_pair list 에서 num_samples 개의 샘플을 랜덤하게 추출 
    user_inds = [index[0] for index in samples] # user row indices
    item_inds = [index[1] for index in samples] # item indices
    train[user_inds, item_inds] = 0 # randomly chosen user-item pair to 0
    train.eliminate_zeros() # getting rid of unnecessary zeros for memory save
    return train, test, list(set(user_inds)) # unique list of user rows that were changed (result)   

In [28]:
train, test, alt_users = make_train(sparse, pct_test=.2)

# ALS for Implicit Feedback

In [None]:
# turn sparse matrix → confidence matrix
# Cui = 1 + (a * Rui)
# refer to confidenc matrix equation where Cui is the confidence matrix for u and i
# rui = original matrix of purchases 
# a (alpha) = linear scaling of the rating preferences (number of purchases) ~ 40 is decent
## decreasing a means decreasing variability in confidence between various ratings
# minimize the cost functon for users U (derivative) and items I
# two equations will iterate back and forth until they "converge"
# iteration = number of items to alternate between both user feature vector in ALS (more → better convergence, but more computation)
# lambda for regularization (prevents overfitting during training) → increase will increase bias and decrease variability as a tradeoff
# rank_size = number of latent featuers in the u/i feature vectors (increasing may lead to overfitting but reduce bias)
# output = feature vectors for users and items ("rating" of each point in original matrix)
# 여기서는 implicit 패키지를 사용해서 빠르게 ALS 적용
# ALS 관련해서는 다시 git 에 정리할것

In [32]:
# conda install -c conda-forge implicit
import implicit
alpha = 15
user_vecs, item_vecs = implicit.alternating_least_squares((train * alpha).astype("double"),
                                                          factors=20,
                                                          regularization=.1,
                                                          iterations=50)

This method is deprecated. Please use the AlternatingLeastSquares class instead
100%|█████████████████████████████████████████████████████| 50.0/50 [00:01<00:00, 25.50it/s]


In [35]:
user_vecs

array([[ 0.00269402,  0.00122115, -0.00077234, ..., -0.00196641,
        -0.0005734 ,  0.00111094],
       [-0.00447134, -0.01529366,  0.00719723, ...,  0.00397395,
         0.00776083, -0.01365247],
       [ 0.02053541,  0.00817309, -0.00174476, ...,  0.01550859,
         0.03371258, -0.03715759],
       ...,
       [ 0.01383563,  0.03325851, -0.00902754, ..., -0.00914942,
         0.02554496, -0.02158495],
       [ 0.01596301,  0.01427091,  0.02349794, ...,  0.02163933,
         0.00669048,  0.00326324],
       [-0.0402585 ,  0.00766781,  0.05628044, ...,  0.0151436 ,
         0.00146286, -0.04000969]], dtype=float32)