**Context**

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

**Content**

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

**Reference**

https://jessesw.com/Rec-System/

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

In [3]:
os.getcwd()
os.chdir("D:\Kaggle\E-Commerce")

In [4]:
# need to take all of the transactions for each customerand put these into ALS format
# need unique customer ID (rows) and unique item ID (columns) of the matrix
# values of the matrix should be the total number of purchases for each item by each customer

# Data Pre-Processing

In [5]:
df = pd.read_csv("data.csv", encoding="cp1252")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [6]:
df.info()
# CustomerID has some NA values → don't know who bought the item
# it's okay for now that Description column contains NA (unless planning to use Word2Vector)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [7]:
# dropping NA values → now can match all purchases to specific customers
df = df.loc[pd.isnull(df["CustomerID"])==False]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      406829 non-null object
StockCode      406829 non-null object
Description    406829 non-null object
Quantity       406829 non-null int64
InvoiceDate    406829 non-null object
UnitPrice      406829 non-null float64
CustomerID     406829 non-null float64
Country        406829 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


In [9]:
# unique item list
item_list = df[["StockCode", "Description"]].drop_duplicates()
# encode "StockCode" as string for future use
item_list["StockCode"] = item_list["StockCode"].astype(str)
item_list.head()

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.


In [None]:
# group purcahse quantities together by StockCode and ItemID
# change any sums that equal zero to one (when the product was returned after purchase)
# only include customers with a positive purchase total
# set up sparse matrix (to solve memory issues)

In [24]:
# creating sparse matrix
df["CustomerID"] = df["CustomerID"].astype(int)
df = df[["StockCode", "Quantity", "CustomerID"]]
# 각 유저가 산 상품 갯수 (그래서 CustomerID 가 중복)
group = df.groupby(["CustomerID", "StockCode"]).sum().reset_index()
# replace 0 purchases to 1
group["Quantity"].loc[group["Quantity"]==0] = 1
# extract only customers that have made purchase
purchased = group[group["Quantity"]>0]
purchased.head()

Unnamed: 0,CustomerID,StockCode,Quantity
0,12346,23166,1
1,12347,16008,24
2,12347,17021,36
3,12347,20665,6
4,12347,20719,40


In [None]:
# instead of explicit rating, purchase quantity can represent a "confidence" of items how "strong" the interaction was
# items with a larger number of purchase → can carry out more weight in ratings

# Create Sparse Matrix

In [27]:
# getting unique customers, products & total purchases
customers = list(np.sort(purchased["CustomerID"].unique()))
products = list(purchased["StockCode"].unique())
quantity = list(purchased["Quantity"])

In [33]:
# getting associated row and column indices (assigning labels)
rows = purchased["CustomerID"].astype("category", categories=customers).cat.codes
columns = purchased["StockCode"].astype("category", categories=products).cat.codes

In [34]:
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve

In [37]:
# create sparse matrix
sparse = sparse.csr_matrix((quantity, (rows, columns)), shape=(len(customers), len(products)))

In [38]:
# 4338 customers & 3664 items
sparse

<4338x3664 sparse matrix of type '<class 'numpy.int32'>'
	with 266723 stored elements in Compressed Sparse Row format>

## Sparsity

In [43]:
# number of possible interactions in the matrix
matrix_size = sparse.shape[0]*sparse.shape[1]
# number of items interacted with
num_purchases = len(sparse.nonzero()[0])
sparsity = 100 * (1 - (num_purchases / matrix_size))
sparsity

98.32190920694744

In [None]:
# for CF, the maximum sparsity limit is about 99.5 percent

# Train Test Split

In [1]:
# test set = exact copy of original data
# training set = masking a random percentage of user/item interactions (as if user never purchased)
# check in test set which items were recommended to the user → actually purchased
# check if the model recommends the most popular item to every user → popularity v. preference