# **Report 1: Analysing Matrix Monitoring Methods In Accelerating Frequent Pattern Recognition Algorithms**

### **Professor:** Dr. Ghatee

### **Head TA:** Behnam Yousefimehr

### **Author:** Hassan Hajizadeh

## **Step 1: Downloading, Loading And Preprocessing The Data From UCI Website(https://archive.ics.uci.edu/dataset/352/online+retail)**

We already downloaded the Dataset and for properly loading the file we will install pandas and openpyxl (for loading xlsx) libraries.

In [1]:
!pip install pandas
!pip install openpyxl



### **Loading Data**

In [2]:
import pandas as pd

df = pd.read_excel('Online Retail.xlsx')

### **Cleaning Data**

Based on what we see below in dataset info, indicates that there are some empty values in the description and the CustomerID Features.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


#### **Dropping Records With Empty Cells**

In [4]:
df.dropna(inplace=True)

#### **Checking And Removing The Negative Data**

In [5]:
(df[["Quantity","UnitPrice","CustomerID"]] < 0).any().any()

np.True_

In [6]:
df[df[["Quantity","UnitPrice","CustomerID"]] < 0].stack()

141     Quantity    -1.0
154     Quantity    -1.0
235     Quantity   -12.0
236     Quantity   -24.0
237     Quantity   -24.0
                    ... 
540449  Quantity   -11.0
541541  Quantity    -1.0
541715  Quantity    -5.0
541716  Quantity    -1.0
541717  Quantity    -5.0
Length: 8905, dtype: object

In [7]:
df.drop(index=df[~(df[["Quantity","UnitPrice","CustomerID"]] >= 0).all(axis=1)].index, inplace=True)


In [8]:
(df[["Quantity","UnitPrice","CustomerID"]] < 0).any().any()

np.False_

## **Step 2: Creating Transaction-Item Binary Matrix**

Based on matrix below we will use InvoiceNo and StockCode categories to create our binary matrix.

In [16]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


In [18]:
binary_matrix = pd.crosstab(df['InvoiceNo'], df['StockCode'])
binary_matrix


StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581586,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
