In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder

## Data Loading

In [2]:
market = pd.read_csv("kiosco.csv",header=None,skipfooter=0)
market = market.loc[1:,:]
market.sample(10)

Unnamed: 0,0,1,2,3,4
645,sanguche miga,sprite,,,
280,phillip,,,,
404,chester,,,,
798,mentoplus,,,,
317,phillip,,,,
907,yougurt,cereales,,,
831,phillip,alka,,,
154,coca,,,,
40,coca,,,,
665,pebete,,,,


In [3]:
market.shape

(1032, 5)

In [4]:
market.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

We have data from **1032 transactions** of the small market. The maximum quantity of items that customers purchase is 5 items in one transaction, and the minimum is 1.

## Data Preprocessing

First and foremost, I will convert the null data in transactions with fewer than 5 items into blank spaces.

In [5]:
market.fillna('',axis=1,inplace=True)
market.sample(10)

Unnamed: 0,0,1,2,3,4
270,pier,,,,
268,liverpool,,,,
110,belden,,,,
36,caramelos,,,,
476,heineken,,,,
695,belden,,,,
578,chester,,,,
192,phillip,,,,
889,pebete,caramelos,,,
109,phillip,alka,,,


Now I am obtaining the unique values in the dataset, that is, all the items sold in the last two weeks.

In [6]:
uniques = pd.unique(market.values.ravel())
print(len(uniques))
uniques

94


array(['fantoche', '', 'lucky', 'block', 'pier', 'coca', 'liverpool',
       'chester', 'lata coca', 'vino', 'caramelos', 'don satur', 'cepita',
       'pan', 'mentoplus', 'quilmes', 'pañuelos', '9 de oro', 'guaymayen',
       'brahma', 'fanta', 'pebete', 'phillip', 'alka', 'encendedor',
       'malboro', 'kit kat', 'santa fe', 'pritty', 'saladix', 'vocacion',
       'café', 'turron', 'belden', 'mogul', 'fresh', 'fernet', 'terepin',
       'speed', 'levite', 'powerade', 'sanguche milanesa', 'sprite',
       'schneider', 'twistos', 'dos corazones', 'doritos', 'stella',
       'pitusas', 'doctor lemon', 'monster', 'petaca', 'polvorita',
       'chupetin', 'aquarius', 'lays', 'rhodesia', 'oreo', 'pronto',
       'agua', 'doble cola', 'facturas', 'matecocido', 'agua tonica',
       'baggio', 'sanguche miga', 'yerba', 'bandeja de miga', 'opera',
       'bizcochos', 'alfajor santafesino', 'pipas', 'gin', 'heineken',
       'iguana', 'gomitas', 'mini oreo', 'jorgito', 'conitos',
       'alfaj

I am storing the transactions in lists of equal size.

In [7]:
transactions = []
for i in range(0, 1032):
    transactions.append([str(market.values[i,j]) for j in range(0, 5)])

In [8]:
print(len(transactions))
print(len(transactions[0]))
transactions[87]

1032
5


['liverpool', 'caramelos', '', '', '']

In order to use the transaction data with the Apriori algorithm, I need to convert the transactions into a dataframe with one-hot encoding. So, first of all, I instantiate the transaction encoder, which is the array I will use to create that dataframe.

In [9]:
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
te_array

array([[ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       ...,
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

I am creating the new dataframe with one-hot encoding.

In [10]:
basket = pd.DataFrame(te_array,columns=te.columns_)

te.columns_

['',
 '9 de oro',
 'agua',
 'agua tonica',
 'alfajor block',
 'alfajor de maicena',
 'alfajor santafesino',
 'alka',
 'aquarius',
 'baggio',
 'bandeja de miga',
 'barrita de cereales',
 'belden',
 'bizcochos',
 'block',
 'brahma',
 'café',
 'campeon',
 'caramelos',
 'cepita',
 'cereales',
 'chester',
 'chupetin',
 'coca',
 'cofler',
 'conitos',
 'corona',
 'doble cola',
 'doctor lemon',
 'don satur',
 'doritos',
 'dos corazones',
 'encendedor',
 'facturas',
 'fanta',
 'fantoche',
 'fernet',
 'fresh',
 'gancia',
 'gatorade',
 'gin',
 'gomitas',
 'guaymayen',
 'heineken',
 'iguana',
 'jorgito',
 'kit kat',
 'lata coca',
 'lays',
 'levite',
 'lincon',
 'liverpool',
 'lucky',
 'malboro',
 'matecocido',
 'mentoplus',
 'mini oreo',
 'mogul',
 'monster',
 'opera',
 'oreo',
 'pan',
 'paseo',
 'pañuelos',
 'pebete',
 'petaca',
 'phillip',
 'pier',
 'pipas',
 'pitusas',
 'polvorita',
 'postre',
 'powerade',
 'pritty',
 'pronto',
 'quilmes',
 'rhodesia',
 'saladix',
 'sanguche miga',
 'sanguche m

I am removing the column of blank spaces to avoid interfering with the modeling of the algorithm.

In [11]:
basket.drop(columns=[''],axis=1,inplace=True)

In [12]:
basket.sample(1)

Unnamed: 0,9 de oro,agua,agua tonica,alfajor block,alfajor de maicena,alfajor santafesino,alka,aquarius,baggio,bandeja de miga,...,stella,tarta,terepin,topline,turron,twistos,vino,vocacion,yerba,yougurt
838,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


## Saving the new processed dataset.

In [13]:
basket.to_csv("basket.csv", index=False)

## Noise Cleanup

Since I noticed that the large quantity of items, even if they appeared in a few transactions, was generating noise in the data and making it difficult to obtain association rules, I decided to remove items that have been purchased less than 15 times.

Furthermore, I decided to remove the "caramelos" and "alka" products since they are items that are commonly given to customers as change for leftover money. This was only generating noise and resulting in useless rules.



### Less Sold Items

In [14]:
_ = basket.sum()
_ = pd.DataFrame(_).sort_values(0)
delete = _[_[0] < 15].index
delete = list(delete)
delete.append("caramelos"),delete.append("alka")

(None, None)

### Cleaning of Less Sold Items


In [15]:
market_clean = market
for item in delete:
    market_clean = market_clean.replace(item,"")

### Obtaining the New Dataset with Reduced Noise in the Data


In [16]:
uniques = pd.unique(market_clean.values.ravel())
print(uniques)

['fantoche' '' 'lucky' 'pier' 'coca' 'liverpool' 'chester' 'mentoplus'
 'quilmes' 'guaymayen' 'brahma' 'pebete' 'phillip' 'encendedor' 'malboro'
 'café' 'belden' 'levite' 'sanguche milanesa' 'sprite' 'aquarius'
 'facturas' 'baggio']


In [17]:
transactions = []
for i in range(0, 1032):
    transactions.append([str(market_clean.values[i,j]) for j in range(0, 5)])

In [18]:
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
te_array

array([[ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       ...,
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

In [19]:
basket_clean = pd.DataFrame(te_array,columns=te.columns_)
basket_clean.columns

Index(['', 'aquarius', 'baggio', 'belden', 'brahma', 'café', 'chester', 'coca',
       'encendedor', 'facturas', 'fantoche', 'guaymayen', 'levite',
       'liverpool', 'lucky', 'malboro', 'mentoplus', 'pebete', 'phillip',
       'pier', 'quilmes', 'sanguche milanesa', 'sprite'],
      dtype='object')

### Removal of Blank Spaces

In [20]:
basket_clean.drop(columns="", inplace=True)

In [21]:
basket_clean.columns

Index(['aquarius', 'baggio', 'belden', 'brahma', 'café', 'chester', 'coca',
       'encendedor', 'facturas', 'fantoche', 'guaymayen', 'levite',
       'liverpool', 'lucky', 'malboro', 'mentoplus', 'pebete', 'phillip',
       'pier', 'quilmes', 'sanguche milanesa', 'sprite'],
      dtype='object')

In [22]:
filas = basket_clean.sum(axis=1)
filas_a_eliminar = filas[filas==0]
filas_a_eliminar.index

Int64Index([   2,    8,    9,   10,   11,   12,   13,   14,   16,   24,
            ...
            1001, 1002, 1003, 1008, 1016, 1017, 1021, 1022, 1027, 1029],
           dtype='int64', length=238)

In [23]:
basket_clean.drop(filas_a_eliminar.index, inplace = True)
basket_clean.shape

(794, 22)

## Saving New Dataset

In [24]:
basket_clean.to_csv("basket_clean.csv", index=False)