In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder

## Data Loading

In [2]:
market = pd.read_csv("kiosco.csv",header=None,skipfooter=0)
market = market.loc[1:,:]
market.sample(10)

Unnamed: 0,0,1,2,3,4
348,pier,coca,belden,,
525,quilmes,,,,
59,santa fe,liverpool,,,
454,quilmes,,,,
842,pebete,coca,,,
45,quilmes,,,,
717,doctor lemon,,,,
271,fantoche,monster,,,
602,doctor lemon,,,,
489,quilmes,coca,,,


In [3]:
market.shape

(1032, 5)

In [4]:
market.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

We have data from **1032 transactions** of the small market. The maximum quantity of items that customers purchase is 5 items in one transaction, and the minimum is 1.

## Data Preprocessing

First and foremost, I will convert the null data in transactions with fewer than 5 items into blank spaces.

In [5]:
market.fillna('',axis=1,inplace=True)
market.sample(10)

Unnamed: 0,0,1,2,3,4
196,caramelos,,,,
151,pier,,,,
962,fresh,,,,
282,malboro,,,,
250,alka,,,,
96,coca,,,,
63,phillip,quilmes,,,
960,coca,sanguche de milanesa,phillip,,
757,malboro,,,,
160,doctor lemon,,,,


Now I am obtaining the unique values in the dataset, that is, all the items sold in the last two weeks.

In [6]:
uniques = pd.unique(market.values.ravel())
print(len(uniques))
uniques

103


array(['fantoche', '', 'lucky', 'block', 'pier', 'coca', 'liverpool',
       'chester', 'lata coca', 'vino', 'caramelos', 'don satur', 'cepita',
       'pan', 'mentoplus', 'fantonche', 'quilmes', 'pañuelos', '9 de oro',
       'guaymayen', 'brahma', 'fanta', 'pebete', 'phillip', 'alka',
       'encendedor', 'malboro', 'kit kat', 'santa fe', 'pritty',
       'saladix', 'vocacion', 'café', 'turron', 'belden', 'mogul',
       'fresh', 'fernet', 'terepin', 'speed', 'levite', 'powerade',
       'sanguche milanesa', 'sprite', 'schneider', 'twistos',
       'dos corazones', 'dorito', 'stella', 'pitusas', 'doctor lemon',
       'monster', 'petaca', 'polvorita', 'chupetin', 'aquarius', 'lays',
       'rodesia', 'oreo', 'pronto', 'agua', 'doble cola', 'facturas',
       'matecocido', 'rodecia', 'agua tonica', 'baggio',
       'sanguche de miga', 'yerba', 'bandeja de miga', 'opera',
       'bizcochos', 'encededor', 'alfajor santafecino', 'pipas', 'gin',
       'heineken', 'iguana', 'sanguche miga

I am storing the transactions in lists of equal size.

In [7]:
transactions = []
for i in range(0, 1032):
    transactions.append([str(market.values[i,j]) for j in range(0, 5)])

In [8]:
print(len(transactions))
print(len(transactions[0]))
transactions[87]

1032
5


['liverpool', 'caramelos', '', '', '']

In order to use the transaction data with the Apriori algorithm, I need to convert the transactions into a dataframe with one-hot encoding. So, first of all, I instantiate the transaction encoder, which is the array I will use to create that dataframe.

In [9]:
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
te_array

array([[ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       ...,
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

I am creating the new dataframe with one-hot encoding.

In [10]:
basket = pd.DataFrame(te_array,columns=te.columns_)

te.columns_

['',
 '9 de oro',
 'agua',
 'agua tonica',
 'alfajor block',
 'alfajor de maicena',
 'alfajor santafecino',
 'alfajor santafesino',
 'alka',
 'aquarius',
 'baggio',
 'bandeja de miga',
 'barrita de cereales',
 'belden',
 'bizcochos',
 'block',
 'brahma',
 'cafe',
 'café',
 'campeon',
 'caramelos',
 'cepita',
 'cereales',
 'cerealeses',
 'chester',
 'chupetin',
 'coca',
 'cofler',
 'conitos',
 'corona',
 'doble cola',
 'doctor lemon',
 'don satur',
 'dorito',
 'doritos',
 'dos corazones',
 'encededor',
 'encendedor',
 'facturas',
 'fanta',
 'fantoche',
 'fantonche',
 'fernet',
 'fresh',
 'gancia',
 'gatorade',
 'gin',
 'gomitas',
 'guaymayen',
 'heineken',
 'iguana',
 'jorgito',
 'kit kat',
 'lata coca',
 'lays',
 'levite',
 'lincon',
 'liverpool',
 'lucky',
 'malboro',
 'matecocido',
 'mentoplus',
 'mini oreo',
 'mogul',
 'monster',
 'opera',
 'oreo',
 'pan',
 'paseo',
 'pañuelos',
 'pebete',
 'petaca',
 'phillip',
 'pier',
 'pipas',
 'pitusas',
 'polvorita',
 'postre',
 'powerade',
 '

I am removing the column of blank spaces to avoid interfering with the modeling of the algorithm.

In [11]:
basket.drop(columns=[''],axis=1,inplace=True)

In [12]:
basket.sample(1)

Unnamed: 0,9 de oro,agua,agua tonica,alfajor block,alfajor de maicena,alfajor santafecino,alfajor santafesino,alka,aquarius,baggio,...,stella,tarta,terepin,topline,turron,twistos,vino,vocacion,yerba,yougurt
730,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Saving the new processed dataset.

In [13]:
basket.to_csv("basket.csv", index=False)