# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyize them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## 1. CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients finded as more relevant to their business.

What we want now is to check wether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list.

We will do this in two phases: in the first one we will create a dataframe, with only some of the rows of our csv.

In the second part, we will use chunks to load, chunk by chunk, all the dataframe, and extract the samei nformation.

Finally, with the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1.1. LOADING ONLY PART OF THE DATAFRAME:

In [63]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [64]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions.csv"
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [66]:
# We create the list of selected products, and will try with it to obtain the values of the ids of the indicated products:

list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera', 
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [6]:
# We import the dataframe:

df=pd.read_csv(file_path+file_name, nrows=1000000, sep=sep)

In [7]:
# According to what we saw in the previous notebook, we have to do some cleaning:

df.dropna(how='any', inplace=True)
df=df.drop('Unnamed: 0', axis=1)

In [8]:
# Most of the descriptions are in uppercase, we first reduce everything to lowercase:

df['description_lower']=df['description'].str.lower()

In [9]:
# Now we construct the dictionaries of the selected ids and descriptions that each product of the list finds:

# What you are seeing here is the final version both of the list of the selected products, and the way they are looked for in the
# dataframe. We started looking within the string of the description, with a ".contains", but it was noticed that startswith 
# was more suited for the job. Also, the values in the list of the products were changed in order to capture more names
# each time:

rel_prod_list_ids=dict()
rel_prod_list_descrip=dict()

for product in list_of_products:
    rel_prod_list_ids[product]=df[df['description_lower'].str.startswith(product)]['product_id'].unique()
    rel_prod_list_descrip[product]=df[df['description_lower'].str.startswith(product)]['description_lower'].unique()

In [10]:
# For minor, additional checks:

df[df['product_id']==107].groupby('description').first()

Unnamed: 0_level_0,product_id,order_date,section,store,units_ordered,description_lower
description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CROISSANT ALMENDRA LARGO,107.0,3/9/2019 0:00:00,0,BmUP,0,croissant almendra largo
CROISSANT VACIOS,107.0,16/6/2009 0:00:00,0,BmUP,300,croissant vacios


In [11]:
# This is the dictionary with the word and its associated codes:

rel_prod_list_ids

{'croissant': array([ 102.,  103.,  105.,  107.,  101.,  132., 5001.,  100.,  214.,
         189.,  198.,  197., 9999.,  513.,  512.,  112.]),
 'croissant petit': array([103., 102.]),
 'tarta mousse 3 chocolates': array([9999.,  453.]),
 'tarta de manzana 2º': array([462.]),
 'palmera': array([ 140.,  182.,  190., 9999.,  141.]),
 'tarta opera': array([ 9999.,   414.,   426.,   427.,   403., 14998.,   402.]),
 'postre fresas y mascarpone': array([4511., 9999.]),
 'milhojas frambuesa 2º': array([459.]),
 'tortel': array([ 112., 3352., 9999., 3375.]),
 'baguette': array([ 115., 8739., 9999.])}

In [12]:
# The same but with the descriptions: Just for checks:

rel_prod_list_descrip[list_of_products[5]]

array(['tarta opera del 2 escrito sobre la trta felicidades rafael ',
       'tarta opera del 2º con cartel "felicidades"',
       'tarta opera 5º con cartel "felicidades lili" y adornada con frutas naturales',
       'tarta opera 2º', 'tarta opera del 5º',
       'tarta opera del 4º con cartel " felicidades dolly "',
       'tarta opera del 2º con cartel " felicidades raul "',
       'tarta opera del 4º con cartel que ponga "felicidades gaës" (ojo que la letra e lleva dieresis)',
       'tarta opera del 2º escrito encima " happy birthay nano  aba y papa "',
       'tarta opera 2º escrito en un cartel felicidades 18',
       'tarta opera 3º',
       'tarta opera 32 rac. escrito sobre la tarta " felicidades jose. feliz 60 cumpleaños"ccccccccccc',
       'tarta opera del 4  con cartel escrito- hugo y mar, muchas felicidades de vuestra familia-',
       'tarta opera 5º', 'tarta opera 2º felicidades alejandra',
       'tarta opera del 6º escrito feliz cumpleaños',
       'tarta opera 10 ra

In [13]:
# For some additional checks:

df[df['product_id']==450]['description'].unique()

array(['POSTRE MOUSSE TRES CHOCOLATES',
       'POSTRE VIRUTA CHOCOLATE  RECTANGULAR',
       'POSTRES  MILHOJAS  NATA CREMA', ...,
       'POSTRE MANZANA  CARAMELO -- HOTELES-',
       'POSTRE MOUSSE CHOCOLATE  VASITO',
       'POSTRE RECTANGULAR  VIRUTA CHOCOLATE'], dtype=object)

### 1.2. LOADING ALL THE DATAFRAME IN CHUNKS AND GETTING ALL THE RESULTS:

In [65]:
# We import the dataframe:

reader=pd.read_csv(file_path+file_name, sep=sep, chunksize=2000000)

# Two empty lists are created to store, for each product, the list that results from looking for it in the chunk.

NL1=list([None]*len(list_of_products))
NL2=list([None]*len(list_of_products))

# Getting the chunks and proceeding:

for chunk in reader:

    chunk.dropna(how='any', inplace=True)
    chunk=chunk.drop('Unnamed: 0', axis=1)

    chunk['description_lower']=chunk['description'].str.lower()

    list_prod_list_ids=[None]*len(list_of_products)
    list_prod_list_descrip=[None]*len(list_of_products)
        
    for i, product in enumerate(list_of_products):

        list_prod_list_ids[i]=list(chunk[chunk['description_lower'].str.startswith(product)]['product_id'].unique())
        list_prod_list_descrip[i]=list(chunk[chunk['description_lower'].str.startswith(product)]['description_lower'].unique())
        
    for i, element in enumerate(list_prod_list_ids):
        
        if NL1[i]:    
            NL1.append(list_prod_list_ids[i])
            NL2.append(list_prod_list_descrip[i])
            
        else:
            NL1[i]=list_prod_list_ids[i]
            NL2[i]=list_prod_list_descrip[i]
    

In [15]:
rel_prod_list_ids=dict(zip(list_of_products,NL1))
rel_prod_list_descrip=dict(zip(list_of_products,NL2))

In [16]:
rel_prod_list_ids

{'croissant': [102.0,
  103.0,
  105.0,
  107.0,
  101.0,
  132.0,
  5001.0,
  100.0,
  214.0,
  189.0,
  198.0,
  197.0,
  9999.0,
  513.0,
  512.0,
  112.0],
 'croissant petit': [103.0, 102.0],
 'tarta mousse 3 chocolates': [9999.0, 453.0],
 'tarta de manzana 2º': [462.0, 9999.0],
 'palmera': [140.0, 182.0, 190.0, 9999.0, 141.0],
 'tarta opera': [9999.0, 414.0, 426.0, 427.0, 403.0, 14998.0, 402.0, 428.0],
 'postre fresas y mascarpone': [4511.0, 9999.0, 450.0],
 'milhojas frambuesa 2º': [459.0],
 'tortel': [112.0, 3352.0, 9999.0, 3375.0],
 'baguette': [115.0, 8739.0, 9999.0]}

In [52]:
rel_prod_list_descrip

{'croissant': ['croissant frances',
  'croissant petit',
  'croissant chocolate',
  'croissant vacios',
  'croissant',
  'croissant integral',
  'croissant alargado grande piezas',
  'croissant de chocolate',
  'croissant sobrasada gr.',
  'croissant integral   latas',
  'croissant alargado  paris   piezas',
  'croissants  paris  --  largos  -',
  'croissant petit  alargados  piezas --futbol',
  'croissant petit  --  futbol --',
  'croissant   integral',
  'croissant cereales largo',
  'croissant normal largo',
  'croissant normal petit',
  'croissant chocolate largo',
  'croissant almendra largo',
  'croissant cereales petit',
  'croissant almendra petit',
  'croissant chocolate petit',
  'croissant integral     6 piezas--',
  'croissant  integral    --  piezas --',
  'croissant paris alargados (futbol)',
  'croissant frances--alaragado  piezas--futbol',
  'croissant  paris  alargados  piezas',
  'croissant pi\\ones',
  'croissant paris  alargados',
  'croissant frances, de estas lata

Storing the list in a file:

In [60]:
import csv

with open(file_path+'rel_prod_list_descrip.csv', 'w') as f:
    for key in rel_prod_list_ids.keys():
        f.write("%s;%s\n"%(key,rel_prod_list_descrip[key]))

###  1.3. FINAL LIST:

In [67]:
# At end, we reach the following conclussions:

dict_of_products={'croissant': 100, # serious doubts, if it is not the 100, then possibly it should be 100+101+102
                  'croissant petit': 103,
                  'tarta mousse 3 chocolates': 9999, # almost only for order, creating a new id for this product is suggested
                  'tarta de manzana 2º': 462,
                  'palmeras de trufa': 182, # palmeras: 140
                  'tarta opera': 414, # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  'postre fresas y mascarpone':4511,
                  'milhojas frambuesa 2º': 459,
                  'torteles': 112,
                  'baguette':115}

In [68]:
dict_of_products

{'croissant': 100,
 'croissant petit': 103,
 'tarta mousse 3 chocolates': 9999,
 'tarta de manzana 2º': 462,
 'palmeras de trufa': 182,
 'tarta opera': 414,
 'postre fresas y mascarpone': 4511,
 'milhojas frambuesa 2º': 459,
 'torteles': 112,
 'baguette': 115}

### 1.4. CHANGING THE ID OF A PRODUCT:

We face now the problem that one of our products is using a code that describes the orders (as seen in previous scripts).

To avoid complexity in the code to come, we decide to change the id of our product to another one, taking care firstly that it is not currently in use.

In [69]:
df.loc[df['description_lower'].str.startswith('tarta mousse 3 chocolates'), 'product_id']=10002

In [70]:
df[df ['product_id']==10002]

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,description_lower
888,10002.0,"Tarta mousse 3 chocolates del 3º, escrito en l...",4/4/2013 0:00:00,0,BmUP,000,"tarta mousse 3 chocolates del 3º, escrito en l..."
2340,10002.0,Tarta mousse 3 chocolates del 3º escrito encim...,15/3/2014 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del 3º escrito encim...
2531,10002.0,Tarta mousse 3 chocolates del 5º que ponga Fel...,5/4/2013 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del 5º que ponga fel...
2538,10002.0,TARTA MOUSSE 3 CHOCOLATES DEL SEGUNDO ESCRITO...,5/4/2013 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del segundo escrito...
38205,10002.0,TARTA MOUSSE 3 CHOCOLATES 4º,18/12/2012 0:00:00,0,BmUP,000,tarta mousse 3 chocolates 4º
40313,10002.0,Tarta mousse 3 chocolates del 4º,25/4/2013 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del 4º
40451,10002.0,"Tarta Mousse 3 chocolates del 4º con cartel "" ...",26/4/2013 0:00:00,0,BmUP,000,"tarta mousse 3 chocolates del 4º con cartel "" ..."
52807,10002.0,TARTA MOUSSE 3 CHOCOLATES DEL 3 - FELICIDADES ...,21/4/2012 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del 3 - felicidades ...
64034,10002.0,"Tarta Mousse 3 chocolates 2º "" Felicidades San...",26/8/2012 0:00:00,0,BmUP,000,"tarta mousse 3 chocolates 2º "" felicidades san..."
93654,10002.0,TARTA MOUSSE 3 CHOCOLATES DEL 2 ESCRITO SOBRE ...,25/2/2012 0:00:00,0,BmUP,000,tarta mousse 3 chocolates del 2 escrito sobre ...


In [71]:
# We make this small arrangement also to the dict_of_products:

dict_of_products['tarta mousse 3 chocolates']= 10002 # New id created

In [72]:
dict_of_products

{'croissant': 100,
 'croissant petit': 103,
 'tarta mousse 3 chocolates': 10002,
 'tarta de manzana 2º': 462,
 'palmeras de trufa': 182,
 'tarta opera': 414,
 'postre fresas y mascarpone': 4511,
 'milhojas frambuesa 2º': 459,
 'torteles': 112,
 'baguette': 115}

##  2. SELECTING OUR PRODUCTS:

Along this part of the script we will select the products described in the list obtained in our last script, and arrange them in a dataframe in a convenient manner for their manipulation.

### 2.1. CREATING THE FILTERED DATAFRAME:

In [73]:
reader=pd.read_csv(file_path+file_name, sep=sep, chunksize=100000)


filtered_df=pd.DataFrame()

for chunk in reader:
    
    chunk.dropna(how='any', inplace=True)
    chunk=chunk.drop('Unnamed: 0', axis=1)

    for value in dict_of_products.values():
        
        if filtered_df.empty:
            filtered_df=chunk[chunk['product_id']==value]
        else:
            filtered_df=filtered_df.append(chunk[chunk['product_id']==value])

In [74]:
filtered_df.head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
47,100.0,CROISANTS,16/6/2009 0:00:00,0,BmUP,0
754,100.0,CROISANTS,21/12/2012 0:00:00,0,BmUP,1500
1029,100.0,CROISANTS,14/3/2014 0:00:00,0,BmUP,1800
1510,100.0,CROISANTS,30/7/2013 0:00:00,0,BmUP,600
1645,100.0,CROISSANT,3/7/2019 0:00:00,0,BmUP,400


In [75]:
filtered_df['product_id'].unique()

array([ 100.,  103.,  462.,  182.,  414., 4511.,  459.,  112.,  115.])

In [76]:
filtered_df.shape

(885643, 6)

In [77]:
filtered_df[filtered_df['product_id']==103].shape

(104462, 6)

### 2.2. CLEANING THE DATA PRIOR TO ITS STORAGE:

 DATES TO APPROPRIATE FORMAT USING DATETIME:

In [78]:
from datetime import datetime as dttm

In [79]:
filtered_df['date']=filtered_df['order_date'].apply(lambda x: dttm.strptime(x,'%d/%m/%Y 0:00:00'))

We drop two columns that are of absolutely no interest for us:

In [80]:
filtered_df.drop('order_date', axis=1, inplace=True)

We convert the units ordered from string to a numeric type:

In [81]:
filtered_df.head()

Unnamed: 0,product_id,description,section,store,units_ordered,date
47,100.0,CROISANTS,0,BmUP,0,2009-06-16
754,100.0,CROISANTS,0,BmUP,1500,2012-12-21
1029,100.0,CROISANTS,0,BmUP,1800,2014-03-14
1510,100.0,CROISANTS,0,BmUP,600,2013-07-30
1645,100.0,CROISSANT,0,BmUP,400,2019-07-03


In [82]:
filtered_df['units_ordered_numeric']=filtered_df['units_ordered'].str.split(",").str[0].astype(dtype='long')

In [83]:
filtered_df.drop('units_ordered', axis=1, inplace=True)

In [84]:
filtered_df.rename(columns={'units_ordered_numeric':'units_ordered'}, inplace=True)

In [85]:
filtered_df.head()

Unnamed: 0,product_id,description,section,store,date,units_ordered
47,100.0,CROISANTS,0,BmUP,2009-06-16,0
754,100.0,CROISANTS,0,BmUP,2012-12-21,15
1029,100.0,CROISANTS,0,BmUP,2014-03-14,18
1510,100.0,CROISANTS,0,BmUP,2013-07-30,6
1645,100.0,CROISSANT,0,BmUP,2019-07-03,4


Finally, we end by storing our results in a csv:

In [86]:
filtered_df.to_csv(exit_path+filtered_file_name, sep=sep)