# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyze them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients found as more relevant to their business.

What we want now is to check whether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list. With the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1. Read dataframe

In [1]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import math


%matplotlib inline
pd.options.display.max_columns = None



In [2]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions.csv"
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [3]:
# We create the list of products provided by the client
list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera de chocolate'
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [4]:
# We import the dataframe:
df=pd.read_csv(file_path+file_name, sep=sep)

In [5]:
df.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
20325492,462.0,MANZANA 2º,27/3/2010 0:00:00,0,AnUP,300
10545620,3368.0,BOLSAS DE PATATAS TORTILLA,29/2/2008 0:00:00,0,RzUP,300
13326714,561.0,PAN DE MALLORCA BOLSA 120 GRS,26/1/2014 0:00:00,0,LiUP,0
30379282,413.0,TARTA LIMON 2º,2/10/2015 0:00:00,0,GeUP,100
10091881,212.0,EMPANADILLA DE VERDURAS ( ESPINACA Y ACELGA),20/1/2013 0:00:00,0,RzUP,50


### 2. Normalizing and aggregating description names

Unfortunately, there is no convention for the description and one id could 

1. Normalize descriptions as much as possible using:
    - Regex expressions 
    - Basic NLP for spell-checking.
2. Create a normalization file with the following structure:
    - Unique Product_id and normalized description
    - Flag to indicate if the product is part of the given list, or not.  
3. Finally review the list manually. 

### 2.1 Normalizing description names 

In [6]:
df_descriptions_unique = pd.Series(df['description'].unique())
# Most of the descriptions are in uppercase, however others are in lower:

df_descriptions_normalized = df_descriptions_unique.str.lower()
# We also notice that there are spacing issues at the begining, end of the description and between words:

df_descriptions_normalized=df_descriptions_normalized.str.strip()
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'[^0-9a-zA-Zº()ª]+', ' ') #replace non alfanumeric with spaces
df_descriptions_normalized=df_descriptions_normalized.str.replace(r' +', ' ') # Remove multi-spacing

In [7]:
pd.DataFrame(dict(desc_original = df_descriptions_unique, desc_normalized = df_descriptions_normalized)).sample(10)

Unnamed: 0,desc_normalized,desc_original
2132,postre cheescake,POSTRE CHEESCAKE
7020,encargo tarta cheesecake del 3 escrito felicid...,Encargo TARTA CHEESECAKE DEL 3 ESCRITO FELICID...
23735,postres de merengue y limon,POSTRES DE MERENGUE Y LIMON
15320,mini torrijas,Mini torrijas
9847,manzana 1º con cartel en chocolate escrito 22280,MANZANA 1º con cartel en chocolate escrito: 22280
36234,tarta selva negra 3º,Tarta selva negra 3º
11885,encargo tarta san marcos del 3 feliz cumplea o...,Encargo TARTA SAN MARCOS DEL 3 FELIZ CUMPLEAÑO...
30963,besamelas 2 de encargo,BESAMELAS 2 DE ENCARGO
9508,pasta empi onados,PASTA EMPIÑONADOS
24991,encargo cajas de 11 huevos leche,Encargo CAJAS DE 11 HUEVOS LECHE


Now lets gets get our hands dirty and apply some maths to calculate string distnace and finish cleaning all those messy product descriptions... This is what we are going to do:

1. Create a dataset with pastry products by parsing the bakery catalogues, and other pastry websites. (this was done manually, by converting the pdf catalogues to txt using an external web. THe resulting file is named productos.txt)

2. Following the indications from: https://medium.com/@hdezfloresmiguelangel/implementando-un-corrector-ortogr%C3%A1fico-en-python-utilizando-la-distancia-de-levenshtein-498ec0dd1105 create an spell-checker based on the products.txt dataset and the Levenshtein distance


In [8]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../../data/01_additional_data/productos.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [9]:
correction("trta")

'tarta'

In [10]:
correction("café")

'cafe'

Fantastic! the it seems to work. Lets now apply it to our dataset:

In [11]:
def spell_check (line):
    "Given a sentence, returns spell-checks word by word"
    if type(line) == str and len(line) > 0:
        new = []
        line = line.split(" ")
        for word in line:
            if type(word) == str:
                word = correction(word)
            new.append(word)
        return " ".join(new)
            
    else:
        return line

Caution! the following line of code may take some time to process:

In [12]:
# Applying spell_check to all the dataset
normalized_names = df_descriptions_normalized.apply(lambda line: spell_check(line))

Lets now merge the normalized names back to the original file, and check how effective was this cleaning:

In [13]:
to_merge = pd.DataFrame(dict(description = df_descriptions_unique, desc_normalized = df_descriptions_normalized))

In [None]:
df_with_normalized_descriptions = pd.merge(df, to_merge, on = 'description').sort_values(by='order_date')

In [15]:
df_with_normalized_descriptions.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,desc_normalized
24681148,923.0,POSTRE TARTALETA MANZANA Y CARAMELO,30/8/2010 0:00:00,0,GoUP,0,postre tartaleta manzana y caramelo
23401008,3439.0,LENGUAS GATO NEGRAS,9/2/2016 0:00:00,0,GeUP,0,lenguas gato negras
11759646,463.0,MANZANA 3º,17/2/2014 0:00:00,0,AaUP,100,manzana 3º
5019151,409.0,SELVA NEGRA 1º,24/1/2016 0:00:00,0,VeUp,100,selva negra 1º
20145633,4422.0,POSTRE TARTALETA DE CHOCOLATE Y PLATANO,15/12/2014 0:00:00,0,BmUP,400,postre tartaleta de chocolate y platano


In [16]:
unique_descriptions_raw = len(df['description'].unique())
unique_descriptions_normalized = len(df_with_normalized_descriptions['desc_normalized'].unique())
print('The product descritions were cleaned from {} unique names to {}.'.format(unique_descriptions_raw,unique_descriptions_normalized))

The product descritions were cleaned from 48517 unique names to 42309.


In [17]:
# Saving the file to the intermiady folder
output_path_df_with_normalized_descriptions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions.to_csv(output_path_df_with_normalized_descriptions, index = False, sep = ';' )

In [18]:
unique_descriptions_raw = len(df['description'].unique())

In [19]:
len(df_with_normalized_descriptions['desc_normalized'].unique())

42309

### 2.2 Identifying product descritions that the client wants us to predict

It is time to create the file that will be manually reviewed.

- First, we will take only the normalized description, and product id columns
- Then, we will remove duplicates
- Then, we will create the column "target-list-product" to identify if the normalized description matches with the list provided with the client or not
- Finally, we will take into account the other analysis where it was concluded tha:

|List name |Proposed id match |
|---------|-----------------|
|croissant|100,101,102|
|croissant petit|103|
|tarta mousse 3 chocolates|9999| 
|tarta de manzana 2º|462| 
|palmeras de trufa|182| 
|tarta opera|414| 
|postre fresas y mascarpone|4511| 
|milhojas frambuesa 2º|112| 
|torteles|414|
|baguette|115| 



In [20]:
df_normalized_id_desc_unique = df_with_normalized_descriptions[["desc_normalized","product_id"]].drop_duplicates()

In [21]:
def find_match (line, options = list_of_products):
    if not(line is None) and type(line)== str:
        highest = process.extractOne(line,list_of_products)
        if not(highest is None) and highest[1] >80:
            return highest[0]
        else:
            return 'not-found'
    else:
        return 'not-found'
        

In [23]:
df_normalized_id_desc_unique["target_names_B"] = df_normalized_id_desc_unique["desc_normalized"].apply(lambda line: find_match(line))

In [24]:
# Input from Illan:

In [25]:
dict_of_products_matches={100: 'croissant', 
                  101: 'croissant',
                  102: 'croissant',
                  103: 'croissant petit',
                  9999: 'tarta mousse 3 chocolates', # almost only for order, creating a new id for this product is suggested
                  462: 'tarta de manzana 2º',
                  182: 'palmera de chocolate', # palmeras: 140
                  414: 'tarta opera', # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  4511:'postre fresas y mascarpone',
                  459: 'milhojas frambuesa 2º',
                  112: 'tortel',
                  115: 'baguette'}

In [26]:
def target_names_a(product_id, dict_of_products_matches= dict_of_products_matches):
    if not(product_id is None) and not(math.isnan(product_id)) and int(product_id)  in dict_of_products_matches:
        return dict_of_products_matches[int(product_id)]
    else:
        return 'not-found'
    

In [27]:
df_normalized_id_desc_unique['target_names_A']=df_normalized_id_desc_unique["product_id"].apply(lambda line: target_names_a(line))

In [28]:
df_normalized_id_desc_unique.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A
30065942,postre milhojas de mango con culis de frutas,450.0,tarta de manzana 2º,not-found
28087015,milhoja frambuesa del cuarto,460.0,not-found,not-found
29859486,hojaldre rectangulares para tartare para el 20...,137.0,not-found,not-found
29556551,pan molde grande blanco,2007.0,not-found,not-found
28073326,turron de marron,2707.0,palmera de chocolatetarta opera,not-found
29680752,postres mus chocolate blanco salsa de guindas,450.0,tarta de manzana 2º,not-found
30484156,tarta chessecake 3º escrito en la tarta felici...,4401.0,not-found,not-found
29984654,tarta de yema del 1º felicidades i aki,9999.0,not-found,tarta mousse 3 chocolates
30511016,selva negra 3º con viruta por encima como al d...,411.0,not-found,not-found
30293347,raciones changurro al horno aa3790,9999.0,not-found,tarta mousse 3 chocolates


In [29]:
output_path_df_normalized_id_desc_unique = exit_path + 'normalized_names_with_target.csv'
df_normalized_id_desc_unique.to_csv(output_path_df_normalized_id_desc_unique, index = False, sep = ';' )

### 2.3 Review Product by Product

In [41]:
#Before cleaning, lets fill the na descriptions
df_normalized_id_desc_unique['desc_normalized'] = df_normalized_id_desc_unique['desc_normalized'].fillna('no-description')

#### 2.3.1 Cleaning: milhojas de frambuesa 2º

In [71]:
milhojas = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('milhojas')]
milhojas_frambuesa = milhojas[milhojas['desc_normalized'].str.contains('frambuesa')].copy()
milhojas_frambuesa['target_names_C'] = 'milhojas frambuesa'
milhojas_frambuesa.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29811122,postres variados mus limon mus tres chocolate ...,450.0,milhojas frambuesa 2º,not-found,milhojas frambuesa
30316219,tarta milhojas frambuesa 4 raciones 1º,9999.0,not-found,tarta mousse 3 chocolates,milhojas frambuesa
30546279,encargo tarta milhojas y frambuesa del 1 feliz...,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,milhojas frambuesa
28635277,milhojas frambuesa 2º con un cartel feliz cump...,459.0,not-found,milhojas frambuesa 2º,milhojas frambuesa
29804843,tarta milhojas frambuesa 3º con cartel feliz y...,414.0,not-found,tarta opera,milhojas frambuesa
29687239,tarta milhojas frambuesa del 4,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
28240661,tarta milhojas de frambuesa del 8,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
30230196,milhojas frambuesa 2º con dibujo enviado a f b...,459.0,not-found,milhojas frambuesa 2º,milhojas frambuesa
29666695,milhojas de frambuesa 40r feliz cumplea os joh...,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
30538721,encargo milhojas de frambuesa de 16 raciones,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa


#### 2.3.2 Cleaning: croissant petite

In [102]:
croissant = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('croissant')].copy()
croissant_petite = croissant[croissant['desc_normalized'].str.contains('petit')].copy()
croissant_petite['target_names_C'] = 'croissant'
croissant_petite.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30064662,croissant petit latas a las 10 horas,103.0,croissant,croissant petit,croissant
29721736,croissant petit futbol,103.0,croissant petit,croissant petit,croissant
29831196,croissant petit latas cocer a las 14h,103.0,croissant,croissant petit,croissant
30086894,croissant petit cocidas a las 14 00 horas,103.0,croissant,croissant petit,croissant
30049812,croissant petit latas cocer a las 11 h,103.0,croissant,croissant petit,croissant
26638349,croissant petit futbol,103.0,croissant petit,croissant petit,croissant
28243089,petit croissants cocidos latas mandarlas en el...,9999.0,croissant,tarta mousse 3 chocolates,croissant
29741510,croissant petit cocer a las 2,103.0,croissant,croissant petit,croissant
30092863,croissant petit lata cocer 13 h,103.0,croissant,croissant petit,croissant
29985752,petit croissants alargados,102.0,croissant,croissant,croissant


#### 2.3.3 Cleaning: croissant

In [90]:
croissant = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('croissant')].copy()
croissant_simple = croissant[~croissant['desc_normalized'].str.contains('petit|tira|masa')].copy()
croissant_simple['target_names_C'] = 'croissant'
croissant_simple.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
3401665,croissant,100.0,croissant,croissant,croissant
30087983,croissants alrgados grandes piezas,103.0,croissant,croissant petit,croissant
30183273,croissant paris chocolate latas,105.0,croissant,not-found,croissant
29467029,croissants alargados grandes piezas,9999.0,croissant,tarta mousse 3 chocolates,croissant
29709110,croissant alargados grandes,9999.0,croissant,tarta mousse 3 chocolates,croissant
29620180,croissant alargados grandes piezas,103.0,croissant,croissant petit,croissant
28573453,croissant frances cocido en el coche del pan,102.0,croissant,croissant,croissant
28530116,croissant paris alargados,5001.0,croissant,not-found,croissant
30175917,croissant paris alargados futbol latas,5001.0,croissant,not-found,croissant
30103825,croissant alargados grande piezas,101.0,croissant,croissant,croissant


#### 2.3.4 Cleaning: mousse tres chocolates

In [109]:
mousse = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].copy()
mousse_tres = mousse[mousse['desc_normalized'].str.contains('tres|3')].copy()
mousse_tres_chocolates = mousse_tres[mousse_tres['desc_normalized'].str.contains('chocolate')].copy()

mousse_tres_chocolates['target_names_C'] = 'mousse tres chocolates'
mousse_tres_chocolates.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
27875100,mousse 3 chocolates 2º escribir sobre la tarta...,451.0,not-found,not-found,mousse tres chocolates
30098265,tarta mousse tres chocolates del 3º escrito en...,9999.0,not-found,tarta mousse 3 chocolates,mousse tres chocolates
28585214,mousse 3 chocolates 2ºescrito sobre la tarta f...,451.0,not-found,not-found,mousse tres chocolates
28602579,mousse 3 chocolates 3º felicidades esther,453.0,not-found,not-found,mousse tres chocolates
30532918,mousse 3 chocolates 2º con cartel de felicidad...,451.0,not-found,not-found,mousse tres chocolates
29763269,tarta mousse tres chocolates del 3 escrito fel...,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,mousse tres chocolates
27901049,mousse de tres chocolates 16 raciones,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,mousse tres chocolates
29737677,mousse 3 chocolates 2º con el dibujo de mickey...,451.0,not-found,not-found,mousse tres chocolates
29480295,postre mousse tres chocolates con macarron de ...,13999.0,tarta mousse 3 chocolates,not-found,mousse tres chocolates
27905501,postre mousse tres chocolates sin chocolatina ...,45.0,tarta mousse 3 chocolates,not-found,mousse tres chocolates


#### 2.3.5 Cleaning: tarta de manzana 2

In [105]:
manzana = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('manzana')]
manzana_tarta = manzana[manzana['desc_normalized'].str.contains('tarta')].copy()
manzana_tarta_dos = manzana_tarta[manzana_tarta['desc_normalized'].str.contains('dos|2')].copy()
manzana_tarta_dos['target_names_C'] = 'tarta de manzana'
manzana_tarta_dos.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29761257,tarta manzana caramelo 2º,436.0,not-found,not-found,tarta de manzana
27981891,encargo tarta manzana del 2,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,tarta de manzana
29573966,encargotarta de caramelo y manzana 2,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,tarta de manzana
30135213,tarta manzana caramelo 16r 2 pisos,414.0,tarta de manzana 2º,tarta opera,tarta de manzana
29462819,tarta cremoso caramelo y manzana tartero 3º y 2º,414.0,not-found,tarta opera,tarta de manzana
30134681,tarta caramelo y manzana 2º inga 31,4490.0,not-found,not-found,tarta de manzana
28596404,encargo tarta de manzana y nuezes del 2,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,tarta de manzana
29710676,tarta caramelo y manzana 2º felicidades elena,4490.0,not-found,not-found,tarta de manzana
30143538,tarta caramelo y manzana 2º felicidades,4490.0,not-found,not-found,tarta de manzana
29919050,tarta manzana b a 2º (hora 5),6017.0,not-found,not-found,tarta de manzana


#### 2.3.6 Cleaning: palmera de chocolate 

In [128]:

palmera = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('palmera')]
palmera_chocolate = palmera[palmera['desc_normalized'].str.contains('chocolate|trufa')].copy()
palmera_chocolate['target_names_C'] = 'palmera chocolate'
palmera_chocolate.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29756573,palmeras de trufa hay 50 de encargo,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
30378703,palmeras de trufa 2 unidades con furgon del pa...,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
28608653,palmera de trufa,9999.0,palmera de chocolatetarta opera,tarta mousse 3 chocolates,palmera chocolate
28311438,palmeras de trufa encargo web 4 unidades manda...,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
30295123,palmeras de trufa encargo 1º hora,182.0,not-found,palmera de chocolate,palmera chocolate
28636384,palmeras de trufa mandar con el furgon del pan,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
29680306,palmeras de trufa necesito 1 con el furg n del...,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
17188873,palmeras de trufa,182.0,palmera de chocolatetarta opera,palmera de chocolate,palmera chocolate
30546609,encargo palmeras de chocolate,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,palmera chocolate
28371789,palmeras de trufa ( 1 unid de encargo),182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate


#### 2.3.7 Cleaning: tarta ópera 

In [129]:
opera = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('opera')]
opera_tarta = opera[opera['desc_normalized'].str.contains('tarta')].copy()
opera_tarta['target_names_C'] = 'tarta opera'

opera_tarta.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30543868,tarta opera de 6 raciones escrito sobre la tar...,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,tarta opera
4939482,encargotarta opera del 3 escribir sobre la tar...,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,tarta opera
30017610,tartas especial opera del 2º con cartel que po...,14998.0,not-found,not-found,tarta opera
29721037,tarta opera del tercero,9999.0,not-found,tarta mousse 3 chocolates,tarta opera
30016980,tarta opera del 6º con dibujo del cliente que ...,9999.0,not-found,tarta mousse 3 chocolates,tarta opera
30386109,opera 3º escrito en la tarta por muchos a os mas,427.0,not-found,not-found,tarta opera
30171248,tarta opera del 5,9999.0,palmera de chocolatetarta opera,tarta mousse 3 chocolates,tarta opera
29763131,opera 2º escrito sobre la tarta en mayusculas ...,428.0,not-found,not-found,tarta opera
30151120,tartas opera del 3,9999.0,palmera de chocolatetarta opera,tarta mousse 3 chocolates,tarta opera
28657951,opera 2º escrito en la tarta felices 30,428.0,not-found,not-found,tarta opera


#### 2.3.9 Cleaning: postre de fresas y mascarpone

In [130]:
postre = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('postre')]
postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()

postre_fresa_mascarpone['target_names_C'] = 'postre de fresas y mascarpone'

postre_fresa_mascarpone.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29774044,postre de fresas y mascarpone,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone
28299212,encargo postre fresa y mascarpone encargo web,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone
15825390,postre fresas y mascarpone,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone
28379759,encargo 1 postre de cada de breton milhojas li...,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,postre de fresas y mascarpone
30445637,encargo postre eclair fresa y mascarpone encar...,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone
28442327,postre fresas y mascarpone para foto,4511.0,postre fresas y mascarpone,postre fresas y mascarpone,postre de fresas y mascarpone
30236527,postre eclair fresas y mascarpone en furgon pan,938.0,postre fresas y mascarpone,not-found,postre de fresas y mascarpone
15812216,postre fresas y mascarpone,4511.0,postre fresas y mascarpone,postre fresas y mascarpone,postre de fresas y mascarpone
30462179,postres tartaleta de fresa y mascarpone,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,postre de fresas y mascarpone
30462464,postre fresas y mascarpone encargo web,4511.0,postre fresas y mascarpone,postre fresas y mascarpone,postre de fresas y mascarpone


#### 2.3.9 Cleaning: tortel

In [132]:
postre = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('tortel')]
postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()

postre_fresa_mascarpone['target_names_C'] = 'postre de fresas y mascarpone'

postre_fresa_mascarpone.sample(10)

ValueError: 'a' must be greater than 0 unless no samples are taken

### 2.3 Merge the final list with the original dataset. Filter by products on the list, and save the file.


### 2.3 Manually review the list