# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyze them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients found as more relevant to their business.

What we want now is to check whether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list. With the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1. Read dataframe

In [211]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import math


%matplotlib inline
pd.options.display.max_columns = None

In [223]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions_sample.csv"#"b2-transactions.csv"
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [213]:
# We create the list of products provided by the client
list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera de chocolate'
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [214]:
# We import the dataframe:
df=pd.read_csv(file_path+file_name, sep=sep)

In [215]:
df.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
13151621,295.0,TARTA QUICHE ESPINACAS,22/8/2017 0:00:00,0,LiUP,0
3954872,1563.0,TUBO ACET.200 gr ALMOHADILLA MIGÑON,24/5/2010 0:00:00,0,CzUP,0
21703657,429.0,YEMA TOSTADA 1º,29/9/2008 0:00:00,0,MoUP,100
19132777,100.0,CROISANTS,30/6/2016 0:00:00,0,GrUP,0
26393662,709.0,CAPUCHINAS,6/3/2010 0:00:00,0,ViUP,0


### 2. Normalizing and aggregating description names

Unfortunately, there is no convention for the description and one id could 

1. Normalize descriptions as much as possible using:
    - Regex expressions 
    - Basic NLP for spell-checking.
2. Create a normalization file with the following structure:
    - Unique Product_id and normalized description
    - Flag to indicate if the product is part of the given list, or not.  
3. Finally review the list manually. 

### 2.1 Normalizing description names 

In [216]:
df_descriptions_unique = pd.Series(df['description'].unique())
# Most of the descriptions are in uppercase, however others are in lower:

df_descriptions_normalized = df_descriptions_unique.str.lower()
# We also notice that there are spacing issues at the begining, end of the description and between words:

df_descriptions_normalized=df_descriptions_normalized.str.strip()
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'[^0-9a-zA-Zº()ª]+', ' ') #replace non alfanumeric with spaces
df_descriptions_normalized=df_descriptions_normalized.str.replace(r' +', ' ') # Remove multi-spacing

In [217]:
pd.DataFrame(dict(desc_original = df_descriptions_unique, desc_normalized = df_descriptions_normalized)).sample(10)

Unnamed: 0,desc_normalized,desc_original
43736,selva negra 4º con cartel de felicidades alex,SELVA NEGRA 4º CON CARTEL DE FELICIDADES ALEX
31704,mousse 3 chocolates 2ºfelicidades manuel ignacio,MOUSSE 3 CHOCOLATES 2ºFELICIDADES MANUEL IGNACIO
9122,encargo4latastorteles coccidas,Encargo4LATASTORTELES COCCIDAS
22893,mini pizzas verduras,MINI PIZZAS VERDURAS
40858,mus tres chocolates del 2,MUS TRES CHOCOLATES DEL 2
36396,patatas panaderas 1kilo,PATATAS PANADERAS 1KILO
4003,cestitos almendras garrapi,CESTITOS ALMENDRAS GARRAPIÑ
17601,cajas bombon licor uva y guinda,CAJAS BOMBON LICOR UVA Y GUINDA
21124,bandeja de mi cuit 10r,BANDEJA DE MI CUIT 10R
23179,helados de vainilla 28 r,HELADOS DE VAINILLA 28 R/


Now lets gets get our hands dirty and apply some maths to calculate string distnace and finish cleaning all those messy product descriptions... This is what we are going to do:

1. Create a dataset with pastry products by parsing the bakery catalogues, and other pastry websites. (this was done manually, by converting the pdf catalogues to txt using an external web. THe resulting file is named productos.txt)

2. Following the indications from: https://medium.com/@hdezfloresmiguelangel/implementando-un-corrector-ortogr%C3%A1fico-en-python-utilizando-la-distancia-de-levenshtein-498ec0dd1105 create an spell-checker based on the products.txt dataset and the Levenshtein distance


In [218]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../../data/01_additional_data/productos.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [219]:
correction("trta")

'tarta'

In [220]:
correction("café")

'cafe'

Fantastic! the it seems to work. Lets now apply it to our dataset:

In [221]:
def spell_check (line):
    "Given a sentence, returns spell-checks word by word"
    if type(line) == str and len(line) > 0:
        new = []
        line = line.split(" ")
        for word in line:
            if type(word) == str:
                word = correction(word)
            new.append(word)
        return " ".join(new)
            
    else:
        return line

Caution! the following line of code may take some time to process:

In [222]:
# Applying spell_check to all the dataset
df_descriptions_normalized = df_descriptions_normalized.apply(lambda line: spell_check(line))

KeyboardInterrupt: 

Lets now merge the normalized names back to the original file, and check how effective was this cleaning:

In [None]:
to_merge = pd.DataFrame(dict(description = df_descriptions_unique, desc_normalized = df_descriptions_normalized))
df_with_normalized_descriptions = pd.merge(df, to_merge, how='left', on = 'description').sort_values(by='order_date')
df_with_normalized_descriptions.sample(5)

In [None]:
# Checking effectiveness of the data cleaning:
unique_descriptions_raw = len(df['description'].unique())
unique_descriptions_normalized = len(df_with_normalized_descriptions['desc_normalized'].unique())
print('The product descritions were cleaned from {} unique names to {}.'.format(unique_descriptions_raw,unique_descriptions_normalized))

In [None]:
# Saving the file to the intermiady folder
output_path_df_with_normalized_descriptions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions.to_csv(output_path_df_with_normalized_descriptions, index = False, sep = ';' )

### 2.2 Identifying product descritions that the client wants us to predict

It is time to create the file that will be manually reviewed.

- First, we will take only the normalized description, and product id columns
- Then, we will remove duplicates
- Then, we will create the column "target-list-product" to identify if the normalized description matches with the list provided with the client or not
- Finally, we will take into account the other analysis where it was concluded tha:

|List name |Proposed id match |
|---------|-----------------|
|croissant|100,101,102|
|croissant petit|103|
|tarta mousse 3 chocolates|9999| 
|tarta de manzana 2º|462| 
|palmeras de trufa|182| 
|tarta opera|414| 
|postre fresas y mascarpone|4511| 
|milhojas frambuesa 2º|112| 
|torteles|414|
|baguette|115| 



In [20]:
df_normalized_id_desc_unique = df_with_normalized_descriptions[["desc_normalized","product_id"]].drop_duplicates()

In [21]:
def find_match (line, options = list_of_products):
    if not(line is None) and type(line)== str:
        highest = process.extractOne(line,list_of_products)
        if not(highest is None) and highest[1] >80:
            return highest[0]
        else:
            return 'not-found'
    else:
        return 'not-found'
        

In [23]:
df_normalized_id_desc_unique["target_names_B"] = df_normalized_id_desc_unique["desc_normalized"].apply(lambda line: find_match(line))

In [24]:
# Input from Illan:

In [25]:
dict_of_products_matches={100: 'croissant', 
                  101: 'croissant',
                  102: 'croissant',
                  103: 'croissant petit',
                  9999: 'tarta mousse 3 chocolates', # almost only for order, creating a new id for this product is suggested
                  462: 'tarta de manzana 2º',
                  182: 'palmera de chocolate', # palmeras: 140
                  414: 'tarta opera', # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  4511:'postre fresas y mascarpone',
                  459: 'milhojas frambuesa 2º',
                  112: 'tortel',
                  115: 'baguette'}

In [26]:
def target_names_a(product_id, dict_of_products_matches= dict_of_products_matches):
    if not(product_id is None) and not(math.isnan(product_id)) and int(product_id)  in dict_of_products_matches:
        return dict_of_products_matches[int(product_id)]
    else:
        return 'not-found'
    

In [27]:
df_normalized_id_desc_unique['target_names_A']=df_normalized_id_desc_unique["product_id"].apply(lambda line: target_names_a(line))

In [28]:
df_normalized_id_desc_unique.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A
30065942,postre milhojas de mango con culis de frutas,450.0,tarta de manzana 2º,not-found
28087015,milhoja frambuesa del cuarto,460.0,not-found,not-found
29859486,hojaldre rectangulares para tartare para el 20...,137.0,not-found,not-found
29556551,pan molde grande blanco,2007.0,not-found,not-found
28073326,turron de marron,2707.0,palmera de chocolatetarta opera,not-found
29680752,postres mus chocolate blanco salsa de guindas,450.0,tarta de manzana 2º,not-found
30484156,tarta chessecake 3º escrito en la tarta felici...,4401.0,not-found,not-found
29984654,tarta de yema del 1º felicidades i aki,9999.0,not-found,tarta mousse 3 chocolates
30511016,selva negra 3º con viruta por encima como al d...,411.0,not-found,not-found
30293347,raciones changurro al horno aa3790,9999.0,not-found,tarta mousse 3 chocolates


In [29]:
output_path_df_normalized_id_desc_unique = exit_path + 'normalized_names_with_target.csv'
df_normalized_id_desc_unique.to_csv(output_path_df_normalized_id_desc_unique, index = False, sep = ';' )

### 2.3 Review Product by Product

In [142]:
#Before cleaning, lets fill the na descriptions
df_normalized_id_desc_unique['desc_normalized'] = df_normalized_id_desc_unique['desc_normalized'].fillna('no-description')

#And create an empty list to add the dataframes aggregation
list_of_dfs = []

#### 2.3.1 Cleaning: milhojas de frambuesa 2º

In [165]:
milhojas = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('milhojas')]
milhojas_frambuesa = milhojas[milhojas['desc_normalized'].str.contains('frambuesa')].copy()
milhojas_frambuesa['target_names_C'] = 'milhojas frambuesa'
list_of_dfs.append(milhojas_frambuesa)
milhojas_frambuesa.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30353063,tarta milhojas de frambuesa de 10rac,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
30454416,tarta milhojas de frambuesa del 5 20raciones,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,milhojas frambuesa
30194564,encargotarta milhojas de frambuesas del 2 escr...,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
30152182,milhojas frambuesa 2º con cartel feliz cumplea...,459.0,not-found,milhojas frambuesa 2º,milhojas frambuesa
28639052,tarta milhojas de frambuesa de 20raciones,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,milhojas frambuesa


#### 2.3.2 Cleaning: croissant petite

In [164]:
croissant = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('croissant')].copy()
croissant_petite = croissant[croissant['desc_normalized'].str.contains('petit')].copy()
croissant_petite['target_names_C'] = 'croissant'
list_of_dfs.append(croissant_petite)
croissant_petite.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30094170,latas petit croissant cocer 14 h,103.0,croissant,croissant petit,croissant
30064662,croissant petit latas a las 10 horas,103.0,croissant,croissant petit,croissant
30025167,latas croissant petit cocidas a las 11 00 horas,103.0,croissant,croissant petit,croissant
30084196,croissant petit latas cocer alas 14 h,103.0,croissant,croissant petit,croissant
29831196,croissant petit latas cocer a las 14h,103.0,croissant,croissant petit,croissant


#### 2.3.3 Cleaning: croissant

In [163]:
croissant = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('croissant')].copy()
croissant_simple = croissant[~croissant['desc_normalized'].str.contains('petit|tira|masa')].copy()
croissant_simple['target_names_C'] = 'croissant'
list_of_dfs.append(croissant_simple)
croissant_simple.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30081959,croissants alargados grandes piezas,103.0,croissant,croissant petit,croissant
30175917,croissant paris alargados futbol latas,5001.0,croissant,not-found,croissant
28570693,croissant frances de estas latas necesitaria 3...,102.0,croissant petit,croissant,croissant
30147739,croissant parisien cocido lata,9999.0,croissant,tarta mousse 3 chocolates,croissant
29735103,croissant grandes alargados piezas,101.0,croissant,croissant,croissant


#### 2.3.4 Cleaning: mousse tres chocolates

In [162]:
mousse = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].copy()
mousse_tres = mousse[mousse['desc_normalized'].str.contains('tres|3')].copy()
mousse_tres_chocolates = mousse_tres[mousse_tres['desc_normalized'].str.contains('chocolate')].copy()
mousse_tres_chocolates['target_names_C'] = 'mousse tres chocolates'
list_of_dfs.append(mousse_tres_chocolates)
mousse_tres_chocolates.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
28484046,mousse 3 chocolates 1ºplasmar encima de la tar...,452.0,not-found,not-found,mousse tres chocolates
28644475,encargo tarta mousse 3 chocolates 2º con carte...,9999.0,not-found,tarta mousse 3 chocolates,mousse tres chocolates
11250314,mousse 3 chocolates 3º,453.0,not-found,not-found,mousse tres chocolates
28032215,mousse 3 chocolates 1º con cartel felicidades ...,452.0,not-found,not-found,mousse tres chocolates
29765155,mousse 3 chocolates 2º felicidades m jose usama,451.0,not-found,not-found,mousse tres chocolates


#### 2.3.5 Cleaning: tarta de manzana 2

In [160]:
manzana = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('manzana')]
manzana_tarta = manzana[manzana['desc_normalized'].str.contains('tarta')].copy()
manzana_tarta_dos = manzana_tarta[manzana_tarta['desc_normalized'].str.contains('dos|2')].copy()
manzana_tarta_dos['target_names_C'] = 'tarta de manzana'
list_of_dfs.append(manzana_tarta_dos)
manzana_tarta_dos.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29121177,tarta cremoso caramelo y manzana 2º,4490.0,not-found,not-found,tarta de manzana
29710176,tarta caramelo y manzana 2ºfelicidades alicia ...,4490.0,not-found,not-found,tarta de manzana
28422479,encargo tarta de manzana y almendras del 2,9999.0,tarta mousse 3 chocolates,tarta mousse 3 chocolates,tarta de manzana
27030135,tarta caramelo y manzana 2º,4490.0,not-found,not-found,tarta de manzana
29671509,tarta manzana y cremoso del 2º,9999.0,not-found,tarta mousse 3 chocolates,tarta de manzana


#### 2.3.6 Cleaning: palmera de chocolate 

In [159]:
palmera = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('palmera')]
palmera_chocolate = palmera[palmera['desc_normalized'].str.contains('chocolate|trufa')].copy()
palmera_chocolate['target_names_C'] = 'palmera chocolate'

list_of_dfs.append(palmera_chocolate)
palmera_chocolate.sample(5)


Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30517694,palmeras de chocolate piezas encargo mandar si...,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,palmera chocolate
29680306,palmeras de trufa necesito 1 con el furg n del...,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
30378703,palmeras de trufa 2 unidades con furgon del pa...,182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
28371789,palmeras de trufa ( 1 unid de encargo),182.0,tarta de manzana 2º,palmera de chocolate,palmera chocolate
28608653,palmera de trufa,9999.0,palmera de chocolatetarta opera,tarta mousse 3 chocolates,palmera chocolate


#### 2.3.7 Cleaning: tarta ópera 

In [158]:
opera = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('opera')]
opera_tarta = opera[opera['desc_normalized'].str.contains('tarta')].copy()
opera_tarta['target_names_C'] = 'tarta opera'

list_of_dfs.append(opera_tarta)
opera_tarta.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30179925,tarta opera de 10 rac,9999.0,not-found,tarta mousse 3 chocolates,tarta opera
26445101,tarta opera 2º escrito en un cartel felicidade...,9999.0,not-found,tarta mousse 3 chocolates,tarta opera
26607577,tarta opera 3º,403.0,not-found,not-found,tarta opera
30011640,tarta opera del 2º escrito felicidades,9999.0,not-found,tarta mousse 3 chocolates,tarta opera
30152152,tarta opera del 8,9999.0,palmera de chocolatetarta opera,tarta mousse 3 chocolates,tarta opera


#### 2.3.9 Cleaning: postre de fresas y mascarpone

In [157]:
postre = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('postre')]
postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()

postre_fresa_mascarpone['target_names_C'] = 'postre de fresas y mascarpone'
list_of_dfs.append(postre_fresa_mascarpone)
postre_fresa_mascarpone.sample(5)


Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
15779652,postre fresas y mascarpone,450.0,postre fresas y mascarpone,not-found,postre de fresas y mascarpone
28629269,postre tartaleta fresas y mascarpone,4511.0,postre fresas y mascarpone,postre fresas y mascarpone,postre de fresas y mascarpone
28632163,postre mascarpone fresa,450.0,postre fresas y mascarpone,not-found,postre de fresas y mascarpone
29774240,postre de fresas mascarpone,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone
28390847,postre fresa y mascarpone,9999.0,postre fresas y mascarpone,tarta mousse 3 chocolates,postre de fresas y mascarpone


#### 2.3.9 Cleaning: tortel

In [156]:
tortel = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('tortel')].copy()
tortel['target_names_C'] = 'tortel'
list_of_dfs.append(tortel)
tortel.sample(5)


Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
30113222,mini torteles piezas crudas,198.0,tortel,not-found,tortel
29689152,piezas mini torteles crudos,2999.0,tortel,not-found,tortel
28293125,tortelini carbonara,9999.0,tortel,tarta mousse 3 chocolates,tortel
29625010,mini torteles,103.0,tortel,croissant petit,tortel
30119513,minitorteles grudos,9999.0,tortel,tarta mousse 3 chocolates,tortel


#### 2.3.10 Cleaning: tortel

In [166]:
baguette = df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('baguette|baguete|baguet')].copy()
baguette['target_names_C'] = 'baguette'
list_of_dfs.append(baguette)
baguette.sample(5)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
18692442,baguette mallorca,115.0,baguette,baguette,baguette
29992163,piezas de baguet multicereales,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,baguette
2133812,baguett mallorca,115.0,not-found,baguette,baguette
28244770,baguette mallorca cocidas,115.0,baguette,baguette,baguette
30038356,baguettes multicereales,9999.0,baguette,tarta mousse 3 chocolates,baguette


In [167]:
list_of_products_df = pd.concat(list_of_dfs)
list_of_products_df.sample(10)

Unnamed: 0,desc_normalized,product_id,target_names_B,target_names_A,target_names_C
29648537,milhojas frambuesa 2ºcon cartel felicidades ju...,459.0,not-found,milhojas frambuesa 2º,milhojas frambuesa
30457000,opera 2º escrito en la tarta felices 40 cristo,428.0,not-found,not-found,tarta opera
28533110,milhojas frambuesa 2º con cartel feliz cumplea...,459.0,not-found,milhojas frambuesa 2º,milhojas frambuesa
28546795,postres milhojas de frambuesas,9999.0,tarta de manzana 2º,tarta mousse 3 chocolates,milhojas frambuesa
30152878,mousse 3 chocolates 3º escrito en la tarta fel...,453.0,not-found,not-found,mousse tres chocolates
29750031,mousse 3 chocolates 2º escrito en tarta feliz ...,451.0,not-found,not-found,mousse tres chocolates
28448684,mousse 3 chocolates 2º con cartel felicidades ...,451.0,not-found,not-found,mousse tres chocolates
27501036,mousse 3 chocolates 3º cartel felicidades marina,453.0,not-found,not-found,mousse tres chocolates
30429506,mousse 3 chocolates 3º con cartel felicidades ...,453.0,not-found,not-found,mousse tres chocolates
28456207,mousse 3 chocolates 1º escrito sobre la tarta ...,452.0,not-found,not-found,mousse tres chocolates


### 2.4 Merge the final list with the original dataset.


In order to merge, we need to merge back description_normalized to description, and then description to description

In [194]:
df_with_normalized_descriptions_and_list_names = pd.merge(df_with_normalized_descriptions, list_of_products_df[['desc_normalized','target_names_C']],how='left',on = 'desc_normalized')


In [195]:
df_with_normalized_descriptions_and_list_names.sample(10)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,desc_normalized,target_names_C
20923716,420.0,TIRA DE HOJALDRE Y FRUTAS,23/8/2014 0:00:00,0,PoUP,100,tira de hojaldre y frutas,
28031668,3352.0,TORTELLINI BOLOGNESA,29/3/2008 0:00:00,0,LiUP,0,tortellini bolognesa,tortel
13571246,1278.0,CAJA BOMBONES 1 KG.,19/12/2010 0:00:00,0,BmUP,0,caja bombones 1 kg,
20234177,483.0,TARTA CAOBA 2º,23/2/2010 0:00:00,0,GrUP,0,tarta caoba 2º,
11951643,4511.0,POSTRE FRESAS Y MASCARPONE,18/1/2017 0:00:00,0,ZiUO,800,postre fresas y mascarpone,postre de fresas y mascarpone
27656901,452.0,MOUSSE 3 CHOCOLATES 1º,29/1/2019 0:00:00,0,ViUP,0,mousse 3 chocolates 1º,mousse tres chocolates
18412742,9999.0,TARTA MOUSSE TRES CHOCOLATES 4--16 RAC.,22/1/2010 0:00:00,0,AeUP,0,tarta mousse tres chocolates 4 16 rac,mousse tres chocolates
8989405,436.0,CHOCOLATE 2º,15/7/2015 0:00:00,0,GeUP,0,chocolate 2º,
2137659,366.0,GAZPACHO,10/5/2012 0:00:00,0,ZiUO,0,gazpacho,
36738950,245.0,Sandwiches Surtidos,7/4/2017 0:00:00,0,RzUP,0,sandwiches surtidos,


In [196]:
df.shape

(30550252, 6)

In [197]:
df_with_normalized_descriptions.shape

(30550252, 7)

In [198]:
df_with_normalized_descriptions_and_list_names.shape

(39893084, 8)

### 2.3 Manually review the list

In [None]:
 Filter by products on the list, and save the file.