# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyze them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients found as more relevant to their business.

What we want now is to check whether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list. With the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1. Read dataframe

In [48]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import math


%matplotlib inline
pd.options.display.max_columns = None

In [49]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions_sample.csv" #"b2-transactions.csv" 
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [50]:
# We create the list of products provided by the client
list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera de chocolate'
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [51]:
# We import the dataframe:
df=pd.read_csv(file_path+file_name, sep=sep)

In [52]:
df.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
34894,501.0,SABLE NARANJA,11/10/2014 0:00:00,0,GoUP,0
32347,115.0,BAGUETT MALLORCA,19/5/2008 0:00:00,0,MsUP,0
22577,462.0,MANZANA 2º,1/7/2011 0:00:00,0,SeUP,100
39726,446.0,POSTRE TARTALETA LIMON,26/10/2018 0:00:00,0,CzUP,0
1858,8840.0,BOLSA BESAMELAS PEQUEÑAS 8 UD,22/4/2015 0:00:00,0,GeUP,0


### 2. Normalizing and aggregating description names

Unfortunately, there is no convention for the description and one id could 

1. Normalize descriptions as much as possible using:
    - Regex expressions 
    - Basic NLP for spell-checking.
2. Create a normalization file with the following structure:
    - Unique Product_id and normalized description
    - Flag to indicate if the product is part of the given list, or not.  
3. Finally review the list manually. 

### 2.1 Normalizing description names 

In [53]:
# Setting Null descriptions to 'no-description'
df['description'].fillna('no-description', inplace = True)

# Unique product descriptions
df_descriptions_unique = pd.Series(df['description'].unique())

# Most of the descriptions are in uppercase, however others are in lower:
df_descriptions_normalized = df_descriptions_unique.str.lower()

#replace non alfanumeric with space
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'[^0-9a-zA-Zº()ª]+', ' ') 

# We also notice that there are spacing issues at the begining, end of the description and between words:
df_descriptions_normalized=df_descriptions_normalized.str.strip()

# Remove multi-spacing
df_descriptions_normalized=df_descriptions_normalized.str.replace(r' +', ' ') 

In [54]:
pd.DataFrame(dict(desc_original = df_descriptions_unique, desc_normalized = df_descriptions_normalized)).sample(10)

Unnamed: 0,desc_normalized,desc_original
2791,encargo ensaimada de nata 6 raciones,Encargo ENSAIMADA DE NATA 6 RACIONES
2164,viruta de chocolate del 5 20rac,"VIRUTA DE CHOCOLATE DEL -5-, 20RAC"
4918,milhojas frambuesa 2º cartel felicidades shadi,"MILHOJAS FRAMBUESA 2º cartel "" Felicidades S..."
2669,litros de crema verduras,LITROS DE CREMA VERDURAS
6036,suizos piezas cocer a las 10 h,SUIZOS PIEZAS COCER A LAS 10 H
2248,pechuga miso,PECHUGA MISO
5608,pollo en pepitoria,POLLO EN PEPITORIA
1464,herradura,HERRADURA
2976,encargotarta de limon 10 raciones,EncargoTARTA DE LIMON 10 RACIONES
3880,encargo tarta selva negra 2º con cartel felici...,"Encargo TARTA SELVA NEGRA 2º CON CARTEL "" FELI..."


Now lets gets get our hands dirty and apply some maths to calculate string distnace and finish cleaning all those messy product descriptions... This is what we are going to do:

1. Create a dataset with pastry products by parsing the bakery catalogues, and other pastry websites. (this was done manually, by converting the pdf catalogues to txt using an external web. THe resulting file is named productos.txt)

2. Following the indications from: https://medium.com/@hdezfloresmiguelangel/implementando-un-corrector-ortogr%C3%A1fico-en-python-utilizando-la-distancia-de-levenshtein-498ec0dd1105 create an spell-checker based on the products.txt dataset and the Levenshtein distance


In [55]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../../data/01_additional_data/productos.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [56]:
correction("trta")

'tarta'

In [57]:
correction("café")

'cafe'

Fantastic! the it seems to work. Lets now apply it to our dataset:

In [58]:
def spell_check (line):
    "Given a sentence, returns spell-checks word by word"
    if type(line) == str and len(line) > 0:
        new = []
        line = line.split(" ")
        for word in line:
            if type(word) == str:
                word = correction(word)
            new.append(word)
        return " ".join(new)
            
    else:
        return line

Caution! the following line of code may take some time to process:

In [12]:
# Applying spell_check to all the dataset
df_descriptions_normalized = df_descriptions_normalized.apply(lambda line: spell_check(line))

Lets now merge the normalized names back to the original file, and check how effective was this cleaning:

In [59]:
to_merge = pd.DataFrame(dict(description = df_descriptions_unique, desc_normalized = df_descriptions_normalized))

df_with_normalized_descriptions = pd.merge(df, to_merge, how='left', on = 'description').sort_values(by='order_date')
df_with_normalized_descriptions.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,desc_normalized
99009,231.0,PINCHOS SALCHICHA,27/8/2009 0:00:00,0,SeUP,50,pinchos salchicha
49064,350.0,ENSALADA DE ARROZ CON GAMBAS,30/12/2010 0:00:00,0,ViUP,0,ensalada de arroz con gambas
5817,3389.0,ALBONDIGAS EN SALSA,23/12/2014 0:00:00,0,PoUP,200,albondigas en salsa
56949,116.0,MEDIAS NOCHES,28/4/2012 0:00:00,0,SeUP,200,medias noches
98000,6008.0,NAPOLITANAS CREMA /PIEZA/5HORA,17/11/2008 0:00:00,0,ViUP,0,napolitanas crema pieza 5hora


In [63]:
#Control merge size:
if (df.shape[0] == df_with_normalized_descriptions.shape[0] ): 
    print("All OK. 'df' has the same size as 'df_with_normalized_descriptions' ")
else:
    print("ERROR. 'df' has NOT the same size as 'df_with_normalized_descriptions' ")


All OK. 'df' has the same size as 'df_with_normalized_descriptions' 


In [64]:
# Checking effectiveness of the data cleaning:
unique_descriptions_raw = len(df['description'].unique())
unique_descriptions_normalized = len(df_with_normalized_descriptions['desc_normalized'].unique())
print('The product descritions were cleaned from {} unique names to {}.'.format(unique_descriptions_raw,unique_descriptions_normalized))

The product descritions were cleaned from 6078 unique names to 5711.


Not super effective...

In [15]:
# Saving the file to the intermiady folder
output_path_df_with_normalized_descriptions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions.to_csv(output_path_df_with_normalized_descriptions, index = False, sep = ';' )

### 2.2 Identifying product descriptions that the client wants us to predict

It is time to create the file that will be manually reviewed.

- First, we compare the normalized descriptions with the list of products provided with the client, and suggest matches using the library fuzzywuzzy
- Second, we will use the results from the other analysis.
- third, we will manually evaluate if the results are good

#### 2.2.1 Using the library fuzzywuzzy to compare the product normalized descriptions with the list of products provided by the client and suggest a match, or alternatively - "match-not-found"

In [73]:
df_normalized_desc_unique = pd.DataFrame(df_with_normalized_descriptions["desc_normalized"].unique(), columns = ['desc_normalized'])

In [87]:
def find_match (line, options = list_of_products):
    "Returns product match if the the calculated difference between strings is greater than 80, 'match-not-found' otherwise"
    if not(line is None) and type(line)== str:
        highest = process.extractOne(line,list_of_products)
        if not(highest is None) and highest[1] >80:
            return highest[0]
        else:
            return 'match-not-found'
    else:
        return 'match-not-found'

# Applying matching function to all product normalized descriptions
df_normalized_desc_unique["target_names_fuzzywuuzy"] = df_normalized_id_desc_unique["desc_normalized"].apply(lambda line: find_match(line))

Lets now evaluate how effectively did we match the normalized descriptions with the list that the client provided us:

In [95]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].head(10)

NameError: name 'df_normalized_desc_unique' is not defined

As we can see... its not actually very good, lets try something different.

#### 2.2.1 Using the results from the other analysis

Lets now use the results from the manual analysis to see how efective the measure was:

In [92]:
# Since this matching is performed at id level, 
# lets create a new dataset with unique product_id, descriptions, and evaluate it:
df_normalized_id_desc_unique = df_with_normalized_descriptions[["product_id",'desc_normalized']].drop_duplicates()

In [93]:
dict_of_products_matches={100: 'croissant', 
                  101: 'croissant',
                  102: 'croissant',
                  103: 'croissant petit',
                  9999: 'tarta mousse 3 chocolates', # almost only for order, creating a new id for this product is suggested
                  462: 'tarta de manzana 2º',
                  182: 'palmera de chocolate', # palmeras: 140
                  414: 'tarta opera', # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  4511:'postre fresas y mascarpone',
                  459: 'milhojas frambuesa 2º',
                  112: 'tortel',
                  115: 'baguette'}

In [94]:
def target_names_a(product_id, dict_of_products_matches= dict_of_products_matches):
    'Returns match if the product_id is found within the given dict or, otherise "match-not-found"'
    if not(product_id is None) and not(math.isnan(product_id)) and int(product_id)  in dict_of_products_matches:
        return dict_of_products_matches[int(product_id)]
    else:
        return 'match-not-found'
    
df_normalized_id_desc_unique['target_names_manual_analysis']=df_normalized_id_desc_unique["product_id"].apply(lambda line: target_names_a(line))

Lets now check how effective this was:

In [99]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].head(15)

Unnamed: 0,product_id,desc_normalized,target_names_Manual_Analysis
80911,312.0,mousse de perigot,match-not-found
31329,309.0,mousse de oca,match-not-found
58329,452.0,mousse 3 chocolates 1º,match-not-found
65217,728.0,petit four mousse praline,match-not-found
79915,451.0,mousse 3 chocolates 2º,match-not-found
36392,453.0,mousse 3 chocolates 3º,match-not-found
67331,451.0,mousse 3 chocolates 2º cartel 70,match-not-found
21134,311.0,mousse de salmon,match-not-found
79516,450.0,postre mousse tres chocolates,match-not-found
34290,45.0,postre mousse tres chocolates,match-not-found


Again, not very good, since most of the 'mousse 3 chocolates' are unmatched.

It is clear that we need an better way to match the results. Lets try doing keywords filtering product by product.

### 2.3 Review Product by Product

In [123]:
#First, lets create again a dataframe with unique descriptions
unique_normalized_decriptions = pd.DataFrame(df_with_normalized_descriptions['desc_normalized'].unique(), columns = ['desc_normalized'])

#And a empty list to add all the unitary analysis. It will be use to concatenate results.
list_of_dfs = []

#### 2.3.1 Matching: milhojas de frambuesa 2º

In [124]:
milhojas = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('milhojas')]
milhojas_frambuesa = milhojas[milhojas['desc_normalized'].str.contains('frambuesa')].copy()
milhojas_frambuesa['target_names_C'] = 'milhojas frambuesa'
list_of_dfs.append(milhojas_frambuesa)
milhojas_frambuesa.sample(5)

Unnamed: 0,desc_normalized,target_names_C
1185,encargo tarta milhojas de frambuesa del 1,milhojas frambuesa
4444,postre milhojas de frambuesas salsa (caceria a...,milhojas frambuesa
5603,milhojas frambuesa 2º con cartel felicidades e...,milhojas frambuesa
4391,fuente tena postres surtidos de ecler chocolat...,milhojas frambuesa
2709,postre caceria don jose milhojas frambuesa sal...,milhojas frambuesa


#### 2.3.2 Matching: croissant petite

In [125]:
croissant = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('croissant')].copy()
croissant_petite = croissant[croissant['desc_normalized'].str.contains('petit')].copy()
croissant_petite['target_names_prod_by_prod'] = 'croissant'
list_of_dfs.append(croissant_petite)
croissant_petite.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
432,croissant petit,croissant
1322,petit croissant latas,croissant
1380,croissant petit cocer a las 2 horas,croissant
1547,croissant petit latas cocer a las 14 h,croissant
1814,croissant cereales petit,croissant


#### 2.3.3 Matching: croissant

In [126]:
croissant = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('croissant')].copy()
croissant_simple = croissant[~croissant['desc_normalized'].str.contains('petit|tira|masa')].copy()
croissant_simple['target_names_prod_by_prod'] = 'croissant'
list_of_dfs.append(croissant_simple)
croissant_simple.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
146,croissant chocolate,croissant
212,croissant integral,croissant
355,croissant frances,croissant
382,croissant alargado grande piezas,croissant
429,croissant alargado grande,croissant


#### 2.3.4 Matching: mousse tres chocolates

In [127]:
mousse = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('mousse')].copy()
mousse_tres = mousse[mousse['desc_normalized'].str.contains('tres|3')].copy()
mousse_tres_chocolates = mousse_tres[mousse_tres['desc_normalized'].str.contains('chocolate')].copy()
mousse_tres_chocolates['target_names_prod_by_prod'] = 'mousse tres chocolates'
list_of_dfs.append(mousse_tres_chocolates)
mousse_tres_chocolates.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
107,mousse 3 chocolates 1º,mousse tres chocolates
351,mousse 3 chocolates 2º,mousse tres chocolates
368,mousse 3 chocolates 3º,mousse tres chocolates
438,mousse 3 chocolates 2º cartel 70,mousse tres chocolates
533,postre mousse tres chocolates,mousse tres chocolates


#### 2.3.5 Matching: tarta de manzana 2

In [128]:
manzana = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('manzana')]
manzana_tarta = manzana[manzana['desc_normalized'].str.contains('tarta')].copy()
manzana_tarta_dos = manzana_tarta[manzana_tarta['desc_normalized'].str.contains('dos|2')].copy()
manzana_tarta_dos['target_names_prod_by_prod'] = 'tarta de manzana'
list_of_dfs.append(manzana_tarta_dos)
manzana_tarta_dos.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
870,tarta cremoso caramelo y manzana 2º,tarta de manzana
1145,tarta caramelo y manzana 2º,tarta de manzana
2083,tarta de manzana 2º,tarta de manzana
2197,tarta manzana b a 2º (hora 5),tarta de manzana
3238,encargotarta del 2 de avellana y manzana,tarta de manzana


#### 2.3.6 Matching: palmera de chocolate 

In [129]:
palmera = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('palmera')]
palmera_chocolate = palmera[palmera['desc_normalized'].str.contains('chocolate|trufa')].copy()
palmera_chocolate['target_names_prod_by_prod'] = 'palmera chocolate'

list_of_dfs.append(palmera_chocolate)
palmera_chocolate.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
237,palmeras de trufa,palmera chocolate
2746,palmeras chocolate,palmera chocolate
3180,palmeras de trufa (alberto encargos),palmera chocolate


#### 2.3.7 Matching: tarta ópera 

In [130]:
opera = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('opera')]
opera_tarta = opera[opera['desc_normalized'].str.contains('tarta')].copy()
opera_tarta['target_names_prod_by_prod'] = 'tarta opera'

list_of_dfs.append(opera_tarta)
opera_tarta.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
292,tarta opera 2º,tarta opera
689,tarta opera del 3º escrito vuela entrenado nad...,tarta opera
702,tarta opera 3º escrito en la tarta felicidades...,tarta opera
744,opera del 6 escrito en tarta felicidades sr he...,tarta opera
1736,tarta opera del 6º escrito feliz cumplea os,tarta opera


#### 2.3.9 Matching: postre de fresas y mascarpone

In [131]:
postre = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('postre')]
postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()

postre_fresa_mascarpone['target_names_prod_by_prod'] = 'postre de fresas y mascarpone'
list_of_dfs.append(postre_fresa_mascarpone)
postre_fresa_mascarpone.head()


Unnamed: 0,desc_normalized,target_names_prod_by_prod
374,postre tartaleta fresas y mascarpone,postre de fresas y mascarpone
510,postre fresas y mascarpone,postre de fresas y mascarpone
600,postre eclair fresas y mascarpone,postre de fresas y mascarpone


#### 2.3.9 Matching: tortel

In [132]:
tortel = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('tortel')].copy()
tortel['target_names_prod_by_prod'] = 'tortel'
list_of_dfs.append(tortel)
tortel.head(5)


Unnamed: 0,desc_normalized,target_names_prod_by_prod
254,torteles,tortel
806,mini torteles piezas hora 5,tortel
1668,tortellini bolognesa,tortel
1674,minitortel en crudo,tortel
1849,tortellini carbonara,tortel


#### 2.3.10 Matching: baguette

In [133]:
baguette = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('baguette|baguete|baguet')].copy()
baguette['target_names_prod_by_prod'] = 'baguette'
list_of_dfs.append(baguette)
baguette.head(5)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
147,baguette mallorca,baguette
189,baguet piezas 5 horas,baguette
271,baguet,baguette
516,baguett mallorca,baguette
776,baguet (5 ma ana),baguette


### Lets now concatenate results, merge them back to the  full list of normalized descriptions and evaluate its effectiveness

In [136]:
# Lets concatenate the results:
list_of_products_df = pd.concat(list_of_dfs, sort=False)

In [140]:
df_desc_normalezed_vs_prod_by_prod = pd.merge(df_with_normalized_descriptions, list_of_products_df[['desc_normalized','target_names_prod_by_prod']],how='left',on = 'desc_normalized')


In [141]:
# Merging test:

In [145]:
#Control merge size:
if (df_with_normalized_descriptions.shape[0] == aux.shape[0] ): 
    print("All OK. 'df_with_normalized_descriptions' has the same size as 'df_with_normalized_descriptions' ")
else:
    print("ERROR. 'df' has NOT the same size as 'df_with_normalized_descriptions' ")


All OK. 'df_with_normalized_descriptions' has the same size as 'df_with_normalized_descriptions' 


Lets now check how effective it was:

In [150]:
# Lets look at the first 10 description names
df_desc_normalezed_vs_prod_by_prod[['desc_normalized','target_names_prod_by_prod']].head(10)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
0,plum cake damas,
1,pum cake fresa y pistacho,
2,bombon marc cava,
3,zumos frutas acidos,
4,caramelos rellenos c 200 gr,
5,tejas,
6,trufas paris,
7,galleguitas,
8,caja bombon fruta 1 4 kg,
9,naranjines,


In [151]:
# Lets now review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_desc_normalezed_vs_prod_by_prod.loc[df_desc_normalezed_vs_prod_by_prod['desc_normalized'].str.contains('mousse'),['desc_normalized','target_names_prod_by_prod'] ].head(15)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
67,mousse de perigot,
68,mousse de oca,
122,mousse 3 chocolates 1º,mousse tres chocolates
393,petit four mousse praline,
482,mousse 3 chocolates 2º,mousse tres chocolates
509,mousse 3 chocolates 2º,mousse tres chocolates
518,mousse 3 chocolates 3º,mousse tres chocolates
608,mousse 3 chocolates 2º,mousse tres chocolates
629,mousse 3 chocolates 1º,mousse tres chocolates
642,mousse 3 chocolates 1º,mousse tres chocolates


This looks much better! lets now check that the data integrity has not been compromised

### 2.5 Test that data has not been corrputed

To test the integrity of the data, the original dataset should be the same as the last dataset without that we added, in other words, without the columns with the normalized descriptuons, and the target names:

In [164]:
# First, lets check the size of both dataframes:
print("Original dataset shape: {}".format(df.shape))
print("Resulting dataset shape: {}".format(df_desc_normalezed_vs_prod_by_prod.shape))

Original dataset shape: (100000, 6)
Resulting dataset shape: (100000, 8)


The shape looks good, we were expecting the resulting dataset to have to columns more. Lets now evauate if they are actually the same dataset if we remove the added columns:

In [180]:
# Selecting original columnsd from the resulting df
df_result = df_desc_normalezed_vs_prod_by_prod.loc[:, df.columns]

In [177]:
(df_desc_normalezed_vs_prod_by_prod.loc[:, df.columns]).sort_values(by = ['order_date','store','description']).head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
7,177.0,GALLEGUITAS,1/1/2008 0:00:00,0,AeUP,0
26,4491.0,POSTRE CREMOSO DE CARAMELO Y MANZANA,1/1/2008 0:00:00,0,AeUP,0
30,1458.0,B/100 GR. BOLAS (D),1/1/2008 0:00:00,0,AnUP,0
29,3413.0,CAJAS ROCA FILET,1/1/2008 0:00:00,0,AnUP,0
16,6001.0,GALLEGUITAS/HOTEL/PIEZA/5 HORAS,1/1/2008 0:00:00,0,AnUP,0


In [176]:
df.sort_values(by = ['order_date','store','description']).head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
7928,177.0,GALLEGUITAS,1/1/2008 0:00:00,0,AeUP,0
63907,4491.0,POSTRE CREMOSO DE CARAMELO Y MANZANA,1/1/2008 0:00:00,0,AeUP,0
33312,1458.0,B/100 GR. BOLAS (D),1/1/2008 0:00:00,0,AnUP,0
37598,3413.0,CAJAS ROCA FILET,1/1/2008 0:00:00,0,AnUP,0
84662,6001.0,GALLEGUITAS/HOTEL/PIEZA/5 HORAS,1/1/2008 0:00:00,0,AnUP,0


In [166]:
# Selecting the original columns from the dataset where the names have been normalized:
df.equals(df_desc_normalezed_vs_prod_by_prod.loc[:, df.columns])

False

### 2.6 Filter dataset to only include the products from the list provided by the client, and save to csv

In [43]:
 Filter by products on the list, and save the file.

SyntaxError: invalid syntax (<ipython-input-43-01e1e2be50d5>, line 1)