# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyze them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients found as more relevant to their business.

What we want now is to check whether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list. With the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1. Read dataframe

In [1]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import math


%matplotlib inline
pd.options.display.max_columns = None



In [2]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = 'b2-transactions_sample.csv' #"b2-transactions.csv" 
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [3]:
# We create the list of products provided by the client
list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera de chocolate'
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [4]:
# We import the dataframe:
df=pd.read_csv(file_path+file_name, sep=sep)

In [5]:
df.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
175,619.0,BOCADITOS DE NATA,13/5/2016 0:00:00,0,MsUP,0
817,1022.0,TOMATE CONFITADO,7/3/2008 0:00:00,0,CzUP,0
874,463.0,MANZANA 3º,4/3/2017 0:00:00,0,BmUP,300
493,452.0,MOUSSE 3 CHOCOLATES 1º,5/4/2015 0:00:00,0,BmUP,200
298,196.0,MINI BESAMELAS,17/3/2010 0:00:00,0,PoUP,100


### 2. Normalizing and aggregating description names

Unfortunately, there is no convention for the description and one id could 

1. Normalize descriptions as much as possible using:
    - Regex expressions 
    - Basic NLP for spell-checking.
2. Create a normalization file with the following structure:
    - Unique Product_id and normalized description
    - Flag to indicate if the product is part of the given list, or not.  
3. Finally review the list manually. 

### 2.1 Normalizing description names 

In [6]:
# Setting Null descriptions to 'no-description'
df['description'].fillna('no-description', inplace = True)

# Unique product descriptions
df_descriptions_unique = pd.Series(df['description'].unique())

# Most of the descriptions are in uppercase, however others are in lower:
df_descriptions_normalized = df_descriptions_unique.str.lower()

#replace non alfanumeric with space
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'[^0-9a-zA-Zº()ª:-]+', ' ') 

# We also notice that there are spacing issues at the begining, end of the description and between words:
df_descriptions_normalized=df_descriptions_normalized.str.strip()

# Remove multi-spacing. multi '-' and multi ':'
df_descriptions_normalized=df_descriptions_normalized.str.replace(r' +', ' ') 
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'-+', ' ') 
df_descriptions_normalized=df_descriptions_normalized.str.replace(r':+', ' ') 

In [7]:
pd.DataFrame(dict(desc_original = df_descriptions_unique, desc_normalized = df_descriptions_normalized)).sample(10)

Unnamed: 0,desc_normalized,desc_original
414,galaxito peq limon,GALAXITO PEQ. LIMON
367,tableta choco 1 2 kg vainilla,TABLETA CHOCO. 1/2 KG.VAINILLA
149,bombon frambuesa licor,BOMBON FRAMBUESA LICOR
410,tira de albaricoque,TIRA DE ALBARICOQUE
106,trancha de plum cake limon,TRANCHA DE PLUM CAKE LIMON
363,palmeritas bolsa 90 grs,PALMERITAS BOLSA 90 GRS.
382,selva negra 2ºplasmar chocolatina en tama o de...,SELVA NEGRA 2ºPLASMAR CHOCOLATINA EN TAMAÑO DE...
265,macarron de chocolate,MACARRON DE CHOCOLATE
92,panetone frutas 1 kg,PANETONE FRUTAS 1 KG.
133,b 100 gr celofan lenguas blancas,B/ 100 GR. CELOFAN LENGUAS BLANCAS


Now lets gets get our hands dirty and apply some maths to calculate string distnace and finish cleaning all those messy product descriptions... This is what we are going to do:

1. Create a dataset with pastry products by parsing the bakery catalogues, and other pastry websites. (this was done manually, by converting the pdf catalogues to txt using an external web. THe resulting file is named productos.txt)

2. Following the indications from: https://medium.com/@hdezfloresmiguelangel/implementando-un-corrector-ortogr%C3%A1fico-en-python-utilizando-la-distancia-de-levenshtein-498ec0dd1105 create an spell-checker based on the products.txt dataset and the Levenshtein distance


In [8]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../../data/01_additional_data/productos.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [9]:
correction("trta")

'tarta'

In [10]:
correction("café")

'cafe'

Fantastic! the it seems to work. Lets now apply it to our dataset:

In [11]:
def spell_check (line):
    "Given a sentence, returns spell-checks word by word"
    if type(line) == str and len(line) > 0:
        new = []
        line = line.split(" ")
        for word in line:
            if type(word) == str:
                word = correction(word)
            new.append(word)
        return " ".join(new)
            
    else:
        return line

In [12]:
# CAUTION! The following cell e may take a long time to process (5 hours): 

# spell-check word by word the dataset:
df_descriptions_normalized = df_descriptions_normalized.apply(lambda line: spell_check(line))

Lets now merge the normalized names back to the original file, and check how effective was this cleaning:

In [13]:
to_merge = pd.DataFrame(dict(description = df_descriptions_unique, desc_normalized = df_descriptions_normalized))

df_with_normalized_descriptions = pd.merge(df, to_merge, how='left', on = 'description').sort_values(by='order_date')
df_with_normalized_descriptions.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,desc_normalized
207,887.0,TARRINA MACEDONIA FRUTAS,4/11/2014 0:00:00,0,EnUP,0,tarrina macedonia frutas
26,209.0,EMP. HOJALDRE JAMON YORK,30/6/2008 0:00:00,0,BmUP,0,el hojaldre jamon york
619,720.0,ECLAIR CHOCOLATE,5/12/2013 0:00:00,0,RzUP,0,eclair chocolate
685,160.0,TORRIJAS PEQUEÑAS,12/4/2017 0:00:00,0,CzUP,7500,torrijas parque a
107,506.0,PALADARES CAFE,8/6/2008 0:00:00,0,AnUP,200,paladares cafe


In [14]:
#Control merge size:
if (df.shape[0] == df_with_normalized_descriptions.shape[0] ): 
    test0 = "OK - 'df' has the same size as 'df_with_normalized_descriptions' "
else:
    test0 = "ERROR - 'df' has NOT the same size as 'df_with_normalized_descriptions' "
print(test0)

OK - 'df' has the same size as 'df_with_normalized_descriptions' 


In [15]:
# Checking effectiveness of the data cleaning:
unique_descriptions_raw = len(df['description'].unique())
unique_descriptions_normalized = len(df_with_normalized_descriptions['desc_normalized'].unique())
print('The product descritions were cleaned from {} unique names to {}.'.format(unique_descriptions_raw,unique_descriptions_normalized))

The product descritions were cleaned from 578 unique names to 569.


Not super effective...

In [16]:
# Saving the file to the intermiady folder
output_path_df_with_normalized_descriptions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions.to_csv(output_path_df_with_normalized_descriptions, index = False, sep = ';' )

### 2.2 Identifying product descriptions that the client wants us to predict

It is time to create the file that will be manually reviewed.

- First, we compare the normalized descriptions with the list of products provided with the client, and suggest matches using the library fuzzywuzzy
- Second, we will use the results from the other analysis.
- third, we will manually evaluate if the results are good

#### 2.2.1 Using the library fuzzywuzzy to compare the product normalized descriptions with the list of products provided by the client and suggest a match, or alternatively - "match-not-found"

In [17]:
df_normalized_desc_unique = pd.DataFrame(df_with_normalized_descriptions["desc_normalized"].unique(), columns = ['desc_normalized'])

In [18]:
def find_match (line, options = list_of_products):
    "Returns product match if the the calculated difference between strings is greater than 80, 'match-not-found' otherwise"
    if not(line is None) and type(line)== str:
        highest = process.extractOne(line,list_of_products)
        if not(highest is None) and highest[1] >80:
            return highest[0]
        else:
            return 'match-not-found'
    else:
        return 'match-not-found'

# Applying matching function to all product normalized descriptions
df_normalized_desc_unique["target_names_fuzzywuuzy"] = df_normalized_desc_unique["desc_normalized"].apply(lambda line: find_match(line))

Lets now evaluate how effectively did we match the normalized descriptions with the list that the client provided us:

In [19]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_desc_unique[df_normalized_desc_unique['desc_normalized'].str.contains('mousse')].head(10)

Unnamed: 0,desc_normalized,target_names_fuzzywuuzy
22,postres mousse tres chocolates,match-not-found
50,mousse 3 chocolates 2,tarta mousse 3 chocolates
60,postres mousse chocolate en vasito,match-not-found
115,mousse 3 chocolates 3,tarta mousse 3 chocolates
168,mousse de salmon,tarta mousse 3 chocolates
174,pasteles de mousse de pistacho,tarta de manzana 2º
218,mousse chocolate blanco,match-not-found
270,mousse de hora opcion,match-not-found
291,tarta mousse tres chocolates del 2,tarta mousse 3 chocolates
326,postres mousse frutas bosque,match-not-found


As we can see... its not actually very good, lets try something different.

#### 2.2.1 Using the results from the other analysis

Lets now use the results from the manual analysis to see how efective the measure was:

In [20]:
# Since this matching is performed at id level, 
# lets create a new dataset with unique product_id, descriptions, and evaluate it:
df_normalized_id_desc_unique = df_with_normalized_descriptions[["product_id",'desc_normalized']].drop_duplicates()

In [21]:
dict_of_products_matches={100: 'croissant', 
                  101: 'croissant',
                  102: 'croissant',
                  103: 'croissant petit',
                  9999: 'tarta mousse 3 chocolates', # almost only for order, creating a new id for this product is suggested
                  462: 'tarta de manzana 2º',
                  182: 'palmera de chocolate', # palmeras: 140
                  414: 'tarta opera', # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  4511:'postre fresas y mascarpone',
                  459: 'milhojas frambuesa 2º',
                  112: 'tortel',
                  115: 'baguette'}

In [22]:
def target_names_a(product_id, dict_of_products_matches= dict_of_products_matches):
    'Returns match if the product_id is found within the given dict or, otherise "match-not-found"'
    if not(product_id is None) and not(math.isnan(product_id)) and int(product_id)  in dict_of_products_matches:
        return dict_of_products_matches[int(product_id)]
    else:
        return 'match-not-found'
    
df_normalized_id_desc_unique['target_names_manual_analysis']=df_normalized_id_desc_unique["product_id"].apply(lambda line: target_names_a(line))

Lets now check how effective this was:

In [23]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].head(15)

Unnamed: 0,product_id,desc_normalized,target_names_manual_analysis
56,450.0,postres mousse tres chocolates,match-not-found
773,451.0,mousse 3 chocolates 2,match-not-found
323,877.0,postres mousse chocolate en vasito,match-not-found
171,453.0,mousse 3 chocolates 3,match-not-found
463,311.0,mousse de salmon,match-not-found
416,754.0,pasteles de mousse de pistacho,match-not-found
142,45.0,postres mousse tres chocolates,match-not-found
689,618.0,mousse chocolate blanco,match-not-found
356,310.0,mousse de hora opcion,match-not-found
747,9999.0,tarta mousse tres chocolates del 2,tarta mousse 3 chocolates


Again, not very good, since most of the 'mousse 3 chocolates' are unmatched.

It is clear that we need an better way to match the results. Lets try doing keywords filtering product by product.

### 2.3 Review Product by Product

In [24]:
#First, lets create again a dataframe with unique descriptions
unique_normalized_decriptions = pd.DataFrame(df_with_normalized_descriptions['desc_normalized'].unique(), columns = ['desc_normalized'])

#And a empty list to add all the unitary analysis. It will be use to concatenate results.
list_of_dfs = []

#### 2.3.1 Matching: milhojas de frambuesa 2º

In [25]:
milhojas = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('milhojas')]
milhojas_frambuesa = milhojas[milhojas['desc_normalized'].str.contains('frambuesa')].copy()
milhojas_frambuesa['target_names_prod_by_prod'] = 'milhojas frambuesa'
list_of_dfs.append(milhojas_frambuesa)
milhojas_frambuesa.sample(5)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
192,milhojas frambuesa 3,milhojas frambuesa
83,milhojas de frambuesa 3 a 12rac,milhojas frambuesa
177,postres milhojas frambuesa,milhojas frambuesa
312,milhojas y frambuesa 1,milhojas frambuesa
121,milhojas frambuesa 2,milhojas frambuesa


#### 2.3.2 Matching: croissant petite

In [26]:
croissant = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('croissant')].copy()
croissant_petite = croissant[croissant['desc_normalized'].str.contains('petit')].copy()
croissant_petite['target_names_prod_by_prod'] = 'croissant'
list_of_dfs.append(croissant_petite)
croissant_petite.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
58,croissant petit,croissant
207,petit a croissant pastas,croissant


#### 2.3.3 Matching: croissant

In [27]:
croissant = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('croissant')].copy()
croissant_simple = croissant[~croissant['desc_normalized'].str.contains('petit|tira|masa')].copy()
croissant_simple['target_names_prod_by_prod'] = 'croissant'
list_of_dfs.append(croissant_simple)
croissant_simple.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
100,croissant,croissant
166,croissant alargado grande piezas,croissant
213,croissant sobrasada px cocido,croissant
228,croissant integral,croissant
332,croissant racion,croissant


#### 2.3.4 Matching: mousse tres chocolates

In [28]:
mousse = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('mousse')].copy()
mousse_tres = mousse[mousse['desc_normalized'].str.contains('tres|3')].copy()
mousse_tres_chocolates = mousse_tres[mousse_tres['desc_normalized'].str.contains('chocolate')].copy()
mousse_tres_chocolates['target_names_prod_by_prod'] = 'mousse tres chocolates'
list_of_dfs.append(mousse_tres_chocolates)
mousse_tres_chocolates.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
22,postres mousse tres chocolates,mousse tres chocolates
50,mousse 3 chocolates 2,mousse tres chocolates
60,postres mousse chocolate en vasito,mousse tres chocolates
115,mousse 3 chocolates 3,mousse tres chocolates
291,tarta mousse tres chocolates del 2,mousse tres chocolates


#### 2.3.5 Matching: tarta de manzana 2

In [29]:
manzana = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('manzana')]
manzana_tarta = manzana[manzana['desc_normalized'].str.contains('tarta')].copy()
manzana_tarta_dos = manzana_tarta[manzana_tarta['desc_normalized'].str.contains('dos|2')].copy()
manzana_tarta_dos['target_names_prod_by_prod'] = 'tarta de manzana'
list_of_dfs.append(manzana_tarta_dos)
manzana_tarta_dos.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
515,tarta caramelo y manzana 2,tarta de manzana


#### 2.3.6 Matching: palmera de chocolate 

In [30]:
palmera = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('palmera')]
palmera_chocolate = palmera[palmera['desc_normalized'].str.contains('chocolate|trufa')].copy()
palmera_chocolate['target_names_prod_by_prod'] = 'palmera chocolate'

list_of_dfs.append(palmera_chocolate)
palmera_chocolate.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
133,palmera de trufa,palmera chocolate


#### 2.3.7 Matching: tarta ópera 

In [31]:
opera = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('opera')]
opera_tarta = opera[opera['desc_normalized'].str.contains('tarta')].copy()
opera_tarta['target_names_prod_by_prod'] = 'tarta opera'

list_of_dfs.append(opera_tarta)
opera_tarta.head()

Unnamed: 0,desc_normalized,target_names_prod_by_prod
49,tarta chocolate opera 1,tarta opera


#### 2.3.9 Matching: postre de fresas y mascarpone

In [32]:
postre = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('postre')]
postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()

postre_fresa_mascarpone['target_names_prod_by_prod'] = 'postre de fresas y mascarpone'
list_of_dfs.append(postre_fresa_mascarpone)
postre_fresa_mascarpone.head()


Unnamed: 0,desc_normalized,target_names_prod_by_prod
410,postres eclair fresas y mascarpone,postre de fresas y mascarpone


#### 2.3.9 Matching: tortel

In [33]:
tortel = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('tortel')].copy()
tortel['target_names_prod_by_prod'] = 'tortel'
list_of_dfs.append(tortel)
tortel.head(5)


Unnamed: 0,desc_normalized,target_names_prod_by_prod
39,mini torteles cruda,tortel
208,torteles,tortel
495,mini torteles piezas hora 5,tortel


#### 2.3.10 Matching: baguette

In [34]:
baguette = unique_normalized_decriptions[unique_normalized_decriptions['desc_normalized'].str.contains('baguette|baguete|baguet')].copy()
baguette['target_names_prod_by_prod'] = 'baguette'
list_of_dfs.append(baguette)
baguette.head(5)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
51,baguette mallorca,baguette
265,baguet piezas 5 hora,baguette
279,baguett mallorca,baguette
513,baguet,baguette


### Lets now concatenate results, merge them back to the  full list of normalized descriptions and evaluate its effectiveness

In [35]:
# Lets concatenate the results:
list_of_products_df = pd.concat(list_of_dfs, sort=False)

In [36]:
list_of_products_df[list_of_products_df['desc_normalized'].duplicated()]

Unnamed: 0,desc_normalized,target_names_prod_by_prod


In [37]:
df_desc_normalezed_vs_prod_by_prod = pd.merge(df_with_normalized_descriptions, list_of_products_df[['desc_normalized','target_names_prod_by_prod']],how='left',on = 'desc_normalized')

In [38]:
# Merging test:

In [39]:
#Control merge size:
if (df_with_normalized_descriptions.shape[0] == df_desc_normalezed_vs_prod_by_prod.shape[0] ): 
    test1 = "OK - 'df_with_normalized_descriptions' has the same size as 'df_with_normalized_descriptions' "
else:
    test1 = "ERROR - 'df' has NOT the same size as 'df_with_normalized_descriptions' "

print(test1)

OK - 'df_with_normalized_descriptions' has the same size as 'df_with_normalized_descriptions' 


NOTE:

During the first executions of code with the full transactions file, this error was failing; 'df' had less rows than 'df_with_normalized_descriptions'. The reason for this was not easy to identify, however digging we found that that in the normalized description two products descriptions ware naming two different products in the description, however this was not the case for the raw description (before the spell-cheacker):

for example:
- Normalized prod description: 'tarta mousse 3 chocolates de 20 raciones con escrito sobre la tarta manzana y mini felicidades'
- Raw description: 'TARTA MOUSSE 3 CHOCOLATES DE 20 RACIONES CON ESCRITO SOBRE LA TARTA:  MARIANA Y DANI FELICIDADES'

Basically, the spell-corrector was solving some problems; normalizing 'trata' , 'taaarta' under 'tarta', but adding a new one: normalizing words that it doesnt know, that may be a correct word, to a word that it knows: 'MARIANA' to 'manzana'... Ofcourse this is a weakness, however from the manual inspections that were performed, it doesnt seem to happen often.

How we solve it by adding to the bakery products dataset: 
- A list of the most common male and female spanish names: in order to avoid confusion in the names

sources of the datasets:
- spanish names:https://www.ine.es/dyngs/INEbase/es/operacion.htm?c=Estadistica_C&cid=1254736177009&menu=resultados&secc=1254736195454&idp=1254734710990


Also, in this case we added some names that we found to the excel; the right thing to do should we had more time, would be to polish the dataset, by adding not just mallorca catalogue and names, but also a book in spanish. Perhaps, it would also be interesting to applying NLP to identify NAMES from the product descriptions and add them to the products dataset...

Lets now check how effective it was:

In [40]:
# Lets look at the first 10 description names
df_desc_normalezed_vs_prod_by_prod[['desc_normalized','target_names_prod_by_prod']].head(10)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
0,empanadas mariana de carne,
1,inglesitos,
2,empanadas hojaldre y bonito 6 a,
3,paladares cafe,
4,eclair blanco,
5,barrita de frambuesa,
6,barqueta de tostas artesanas,
7,tarta quiche espinacas,
8,fondue chocolate blanco,
9,moda pan espinacas g,


In [41]:
# Lets now review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_desc_normalezed_vs_prod_by_prod.loc[df_desc_normalezed_vs_prod_by_prod['desc_normalized'].str.contains('mousse'),['desc_normalized','target_names_prod_by_prod'] ].head(15)

Unnamed: 0,desc_normalized,target_names_prod_by_prod
22,postres mousse tres chocolates,mousse tres chocolates
56,mousse 3 chocolates 2,mousse tres chocolates
69,postres mousse chocolate en vasito,mousse tres chocolates
117,mousse 3 chocolates 2,mousse tres chocolates
129,mousse 3 chocolates 3,mousse tres chocolates
171,postres mousse tres chocolates,mousse tres chocolates
198,mousse de salmon,
205,pasteles de mousse de pistacho,
234,mousse 3 chocolates 2,mousse tres chocolates
238,postres mousse tres chocolates,mousse tres chocolates


This looks much better! lets now check that the data integrity has not been compromised

### 2.5 Test that data has not been corrputed

To test the integrity of the data, the original dataset should be the same as the last dataset without that we added, in other words, without the columns with the normalized descriptuons, and the target names:

In [42]:
# First, lets check the size of both dataframes:
print("Original dataset shape: {}".format(df.shape))
print("Resulting dataset shape: {}".format(df_desc_normalezed_vs_prod_by_prod.shape))

Original dataset shape: (1000, 6)
Resulting dataset shape: (1000, 8)


The shape looks good, we were expecting the resulting dataset to have to columns more. Lets now evauate if they are actually the same dataset if we remove the added columns:

In [43]:
# Selecting original columnsd from the resulting df
df_result = df_desc_normalezed_vs_prod_by_prod.loc[:, df.columns]

In [44]:
# Now, lets compare it with the original dataset, sorting them out in the same way:
df_result_sorted = df_result.sort_values(by = ['order_date','store','description','product_id', 'units_ordered']).reset_index().drop('index', axis = 1)
df_original_sorted = df.sort_values(by = ['order_date','store','description',  'product_id', 'units_ordered']).reset_index().drop('index', axis = 1)

In [45]:
df_result_sorted.head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
0,291.0,EMPANADA MEDIANA DE CARNE,1/1/2008 0:00:00,0,GoUP,0
1,175.0,INGLESITOS,1/1/2016 0:00:00,0,JPUP,100
2,202.0,Empanada Hojaldre y Bonito 6 rac.,1/10/2017 0:00:00,0,AaUP,0
3,506.0,PALADARES CAFE,1/11/2009 0:00:00,0,AaUP,0
4,721.0,ECLAIR BLANCO,1/11/2017 0:00:00,0,BmUP,0


In [46]:
df_original_sorted.head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
0,291.0,EMPANADA MEDIANA DE CARNE,1/1/2008 0:00:00,0,GoUP,0
1,175.0,INGLESITOS,1/1/2016 0:00:00,0,JPUP,100
2,202.0,Empanada Hojaldre y Bonito 6 rac.,1/10/2017 0:00:00,0,AaUP,0
3,506.0,PALADARES CAFE,1/11/2009 0:00:00,0,AaUP,0
4,721.0,ECLAIR BLANCO,1/11/2017 0:00:00,0,BmUP,0


In [47]:
# Now that they have the same columns, and are sorted using the same criteria, lets evaluate if they are the same:
comparison_result = df_result_sorted.equals(df_original_sorted)

if comparison_result == True:
    test2 = 'OK - The original dataset is the similar to the resulting dataset'
else:
     test2 ='ERROR - The original dataset are NOT found'

print(test2)

OK - The original dataset is the similar to the resulting dataset


### 2.6 Filter dataset to only include the products from the list provided by the client, and save to csv

In [57]:
df_target_products = df_desc_normalezed_vs_prod_by_prod[~df_desc_normalezed_vs_prod_by_prod['target_names_prod_by_prod'].isnull()]
df_other_products = df_desc_normalezed_vs_prod_by_prod[df_desc_normalezed_vs_prod_by_prod['target_names_prod_by_prod'].isnull()]

In [58]:
df_target_products_file_name = exit_path + 'filtered_transactions_not_clean.csv' 
df_target_products.to_csv(df_target_products_file_name, index = False, sep = ';' )
df_target_products.head()

Unnamed: 0,product_id,description,order_date,section,store,units_ordered,desc_normalized,target_names_prod_by_prod
22,450.0,POSTRE MOUSSE TRES CHOCOLATES,10/10/2010 0:00:00,0,GrUP,0,postres mousse tres chocolates,mousse tres chocolates
40,2999.0,MINI TORTELES CRUDOS,10/8/2008 0:00:00,0,MoUP,1100,mini torteles cruda,tortel
53,4401.0,TARTA CHOCOLATE PERA 1º,11/2/2008 0:00:00,0,PoUP,0,tarta chocolate opera 1,tarta opera
56,451.0,MOUSSE 3 CHOCOLATES 2º,11/2/2014 0:00:00,0,GrUP,0,mousse 3 chocolates 2,mousse tres chocolates
57,115.0,BAGUETTE MALLORCA,11/3/2017 0:00:00,0,AeUP,0,baguette mallorca,baguette


In [59]:
df_unfiltered_products = exit_path + 'unfiltered_transactions.csv' 
df_unfiltered_products_name.to_csv(df_unfiltered_products_name, index = False, sep = ';' )
df_unfiltered_products.head()

AttributeError: 'str' object has no attribute 'to_csv'

# ERROR CONTROL

In [50]:
print(test0)
print(test1)
print(test2)

OK - 'df' has the same size as 'df_with_normalized_descriptions' 
OK - 'df_with_normalized_descriptions' has the same size as 'df_with_normalized_descriptions' 
OK - The original dataset is the similar to the resulting dataset
