# FILTERING THE DATASET:

Along this script we will get from an initial list of products provided by our client, to a final list (as per the names and ids present within the real data), which will be used to filter our initial data in order to get a smaller, more manageable file.

This process will be divided in two main steps:

- Check the names in our list with the descriptions present in our data, analyze them and select a final list

- Use this list to filter our data and store the resulting information in a more small and convenient file

## CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients found as more relevant to their business.

What we want now is to check whether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list. With the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

### 1. Read dataframe

In [1]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import math
import seaborn as sns


%matplotlib inline
pd.options.display.max_columns = None



In [219]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions.csv" # 'prueba.csv' 
exit_path = "../../data/02_intermediate/"

filtered_file_name="c1-filtered_transactions.csv"

sep=";"

In [13]:
# We create the list of products provided by the client
list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera de chocolate'
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [224]:
# We import the dataframe:
df=pd.read_csv(file_path+file_name, sep=sep, nrows=100000)

In [225]:
df.sample(5)

Unnamed: 0,product_id,description,order_date,section,store,units_ordered
64053,118.0,BARRITAS DE PAN,26/8/2012 0:00:00,0,BmUP,200
61217,212.0,Empanadilla de espinacas,29/8/2018 0:00:00,0,BmUP,0
11912,9999.0,CAJA DE MINISANWICH DE 24 UND,2/7/2012 0:00:00,0,BmUP,0
32542,9999.0,SANWICH POLLO,23/7/2012 0:00:00,0,BmUP,0
46793,551.0,CAJA PALMERAS,9/4/2012 0:00:00,0,BmUP,0


### 2. Normalizing and aggregating description names

Unfortunately, there is no convention for the description and one id could 

1. Normalize descriptions as much as possible using:
    - Regex expressions 
    - Basic NLP for spell-checking.
2. Create a normalization file with the following structure:
    - Unique Product_id and normalized description
    - Flag to indicate if the product is part of the given list, or not.  
3. Finally review the list manually. 

### 2.1 Normalizing description names 

In [226]:
# Setting Null descriptions to 'no-description'
df['description'].fillna('no-description', inplace = True)

# Unique product descriptions
df_descriptions_unique = pd.Series(df['description'].unique())

# Most of the descriptions are in uppercase, however others are in lower:
df_descriptions_normalized = df_descriptions_unique.str.lower()

#replace non alfanumeric with space
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'[^0-9a-zA-Zº()ª:-]+', ' ') 

# We also notice that there are spacing issues at the begining, end of the description and between words:
df_descriptions_normalized=df_descriptions_normalized.str.strip()

# Remove multi-spacing. multi '-' and multi ':'
df_descriptions_normalized=df_descriptions_normalized.str.replace(r' +', ' ') 
df_descriptions_normalized=df_descriptions_normalized.str.replace(r'-+', ' ') 
df_descriptions_normalized=df_descriptions_normalized.str.replace(r':+', ' ') 

In [227]:
pd.DataFrame(dict(desc_original = df_descriptions_unique, desc_normalized = df_descriptions_normalized)).sample(10)

Unnamed: 0,desc_original,desc_normalized
173,BOLSA DE PALITOS HOJADRE OREGANO,bolsa de palitos hojadre oregano
3060,EncargoTARTA SAN MARCOS DEL 3,encargotarta san marcos del 3
3362,Sandwich cangrejo,sandwich cangrejo
4125,"TARTA DE MANZANA 1º CON CARTEL, FELICIDADES AL...",tarta de manzana 1º con cartel felicidades alicia
3595,TACOS DE SALMON CON GAMBAS Y TRIGUEROS,tacos de salmon con gambas y trigueros
2706,"Tarta infantil rectangular del 1º con cartel ""...",tarta infantil rectangular del 1º con cartel f...
2184,EncargoTARTAMILHOJAYFRAMBUESA DEL 1,encargotartamilhojayframbuesa del 1
1125,B/250gr GAJOS,b 250gr gajos
1960,"EncargoSANDWHIS ROAST BEEF,ROQUEFORT,SALMON,VE...",encargosandwhis roast beef roquefort salmon ve...
3623,TARRO CRISTAL CEREZAS 450 GRS,tarro cristal cerezas 450 grs


Now lets gets get our hands dirty and apply some maths to calculate string distnace and finish cleaning all those messy product descriptions... This is what we are going to do:

1. Create a dataset with pastry products by parsing the bakery catalogues, and other pastry websites. (this was done manually, by converting the pdf catalogues to txt using an external web. THe resulting file is named productos.txt)

2. Following the indications from: https://medium.com/@hdezfloresmiguelangel/implementando-un-corrector-ortogr%C3%A1fico-en-python-utilizando-la-distancia-de-levenshtein-498ec0dd1105 create an spell-checker based on the products.txt dataset and the Levenshtein distance


In [228]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../../data/01_additional_data/productos.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [229]:
correction("trta")

'tarta'

In [230]:
correction("café")

'cafe'

Fantastic! the it seems to work. Lets now apply it to our dataset:

In [231]:
def spell_check (line):
    "Given a sentence, returns spell-checks word by word"
    if type(line) == str and len(line) > 0:
        new = []
        line = line.split(" ")
        for word in line:
            if type(word) == str:
                word = correction(word)
            new.append(word)
        return " ".join(new)
            
    else:
        return line

In [None]:
# CAUTION! The following cell e may take a long time to process (5 hours): 

# spell-check word by word the dataset:
df_descriptions_normalized = df_descriptions_normalized.apply(lambda line: spell_check(line))

Lets now merge the normalized names back to the original file, and check how effective was this cleaning:

In [None]:
to_merge = pd.DataFrame(dict(description = df_descriptions_unique, desc_normalized = df_descriptions_normalized))

df_with_normalized_descriptions_transactions = pd.merge(df, to_merge, how='left', on = 'description').sort_values(by='order_date')
df_with_normalized_descriptions_transactions.sample(5)

In [None]:
#Control merge size:
if (df.shape[0] == df_with_normalized_descriptions_transactions.shape[0] ): 
    test0 = "OK - 'df' has the same size as 'df_with_normalized_descriptions_transactions' "
else:
    test0 = "ERROR - 'df' has NOT the same size as 'df_with_normalized_descriptions_transactions' "
print(test0)

In [None]:
# Checking effectiveness of the data cleaning:
unique_descriptions_raw = len(df['description'].unique())
unique_descriptions_normalized = len(df_with_normalized_descriptions_transactions['desc_normalized'].unique())
print('The product descritions were cleaned from {} unique names to {}.'.format(unique_descriptions_raw,unique_descriptions_normalized))

Not super effective...

In [None]:
# Saving the file to the intermiady folder
output_path_df_with_normalized_descriptions_transactions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions_transactions.to_csv(output_path_df_with_normalized_descriptions_transactions, index = False, sep = ';' )

### 2.2 Identifying product descriptions that the client wants us to predict

It is time to create the file that will be manually reviewed.

- First, we compare the normalized descriptions with the list of products provided with the client, and suggest matches using the library fuzzywuzzy
- Second, we will use the results from the other analysis.
- third, we will manually evaluate if the results are good

#### 2.2.1 Using the library fuzzywuzzy to compare the product normalized descriptions with the list of products provided by the client and suggest a match, or alternatively - "match-not-found"

In [None]:
df_normalized_desc_unique = pd.DataFrame(df_with_normalized_descriptions_transactions["desc_normalized"].unique(), columns = ['desc_normalized'])

In [None]:
def find_match (line, options = list_of_products):
    "Returns product match if the the calculated difference between strings is greater than 80, 'match-not-found' otherwise"
    if not(line is None) and type(line)== str:
        highest = process.extractOne(line,list_of_products)
        if not(highest is None) and highest[1] >80:
            return highest[0]
        else:
            return 'match-not-found'
    else:
        return 'match-not-found'

# Applying matching function to all product normalized descriptions
df_normalized_desc_unique["target_names_fuzzywuuzy"] = df_normalized_desc_unique["desc_normalized"].apply(lambda line: find_match(line))

Lets now evaluate how effectively did we match the normalized descriptions with the list that the client provided us:

In [None]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_desc_unique[df_normalized_desc_unique['desc_normalized'].str.contains('mousse')].head(10)

As we can see... its not actually very good, lets try something different.

#### 2.2.1 Using the results from the other analysis

Lets now use the results from the manual analysis to see how efective the measure was:

In [None]:
# Since this matching is performed at id level, 
# lets create a new dataset with unique product_id, descriptions, and evaluate it:
df_normalized_id_desc_unique = df_with_normalized_descriptions_transactions[["product_id",'desc_normalized']].drop_duplicates()

In [None]:
dict_of_products_matches={100: 'croissant', 
                  101: 'croissant',
                  102: 'croissant',
                  103: 'croissant petit',
                  9999: 'tarta mousse 3 chocolates', # almost only for order, creating a new id for this product is suggested
                  462: 'tarta de manzana 2º',
                  182: 'palmera de chocolate', # palmeras: 140
                  414: 'tarta opera', # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  4511:'postre fresas y mascarpone',
                  459: 'milhojas frambuesa 2º',
                  112: 'tortel',
                  115: 'baguette'}

In [None]:
def target_names_a(product_id, dict_of_products_matches= dict_of_products_matches):
    'Returns match if the product_id is found within the given dict or, otherise "match-not-found"'
    if not(product_id is None) and not(math.isnan(product_id)) and int(product_id)  in dict_of_products_matches:
        return dict_of_products_matches[int(product_id)]
    else:
        return 'match-not-found'

df_normalized_id_desc_unique['target_names_manual_analysis']=df_normalized_id_desc_unique["product_id"].apply(lambda line: target_names_a(line))

Lets now check how effective this was:

In [None]:
# Lets review the effectiveness filtering by 'mousse '. The expected result is that all 'mousse 3 chocolates' match
df_normalized_id_desc_unique[df_normalized_id_desc_unique['desc_normalized'].str.contains('mousse')].head(15)

Again, not very good, since most of the 'mousse 3 chocolates' are unmatched.

It is clear that we need an better way to match the results. Lets try doing keywords filtering product by product.

In [None]:
# TO DELETE!
output_path_df_with_normalized_descriptions_transactions = exit_path + 'data_with_normalized_names.csv'
df_with_normalized_descriptions_transactions = pd.read_csv(output_path_df_with_normalized_descriptions_transactions, sep = ';' )


### 2.3 Review Product by Product

In [None]:
#First, lets create again a dataframe with unique descriptions
unique_normalized_decriptions = df_with_normalized_descriptions_transactions[['product_id','desc_normalized']].drop_duplicates()

#And a empty list to add all the unitary analysis. It will be use to concatenate results.
list_of_dfs = []

In [None]:
# Functions that we will use:

def plot_count_per_id(df):
    transactions_by_id = df.groupby("product_id")['desc_normalized'].count()
    transactions_by_id.plot.bar()

#### 2.3.1 Matching: milhojas de frambuesa 2º

In [None]:
def filter_milhojas (df):
    'Filters the product descriptions of given dataset by "milhojas" and "frambuesa 2"'
    milhojas = df[df['desc_normalized'].str.contains('milhojas')]
    milhojas_frambuesa = milhojas[milhojas['desc_normalized'].str.contains('frambuesa 2')].copy()
    return milhojas_frambuesa


In [None]:
# Filter the transactions dataset
milhojas_frambuesa_transacciones=filter_milhojas(df_with_normalized_descriptions_transactions)
milhojas_frambuesa_transacciones.head()

In [None]:
# Plot the ditribution of 'product_id'
plot_count_per_id(milhojas_frambuesa_transacciones)

Lets explore the names distribution of 414, 459 and 9999:

In [None]:
#Names distribution for product_id = 459:
df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==459.0]['desc_normalized'].value_counts().head()

In [None]:
#Names distribution for product_id = 414:
df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==414.0]['desc_normalized'].value_counts().head()

In [None]:
#Names distribution for product_id = 9999:
df_with_normalized_descriptions_transactions[(df_with_normalized_descriptions_transactions['product_id']==9999.0) & (df_with_normalized_descriptions_transactions['desc_normalized'].str.contains('milhoja'))]['desc_normalized'].value_counts().head()

It seems there is a strong correlation with id 459, however 459 it also includes other types of 'milhojas'. In addition, 'milhojas de frambuesa 2º' is also included in id number 414, which seems to be a multiple 'tartas' id, and id 9999, which is the id used for custom orders.

For these reasons, we decide to filter milhojas based on the description (filtering the dataframe with only disting 'product_id' and 'des_normalized' values:

In [None]:
#Save unique product_id and product_description:
milhojas_frambuesa = filter_milhojas(unique_normalized_decriptions)
milhojas_frambuesa['target_names_prod_by_prod'] = 'milhojas frambuesa'
list_of_dfs.append(milhojas_frambuesa)
milhojas_frambuesa.sample(4)

#### 2.3.2 Matching: croissant petite

From the analysis performed in notebook "x01-transactions_to_partial_results-yy.ipynb", it was concluded that 'croissant petite' had an strong correlation with id number '103' however, before commiting to filtering by that ID, lets plot the distribution of count of lines per ID, that satisfies the filters of the following function, from the transaction dataset, in order to :

In [None]:
def filter_croissant_petit (df):
    croissant = df[df['desc_normalized'].str.contains('croissant')].copy()
    croissant_petite = croissant[croissant['desc_normalized'].str.contains('petit')].copy()
    return croissant_petite

In [None]:
# Plotting the results:
transactions_croissant_petite = filter_croissant_petit(df_with_normalized_descriptions_transactions)
transactions_croissant_petite.head()

In [None]:
plot_count_per_id(transactions_croissant_petite)

It is clear that id '103' represents the 'croissant petite';  in fact, after reviweing the data with the client, he suggested only taking 103.

In [None]:
# Saving id=103 as croissant petit
croissant_petit = unique_normalized_decriptions[unique_normalized_decriptions['product_id']==103.0].copy()
croissant_petit['target_names_prod_by_prod'] = 'croissant petit'
list_of_dfs.append(croissant_petit)
croissant_petit.head()

#### 2.3.3 Matching: croissant

From the analysis performed in notebook "x01-transactions_to_partial_results-yy.ipynb", it was concluded that 'croissant simple' had an strong correlation with id number '100' and '101' however, before commiting to filtering by that ID, lets plot the distribution of count of lines per ID, that satisfies the filters of the following function, from the transaction dataset, in order to :

In [None]:
def filter_croissant_simple (df):
    croissant = df[df['desc_normalized'].str.contains('croissant')].copy()
    croissant_simple = croissant[~croissant['desc_normalized'].str.contains('petit|tira|masa')].copy()
    return croissant_simple

In [None]:
transactions_croissant_simple = filter_croissant_simple(df_with_normalized_descriptions_transactions)
transactions_croissant_simple.head()

In [None]:
plot_count_per_id(transactions_croissant_simple)

Interesting, the correlation seems to exsit for several ids... Lets explore a bit more... Lets plot the most common description from each id and see if we find any pattern:

In [None]:
for i in transactions_croissant_simple['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


It seems that they have different types of croissant. Checking with the client, he suggested taking only: 100.0 and 101.0 that contains the word 'croissant'

In [None]:
croissant_simple = unique_normalized_decriptions[(unique_normalized_decriptions['product_id']==100.0)|(unique_normalized_decriptions['product_id']==101.0)].copy()
croissant_simple = croissant_simple[croissant_simple['desc_normalized'].str.contains('croissant')]
croissant_simple = croissant_simple[~croissant_simple['desc_normalized'].str.contains('petit|tira|masa')].copy()

croissant_simple['target_names_prod_by_prod'] = 'croissant simple'
list_of_dfs.append(croissant_simple)
croissant_simple.head()

#### 2.3.4 Matching: tarta mousse tres chocolates

From the analysis performed in notebook "x01-transactions_to_partial_results-yy.ipynb", it was concluded that 'mousse tres chocolates' had no correlation with a particular product_id, therefore the analysis will be based on the description:

In [None]:
def filter_mousse_tres_chocolates (df):
    tarta = df[df['desc_normalized'].str.contains('tarta')].copy()
    mousse = tarta[tarta['desc_normalized'].str.contains('mousse|mus')].copy()
    mousse_tres = mousse[mousse['desc_normalized'].str.contains('tres|3')].copy()
    mousse_tres = mousse_tres[~mousse_tres['desc_normalized'].str.contains('mini')].copy()
    mousse_tres_chocolates = mousse_tres[mousse_tres['desc_normalized'].str.contains('chocolate')].copy()
    return mousse_tres_chocolates

In [None]:
transactions_mousse_tres_chocolates = filter_mousse_tres_chocolates(df_with_normalized_descriptions_transactions)
transactions_mousse_tres_chocolates.head()

In [None]:
plot_count_per_id(transactions_mousse_tres_chocolates)

Lets explore it more by having a look at the full distribution of names:

In [None]:
for i in transactions_mousse_tres_chocolates['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


It seems that 'tarta mousse tres chocolates' is all over the place... So the easiest is to filter by product description"

In [None]:
mousse_tres_chocolates = filter_mousse_tres_chocolates(unique_normalized_decriptions)
mousse_tres_chocolates['target_names_prod_by_prod'] = 'mousse tres chocolates'
list_of_dfs.append(mousse_tres_chocolates)
mousse_tres_chocolates.head()

#### 2.3.5 Matching: tarta de manzana 2
From the analysis performed in notebook "x01-transactions_to_partial_results-yy.ipynb", it was concluded that 'tarta de manzana 2' had no correlation with a particular product_id, therefore the analysis will be based on the description:

In [None]:
def filter_tarta_manzana_2 (df):
    manzana_tarta = df[df['desc_normalized'].str.contains('manzana')]
    #manzana_tarta = manzana[manzana['desc_normalized'].str.contains('tarta')].copy() #Removed because we saw it had better fit
    manzana_tarta = manzana_tarta[~manzana_tarta['desc_normalized'].str.contains('caramelo')].copy()
    manzana_tarta_dos = manzana_tarta[manzana_tarta['desc_normalized'].str.contains('dos|2')].copy()

    return manzana_tarta_dos

In [None]:
transactions_manzana_tarta_dos=filter_tarta_manzana_2(df_with_normalized_descriptions_transactions)

In [None]:
plot_count_per_id(transactions_manzana_tarta_dos)

Looks like we found a winner!

In [None]:
manzana_tarta_dos = filter_tarta_manzana_2(unique_normalized_decriptions)
manzana_tarta_dos['target_names_prod_by_prod'] = 'tarta de manzana'
list_of_dfs.append(manzana_tarta_dos)
manzana_tarta_dos.head()

#### 2.3.6 Matching: palmera de chocolate 

In [None]:
def filter_palmera_chocolate (df):
    palmera = df[df['desc_normalized'].str.contains('palmera')]
    palmera_chocolate = palmera[palmera['desc_normalized'].str.contains('chocolate|trufa')].copy() #Added trufa after reviwing results
    return palmera_chocolate

In [None]:
transactions_palmera_chocolate = filter_palmera_chocolate(df_with_normalized_descriptions_transactions)
transactions_palmera_chocolate.head()

In [None]:
plot_count_per_id(transactions_palmera_chocolate)

Again, we have a winner!

In [None]:
palmera_chocolate = filter_palmera_chocolate(unique_normalized_decriptions)
palmera_chocolate['target_names_prod_by_prod'] = 'palmera chocolate'

list_of_dfs.append(palmera_chocolate)
palmera_chocolate.head()

#### 2.3.7 Matching: tarta ópera 

In [None]:
def filter_tarta_opera(df):
    opera = df[df['desc_normalized'].str.contains('opera')]
    opera_tarta = opera[opera['desc_normalized'].str.contains('tarta')].copy()
    return opera_tarta

In [None]:
transactions_tarta_opera = filter_tarta_opera(df_with_normalized_descriptions_transactions)

In [None]:
plot_count_per_id(transactions_tarta_opera)

In [None]:
for i in transactions_tarta_opera['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


Again, all over the place, so we decided to use the description filter:

In [None]:
tarta_opera = filter_tarta_opera(unique_normalized_decriptions)
tarta_opera['target_names_prod_by_prod'] = 'tarta opera'

list_of_dfs.append(tarta_opera)
tarta_opera.head()

#### 2.3.9 Matching: postre de fresas y mascarpone

In [None]:
def filter_postre_fresas_mascarpone (df):
    postre = df[df['desc_normalized'].str.contains('postre')]
    postre_fresa = postre[postre['desc_normalized'].str.contains('fresa')].copy()
    postre_fresa = postre_fresa[~postre['desc_normalized'].str.contains('eclair')].copy() #Client indication
    postre_fresa = postre_fresa[~postre['desc_normalized'].str.contains('tartaleta')].copy() #Client indication

    postre_fresa_mascarpone = postre_fresa[postre_fresa['desc_normalized'].str.contains('mascarpone')].copy()
    return postre_fresa_mascarpone

In [None]:
transactions_postre_fresas_mascarpone = filter_postre_fresas_mascarpone(df_with_normalized_descriptions_transactions)

In [None]:
plot_count_per_id(transactions_postre_fresas_mascarpone)

In [None]:
for i in transactions_postre_fresas_mascarpone['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


Seems that the filter is working :)

In [None]:
postre_fresas_mascarpone = filter_postre_fresas_mascarpone(unique_normalized_decriptions)

postre_fresas_mascarpone['target_names_prod_by_prod'] = 'postre de fresas y mascarpone'
list_of_dfs.append(postre_fresas_mascarpone)
postre_fresas_mascarpone.head()


#### 2.3.9 Matching: tortel

In [None]:
def filter_tortel (df):
    tortel = df[df['desc_normalized'].str.contains('tortel')].copy()
    tortel = tortel[~tortel['desc_normalized'].str.contains('tortellini|mini')].copy()

    return tortel

In [None]:
transactions_tortel = filter_tortel(df_with_normalized_descriptions_transactions)

In [None]:
plot_count_per_id(transactions_tortel)

In [None]:
for i in transactions_tortel['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


In [None]:
tortel = filter_tortel(unique_normalized_decriptions)

tortel['target_names_prod_by_prod'] = 'tortel'
list_of_dfs.append(tortel)
tortel.head(5)

#### 2.3.10 Matching: baguette

In [None]:
def filter_baguette (df):
    baguette = df[df['desc_normalized'].str.contains('baguette|baguete|baguet')].copy()
    return baguette

In [None]:
transaction_baguette = filter_baguette(df_with_normalized_descriptions_transactions)

In [None]:
plot_count_per_id(transaction_baguette)

In [None]:
for i in transaction_baguette['product_id'].unique():
    to_print = df_with_normalized_descriptions_transactions[df_with_normalized_descriptions_transactions['product_id']==i]['desc_normalized'].value_counts().head()
    print("***Plotting id: {} ".format(i))
    print("-")

    print(to_print)
    print("")


Per client indications we take id = 115.0 containing baguette

In [None]:
baguette = unique_normalized_decriptions[unique_normalized_decriptions['product_id']==115.0].copy()
baguette = baguette[baguette['desc_normalized'].str.contains('baguette|baguete|baguet')].copy()

baguette['target_names_prod_by_prod'] = 'baguette'
list_of_dfs.append(baguette)
baguette.head(5)

### Lets now concatenate results, merge them back to the  full list of normalized descriptions and evaluate its effectiveness

In [None]:
# Lets concatenate the results:
list_of_products_df = pd.concat(list_of_dfs, sort=False)

In [None]:
df_desc_normalezed_vs_prod_by_prod = pd.merge(df_with_normalized_descriptions_transactions, list_of_products_df,how='left',on=['desc_normalized','product_id'])

## Merging test:

In [None]:
#Control merge size:
if (df_with_normalized_descriptions_transactions.shape[0] == df_desc_normalezed_vs_prod_by_prod.shape[0] ): 
    test1 = "OK - 'df_with_normalized_descriptions_transactions' has the same size as 'df_with_normalized_descriptions_transactions' "
else:
    test1 = "ERROR - 'df' has NOT the same size as 'df_desc_normalezed_vs_prod_by_prod' "

print(test1)

*NOTE:*

During the first executions of code with the full transactions file, this error was failing; 'df' had less rows than 'df_with_normalized_descriptions_transactions'. The reason for this was not easy to identify, however digging we found that that in the normalized description two products descriptions ware naming two different products in the description, however this was not the case for the raw description (before the spell-cheacker):

for example:
- Normalized prod description: 'tarta mousse 3 chocolates de 20 raciones con escrito sobre la tarta manzana y mini felicidades'
- Raw description: 'TARTA MOUSSE 3 CHOCOLATES DE 20 RACIONES CON ESCRITO SOBRE LA TARTA:  MARIANA Y DANI FELICIDADES'

Basically, the spell-corrector was solving some problems; normalizing 'trata' , 'taaarta' under 'tarta', but adding a new one: normalizing words that it doesnt know, that may be a correct word, to a word that it knows: 'MARIANA' to 'manzana'... Ofcourse this is a weakness, however from the manual inspections that were performed, it doesnt seem to happen often.

How we solve it by adding to the bakery products dataset: 
- A list of the most common male and female spanish names: in order to avoid confusion in the names

sources of the datasets:
- spanish names:https://www.ine.es/dyngs/INEbase/es/operacion.htm?c=Estadistica_C&cid=1254736177009&menu=resultados&secc=1254736195454&idp=1254734710990


Also, in this case we added some names that we found to the excel; the right thing to do should we had more time, would be to polish the dataset, by adding not just mallorca catalogue and names, but also a book in spanish. Perhaps, it would also be interesting to applying NLP to identify NAMES from the product descriptions and add them to the products dataset...

### 2.5 Test that data has not been corrputed

To test the integrity of the data, the original dataset should be the same as the last dataset without that we added, in other words, without the columns with the normalized descriptuons, and the target names:

In [None]:
# First, lets check the size of both dataframes:
print("Original dataset shape: {}".format(df.shape))
print("Resulting dataset shape: {}".format(df_desc_normalezed_vs_prod_by_prod.shape))

The shape looks good, we were expecting the resulting dataset to have to columns more. Lets now evauate if they are actually the same dataset if we remove the added columns:

In [None]:
# Selecting original columnsd from the resulting df
df_result = df_desc_normalezed_vs_prod_by_prod.loc[:, df.columns]

In [None]:
# Now, lets compare it with the original dataset, sorting them out in the same way:
df_result_sorted = df_result.sort_values(by = ['order_date','store','description','product_id', 'units_ordered']).reset_index().drop('index', axis = 1)
df_original_sorted = df.sort_values(by = ['order_date','store','description',  'product_id', 'units_ordered']).reset_index().drop('index', axis = 1)

In [None]:
df_result_sorted.head()

In [None]:
df_original_sorted.head()

In [None]:
# Now that they have the same columns, and are sorted using the same criteria, lets evaluate if they are the same:
comparison_result = df_result_sorted.equals(df_original_sorted)

if comparison_result == True:
    test2 = 'OK - Original dataset is similar to the resulting dataset'
else:
     test2 ='ERROR - Original dataset NOT found'

print(test2)

### 2.6 Filter dataset to only include the products from the list provided by the client, and save to csv

In [None]:
df_target_products = df_desc_normalezed_vs_prod_by_prod[~df_desc_normalezed_vs_prod_by_prod['target_names_prod_by_prod'].isnull()]
# df_other_products = df_desc_normalezed_vs_prod_by_prod[df_desc_normalezed_vs_prod_by_prod['target_names_prod_by_prod'].isnull()]

In [None]:
df_target_products.head()

In [None]:
from datetime import datetime as dttm

df_target_products['date']=df_target_products['order_date'].apply(lambda x: dttm.strptime(x,'%d/%m/%Y 0:00:00')).copy()

df_target_products['units_ordered_numeric']=df_target_products['units_ordered'].str.split(",").str[0].astype(dtype='long').copy()

df_target_products.drop('order_date', axis=1, inplace=True)

df_target_products.drop('units_ordered', axis=1, inplace=True)

df_target_products.rename(columns={'units_ordered_numeric':'units_ordered'}, inplace=True)

In [None]:
type(df_target_products['date'].loc[33])

In [None]:
df_target_products_file_name = exit_path + 'filtered_transactions_not_clean.csv' 
df_target_products.to_csv(df_target_products_file_name, index = False, sep = ';' )
# df_target_products.head()

In [None]:
unfiltered_products_file_name = exit_path + 'unfiltered_transactions.csv' 
df_other_products.to_csv(unfiltered_products_file_name, index = False, sep = ';' )
df_other_products.head()

# ERROR CONTROL

In [None]:
print(test0)
print(test1)
print(test2)

FALTARIA LIMPIAR EL DATASET