# CREATING THE LIST OF PRODUCTS FOR THE ANALYSIS:

After rearranging the data in a more convenient manner and doing some introductory analysis of the data, we now want to get down to work with our data.

A list has been given to us of the 10 products that our clients finded as more relevant to their business.

What we want now is to check wether the names on the list correspond to certain uniques ids, or, as seen in the previous scripts, some conflict of unicity will arise between the id of our products and their descriptions.

So, we are going to check our dataframe and select from it the ids and descriptions of our products that match the indications given in our clients list.

We will do this in two phases: in the first one we will create a dataframe, with only some of the rows of our csv.

In the second part, we will use chunks to load, chunk by chunk, all the dataframe, and extract the samei nformation.

Finally, with the lists (in reality, two dictionaries) of the ids and descriptions that match every product given to us, we will decide which are the more appropriate.

Perhaps some guidance from our client would be needed at this stage.

## 1. LOADING ONLY PART OF THE DATAFRAME:

In [None]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [None]:
# Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions.csv"

sep=";"

In [None]:
# We create the list of selected products, and will try with it to obtain the values of the ids of the indicated products:

list_of_products=['croissant',
                  'croissant petit',
                  'tarta mousse 3 chocolates',
                  'tarta de manzana 2º',
                  'palmera', 
                  'tarta opera',
                  'postre fresas y mascarpone',
                  'milhojas frambuesa 2º',
                  'tortel',
                  'baguette']

In [None]:
# We import the dataframe:

df=pd.read_csv(file_path+file_name, nrows=1000000, sep=sep)

In [None]:
# According to what we saw in the previous notebook, we have to do some cleaning:

df.dropna(how='any', inplace=True)
df=df.drop('Unnamed: 0', axis=1)

In [None]:
# Most of the descriptions are in uppercase, we first reduce everything to lowercase:

df['description_lower']=df['description'].str.lower()

In [None]:
# Now we construct the dictionaries of the selected ids and descriptions that each product of the list finds:

# What you are seeing here is the final version both of the list of the selected products, and the way they are looked for in the
# dataframe. We started looking within the string of the description, with a ".contains", but it was noticed that startswith 
# was more suited for the job. Also, the values in the list of the products were changed in order to capture more names
# each time:

rel_prod_list_ids=dict()
rel_prod_list_descrip=dict()

for product in list_of_products:
    rel_prod_list_ids[product]=df[df['description_lower'].str.startswith(product)]['product_id'].unique()
    rel_prod_list_descrip[product]=df[df['description_lower'].str.startswith(product)]['description_lower'].unique()

In [None]:
# For minor, additional checks:

df[df['product_id']==107].groupby('description').first()

In [None]:
# This is the dictionary with the word and its associated codes:

rel_prod_list_ids

In [None]:
# The same but with the descriptions: Just for checks:

rel_prod_list_descrip[list_of_products[5]]

In [None]:
# For some additional checks:

df[df['product_id']==450]['description'].unique()

## 2. LOADING ALL THE DATAFRAME IN CHUNKS AND GETTING ALL THE RESULTS:

In [None]:
# We import the dataframe:

reader=pd.read_csv(file_path+file_name, sep=sep, chunksize=2000000)

# Two empty lists are created to store, for each product, the list that results from looking for it in the chunk.

NL1=list([None]*len(list_of_products))
NL2=list([None]*len(list_of_products))

# Getting the chunks and proceeding:

for chunk in reader:

    chunk.dropna(how='any', inplace=True)
    chunk=chunk.drop('Unnamed: 0', axis=1)

    chunk['description_lower']=chunk['description'].str.lower()

    list_prod_list_ids=[None]*len(list_of_products)
    list_prod_list_descrip=[None]*len(list_of_products)
        
    for i, product in enumerate(list_of_products):

        list_prod_list_ids[i]=list(chunk[chunk['description_lower'].str.startswith(product)]['product_id'].unique())
        list_prod_list_descrip[i]=list(chunk[chunk['description_lower'].str.startswith(product)]['description_lower'].unique())
        
    for i, element in enumerate(list_prod_list_ids):
        
        if NL1[i]:    
            NL1.append(list_prod_list_ids[i])
            NL2.append(list_prod_list_descrip[i])
            
        else:
            NL1[i]=list_prod_list_ids[i]
            NL2[i]=list_prod_list_descrip[i]
    

In [None]:
rel_prod_list_ids=dict(zip(list_of_products,NL1))
rel_prod_list_descrip=dict(zip(list_of_products,NL2))

In [None]:
rel_prod_list_ids

##  3. FINAL LIST:

In [None]:
# At end, we reach the following conclussions:

dict_of_products={'croissant': 100, # serious doubts, if it is not the 100, then possibly it should be 100+101+102
                  'croissant petit': 103,
                  'tarta mousse 3 chocolates': 9999, # almost only for order, creating a new id for this product is suggested
                  'tarta de manzana 2º': 462,
                  'palmeras de trufa': 182, # palmeras: 140
                  'tarta opera': 414, # 9999, for order, mostly. If included, creating a new id for this product is suggested
                  'postre fresas y mascarpone':4511,
                  'milhojas frambuesa 2º': 459,
                  'torteles': 112,
                  'baguette':115}

In [None]:
dict_of_products

## 4. CHANGING THE ID OF A PRODUCT:

We face now the problem that one of our products is using a code that describes the orders (as seen in previous scripts).

To avoid complexity in the code to come, we decide to change the id of our product to another one, taking care firstly that it is not currently in use.

In [None]:
df.loc[df['description_lower'].str.startswith('tarta mousse 3 chocolates'), 'product_id']=10002

In [None]:
df[df ['product_id']==10002]

In [None]:
# We make this small arrangement also to the dict_of_products:

dict_of_products['tarta mousse 3 chocolates']= 10002 # New id created

In [None]:
dict_of_products