# References:

Blogpost - https://medium.com/codex/tensorflow-deep-learning-recommenders-on-retail-dataset-ce0c50aff5fa

Source Dataset Citation - 
Olist, and André Sionek. (2018). Brazilian E-Commerce Public Dataset by Olist [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/195341

In this data preparation notebook, I started by loading and preparing several datasets from the Olist e-commerce platform. First, I extracted relevant columns from the orders and customers datasets, then merged them based on the customer ID. Subsequently, I loaded the order items and order reviews datasets, aggregating order items by order ID and calculating the average review score per order. Additionally, I loaded the products dataset and translated the product category names. After merging the product information with the order data, I generated a SKU for each product category to uniquely identify products within categories. Finally, I rearranged the columns and exported the cleaned dataset to a CSV file.

# Load Orders Data

In [1]:
import pandas as pd

# Read the orders dataset into a Pandas DataFrame
orders = pd.read_csv(
    '../data/Olist/olist_orders_dataset.csv'
)

# Select relevant columns: order_id, customer_id, order_purchase_timestamp
orders = orders.loc[:, ['order_id', 'customer_id', 'order_purchase_timestamp']]

# Display the first few rows of the DataFrame
orders.head()

Unnamed: 0,order_id,customer_id,order_purchase_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,2017-10-02 10:56:33
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,2018-07-24 20:41:37
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,2018-08-08 08:38:49
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,2017-11-18 19:28:06
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,2018-02-13 21:18:39


# Load Customers Data

In [2]:
# Read the customers dataset into a Pandas DataFrame
customers = pd.read_csv(
    '../data/Olist/olist_customers_dataset.csv'
)

# Select relevant columns: customer_id, customer_unique_id, customer_city
customers = customers.loc[:, ['customer_id', 'customer_unique_id', 'customer_city']]

# Display the first few rows of the DataFrame
customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_city
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,franca
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,sao bernardo do campo
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,sao paulo
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,mogi das cruzes
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,campinas


# Merge Orders and Customers Data

In [3]:
# Merge orders and customers DataFrames on 'customer_id' using a left join
merged_df = pd.merge(orders, customers, on='customer_id', how='left')

# Drop the 'customer_id' column from the merged DataFrame
merged_df = merged_df.drop(columns=['customer_id'])

# Display the first few rows of the merged DataFrame
merged_df.head()

Unnamed: 0,order_id,order_purchase_timestamp,customer_unique_id,customer_city
0,e481f51cbdc54678b7cc49136f2d6af7,2017-10-02 10:56:33,7c396fd4830fd04220f754e42b4e5bff,sao paulo
1,53cdb2fc8bc7dce0b6741e2150273451,2018-07-24 20:41:37,af07308b275d755c9edb36a90c618231,barreiras
2,47770eb9100c2d0c44946d9cf07ec65d,2018-08-08 08:38:49,3a653a41f6f9fc3d2a113cf8398680e8,vianopolis
3,949d5b44dbf5de918fe9c16f97b45f8a,2017-11-18 19:28:06,7c142cf63193a1473d2e66489a9ae977,sao goncalo do amarante
4,ad21c59c0840e6cb83a9ceb5573f8159,2018-02-13 21:18:39,72632f0f9dd73dfee390c9b22eb56dd6,santo andre


In [4]:
merged_df[merged_df['customer_unique_id']=='00172711b30d52eea8b313a7f2cced02']

Unnamed: 0,order_id,order_purchase_timestamp,customer_unique_id,customer_city
39737,c306eca42d32507b970739b5b6a5a33a,2018-08-13 09:14:07,00172711b30d52eea8b313a7f2cced02,jequie
69921,bb874c45df1a3c97842d52f31efee99a,2018-07-28 00:23:49,00172711b30d52eea8b313a7f2cced02,jequie


# Load and Group Order Items Data

In [5]:
# Read the order items dataset into a Pandas DataFrame
order_items = pd.read_csv(
    '../data/Olist/olist_order_items_dataset.csv'
)

# Select relevant columns: order_id, product_id, price
order_items = order_items.loc[:, ['order_id', 'product_id', 'price']]

# Group by order_id, product_id, and price, counting the quantity of each item
order_items = order_items.groupby(["order_id", "product_id", "price"]).size().reset_index(name="quantity")

# Display the first few rows of the grouped DataFrame
order_items.head()

Unnamed: 0,order_id,product_id,price,quantity
0,00010242fe8c5a6d1ba2dd792cb16214,4244733e06e7ecb4970a6e2683c13e61,58.9,1
1,00018f77f2f0320c557190d7a144bdd3,e5f2d52b802189ee658865ca93d83a8f,239.9,1
2,000229ec398224ef6ca0657da4fc703e,c777355d18b72b67abbeef9df44fd0fd,199.0,1
3,00024acbcdf0a6daa1e931b038114c75,7634da152a4610f1595efa32f14722fc,12.99,1
4,00042b26cf59d7ce69dfabb4e55b4fd9,ac6c3623068f30de03045865e4e10089,199.9,1


In [6]:
order_items[order_items['order_id']=='fffb9224b6fc7c43ebb0904318b10b5f']

Unnamed: 0,order_id,product_id,price,quantity
102418,fffb9224b6fc7c43ebb0904318b10b5f,43423cdffde7fda63d0414ed38c11a73,55.0,4


# Load and Aggregate Order Reviews Data

In [7]:
# Read the order reviews dataset into a Pandas DataFrame
order_reviews = pd.read_csv(
    '../data/Olist/olist_order_reviews_dataset.csv'
)

# Select relevant columns: order_id, review_score
order_reviews = order_reviews.loc[:, ['order_id', 'review_score']]

# Group by order_id and calculate the mean review_score for each order
order_reviews = order_reviews.groupby('order_id')['review_score'].mean().reset_index()

# Display the first few rows of the aggregated DataFrame
order_reviews.head()

Unnamed: 0,order_id,review_score
0,00010242fe8c5a6d1ba2dd792cb16214,5.0
1,00018f77f2f0320c557190d7a144bdd3,4.0
2,000229ec398224ef6ca0657da4fc703e,5.0
3,00024acbcdf0a6daa1e931b038114c75,4.0
4,00042b26cf59d7ce69dfabb4e55b4fd9,5.0


In [8]:
order_reviews[order_reviews['order_id']=='0035246a40f520710769010f752e7507']

Unnamed: 0,order_id,review_score
84,0035246a40f520710769010f752e7507,5.0


# Load Products Data

In [9]:
# Read the products dataset into a Pandas DataFrame
products = pd.read_csv(
    '../data/Olist/olist_products_dataset.csv'
)

# Select relevant columns: product_id, product_category_name
products = products.loc[:, ['product_id', 'product_category_name']]

# Display the first few rows of the DataFrame
products.head()

Unnamed: 0,product_id,product_category_name
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria
1,3aa071139cb16b67ca9e5dea641aaa2f,artes
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer
3,cef67bcfe19066a932b7673e239eb23d,bebes
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas


In [10]:
products[products['product_id']=='a1804276d9941ac0733cfd409f5206eb']

Unnamed: 0,product_id,product_category_name
23158,a1804276d9941ac0733cfd409f5206eb,


# Load Product Category Names Data

In [11]:
# Read the product category names translation dataset into a Pandas DataFrame
products_category_name = pd.read_csv(
    '../data/Olist/product_category_name_translation.csv'
)

# Display the first few rows of the DataFrame
products_category_name.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


# Merge Products with Category Names

In [12]:
# Merge products and product category names DataFrames on 'product_category_name' using a left join
merged_products = pd.merge(products, products_category_name, on='product_category_name', how='left')

# Drop the 'product_category_name' column from the merged DataFrame
merged_products = merged_products.drop(columns=['product_category_name'])

# Display the first few rows of the merged DataFrame
merged_products.head()

Unnamed: 0,product_id,product_category_name_english
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumery
1,3aa071139cb16b67ca9e5dea641aaa2f,art
2,96bd76ec8810374ed1b65e291975717f,sports_leisure
3,cef67bcfe19066a932b7673e239eb23d,baby
4,9dc1a7de274444849c219cff195d0b71,housewares


# Data Preparation: Merging, Cleaning, and Transforming

In [13]:
# Merge merged_df with order_items DataFrame on 'order_id' using a left join
df = pd.merge(merged_df, order_items, on='order_id', how='left')

# Merge the resulting DataFrame with order_reviews DataFrame on 'order_id' using a left join
df = pd.merge(df, order_reviews, on='order_id', how='left')

# Merge the resulting DataFrame with merged_products DataFrame on 'product_id' using a left join
df = pd.merge(df, merged_products, on='product_id', how='left')

# Drop rows with missing values in columns 'product_id', 'price', and 'quantity'
df.dropna(subset=['product_id', 'price', 'quantity'], inplace=True)

# Fill missing values in 'review_score' column with 0
df['review_score'] = df['review_score'].fillna(0)

# Fill missing values in 'product_category_name_english' column with 'UNKNOWN'
df['product_category_name_english'] = df['product_category_name_english'].fillna('UNKNOWN')

# Convert 'order_purchase_timestamp' to datetime, then to integer representing Unix timestamp in seconds
df['timestamp'] = pd.to_datetime(df['order_purchase_timestamp']).astype(int) // 10**9

# Drop the 'order_purchase_timestamp' column from the DataFrame
df.drop('order_purchase_timestamp', axis=1, inplace=True)

# Rename columns 'customer_unique_id' to 'user_id' and 'product_category_name_english' to 'product_category'
df.rename(columns={'customer_unique_id': 'user_id', 'product_category_name_english': 'product_category'}, inplace=True)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,order_id,user_id,customer_city,product_id,price,quantity,review_score,product_category,timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,7c396fd4830fd04220f754e42b4e5bff,sao paulo,87285b34884572647811a353c7ac498a,29.99,1.0,4.0,housewares,1506941793
1,53cdb2fc8bc7dce0b6741e2150273451,af07308b275d755c9edb36a90c618231,barreiras,595fac2a385ac33a80bd5114aec74eb8,118.7,1.0,4.0,perfumery,1532464897
2,47770eb9100c2d0c44946d9cf07ec65d,3a653a41f6f9fc3d2a113cf8398680e8,vianopolis,aa4383b373c6aca5d8797843e5594415,159.9,1.0,5.0,auto,1533717529
3,949d5b44dbf5de918fe9c16f97b45f8a,7c142cf63193a1473d2e66489a9ae977,sao goncalo do amarante,d0b61bfb1de832b15ba9d266ca96e5b0,45.0,1.0,5.0,pet_shop,1511033286
4,ad21c59c0840e6cb83a9ceb5573f8159,72632f0f9dd73dfee390c9b22eb56dd6,santo andre,65266b2da20d04dbe00c5c2d3bb7859e,19.9,1.0,5.0,stationery,1518556719


# Generate SKU for Product Categories

In [14]:
# Selecting a subset of columns: product_id, product_category and dropping duplicate rows
products_subset = df.loc[:, ['product_id', 'product_category']].drop_duplicates()

# Generating SKU by grouping product categories and assigning cumulative count within each group
products_subset['SKU'] = products_subset.groupby('product_category').cumcount()

# Display the first few rows of the DataFrame with generated SKU
products_subset.head()

Unnamed: 0,product_id,product_category,SKU
0,87285b34884572647811a353c7ac498a,housewares,0
1,595fac2a385ac33a80bd5114aec74eb8,perfumery,0
2,aa4383b373c6aca5d8797843e5594415,auto,0
3,d0b61bfb1de832b15ba9d266ca96e5b0,pet_shop,0
4,65266b2da20d04dbe00c5c2d3bb7859e,stationery,0


# Merge DataFrames and Create Final DataFrame

In [15]:
# Merge df with products_subset DataFrame on 'product_id' and 'product_category' using a left join
master_df = pd.merge(df, products_subset, on=['product_id', 'product_category'], how='left')

# Combine product category and SKU to form a new product_id
master_df['product_id'] = master_df['product_category'] + ' SKU ' +  master_df['SKU'].astype('str')

# Drop the 'SKU' column from the DataFrame
master_df.drop('SKU', axis=1, inplace=True)

# Define the desired column order
column_order = ['order_id', 'timestamp', 'user_id', 'customer_city', 'product_category', 'product_id', 'quantity', 'price', 'review_score']

# Rearrange columns in the DataFrame based on the defined order
master_df = master_df[column_order]

# Display the first few rows of the master DataFrame
master_df.head()

Unnamed: 0,order_id,timestamp,user_id,customer_city,product_category,product_id,quantity,price,review_score
0,e481f51cbdc54678b7cc49136f2d6af7,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,housewares,housewares SKU 0,1.0,29.99,4.0
1,53cdb2fc8bc7dce0b6741e2150273451,1532464897,af07308b275d755c9edb36a90c618231,barreiras,perfumery,perfumery SKU 0,1.0,118.7,4.0
2,47770eb9100c2d0c44946d9cf07ec65d,1533717529,3a653a41f6f9fc3d2a113cf8398680e8,vianopolis,auto,auto SKU 0,1.0,159.9,5.0
3,949d5b44dbf5de918fe9c16f97b45f8a,1511033286,7c142cf63193a1473d2e66489a9ae977,sao goncalo do amarante,pet_shop,pet_shop SKU 0,1.0,45.0,5.0
4,ad21c59c0840e6cb83a9ceb5573f8159,1518556719,72632f0f9dd73dfee390c9b22eb56dd6,santo andre,stationery,stationery SKU 0,1.0,19.9,5.0


# Write Final dataframe to CSV

In [16]:
# Export the master DataFrame to a CSV file without including the index
master_df.to_csv('../data/clean_olist_data.csv', index=False)

By combining information from multiple sources and performing necessary data cleaning and transformation steps, I have prepared a comprehensive dataset that lays a foundation for synthesizing data for building rank-based recommendation models. This dataset, enriched with customer, order, and product information, represents some of the key aspects of e-commerce operations data.