## <a id='toc1_1_'></a>[Data Combining & Data Export (2 of 2)](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Combining & Data Export (2 of 2)](#toc1_1_)    
  - [I. Data Combining](#toc1_2_)    
    - [I.1. Data overview](#toc1_2_1_)    
    - [I.2. Ensure matching data types on key columns](#toc1_2_2_)    
    - [I.3. Merge dataframes](#toc1_2_3_)    
  - [II. Data Export](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# create a path to the directory
path = r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis'

# import the 'orders_products_combined.pkl' dataset
df_ords_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

# import the 'products_cleaned.pkl' dataset
df_prods = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'products_cleaned.pkl'))

## <a id='toc1_2_'></a>[I. Data Combining](#toc0_)

### <a id='toc1_2_1_'></a>[I.1. Data overview](#toc0_)

In [4]:
# check the dimensions of both dataframes
print('Number of rows and columns of df_ords_combined:')
df_ords_combined.shape

Number of rows and columns of df_ords_combined:


(32434489, 11)

In [5]:
print('Number of rows and columns of df_prods:')
df_prods.shape

Number of rows and columns of df_prods:


(49672, 5)

In [6]:
# check the output of both dataframes
print('Output of df_ords_combined:')
df_ords_combined.head(5)

Output of df_ords_combined:


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_new_customer,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,True,196,1,0,both
1,2539329,1,1,2,8,,True,14084,2,0,both
2,2539329,1,1,2,8,,True,12427,3,0,both
3,2539329,1,1,2,8,,True,26088,4,0,both
4,2539329,1,1,2,8,,True,26405,5,0,both


In [7]:
print('Output of df_prods:')
df_prods.head(5)

Output of df_prods:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


### <a id='toc1_2_2_'></a>[I.2. Ensure matching data types on key columns](#toc0_)

In [8]:
# check the data types in both dataframes
print('Data types in df_ords_combined:') 
df_ords_combined.dtypes

Data types in df_ords_combined:


order_id                    object
user_id                     object
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
is_new_customer               bool
product_id                  object
add_to_cart_order            int64
reordered                    int64
_merge                    category
dtype: object

In [9]:
print('Data types in df_prods:') 
df_prods.dtypes

Data types in df_prods:


product_id        object
product_name      object
aisle_id          object
department_id     object
prices           float64
dtype: object

### <a id='toc1_2_3_'></a>[I.3. Merge dataframes](#toc0_)

In [10]:
# merge df_prods and df_ords_combined using product_id as a key and an indicator flag
df_merged = df_prods.merge(df_ords_combined, on = 'product_id', how = 'inner', indicator = 'exists')

In [11]:
# check the dimensions and output of the merged dataframe
print('Number of rows and columns of df_merged:')
df_merged.shape

Number of rows and columns of df_merged:


(32404859, 16)

In [12]:
print('Output of df_merged:')
df_merged.head(5)

Output of df_merged:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_new_customer,add_to_cart_order,reordered,_merge,exists
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,False,5,0,both,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,False,1,1,both,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,False,20,0,both,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,True,10,0,both,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,False,11,1,both,both


In [13]:
# check the values in column 'exists'
print('Check the values in column exists:')
df_merged['exists'].value_counts()

Check the values in column exists:


both          32404859
left_only            0
right_only           0
Name: exists, dtype: int64

In [14]:
# clean up the merged dataframe by deleting columns '_merge' and 'exists'
df_merged = df_merged.drop(columns = ['_merge', 'exists'])
print('Output of df_merged:')
df_merged.head(5)

Output of df_merged:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_new_customer,add_to_cart_order,reordered
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,False,5,0
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,False,1,1
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,False,20,0
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,True,10,0
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,False,11,1


## <a id='toc1_3_'></a>[II. Data Export](#toc0_)

In [15]:
# export df_merged in .pkl format
df_merged.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))