# <a id='toc1_'></a>[Data Combining & Data Export (1 of 2)](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Combining & Data Export (1 of 2)](#toc1_)    
  - [I. Data Combining](#toc1_1_)    
    - [I.1. Data overview](#toc1_1_1_)    
    - [I.2. Ensure matching data types on key columns](#toc1_1_2_)    
    - [I.3. Merge dataframes](#toc1_1_3_)    
  - [II. Data Export](#toc1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [3]:
# import libraries
import pandas as pd
import numpy as np
import os

In [4]:
# create a path to the directory
path = r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis'

# import the 'orders_cleaned.pkl' dataset
df_ords = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_cleaned.pkl'))

# import the 'order_products__prior.csv' dataset
df_ords_prior = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products__prior.csv'))

## <a id='toc1_1_'></a>[I. Data Combining](#toc0_)

### <a id='toc1_1_1_'></a>[I.1. Data overview](#toc0_)

In [5]:
# check the dimensions of both dataframes
print('Number of rows and columns of df_ords:')
df_ords.shape

Number of rows and columns of df_ords:


(3421083, 7)

In [6]:
print('Number of rows and columns of df_ords_prior:') 
df_ords_prior.shape

Number of rows and columns of df_ords_prior:


(32434489, 4)

In [7]:
# check the output of both dataframes
print('Output of df_ords:') 
df_ords.head(5)

Output of df_ords:


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_new_customer
0,2539329,1,1,2,8,,1
1,2398795,1,2,3,7,15.0,0
2,473747,1,3,3,12,21.0,0
3,2254736,1,4,4,7,29.0,0
4,431534,1,5,4,15,28.0,0


In [8]:
print('Output of df_ords_prior:')
df_ords_prior.head(5)

Output of df_ords_prior:


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


### <a id='toc1_1_2_'></a>[I.2. Ensure matching data types on key columns](#toc0_)

In [9]:
# check the data types in both dataframes
print('Data types in df_ords:') 
df_ords.dtypes

Data types in df_ords:


order_id                   object
user_id                    object
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
is_new_customer             int64
dtype: object

In [10]:
# change the data type of column 'is_new_customer' to boolean
df_ords['is_new_customer'] = df_ords['is_new_customer'].astype('bool')

print('Data type of column is_new_customer:')
df_ords['is_new_customer'].dtype

Data type of column is_new_customer:


dtype('bool')

In [11]:
print('Data types in df_ords_prior:') 
df_ords_prior.dtypes

Data types in df_ords_prior:


order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtype: object

In [None]:
# change the data type of column 'order_id', 'product_id' in df_ords_prior to string
df_ords_prior = df_ords_prior.astype({'order_id': 'str', 'product_id': 'str'})

print('Data type of columns in df_ords_prior:')
df_ords_prior.dtypes

### <a id='toc1_1_3_'></a>[I.3. Merge dataframes](#toc0_)

In [None]:
# merge df_ords and df_ords_prior using order_id as a key and an indicator flag
df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', how = 'inner', indicator = True)

In [None]:
# check the dimensions and output of the merged dataframe
print('Number of rows and columns of df_merged_large:')
df_merged_large.shape

In [None]:
print('Output of df_merged_large:')
df_merged_large.head(5)

In [None]:
# check the values in column '_merge'
print('Check the values in column _merge:')
df_merged_large['_merge'].value_counts()

## <a id='toc1_2_'></a>[II. Data Export](#toc0_)

In [None]:
# export df_merged_large in .pkl format
df_merged_large.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))