# Importing Orders and Products and Merging

#Exercise 4.6 – Task: Orders–Products–Products Merge
#Instacart Basket Analysis

This notebook performs the following steps:
1. Import the combined orders–prior dataset from the Pickle file (`orders_products_combined.pkl`)
2. Confirm the shape of the imported dataframe
3. Import the cleaned products dataset (`products_checked.csv`)
4. Merge product information into the combined orders–products data using `product_id`
5. Validate the merge using the `_merge` indicator column
6. Export the final merged dataframe (`ords_prods_merge`) in an efficient format for later analysis

In [40]:
#Import libraries
import pandas as pd 
import numpy as np
import os           

In [41]:
#Set the base path to the Instacart project folder
path = r'/Users/jessduong/Documents/CF/Achievement 4_Python/12-2025 Instacart Basket Analysis'

In [42]:
#Import the combined orders + prior dataset from the Pickle file created in Exercise 4.6a
#Using read_pickle is faster and preserves the dataframe exactly as exported
df_ords_prods_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [43]:
#Check the shape of the imported dataframe to confirm it matches the exported version
df_ords_prods_combined.shape

(32434489, 12)

In [44]:
#Display the first few rows to visually confirm that the dataframe loaded correctly
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,first_order_flag,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,prior,1,2,8,0.0,True,196,1,0,both
1,2539329,1,prior,1,2,8,0.0,True,14084,2,0,both
2,2539329,1,prior,1,2,8,0.0,True,12427,3,0,both
3,2539329,1,prior,1,2,8,0.0,True,26088,4,0,both
4,2539329,1,prior,1,2,8,0.0,True,26405,5,0,both


#Validation of orders_products_combined import

The `df_ords_prods_combined` dataframe was imported from `orders_products_combined.pkl`.  
The imported dataframe contains 32,434,489 rows and 12 columns, which matches the structure of the merged dataframe from Exercise 4.6a — including the `_merge` diagnostic column generated during the merge process. This confirms that the Pickle export and import preserved all rows and columns exactly.

In [45]:
#Remove the diagnostic merge flag column before merging with products data
df_ords_prods_combined = df_ords_prods_combined.drop(columns=['_merge'])

#Cleaning the combined dataframe before further merges

The `_merge` column was only used to validate the previous join between 
the orders and prior-orders datasets. Now that the merge has been confirmed, 
the column is removed to keep the dataframe clean and ready for the next merge 
with the products data.

In [46]:
#Import the cleaned, wrangled, and deduplicated products dataset
#This file was created in previous exercises and stored in the Prepared Data folder
df_prods_checked = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

In [47]:
#Check the shape of the cleaned products dataframe
df_prods_checked.shape

(49672, 5)

In [48]:
#Preview the first few rows to confirm that 'product_id' and other product attributes are present
df_prods_checked.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


#Choice of join type

For this step I merge the `orders_products_combined` dataframe (every ordered product line) 
with the cleaned `products_checked` table (product attributes) on `product_id`.

I use a **left join** from `orders_products_combined` to `products_checked` because:

- Each row in `orders_products_combined` represents a real Instacart order line and should be retained.
- The products table adds descriptive metadata (e.g., product name, aisle, department) but should not control which orders are kept.
- If a product was removed during earlier cleaning steps, the corresponding order line remains in the data, with missing values only for the product attributes.

Using an inner join here would drop order lines whenever the product does not appear in the cleaned products table, which would artificially reduce the transaction data and risk biasing the analysis.

In [49]:
#Merge the combined orders + prior dataframe with the cleaned products dataframe
#Left join ensures that all order–product combinations are kept
ords_prods_merge = df_ords_prods_combined.merge(
    df_prods_checked,
    on='product_id',   # key column present in both dataframes
    how='left',        # keep all rows from df_ords_prods_combined
    indicator=True)     # add '_merge' column to track join status

In [50]:
#Inspect the first few rows of the final merged dataframe
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_previous_order,first_order_flag,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,prior,1,2,8,0.0,True,196,1,0,Soda,77.0,7.0,9.0,both
1,2539329,1,prior,1,2,8,0.0,True,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91.0,16.0,12.5,both
2,2539329,1,prior,1,2,8,0.0,True,12427,3,0,Original Beef Jerky,23.0,19.0,4.4,both
3,2539329,1,prior,1,2,8,0.0,True,26088,4,0,Aged White Cheddar Popcorn,23.0,19.0,4.7,both
4,2539329,1,prior,1,2,8,0.0,True,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54.0,17.0,1.0,both


In [51]:
#Review the shape of the merged dataframe (should keep the same number of rows as df_ords_prods_combined)
ords_prods_merge.shape

(32435059, 16)

In [52]:
#Check how many rows matched between the orders_products_combined and products_checked dataframes
ords_prods_merge['_merge'].value_counts()

_merge
both          32404859
left_only        30200
right_only           0
Name: count, dtype: int64

#Merge flag validation for products join

The `_merge` indicator column shows how rows were combined during the join:

- `both`: 32,404,859 rows  
- `left_only`: 30,200 rows  
- `right_only`: 0 rows  

The vast majority of rows are labeled `both`, confirming that most `product_id` values exist in both 
the orders–products dataset and the cleaned products table. A smaller number of rows appear as `left_only`, 
indicating order lines whose `product_id` does not have a corresponding record in the cleaned products table 
(e.g., products that may have been filtered out during earlier data quality steps).

There are no `right_only` rows, as expected for a left join, because we are not including products that 
do not appear in any order.

The total row count of the merged dataframe (`ords_prods_merge`) is slightly higher than the original 
`orders_products_combined` dataframe. This occurs when some `product_id` values exist multiple times in 
the products table — the left join duplicates matching rows accordingly. For the purposes of this project, we proceed with the current merged dataset but note this as a data quality consideration for future work.

In [53]:
#Safely remove the merge validation column if it exists
ords_prods_merge = ords_prods_merge.drop(columns=['_merge'], errors='ignore')

In [54]:
ords_prods_merge.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_previous_order', 'first_order_flag',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices'],
      dtype='object')

#Cleaning the final merged dataframe

The `_merge` column was created only to validate the join results in the previous step. 
After reviewing the `_merge` value counts and confirming the join quality, 
I proceeded with a version of `ords_prods_merge` that no longer includes this diagnostic column. 
The final dataframe therefore only contains business-relevant fields and is ready for export and analysis.

In [55]:
#Export the final merged dataset in Pickle format for efficient storage and loading
ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))

#Summary of Exercise 4.6 Task

- Imported the combined orders–prior dataset from `orders_products_combined.pkl`.
- Verified the import by checking the shape and first rows of the dataframe.
- Imported the cleaned and deduplicated products dataset from `products_checked.csv`.
- Merged product metadata into the orders–products data using a left join on `product_id`.
- Validated the merge with the `_merge` flag (32,404,859 matched rows and 30,200 unmatched rows).
- Removed the diagnostic `_merge` column so the final dataset only contains business-relevant fields.
- Exported the final analysis-ready dataframe as:`ords_prods_merge.pkl` (primary working file)