## 4.6.2 Combining & Exporting Data

### This script contains the following points:

#### 1. Import the data sets into Jupyter

#### 2. Check the dimensions of the imported dataframes

#### 3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set

#### 4. Confirm the results of the merge using the merge flag

#### 5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

### 1. Import the data sets into Jupyter

In [13]:
# Tell Python to remmber file path
path = r'C:\Users\gre60\Desktop\30-05-2024 Instacart Basket Analysis'

In [10]:
path_2 = r'/Users/gre604/Desktop/30-05-2024 Instacart Basket Analysis'

In [11]:
# Import order_products_combined pickle file dataset
orders_products_combined = pd.read_pickle(os.path.join(path_2, 'Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [12]:
# Import products_checked dataset
df_prods = pd.read_csv(os.path.join(path_2, 'Data', 'Prepared Data', 'products_checked.csv'))

### 2. Check the dimensions of the imported dataframes

In [13]:
# Check the output for orders_products_combined
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,yes,196,1,0,both
1,2539329,1,1,2,8,,yes,14084,2,0,both
2,2539329,1,1,2,8,,yes,12427,3,0,both
3,2539329,1,1,2,8,,yes,26088,4,0,both
4,2539329,1,1,2,8,,yes,26405,5,0,both


In [14]:
# Check the dimensions for orders_products_combined
orders_products_combined.shape

(32434489, 11)

In [15]:
# Check output for prods
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [16]:
# Check dimensions for prods
df_prods.shape

(49672, 6)

In [17]:
# Drop unnecessary columns
df_prods = df_prods.drop(['Unnamed: 0'], axis=1)

In [18]:
# Drop unnecessary columns
orders_products_combined = orders_products_combined.drop(['_merge'], axis=1)

In [19]:
# Check output for orders_products_combined after dropping column
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,,yes,196,1,0
1,2539329,1,1,2,8,,yes,14084,2,0
2,2539329,1,1,2,8,,yes,12427,3,0
3,2539329,1,1,2,8,,yes,26088,4,0
4,2539329,1,1,2,8,,yes,26405,5,0


In [20]:
# Check output for prods after dropping column
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


### 3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set

#### The most suitable way to combine both datframes is to join through the shared product_id column by inner join.

In [21]:
# Merge the two dataframes
ords_prods_merge = df_prods.merge(orders_products_combined, on = 'product_id', indicator = True)

In [22]:
# Check the results
ords_prods_merge.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,add_to_cart_order,reordered,_merge
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,no,5,0,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,no,1,1,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,no,20,0,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,yes,10,0,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,no,11,1,both


In [23]:
# Check the dimensions for merged dataset
ords_prods_merge.shape

(32404859, 15)

### 4. Confirm the results of the merge using the merge flag

In [24]:
ords_prods_merge['_merge'].value_counts()

both          32404859
right_only           0
left_only            0
Name: _merge, dtype: int64

#### The resulting dataframe (after the merge) has 32,404,859 rows, and each of those rows have information found in both input data sets, as we used an inner join for the purposes of this project.

### 5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [25]:
# Export data to pkl
ords_prods_merge.to_pickle(os.path.join(path_2, 'Data','Prepared Data', 'ords_prods_merge.pkl'))