# Combining and Exporting Data

#### This Script Contains the Following Points:
#### 1. Importing Datasets to Jupyter
#### 2. Check Dimensions of Datasets
#### 3. Merge Datasets
#### 4. Check Merge Using Merge Flag
#### 5. Exporting Data

### 1. Importing Datasets to Jupyter

In [7]:
#importing libraries
import pandas as pd
import numpy as np
import os

In [8]:
#importing data
path = r'/Users/kimkmiz/Documents/Instacart Basket Analysis 2024'

In [9]:
# Import dataset orders_products_combined.pkl
df_ords_prods_combined = pd.read_pickle(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'orders_products_combined.pkl'))

In [10]:
# Import dataset products_clean.csv
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'products_clean.csv'), index_col = False)

### 2. Check the Dimensions of Datasets

In [12]:
# Check the output
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,196,1,0,both
1,2539329,1,1,2,8,,14084,2,0,both
2,2539329,1,1,2,8,,12427,3,0,both
3,2539329,1,1,2,8,,26088,4,0,both
4,2539329,1,1,2,8,,26405,5,0,both


In [13]:
df_ords_prods_combined.shape

(32434489, 10)

In [14]:
# Check the output
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [15]:
df_prods.shape

(49672, 6)

**Observations on Dimensions**
- df_prods not as wide and not as long
- unnamed: 0 column not needed
- merge column in df_ords_prod_combined not needed

In [17]:
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [18]:
#cell accidentally deleted, but 'unnamed:0' column was successfully dropped using drop function

In [19]:
#drop merge column from df_ords_prod_combined
df_ords_prods_combined = df_ords_prods_combined.drop(['_merge'], axis=1)

In [20]:
#check output
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


### 3. Merge Datasets

In [22]:
# Merge the updated dataframes
df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = True)

In [23]:
# Check the output
df_ords_prods_merge.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both
2,0,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both
3,0,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,10,0,both
4,0,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both


### 4. Check Merge Using Merge Flag

In [25]:
df_ords_prods_merge['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

**Observations**
- merged using inner join
- merged datasrt has 32404859 rows with information from both datasets

In [41]:
df_ords_prods_merge.shape

(32404859, 15)

In [45]:
df_ords_prods_merge.describe()

Unnamed: 0.1,Unnamed: 0,product_id,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered
count,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,30328760.0,32404860.0,32404860.0
mean,25600.37,25598.66,71.19612,9.919792,7.79018,1710745.0,102937.2,17.1423,2.738867,13.42515,11.10408,8.352547,0.5895873
std,14085.55,14084.0,38.21139,6.281485,4.242125,987298.8,59466.1,17.53532,2.090077,4.24638,8.779064,7.127071,0.4919087
min,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,13544.0,13544.0,31.0,4.0,4.2,855947.0,51422.0,5.0,1.0,10.0,5.0,3.0,0.0
50%,25303.0,25302.0,83.0,9.0,7.4,1711049.0,102616.0,11.0,3.0,13.0,8.0,6.0,1.0
75%,37951.0,37947.0,107.0,16.0,11.3,2565499.0,154389.0,24.0,5.0,16.0,15.0,11.0,1.0
max,49692.0,49688.0,134.0,21.0,25.0,3421083.0,206209.0,99.0,6.0,23.0,30.0,145.0,1.0


### 5. Exporting Data

In [26]:
# Export data to pkl
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data','IC24 Prepared Data', 'ords_prods_merge.pkl'))