# 4.6.2. IC_Combining & Exporting Data

#### Context:
- We want to combine our dataframes: **`df_ords` and `df_prods`**
- `df_ords` and `df_prods` dataframes don’t contain a common column. 
- To solve this problem, we’ll combine our `df_ords` dataframe with a new dataframe called `orders_products_prior`.
   - Merge these using 'order_id'. 
   - This new dataframe contains a “product_id” column—the same as our `df_prods` dataframe.
   - By adding this column to our `df_ords` dataframe, we’ll have created a common column between `df_ords` and `df_prods`, paving the way for a successful merge.
   
#### Notebook Part 1:
- merging our prepared Instacart orders data with the new orders_products_prior dataframe
    - Merge "df_ords_prior" with "df_ords", using 'order_id'.
    - Using the indicator argument to check whether there was a full match between the two dataframes.
    - Checking the results of the merge using the "value_counts()" function - full match.
- Exporting the merged file in pickle format as “orders_products_combined.pkl”.

#### Notebook Part 2:
- new notebook, importing the “orders_products_combined.pkl” dataframe from the pickle file.
- importing our wrangled, cleaned, and debuped products data set stored in our “Prepared Data” folder from the previous step.
- Checking the shape of the imported dataframes.
- Determining a suitable way to combine the orders_products_combined dataframe with our products data set. 
    - Before merging, we will drop the unnecessary columns from both dataframes.
- Confirming the results of the merge using the merge flag.
- Exporting the newly created dataframe as ords_prods_merge in a suitable format (taking into consideration the size).

### This script contains the following points:

#### 0. Importing Libraries
#### 1. Loading and Checking the Data
#### 2. Merging Dataframes
#### 3. Exporting Merged Dataframe as a Pickle



## 0. Importing Libraries

In [1]:
# Import libraries: pandas, NumPy and os.

import pandas as pd
import numpy as np
import os

## 1. Loading and Checking the Data

Importing Data Files, using os.path.join() function

path = r'/folderpath_to main project folder/'

df = pd.read_csv(os.path.join(path,'folderpath','name.csv'), index_col = False)


In [2]:
# folder path to my main project folder is now stored within variable 'path'

path = r'/Users/pau/06-05-2024 Instacart Basket Analysis'

#### Importing the “orders_products_combined.pkl” data set into my Jupyter notebook using the os library as df_ords_prods_combined

Importing pickle files also follows a similar syntax to its “.csv” counterpart:
- the only difference comes
     - in the function (`read_pickle()`)
     - and the lack of an `index_col`, since pickle-format files include this information already.

In [3]:
# Import the “orders_products_combined.pkl” data from the “Prepared Data” folder as df_ords_prods_combined 


df_ords_prods_combined = pd.read_pickle(os.path.join(path,'02 Data','Prepared Data','orders_products_combined.pkl'))

#### Checking the dimensions of the imported dataframe and if the data is correctly loaded

In [4]:
# Checking "orders_products_combined.pkl" data is correctly loaded

print(df_ords_prods_combined.head()) # to ensure nothing looks off about our imported dataframes.
print(df_ords_prods_combined.info()) # 
print(df_ords_prods_combined.shape) # to confirm the total size of our imported df. Great way to get a feel for the data and have a better idea how to proceed.

   Unnamed: 0.1  Unnamed: 0  order_id  user_id  order_number  \
0             0           0   2539329        1             1   
1             0           0   2539329        1             1   
2             0           0   2539329        1             1   
3             0           0   2539329        1             1   
4             0           0   2539329        1             1   

   orders_day_of_week  order_hour_of_day  days_since_prior_order  \
0                   2                  8                     NaN   
1                   2                  8                     NaN   
2                   2                  8                     NaN   
3                   2                  8                     NaN   
4                   2                  8                     NaN   

   is_first_order  product_id  add_to_cart_order  reordered _merge  
0               1         196                  1          0   both  
1               1       14084                  2          0   both  

After checking, we can confirm that it has the same shape as the df we exported: (32434489, 13)

#### Imporitng our wrangled, cleaned, and debuped products data set stored in our “Prepared Data” folder from the previous step.

In [5]:
# Import the “products_checked.csv” data from the “Prepared Data” folder as df_prods

df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

In [6]:
# Checking "products_checked.csv" data is correctly loaded

print(df_prods.head())
print(df_prods.info())
df_prods.shape

   Unnamed: 0  product_id                                       product_name  \
0           0           1                         Chocolate Sandwich Cookies   
1           1           2                                   All-Seasons Salt   
2           2           3               Robust Golden Unsweetened Oolong Tea   
3           3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...   
4           4           5                          Green Chile Anytime Sauce   

   aisle_id  department_id  prices  
0        61             19     5.8  
1       104             13     9.3  
2        94              7     4.5  
3        38              1    10.5  
4         5             13     4.3  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49672 entries, 0 to 49671
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     49672 non-null  int64  
 1   product_id     49672 non-null  int64  
 2   product_

(49672, 6)

## 2. Merging Dataframes

#### Determining a suitable way to combine the orders_products_combined dataframe with our products data set

- Before merging, we will drop the unnecessary columns from both dataframes

In [7]:
# drop the unnecessary columns from both dataframes

df_ords_prods_combined = df_ords_prods_combined.drop(['Unnamed: 0.1', 'Unnamed: 0', '_merge'], axis=1)


In [8]:
# drop the unnecessary columns from both dataframes

df_prods = df_prods.drop(['Unnamed: 0'], axis=1,)

In [9]:
# Check the results of dropping unnecessary columns

print(df_ords_prods_combined.head())
print(df_ords_prods_combined.info())


   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   2539329        1             1                   2                  8   
1   2539329        1             1                   2                  8   
2   2539329        1             1                   2                  8   
3   2539329        1             1                   2                  8   
4   2539329        1             1                   2                  8   

   days_since_prior_order  is_first_order  product_id  add_to_cart_order  \
0                     NaN               1         196                  1   
1                     NaN               1       14084                  2   
2                     NaN               1       12427                  3   
3                     NaN               1       26088                  4   
4                     NaN               1       26405                  5   

   reordered  
0          0  
1          0  
2          0  
3          0  
4    

In [10]:
df_ords_prods_combined.shape

(32434489, 10)

In [11]:
# Check the results of dropping unnecessary columns

print(df_prods.head())
print(df_prods.info())

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  prices  
0             19     5.8  
1             13     9.3  
2              7     4.5  
3              1    10.5  
4             13     4.3  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49672 entries, 0 to 49671
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49672 non-null  int64  
 1   product_name   49672 non-null  object 
 2   aisle_id       49672 non-null  int64  
 3   department_id  49672 non-null  int64

In [12]:
df_prods.shape

(49672, 5)

In [13]:
# Merge "df_ords_prods_combined" with "df_prods" using their common column "product_id"

df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = True)

In [14]:
# Confirming the results of the merge

df_ords_prods_merge.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_first_order,add_to_cart_order,reordered,_merge
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,0,5,0,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,0,1,1,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,0,20,0,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,1,10,0,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,0,11,1,both


In [15]:
# Checking the results of the merged data

print(df_ords_prods_merge.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404859 entries, 0 to 32404858
Data columns (total 15 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   product_id              int64   
 1   product_name            object  
 2   aisle_id                int64   
 3   department_id           int64   
 4   prices                  float64 
 5   order_id                int64   
 6   user_id                 int64   
 7   order_number            int64   
 8   orders_day_of_week      int64   
 9   order_hour_of_day       int64   
 10  days_since_prior_order  float64 
 11  is_first_order          int64   
 12  add_to_cart_order       int64   
 13  reordered               int64   
 14  _merge                  category
dtypes: category(1), float64(2), int64(11), object(1)
memory usage: 3.4+ GB
None


In [16]:
# Checking the results of the merged data

df_ords_prods_merge.shape

(32404859, 15)

#### Checking the results of the rows and cols:

- df_ords_prods_combined 
(32434489, 13)
(32434489, 10) # after dropping unnec.col

- df_prods
(49672, 6)
(49672, 5)  # after dropping unnec.col

df_ords_prods_merge
(32404859, 15)


In [17]:
# Checking the results of the merge using the "value_counts()" function

df_ords_prods_merge['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

## 3. Export the merged dataframe as a Pickle

In [18]:
# Perform a final check of the dataframe before exporting

print(df_ords_prods_merge.head())
print(df_ords_prods_merge.info())
print(df_ords_prods_merge.shape)

   product_id                product_name  aisle_id  department_id  prices  \
0           1  Chocolate Sandwich Cookies        61             19     5.8   
1           1  Chocolate Sandwich Cookies        61             19     5.8   
2           1  Chocolate Sandwich Cookies        61             19     5.8   
3           1  Chocolate Sandwich Cookies        61             19     5.8   
4           1  Chocolate Sandwich Cookies        61             19     5.8   

   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   3139998      138            28                   6                 11   
1   1977647      138            30                   6                 17   
2    389851      709             2                   0                 21   
3    652770      764             1                   3                 13   
4   1813452      764             3                   4                 17   

   days_since_prior_order  is_first_order  add_to_cart_order  reorde

In [19]:
# Export the merged dataframe to Pickle as "ords_prods_merge.pkl"

df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))