# 4.6.1. IC_Combining & Exporting Data

#### Context:
- We want to combine our dataframes: **`df_ords` and `df_prods`**
- `df_ords` and `df_prods` dataframes don’t contain a common column. 
- To solve this problem, we’ll combine our `df_ords` dataframe with a new dataframe called `orders_products_prior`.
   - Merge these using 'order_id'. 
   - This new dataframe contains a “product_id” column—the same as our `df_prods` dataframe.
   - By adding this column to our `df_ords` dataframe, we’ll have created a common column between `df_ords` and `df_prods`, paving the way for a successful merge.
   
#### Notebook Part 1:
- merging our prepared Instacart orders data with the new orders_products_prior dataframe
    - Merge "df_ords_prior" with "df_ords", using 'order_id'.
    - Using the indicator argument to check whether there was a full match between the two dataframes.
    - Checking the results of the merge using the "value_counts()" function - full match.
- Exporting the merged file in pickle format as “orders_products_combined.pkl”.

#### Notebook Part 2:
- new notebook, importing the “orders_products_combined.pkl” dataframe from the pickle file.
- importing our wrangled, cleaned, and debuped products data set stored in our “Prepared Data” folder from the previous step.
- Checking the shape of the imported dataframes.
- Determining a suitable way to combine the orders_products_combined dataframe with our products data set. 
    - Before merging, we will drop the unnecessary columns from both dataframes.
- Confirming the results of the merge using the merge flag.
- Exporting the newly created dataframe as ords_prods_merge in a suitable format (taking into consideration the size).


### This script contains the following points:

#### 0. Importing Libraries
#### 1. Loading and Checking the Data
#### 2. Merging Dataframes
#### 3. Exporting Merged Dataframe as Pickle



## 0. Importing Libraries

In [17]:
# Import libraries: pandas, NumPy and os.

import pandas as pd
import numpy as np
import os

## 1. Loading and Checking the Data

Importing Data Files, using os.path.join() function

path = r'/folderpath_to main project folder/'

df = pd.read_csv(os.path.join(path,'folderpath','name.csv'), index_col = False)


In [18]:
# folder path to my main project folder is now stored within variable 'path'

path = r'/Users/pau/06-05-2024 Instacart Basket Analysis'

#### Importing the “orders_products_prior.csv” data set into my Jupyter notebook using the os library as df_ords_prior

In [19]:
# Import the “orders_products_prior.csv” data set from the “Original Data” folder as df_ords_prior 

df_ords_prior = pd.read_csv(os.path.join(path,'02 Data','Original Data','orders_products_prior.csv'), index_col = False)

#### Importing the “orders_checked.csv” (the most up to date version) data set into my Jupyter notebook using the os library as df_ords

In [20]:
# Import the “orders_checked.csv” data set from your “Prepared Data” folder as df_ords

df_ords = pd.read_csv(os.path.join(path,'02 Data', 'Prepared Data', 'orders_checked.csv'), index_col = False)

#### Checking the dimensions of the imported dataframe and if the data is correctly loaded

In [21]:
# Checking "orders_products_prior.csv" data is correctly loaded

print(df_ords_prior.head()) # to ensure nothing looks off about our imported dataframes.
print(df_ords_prior.info()) # 
print(df_ords_prior.shape) # to confirm the total size of our imported df. Great way to get a feel for the data and have a better idea how to proceed.

   order_id  product_id  add_to_cart_order  reordered
0         2       33120                  1          1
1         2       28985                  2          1
2         2        9327                  3          0
3         2       45918                  4          1
4         2       30035                  5          0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB
None
(32434489, 4)


In [22]:
# Checking "orders_checked.csv" data is correctly loaded

print(df_ords.head())
print(df_ords.info())
print(df_ords.shape)

   Unnamed: 0.1  Unnamed: 0  order_id  user_id  order_number  \
0             0           0   2539329        1             1   
1             1           1   2398795        1             2   
2             2           2    473747        1             3   
3             3           3   2254736        1             4   
4             4           4    431534        1             5   

   orders_day_of_week  order_hour_of_day  days_since_prior_order  \
0                   2                  8                     NaN   
1                   3                  7                    15.0   
2                   3                 12                    21.0   
3                   4                  7                    29.0   
4                   4                 15                    28.0   

   is_first_order  
0               1  
1               0  
2               0  
3               0  
4               0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data c

- New dataframe, df_ords_prior, is quite large, because it contains each and every order of each and every user—as well as what exactly they ordered.
- df_ords_prior and df_ords, different shape but shared column: “order_id.”
    - In theory, we should have a fully matching “order_id” column, so we shouldn’t need to specify a type of join - default type for a join "inner", so the resulting ds will only contain observations included in both input data sets.

## 2. Merging Dataframes

- Merge "df_ords_prior" with "df_ords"
- Use the indicator argument to check whether there was a full match between the two dataframes.

In [23]:
# Merge "df_ords_prior" with "df_ords"

df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True)

In [24]:
# Confirming the results of the merge

df_merged_large.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,is_first_order,product_id,add_to_cart_order,reordered,_merge
0,0,0,2539329,1,1,2,8,,1,196,1,0,both
1,0,0,2539329,1,1,2,8,,1,14084,2,0,both
2,0,0,2539329,1,1,2,8,,1,12427,3,0,both
3,0,0,2539329,1,1,2,8,,1,26088,4,0,both
4,0,0,2539329,1,1,2,8,,1,26405,5,0,both


In [27]:
# Checking the results of the merge using the "value_counts()" function

df_merged_large['_merge'].value_counts()

_merge
both          32434489
left_only            0
right_only           0
Name: count, dtype: int64

- Using the value_counts() function lets us quickly sum up all the values in the “_merge” column, letting us see instantly whether we have a full match or not.

#### Notes:
- Let’s see what the merge flag frequency shows you here.
    - In the “_merge” column, you can see that there are only entries that have a value of “both,” leading you to think that your key column, “order_id,” exists completely in both dataframes.
    - However, this conclusion is *wrong*.
 
 
- What pandas does here is fill in information about each product for every “order_id” in the `df_ords` dataframe, which is why the resulting dataframe has 32,434,489 rows (the same total count as the `df_ords_prior` dataframe).
- But does this mean that you have a full match? The answer is no. There’s one particular intricacy when using and interpreting the merge flag, and it has a lot to do with the way you chose to merge the dataframes.

    - In this case, you chose the default option of *inner join*. This means that the resulting table will only contain observations found in both dataframes. As such, the merge flag here will only show entries that have a value of “both.” How, then, can you check whether you really have a full match?
    - Check out the output which shows the frequency of a merge using the argument `how = 'outer'`(not shown in the code above). Merging like this will combine all the observations and show you the real merge rate:
        - The results of a merge using the argument how = 'outer'. 
        - The merge flags show us there is not a full match between the two input dataframes.
        - After using this method to double-check your merge, you can see that you don’t actually have a full match. 
               

#### You should always double-check your merge rates using an outer join, as well, especially when you’re exploring new data and performing test merges.
- For your Instacart project, you’ll only be working with data sets that have a full merge rate, so you won’t need to worry about this or apply any changes to the merge you just completed (using `how = 'inner'`).

In [28]:
df_merged_large.shape

(32434489, 13)

we should always pay attention to the initial and final shape of the df:

- df_ords_prior : (32434489, 4)
- df_ords: ( 3421083, 9)

#### df_merged_larged: (32434489, 13) rows exact same num. as df_ords_prior, cols from both df were added, 

## 3. Export the merged dataframe as a Pickle

- A **pickle**, or “.pkl,”
    - is a pandas format used to store data on your computer.
    - While it’s similar to “.csv” files, pickles can only be opened using Python.
    - Importing a pickle into your Jupyter notebook follows the same procedure as importing a “.csv” file and produces the same dataframe.
    
- the biggest difference when it comes to importing and exporting “.csv” files and “.pkl” files is efficiency.
    - Your `df_merged_large` dataframe, for instance, would likely take around two minutes to export as a pickle, while it could take upwards of ten minutes to export as a “.csv” file.

In [29]:
# Perform a final check of the dataframe before exporting

print(df_merged_large.head())
print(df_merged_large.info())
print(df_merged_large.shape)

   Unnamed: 0.1  Unnamed: 0  order_id  user_id  order_number  \
0             0           0   2539329        1             1   
1             0           0   2539329        1             1   
2             0           0   2539329        1             1   
3             0           0   2539329        1             1   
4             0           0   2539329        1             1   

   orders_day_of_week  order_hour_of_day  days_since_prior_order  \
0                   2                  8                     NaN   
1                   2                  8                     NaN   
2                   2                  8                     NaN   
3                   2                  8                     NaN   
4                   2                  8                     NaN   

   is_first_order  product_id  add_to_cart_order  reordered _merge  
0               1         196                  1          0   both  
1               1       14084                  2          0   both  

In [30]:
# Export data to pkl as 'df_merged_large'

df_merged_large.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))