## Contents

01. Checking shape of df_ords and df_ords_prior datasets
02. Merging df_ords_prior and df_ords datasets
03. Data Wrangling and Consistency Checks
04. Exporting as a pickle file 


In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
#importing df_ords and df_ords_prior dataframes

path = r'/Users/lindazhang/Instacart Basket Analysis'
df_ords_prior = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products_prior.csv'), index_col = False)
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_cleaned.csv'), index_col = False)

## 01. Checking shape of df_ords and df_ords_prior datasets

In [3]:
# checking the df_ords dataframe
df_ords.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,customer_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,0,2539329,1,1,2,8,
1,1,1,2398795,1,2,3,7,15.0
2,2,2,473747,1,3,3,12,21.0
3,3,3,2254736,1,4,4,7,29.0
4,4,4,431534,1,5,4,15,28.0


In [4]:
# Selecting the columns I want from the df_ords dataframe

vars_list = ['order_id', 'customer_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order']
df_ords = pd.read_csv(os.path.join(path, '02 Data','Prepared Data','orders_cleaned.csv'), usecols = vars_list)


In [5]:
# Checking the df_ords dataframe again

df_ords.head()

Unnamed: 0,order_id,customer_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [6]:
# checking the number of rows and columns in df_ords

df_ords.shape

(3421083, 6)

In [7]:
# checking what the df_ords_prior dataframe looks like

df_ords_prior.head(20)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0
5,2,17794,6,1
6,2,40141,7,1
7,2,1819,8,1
8,2,43668,9,0
9,3,33754,1,1


In [8]:
# checking for missing values

df_ords_prior.isnull().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

In [9]:
# checking for duplicates

df_ords_prior_dups = df_ords_prior[df_ords_prior.duplicated()]
df_ords_prior_dups

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered


In [10]:
# checking the shape of the df_ords_prior dataframe

df_ords_prior.shape

(32434489, 4)

## 02. Merging df_ords_prior and df_ords datasets

In [11]:
# This is an inner merge joining df_ords_prior to df_ords using the key order_id. 

df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True)


In [12]:
df_merged_large

Unnamed: 0,order_id,customer_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,196,1,0,both
1,2539329,1,1,2,8,,14084,2,0,both
2,2539329,1,1,2,8,,12427,3,0,both
3,2539329,1,1,2,8,,26088,4,0,both
4,2539329,1,1,2,8,,26405,5,0,both
...,...,...,...,...,...,...,...,...,...,...
32434484,2977660,206209,13,1,12,7.0,14197,5,1,both
32434485,2977660,206209,13,1,12,7.0,38730,6,0,both
32434486,2977660,206209,13,1,12,7.0,31477,7,0,both
32434487,2977660,206209,13,1,12,7.0,6567,8,0,both


In [13]:
# There are no "left only" or "right only" because we did an inner join. 
# You only see rows where order_id showed up in the df_ords and df_ords_prior dataframes. 

df_merged_large['_merge'].value_counts()

both          32434489
left_only            0
right_only           0
Name: _merge, dtype: int64

If we had chosen to do an outer merge, we would have seen that 206,209 rows in df_ords did not have a matching order_id in df_ords_prior. 

## 03. Data Wrangling and Consistency Checks

In [14]:
# renaming column

df_merged_large.rename(columns = {'add_to_cart_order' : 'order_added_to_cart'}, inplace = True)

In [15]:
# checking for missing values

df_merged_large.isnull().sum()

order_id                        0
customer_id                     0
order_number                    0
orders_day_of_week              0
order_hour_of_day               0
days_since_prior_order    2078068
product_id                      0
order_added_to_cart             0
reordered                       0
_merge                          0
dtype: int64

In [16]:
df_merged_large.isnull().sum()

order_id                        0
customer_id                     0
order_number                    0
orders_day_of_week              0
order_hour_of_day               0
days_since_prior_order    2078068
product_id                      0
order_added_to_cart             0
reordered                       0
_merge                          0
dtype: int64

In [17]:
# Check for duplicates

df_merged_large_dups = df_merged_large[df_merged_large.duplicated()]
df_merged_large_dups

Unnamed: 0,order_id,customer_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,order_added_to_cart,reordered,_merge


## 04. Exporting as a pickle file 

A pickle, or “.pkl,” is a pandas format used to store data on your computer. While it’s similar to “.csv” files, pickles can only be opened using Python.


CSV Files

Advantages:
Can be opened in multiple tools and programs (Excel, SAS, R)
Can be customized to include certain columns or rows when imported
Have a high compression rate when zipped

Disadvantages:
Take more time to import and export when data sets are large
Can lead to index column issues when exporting and reimporting

PKL Files

Advantages:
Can be imported and exported quickly
Save dataframes exactly as they look in Jupyter, guaranteeing your reimported dataframe won’t be changed
Have a high compression rate when zipped

Disadvantages:
Are only accessible to Python users
Can’t be customized to include certain columns or rows when imported


In [18]:
# Export data to pkl

df_merged_large.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_combined.pkl'))