# <a id='toc1_'></a>[Data Wrangling & Data Subsetting](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Wrangling & Data Subsetting](#toc1_)    
  - [I. Data Wrangling](#toc1_1_)    
    - [I.1. Dataframe df_ords](#toc1_1_1_)    
      - [I.1.1. Drop columns](#toc1_1_1_1_)    
      - [I.1.2. Rename columns](#toc1_1_1_2_)    
      - [I.1.3. Data transposing](#toc1_1_1_3_)    
      - [I.1.4. Change data types](#toc1_1_1_4_)    
    - [I.2. Dataframe df_prods](#toc1_1_2_)    
      - [I.2.1. Change data types](#toc1_1_2_1_)    
  - [II. Data Subsetting](#toc1_2_)    
    - [II.1. Create a subset containing only the required information for breakfast items](#toc1_2_1_)    
    - [II.2. Create a subset for the following departments: alcohol, deli, beverages, and meat/seafood](#toc1_2_2_)    
    - [II.3. Create a subset of user with user_id '1' in df_ords](#toc1_2_3_)    
  - [III. Data Export](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [169]:
# import libraries
import numpy as np
import pandas as pd
import os

In [170]:
# create a path to the directory
path = r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis'

# import the 'orders.csv' dataset
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'))

# import the 'products.csv' dataset
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

# import the 'departments.csv' dataset
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'departments.csv'))

## <a id='toc1_1_'></a>[I. Data Wrangling](#toc0_)

### <a id='toc1_1_1_'></a>[I.1. Dataframe df_ords](#toc0_)

#### <a id='toc1_1_1_1_'></a>[I.1.1. Drop columns](#toc0_)

In [171]:
# current columns in df_ords
df_ords.head(3)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0


In [172]:
# drop eval set column from df_ords
df_ords = df_ords.drop(columns = ['eval_set'])

print('Columns in df_ords:') 
df_ords.columns

Columns in df_ords:


Index(['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

#### <a id='toc1_1_1_2_'></a>[I.1.2. Rename columns](#toc0_)

In [173]:
# rename columns
df_ords.rename(columns = {'order_dow':'orders_day_of_week'}, inplace = True)

print('New column names in df_ords:')
df_ords.columns

New column names in df_ords:


Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

In [174]:
# decriptive statistics on df_ords
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [175]:
# print current column names df_ords
print('Current column names in df_ords:')
df_ords.columns

Current column names in df_ords:


Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

#### <a id='toc1_1_1_3_'></a>[I.1.3. Data transposing](#toc0_)

In [176]:
# transpose the data in df_dep
df_dep_t = df_dep.T

print('Transposed df_dep:')
df_dep_t.head(3)

Transposed df_dep:


Unnamed: 0,0
department_id,department
1,frozen
2,other


In [177]:
#reset the index
df_dep_t.reset_index()

print('Transposed df_dep with resetted index:')
df_dep_t.head(3)

Transposed df_dep with resetted index:


Unnamed: 0,0
department_id,department
1,frozen
2,other


In [178]:
# name new header
new_header = df_dep_t.iloc[0]

print('New header:')
new_header

New header:


0    department
Name: department_id, dtype: object

In [179]:
# create new dataframe that starts on the second row
df_dep_t_new = df_dep_t[1:]

print('Transposed df_dep that starts on the second row:')
df_dep_t_new.head(3)

Transposed df_dep that starts on the second row:


Unnamed: 0,0
1,frozen
2,other
3,bakery


In [180]:
# name the new dataframe columns
df_dep_t_new.columns = new_header

print('New, transposed df_dep:')
df_dep_t_new.head(3)

New, transposed df_dep:


department_id,department
1,frozen
2,other
3,bakery


#### <a id='toc1_1_1_4_'></a>[I.1.4. Change data types](#toc0_)

In [181]:
# print current data types in df_ords
print('Current data types in df_ords:')
df_ords.dtypes

Current data types in df_ords:


order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [182]:
# change the data type of column 'order_id' to string
df_ords['order_id'] = df_ords['order_id'].astype('str')

print('Data type of column order_id:')
df_ords['order_id'].dtype

Data type of column order_id:


dtype('O')

In [183]:
# change the data type of column 'user_id' to string
df_ords['user_id'] = df_ords['user_id'].astype('str')

print('Data type of column user_id:')
df_ords['user_id'].dtype

Data type of column user_id:


dtype('O')

### <a id='toc1_1_2_'></a>[I.2. Dataframe df_prods](#toc0_)

#### <a id='toc1_1_2_1_'></a>[I.2.1. Change data types](#toc0_)

In [184]:
# print current data types in df_prods
print('Current data types in df_prods:')
df_prods.dtypes

Current data types in df_prods:


product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [185]:
# change the data type of column 'product_id', 'aisle_id' and 'department_id' to string
df_prods = df_prods.astype({'product_id': 'str', 'aisle_id': 'str', 'department_id': 'str'})

print('Data type of columns in df_prods:')
df_prods.dtypes

Data type of columns in df_prods:


product_id        object
product_name      object
aisle_id          object
department_id     object
prices           float64
dtype: object

## <a id='toc1_2_'></a>[II. Data Subsetting](#toc0_)

Find the frequency of the corresponding variable to determine the busiest hour for placing orders

In [186]:
# get the most frequent value in column 'order_hour_of_day'
print('Busiest hour for placing orders:')
df_ords['order_hour_of_day'].mode()

Busiest hour for placing orders:


0    10
Name: order_hour_of_day, dtype: int64

In [187]:
# get the frequency of values in column 'order_hour_of_day'
print('Frequency of values in column order_hour_of_day:')
df_ords['order_hour_of_day'].value_counts(dropna = False)

Frequency of values in column order_hour_of_day:


10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: order_hour_of_day, dtype: int64

Determine the meaning behind a value of 4 in the "department_id" column within the df_prods dataframe using a data dictionary

In [188]:
# transform df_dep_t_new to a data dictionary
data_dict = df_dep_t_new.to_dict('index') 

print('Data dictionary:')
data_dict

Data dictionary:


{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [189]:
# determine the meaning behind a value of 4 in the column 'department_id' in df_prods
print('Meaning of value of 4 in department_id in df_prods:')
data_dict.get('4')

Meaning of value of 4 in department_id in df_prods:


{'department': 'produce'}

### <a id='toc1_2_1_'></a>[II.1. Create a subset containing only the required information for breakfast items](#toc0_)

In [190]:
# create a subset for breakfast items in df_prods
df_breakfast = df_prods.loc[df_prods['department_id'].isin([14])]

print('A subset for breakfast items in df_prods:')
df_breakfast.head(5)

A subset for breakfast items in df_prods:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices


### <a id='toc1_2_2_'></a>[II.2. Create a subset for the following departments: alcohol, deli, beverages, and meat/seafood](#toc0_)

In [191]:
# create a subset for alcohol, deli, beverages, meat/seafood items in df_prods
df_party = df_prods.loc[df_prods['department_id'].isin([5, 20, 7, 12])].sort_values(by = 'department_id', ascending = False)

print('A subset for breakfast items in df_prods:')
df_party.head(5)

A subset for breakfast items in df_prods:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices


How many rows does the last dataframe you created have?

In [192]:
#count the rows in df_party
print('Number of rows and columns in df_party:')
df_party.shape

Number of rows and columns in df_party:


(0, 5)

### <a id='toc1_2_3_'></a>[II.3. Create a subset of user with user_id '1' in df_ords](#toc0_)

In [193]:
# create a subset of user with user_id '1' in df_ords
df_user_1 = df_ords[df_ords['user_id'] == '1']

print('Subset of user with user_id of value of 1 in df_ords:')
df_user_1.head(5)

Subset of user with user_id of value of 1 in df_ords:


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


Basic stats on user with user_id '1'

In [194]:
# get the basic stats by describing the subset df_user_1
print('Describe the basic stats on user with user_id of value of 1:')
df_user_1.describe()

Describe the basic stats on user with user_id of value of 1:


Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,11.0,11.0,11.0,10.0
mean,6.0,2.636364,10.090909,19.0
std,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,7.0,0.0
25%,3.5,1.5,7.5,14.25
50%,6.0,3.0,8.0,19.5
75%,8.5,4.0,13.0,26.25
max,11.0,4.0,16.0,30.0


## <a id='toc1_3_'></a>[III. Data Export](#toc0_)

In [195]:
# export dataframe df_prods in .pkl format to preserve data types
df_prods.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'products_wrangled.pkl'))

In [196]:
# export dataframe df_ords in .pkl format to preserve data types
df_ords.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_wrangled.pkl'))

In [197]:
# export dataframe df_dep_t_new in .csv format
df_dep_t_new.to_csv(r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis\02 Data\Prepared Data\departments_wrangled.csv', index_label='department_id')