# Contents List

01. Import libraries
02. Import and narrow data
03. Define and execute user-derived function
04. If-statements with loc()
05. Execute loc() on whole df
06. If-statements with for-loops
07. Task

# 01. Import libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os

# 02. Import and narrow data

In [2]:
# create shortcut for data imports
path = r'C:\Users\jacym\Desktop\Career Foundry projects\04-2023 Instacart basket analysis'

In [3]:
# import merged data
df_orders_products_merged = pd.read_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_merged.pkl'))

In [4]:
# narrow dataframe to first million rows to prevent memory errors while working with user-derived functions
df = df_orders_products_merged[:1000000]

In [5]:
# check work
df.shape

(1000000, 14)

# 03. Define and execute user-derived function

In [6]:
# define new price_label function

def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [7]:
# execute function -- in a new column called price_range, execute the price_label function on each row (each column would be axis=0)
df['price_range'] = df.apply(price_label, axis=1)
# red warning message is not an error

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [8]:
# run value_counts function to see how labels are distributed. Note there are no high-range items in this subset of the data
df['price_range'].value_counts()

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [9]:
# confirm the value count by looking for the max price. If it's over 15, there's an error in your code
df['prices'].max()

14.8

# 04. If-statements with loc()
As warning message above suggests, the loc() function (a pre-defined function rather than user-derived) is a better way to handle these types of operations. Here, if-then logic is implied.

In [10]:
# first set conditions, each in its own cell
# if price is more than 15, then print high-range product in new column called price_range_loc
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [11]:
# if price is between 5 and 15, then print mid-range product (separate arguments -- <= 15 and > 5 -- with parentheses)
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [12]:
# if price is less than 5, then print low-range product
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [13]:
# check value_counts -- same as above, but this way should be faster and yield no warning message
df['price_range_loc'].value_counts()

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

# 05. Execute loc() on whole df
Since loc() is faster, we can use it on the complete df

In [14]:
# if price is more than 15, then print high-range product in new column called price_range_loc
df_orders_products_merged.loc[df_orders_products_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [15]:
# if price is between 5 and 15, then print mid-range product (separate arguments -- <= 15 and > 5 -- with parentheses)
df_orders_products_merged.loc[(df_orders_products_merged['prices'] <=15) & (df_orders_products_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [16]:
# if price is less than 5, then print low-range product
df_orders_products_merged.loc[df_orders_products_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [17]:
# check value counts on full data frame. Now we have some high-range entries.
df_orders_products_merged['price_range_loc'].value_counts()

Mid-range product     21860860
Low-range product     10131448
High-range product      412551
Name: price_range_loc, dtype: int64

In [18]:
df_orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product


# 06. If-statements with for-loops
This is faster than a user-derived function bc you're only looping through one column

Goal: Create new column displaying business of order day

In [19]:
# run frequency check
df_orders_products_merged['order_day_of_week'].value_counts()

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: order_day_of_week, dtype: int64

In [26]:
# create empty list, then run through day of week column and print one of the following if each condition is met

result = []

for value in df_orders_products_merged['order_day_of_week']:
  if value == 0:
    result.append('Busiest day')
  elif value == 4:
    result.append('Least busy')
  else:
    result.append('Regularly busy')

In [21]:
# view list
result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [27]:
# insert list in df by creating new column and filling it with your result list
df_orders_products_merged['busiest_day'] = result

In [28]:
# run frequency check of newly created column to see orders by business category (you can cross-check this with frequency check from 19)
df_orders_products_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

# 07. Task

Step 2: Create busiest_days column

In [31]:
# create new list 'result2' for new classification of day business
# must use 'or' as seen below to append multiple values to a condition. 
result2 = []

for value in df_orders_products_merged['order_day_of_week']:
  if value == 0 or value == 1:
    result2.append('Busy day')
  elif value == 4 or value == 3:
    result2.append('Slow day')
  else:
    result2.append('Regular day')

In [32]:
# create new column and fill with result2
df_orders_products_merged['busiest_days'] = result2

Step 3: Check frequency of new column

In [34]:
# run frequency check of new column to see how many orders were logged in each category
df_orders_products_merged['busiest_days'].value_counts()

Regular day    12916111
Busy day       11864412
Slow day        7624336
Name: busiest_days, dtype: int64

Observations: The highest number of orders are logged during the three regular days of the week (Monday, Thursday and Friday), followed closely by the busiest days (Saturday and Sunday). The slow days of Tuesday and Wednesday account for considerably fewer orders.

Step 4: Create busiest_period_of_day column

In [37]:
# check frequency of hour of day column to see the slowest and busiest periods
df_orders_products_merged['order_hour_of_day'].value_counts()

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

In [38]:
# Create for-loop designating hours 9-16 as most orders, 0-6 as fewest orders, and everything else as average
hours = []

for value in df_orders_products_merged['order_hour_of_day']:
  if value >= 9 and value <= 16:
    hours.append('Most orders')
  elif value >= 0 and value <= 6:
    hours.append('Fewest orders')
  else:
    hours.append('Average orders')

In [39]:
# create new column with hours results
df_orders_products_merged['busiest_period_of_day'] = hours

Step 5: Check frequency of new column

In [40]:
df_orders_products_merged['busiest_period_of_day'].value_counts()

Most orders       21118071
Average orders    10399967
Fewest orders       886821
Name: busiest_period_of_day, dtype: int64

Step 7: Export data with derivations as pkl

In [41]:
df_orders_products_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_expanded.pkl'))