# 4.7 Deriving New Variables

### Contents
01 Import Libraries and Data

02 Create Subset of Data

03 Create and Apply User-defined Function to Subset

04 Use loc() Function on Subset

05 Use loc() Function on Full Dataframe

06 Create For Loop

07 Task 4.7 Exercises

08 Export ords_prods_merged to Pickle File

### 01 Import Libraries and Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create path variable for main project folder
path = r'D:\JupyterProjects\06-2022 Instacart Basket Analysis'

In [3]:
# Use path variable to import orders_products_merged.pkl as ords_prods_merged
ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [4]:
ords_prods_merged.shape

(32404859, 15)

### 02 Create Subset of Data

In [5]:
# Use a subset of 1,000,000 rows
df = ords_prods_merged[:1000000]

In [6]:
df.shape

(1000000, 15)

### 03 Create and Apply User-defined Function to Subset

In [7]:
# Create user-defined function price_label
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [8]:
# Apply user-defined function
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [9]:
# Check frequencies for values in new column
df.value_counts(['price_range'])

price_range      
Mid-range product    756450
Low-range product    243550
dtype: int64

In [10]:
# Find the maximum value for the price_range column
df['prices'].max()

14.8

### 04 Use loc() Function on Subset

In [11]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [12]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [13]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
df.value_counts(['price_range_loc'])

price_range_loc  
Mid-range product    756450
Low-range product    243550
dtype: int64

### 05 Use loc() Function on Full Dataframe

In [15]:
# Use loc() on ords_prods_merge dataframe
ords_prods_merged.loc[ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [16]:
ords_prods_merged.loc[(ords_prods_merged['prices'] <= 15) & (ords_prods_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [17]:
ords_prods_merged.loc[ords_prods_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [18]:
# Count products by price range
ords_prods_merged['price_range_loc'].value_counts(dropna = False)

Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: price_range_loc, dtype: int64

### 06 Create For Loop

In [19]:
# Count orders by day of week
ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)


0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

In [20]:
# Create result list
result = []

for value in ords_prods_merged["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [21]:
# View length of result list
len(result)

32404859

In [22]:
# Create new column and set = to result

In [23]:
ords_prods_merged['busiest_day'] = result

In [24]:
ords_prods_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

### 07 Task 4.7 Exercises

Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [25]:
# Create new result variable
result2 = []

for value in ords_prods_merged['orders_day_of_week']:
  if value in [0, 1]:
    result2.append("Busiest days")
  elif value in [3, 4]:
    result2.append("Slowest days")
  else:
    result2.append("Regularly busy")

In [26]:
# Create new column for busiest days
ords_prods_merged['busiest_days'] = result2

In [27]:
# View frequency results
ords_prods_merged['busiest_days'].value_counts(dropna = False)

Regularly busy    12916111
Busiest days      11864412
Slowest days       7624336
Name: busiest_days, dtype: int64

Check the values of this new column for accuracy. Note any observations in markdown format.

In [28]:
# Check sum of busiest days
6204182+5660230

11864412

In [29]:
# Check sum of slowest days
3840534+3783802

7624336

In [30]:
# Check sum of regularly busy days
4213830+4205791+4496490

12916111

Sum values are as expected.

When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [31]:
# View column names
ords_prods_merged.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'new_customer',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', '_merge', 'price_range_loc',
       'busiest_day', 'busiest_days'],
      dtype='object')

In [32]:
# View frequency for order_hour_of_day
ords_prods_merged['order_hour_of_day'].value_counts(dropna = False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

In [33]:
# Create new list variable
busyhours = []

for value in ords_prods_merged['order_hour_of_day']:
  if value == 10:
    busyhours.append('Most orders')
  elif value == 3:
    busyhours.append('Fewest orders')
  else:
    busyhours.append('Average orders')

In [34]:
# Create new column for busiest period of day
ords_prods_merged['busiest_period_of_day'] = busyhours

Print the frequency for this new column.

In [35]:
# View frequency results
ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

Average orders    29591818
Most orders        2761760
Fewest orders        51281
Name: busiest_period_of_day, dtype: int64

### 08 Export ords_prods_merged to Pickle File

In [36]:
ords_prods_merged.to_pickle(os.path.join(path, 'ords_prods_merged.pkl'))