# 4.7. Deriving new variables
#

# List of contents:
## 1. Import libraries
## 2. Import dataset
## 3. Create a subset of  'ords_prods_merge' dataframe
## 4. Create a flag column (price_range) for subset df using if-statement with user-defined functions
## 5. Create a flag column (price_range_loc) for subset df using if-statement with loc() function
## 6. Create new variables for ords_prods_merge:
### 6.1. 'price_range_loc' using if-statement with loc() function
### 6.2. 'busiest_day' using if-statements with for-loops
### 6.3.  'busiest_days' column  (modification of 'busiest_day')
### 6.4. 'busiest_period_of_day' column
## 7. Export dataframe as 'orders_products_new_variables.pkl'
#

## 1. Import libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

## 2. Import dataset

In [2]:
# Create path variable
path = r'C:\Users\marta\OneDrive\Documents\2023-09-18 Instacart Basket Analysis'

In [3]:
# Import orders_products_merged dataset
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [4]:
# Check the dimensions
ords_prods_merge.shape

(32399732, 14)

## 3. Create a subset of  'ords_prods_merge' dataframe

In [5]:
# Create a subset of 1 million rows
df = ords_prods_merge[:1000000]

In [6]:
# Check the dimensions
df.shape

(1000000, 14)

## 4. Create a flag column (price_range) for subset df using if-statement with user-defined functions

In [7]:
# Define the function 'price_label' to add a flag for price ranges
def price_label(row):
    
    if row['prices'] <= 5:
        return 'Low-range product'
    elif (row['prices'] > 5) and (row['prices'] <= 15):
        return 'Mid-range product'
    elif row['prices'] > 15:
        return 'High range'
    else: return 'Not enough data'

In [8]:
# Apply 'price_label' function on a subset dataframe
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [9]:
# Check frequency of values within 'price_range' column
df['price_range'].value_counts()

price_range
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

In [10]:
# Check max price value in the subset df
df['prices'].max()

14.8

## 5. Create a flag column (price_range_loc) for subset df using if-statement with loc() function

In [11]:
# Set first condition
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [12]:
# Set second condition
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [13]:
# Set third condition
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
# Check frequency of values within 'price_range_loc' column
df['price_range_loc'].value_counts()

price_range_loc
Mid-range product    756450
Low-range product    243550
Name: count, dtype: int64

## 6. Create new variables for ords_prods_merge:

### 6.1. 'price_range_loc' using if-statement with loc() function 

In [15]:
# Set first condition
ords_prods_merge.loc[ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [16]:
# Set second condition
ords_prods_merge.loc[(ords_prods_merge['prices'] <= 15) & (ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [17]:
# Set third condition
ords_prods_merge.loc[ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [18]:
# Check frequency of values within 'price_range_loc' column
ords_prods_merge['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21860860
Low-range product     10126321
High-range product      412551
Name: count, dtype: int64

### 6.2. 'busiest_day' using if-statements with for-loops 

In [19]:
# Print the frequency of 'order_day_of_week'
ords_prods_merge['order_day_of_week'].value_counts(dropna = False)

order_day_of_week
0    6203329
1    5659298
6    4495887
2    4213105
5    4205076
3    3839865
4    3783172
Name: count, dtype: int64

In [20]:
# Create 'result' list to assign a bysyness status for each day
result = []

for value in ords_prods_merge['order_day_of_week']:
  if value == 0:
    result.append('Busiest day')
  elif value == 4:
    result.append('Least busy')
  else:
    result.append('Regularly busy')

In [21]:
# Add 'result' list to ords_prods_merge as 'busiest_day' column
ords_prods_merge['busiest_day'] = result

In [22]:
# Check frequency of values within 'busiest_day' column
ords_prods_merge['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22413231
Busiest day        6203329
Least busy         3783172
Name: count, dtype: int64

### 6.3.  'busiest_days' column  (modification of 'busiest_day')

In [23]:
# Create 'result_new' list to assign new busyness status for each day
result_new = []

for value in ords_prods_merge['order_day_of_week']:
  if value == 0 or value == 1:
    result_new.append('Busiest day')
  elif value == 4 or value == 3:
    result_new.append('Least busy')
  else:
    result_new.append('Regularly busy')

In [24]:
# Create 'busiest_days' column within ords_prods_merge
ords_prods_merge['busiest_days'] = result_new

In [25]:
# Check the output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy


In [26]:
# Frequency table for 'Busiest days' column values
ords_prods_merge['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly busy    12914068
Busiest day       11862627
Least busy         7623037
Name: count, dtype: int64

Number of busiest and least busy days increased, as we included more days into these categories.

### 6.4. 'busiest_period_of_day' column

In [27]:
# Check frequency of values within 'order_hour_of_day' column
ords_prods_merge['order_hour_of_day'].value_counts(dropna = False)

order_hour_of_day
10    2761333
11    2735694
14    2688728
15    2661718
13    2660570
12    2618104
16    2534744
9     2453842
17    2087273
8     1717863
18    1636226
19    1258076
20     976000
7      890923
21     795528
22     634159
23     402272
6      290450
0      218742
1      115683
5       87944
2       69360
4       53232
3       51268
Name: count, dtype: int64

#### As the client asked to divide hours into three categories ('Most orders', 'Average orders' and 'Fewest orders'), we can include 8 hours into each category.

In [28]:
# Create 'busy_time' list to assign a bysyness status for each hour
busy_time = []

for value in ords_prods_merge['order_hour_of_day']:
  if value in [10, 11, 14, 15, 13, 12, 16, 9]:
    busy_time.append('Most orders')
  elif value in [17, 8, 18, 19, 20, 7, 21, 22]:
    busy_time.append('Average orders')
  else:
    busy_time.append('Fewest orders')

In [30]:
# Create 'busiest_period_of_day' column within ords_prods_merge
ords_prods_merge['busiest_period_of_day'] = busy_time

In [31]:
# Check the output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy,Most orders


In [32]:
# Frequency of values within 'busiest_period_of_day' column
ords_prods_merge['busiest_period_of_day'].value_counts(dropna = False)

busiest_period_of_day
Most orders       21114733
Average orders     9996048
Fewest orders      1288951
Name: count, dtype: int64

Total number of values within 'busiest_period_of_day' and 'order_hour_of_day' columns are the same.

## 7. Export dataframe as 'orders_products_new_variables.pkl'

In [33]:
# Export dataframe in pkl format
ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_new_variables.pkl'))