# 4.7 DERIVING NEW VARIABLES
** **
**Reminder**

By now you know your final project deliverable is a report, hence filling out the results you get from the Task will save you a lot of rework for Exercise 10. Complete this Task and then determine whether some of the insights you found in your submission could be filled into the **report**. This way you’ll secure yourself a quick and easy final submission!

In this Exercise, you learned how to create new columns based on existing ones in your dataframe and created some flags and summary columns along the way. For the task, you’ll continue to practice creating new columns in your **ords_prods_merged** dataframe. You can work in the same Jupyter notebook you used while following along in the Exercise.

**SCRIPT CONTENTS:**

      A) Importing Libraries & Files
      
      B) Deriving New Variables (Exercise)
          i) If-Statements with User-Defined Functions
          ii) If-Statements with the loc() Function
          iii) If-Statements with For-Loops
          
      c) Task 4.7
          
      D) Exporting Updated Merged Files
** **

### A) IMPORTING LIBRARIES & FILES

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Document File Location
path = r'C:\Users\G\12-2022 Instacart Basket Analysis'

In [3]:
# Import Merge Pickle Format File

df_merge_large = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

** **
### B) DERIVING NEW VARIABLES (EXERCISES)

#### i) IF-Statements with User-Defined Functions

One thing to note here is that using user-defined functions on a large dataframe can lead to memory issues or trouble with processing power. And your Instacart dataframe is, indeed, large. To avoid any potential issues, let’s just work with a subset of the dataframe for now—the first one million rows.

##### **_SAMPLE OF USER-DEFINED FUNCTION_**

In [4]:
# Define function

def add_num(x,y):
    sum = x + y
    return sum

In [5]:
# Run function

add_num(9, 11)

20

In [6]:
# To create this subset, execute the following code:

df = df_merge_large[:1000000]

df

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,2836489,164627,14,0,15,6.0,False,30489,2,0,Original Hummus,67,20,7.5,both
999996,1843600,164632,5,1,19,9.0,False,30489,2,0,Original Hummus,67,20,7.5,both
999997,733106,164632,9,2,22,10.0,False,30489,3,1,Original Hummus,67,20,7.5,both
999998,1650124,164632,17,6,17,13.0,False,30489,1,1,Original Hummus,67,20,7.5,both


In [7]:
# Define a function

def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [8]:
# Apply the function

df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [9]:
df['price_range'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [10]:
# Check max price value

df['prices'].max()

14.8

#### ii) IF-Statements with the loc() Functions

_As you saw in the warning message above, Python has suggested using the predefined function loc() to accomplish the same thing. Let’s see how using loc() would change the workflow. By using loc(), you can apply the conditional logic of an if-statement to a function without explicitly creating an if-else construct._

In [11]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [12]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [13]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
df['price_range_loc'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

** **
_Now that you’ve seen how much faster loc() works, you can try repeating the process again—this time, on your entire dataframe as opposed to the subset. The only difference in the code is the use of **df_merge_large** instead of df:_

In [15]:
df_merge_large.loc[df_merge_large['prices'] > 15, 'price_range'] = 'High-range product'

In [16]:
df_merge_large.loc[(df_merge_large['prices'] <= 15) & (df_merge_large['prices'] > 5), 'price_range'] = 'Mid-range product'

In [17]:
df_merge_large.loc[df_merge_large['prices'] <= 5, 'price_range'] = 'Low-range product'

In [18]:
df_merge_large['price_range'].value_counts(dropna = False)

Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: price_range, dtype: int64

#### iii) IF-Statements with For-Loops

##### **_SAMPLE OF SIMPLE LOOP_**

In [19]:
print ("My age is 30")

My age is 30


In [20]:
print("My age is 31")

My age is 31


In [21]:
print("My age is 32")

My age is 32


In [22]:
print("My age is 33")

My age is 33


In [23]:
for x in range(30,45):
    print("My age is %d" % (x))

My age is 30
My age is 31
My age is 32
My age is 33
My age is 34
My age is 35
My age is 36
My age is 37
My age is 38
My age is 39
My age is 40
My age is 41
My age is 42
My age is 43
My age is 44


_Now, let’s take a look at how you could use a for-loop in your Instacart dataframe. To do so, you’ll create a new column in your **df_merge_large dataframe** that summarizes how busy each day of the week is. **This would be valuable information for stakeholders as it could give them insight into what products are being bought on the busiest and slowest days. They could use this information to tailor ads on specific days.**_

In [24]:
# Print frequency of order_day_of_week column 

df_merge_large['order_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: order_day_of_week, dtype: int64

**LEGEND:** _Variables of orders_dow (in original dataframe) & order_day_of_week (in clean/merge dataframe)_
     
      0 = Saturday            
      1 = Sunday               
      2 = Monday               
      3 = Tuesday
      4 = Wednesday
      5 = Thursday
      6 = Friday

**OBSERVATION:** 0 means Saturday. This value has the highest frequency, which means that Saturday is the busiest day. Meanwhile, the 4 value has the lowest frequency. A value of 4, here, refers to Wednesday, meaning that Wednesday is the slowest day when it comes to Instacart app orders.

_You want to use this information to create a new column, “busiest day,” that will contain one of three different values: __“Busiest day,” “Least busy,” and “Regularly busy.”__ This can be done by way of a for-loop. The loop will run through every row in the **“order_day_of_week”** column, compare the value in it with what you know are the busiest and slowest days, and assign it the corresponding string value._

In [25]:
# Create an empty list:

result = []

In [26]:
# If-State with For-loops

for value in df_merge_large["order_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [27]:
# Run loop code

result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [28]:
# Create new column Busiest Day to df_mege_large DataFrame

df_merge_large['busiest_day'] = result

In [29]:
# Verify Busiest Day Output

df_merge_large['busiest_day'].value_counts(dropna = False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

In [30]:
# Check DataFrame length

len(result)

32404859

In [31]:
# Check if DataFrame shape (rows) match length value

df_merge_large.shape

(32404859, 17)

### C) TASK

**Directions:**

**1) If you haven’t done so already, complete the instructions in the Exercise for creating the** _“price_label” and “busiest_day”_ **columns.**

**2) Suppose your clients have changed their minds about the labels you created in your** _“busiest_day”_ **column. Now, they want** _“Busiest day” to become “Busiest days” (plural)_**. This label should correspond with the** _two_ **busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the** _two_ **slowest days. Create a new column for this using a suitable method.**

In [32]:
# Create new column for Busiest/Slowest Days frequency and Check labels counts

labeling_counts = df_merge_large['order_day_of_week'].value_counts(dropna = False)

labeling_counts

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: order_day_of_week, dtype: int64

**LEGEND:** _Variables of orders_dow (in original dataframe) & order_day_of_week (in clean/merge dataframe)_
     
      0 = Saturday            
      1 = Sunday               
      2 = Monday               
      3 = Tuesday
      4 = Wednesday
      5 = Thursday
      6 = Friday

In [33]:
# Create index for Frequency Column

labeling_counts.index

Int64Index([0, 1, 6, 2, 5, 3, 4], dtype='int64')

In [34]:
# First two numeric value represents the busiest days of the week and last 2 numeric value is for the slowest days.
# Creating a new list

day_number = list(labeling_counts.index)

day_number

[0, 1, 6, 2, 5, 3, 4]

In [35]:
# Create an empty list:

results = []

In [36]:
# If-Statement with For-Loops

for value in df_merge_large["order_day_of_week"]:
  if value == day_number[0] or value == day_number[1]:
    results.append("Busiest days")
  elif value == day_number[len(day_number) - 1] or value == day_number[len(day_number) - 2]:
    results.append("Slowest days")
  else:
    results.append("Regularly busy")

In [37]:
# Check If-Statement results length

len(results)

32404859

In [38]:
# Verify If-Statement df_merge_large shape matches len() output

df_merge_large['order_day_of_week'].shape

(32404859,)

In [39]:
# Create a new column in df_merge_large DataFrame

df_merge_large['busiest days'] = results

results

['Regularly busy',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Slowest days',
 'Regularly busy',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Slowest days',
 'Slowest days',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest 

**3. Check the values of this new column for accuracy. Note any observations in markdown format.**

In [40]:
df_merge_large['busiest days'].value_counts(dropna = False)

Regularly busy    12916111
Busiest days      11864412
Slowest days       7624336
Name: busiest days, dtype: int64

**4. When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”**

In [41]:
# Check DataFrame Head (column names)

df_merge_large.head()

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range,busiest_day,busiest days
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days


In [42]:
# Create dictionary and Check Output

dict_order_counts = df_merge_large['order_hour_of_day'].value_counts(dropna = False).to_dict()

dict_order_counts

{10: 2761760,
 11: 2736140,
 14: 2689136,
 15: 2662144,
 13: 2660954,
 12: 2618532,
 16: 2535202,
 9: 2454203,
 17: 2087654,
 8: 1718118,
 18: 1636502,
 19: 1258305,
 20: 976156,
 7: 891054,
 21: 795637,
 22: 634225,
 23: 402316,
 6: 290493,
 0: 218769,
 1: 115700,
 5: 87961,
 2: 69375,
 4: 53242,
 3: 51281}

In [43]:
# Create groupings of order frequency by percentile category

quantile_list = list(df_merge_large['order_hour_of_day'].value_counts(dropna = False).quantile([0.75, 0.5, 0.25]))

quantile_list

[2556034.5, 1117230.5, 272562.0]

In [44]:
# Create empty list

results_labeling_hour = []

In [45]:
# If-Statement with For-loops

for value in df_merge_large['order_hour_of_day']:
    for key, val in dict_order_counts.items():
        if value == key:
            if val < quantile_list[2]:
                results_labeling_hour.append('Fewest orders')
            elif val >= quantile_list[2] and val < quantile_list[1]:
                results_labeling_hour.append('Average orders')
            elif val >= quantile_list[1]:
                results_labeling_hour.append('Most orders')

In [46]:
# Define a Function and Check the list

def label_result (x):
    most_orders = x.count('Most orders')
    average_orders = x.count('Average orders')
    fewest_orders = x.count('Fewest orders')
    
    print(f'Most orders: {most_orders}')
    print(f'Average orders: {average_orders}')
    print(f'Fewest orders: {fewest_orders}')
    print(f'Total: {most_orders + average_orders + fewest_orders}')


label_result(results_labeling_hour)

Most orders: 27818650
Average orders: 3989881
Fewest orders: 596328
Total: 32404859


In [47]:
# Check df_merge_large['order_hour_of_day'] column dimension

df_merge_large['order_hour_of_day'].shape

(32404859,)

In [48]:
# Merge dataframes

df_merge_large['busiest_period_of_day'] = results_labeling_hour

**5. Print the frequency for this new column.**

In [49]:
df_merge_large['busiest_period_of_day'].value_counts(dropna = False)

Most orders       27818650
Average orders     3989881
Fewest orders       596328
Name: busiest_period_of_day, dtype: int64

In [50]:
df_merge_large.head()

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range,busiest_day,busiest days,busiest_period_of_day
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Most orders
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Average orders
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Average orders
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders


**6. Ensure your notebook is clean and structured and that your code is well commented.**

### D) Exporting Updated Merged Files


**7. Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.**

In [51]:
# Export to PICKLE
df_merge_large.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_update.pkl'))

**8. Save your notebook and submit it to your tutor for review.**