## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## Importing Data

In [2]:
path = r'/Users/nataliawijaya/Documents/Bootcamp/Instacart Basket Analysis/02 Data'

In [3]:
path

'/Users/nataliawijaya/Documents/Bootcamp/Instacart Basket Analysis/02 Data'

In [4]:
# Importing orders_products_merged.pkl file
df_ords_prods_merged = pd.read_pickle(os.path.join(path, 'Prepared Data', 'orders_products_merged.pkl'))

In [5]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both


## Deriving New Variables

1. If-Statements with User-Defined Functions
2. If-Statements with the loc() Function
3. If-Statements with For-Loops

### If-Statements with User-Defined Functions

The basic structure of a user-defined function is as follows:
1. A definition of the name and arguments the function will take
2. What the function is meant to do

Using user-defined functions on a large dataframe can lead to memory issues or trouble with processing power. To avoid any potential issues, let’s work with a subset of the dataframe for now—the first one million rows.

In [7]:
df_ords_prods_merged.shape

(32404859, 14)

In [8]:
df.shape

(1000000, 14)

You’ll apply the following criteria:
1. If the item’s price is lower than or equal to 5 dollar, it will be labeled a low-range product
2. If the item’s price is above 5 dollar but lower than or equal to 15 dollar, it will be labeled a mid-range product.
3. If the item’s price is above 15 dollar, it will be labeled a high-range product.

In [9]:
# Creating a user-defined funtions
# In the parentheses is row, which is a standard argument telling the function to look at each row within the dataframe
def price_label(row):
    if row['prices'] <= 5:
        return 'Low-range product'
    elif (row['prices'] > 5) and (row['prices'] <= 15):
        return 'Middle-range product'
    elif row['prices'] > 15:
        return 'High-range product'
    else: return 'Not enough data'

In [10]:
# Applying the function
df['price_range'] = df.apply(price_label, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis = 1)


This axis = 1 stands for “rows,” so this code essentially tells Python to apply the function to all rows within the dataframe. Conversely, axis = 0 would refer to all columns within the dataframe.

In [11]:
df['price_range'].value_counts(dropna = False)

price_range
Middle-range product    756450
Low-range product       243550
Name: count, dtype: int64

In [12]:
df['prices'].max()

14.8

There is no price higher than 15 in the subset

### If-Statements with the loc() Function


1. Using loc(), you can apply the conditional logic of an if-statement to a function without explicitly creating an if-else construct.
2. The loc() method runs much faster; the loc() function applies the conditional filters before searching through the dataframe, while your user-defined function searches through the entire dataframe and then determines where to set the filters (remember axis = 1?).
3. With the power of loc(), you can perform operations on much larger dataframes! If you’d tried to do the same thing with your user-defined function, you would likely have received a memory error, but not so with loc().

In [14]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [15]:
df.loc[(df['prices'] > 5) & (df['prices'] <= 15), 'price_range_loc'] = 'Middle-range product'

In [16]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range-product'

In [17]:
df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Middle-range product    756450
Low-range product       243550
Name: count, dtype: int64

In [18]:
df.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range,price_range_loc
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product,Middle-range product
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Middle-range product
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Middle-range product
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Middle-range product
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Middle-range product


#### Try repeating the process again on your entire dataframe as opposed to the subset

In [19]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices']<= 5, 'price_range_loc'] = 'Low-range product'

In [20]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] > 5) & (df_ords_prods_merged['prices'] <= 15), 'price_range_loc'] = 'Middle-range product'

In [21]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range-product'

In [22]:
df_ords_prods_merged['price_range_loc'].value_counts(dropna = False)

price_range_loc
Middle-range product    21860860
Low-range product       10126321
High-range-product        417678
Name: count, dtype: int64

In [23]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product


### If-Statements with For-Loops

In [24]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product


In [25]:
df_ords_prods_merged.shape

(32404859, 15)

In [26]:
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

Project brief:
0: Saturday
1: Sunday
2: Monday
3: Tuesday
4: Wednesday
5: Thursday
6: Friday


Saturday is the busiest day and Wednesday is the least busy day.

In [27]:
result = []

for value in df_ords_prods_merged['orders_day_of_week']:
    if value == 0:
        result.append('Busiest day')
    elif value == 4:
        result.append('Least busy')
    else:
        result.append('Regularly busy')
    

In [28]:
result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [29]:
# Create a new column within in df ords_prods_merged dataframe and set it equal to result
df_ords_prods_merged['busiest_day'] = result

In [30]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy


In [31]:
df_ords_prods_merged['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: count, dtype: int64

## Exercise

#### Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [32]:
df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

- 2 busiest days: 0 (Saturday) and 1 (Sunday)
- 2 least busy days : 3 (Tuesday) and 4 (Wednesday)

In [35]:
# The first step is to create an empty list, "result2" then creating the loop
result2 = []

for value in df_ords_prods_merged['orders_day_of_week']:
    if value == 0 or value == 1:
        result2.append('Busiest days')
    elif value == 3 or value == 4:
        result2.append('Least busy days')
    else:
        result2.append('Regularly busy')

In [36]:
df_ords_prods_merged['busiest_days'] = result2

In [37]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days


In [38]:
df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

busiest_days
Regularly busy     12916111
Busiest days       11864412
Least busy days     7624336
Name: count, dtype: int64

#### When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [39]:
df_ords_prods_merged['order_hour_of_day'].value_counts(dropna = False)

order_hour_of_day
10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: count, dtype: int64

In [40]:
df_ords_prods_merged['order_hour_of_day'].describe()

count    3.240486e+07
mean     1.342515e+01
std      4.246380e+00
min      0.000000e+00
25%      1.000000e+01
50%      1.300000e+01
75%      1.600000e+01
max      2.300000e+01
Name: order_hour_of_day, dtype: float64

Since I want to categorize the hours into three parts: “Most orders,” “Average orders,” and “Fewest orders.”, then I will divide the 24 hours a day into 3 categories (24/3 = 8).
- 8 counts for the busiest hours (most orders) = 10am, 11am, 14pm, 15pm, 13pm, 12pm, 16pm, 9am
- 8 counts for the average hours (average orders) = 17pm, 8am, 18pm, 19pm, 20pm, 7am, 21pm, 22pm
- 8 counts for the least hours (fewest orders) = 23pm, 6am, 0am, 1am, 5am, 2am, 4am, 3am

In [41]:
# The first step is to create an empty list, "hours" then creating the loop

hours = []


for value in df_ords_prods_merged['order_hour_of_day']:
    if value in [10, 11, 14, 15, 13, 12, 16, 9]:
        hours.append('Most orders')
    elif value in [23, 6, 0, 1, 5, 2, 4, 3]:
        hours.append('Fewest orders')
    else:
        hours.append('Average orders')

In [42]:
df_ords_prods_merged['busiest_period_of_day'] = hours

In [43]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days,Most orders


In [44]:
# Check the output of the 'order_hour_of_day' and the 'busiest_period_of_day' column for accuracy

df_ords_prods_merged[['order_hour_of_day', 'busiest_period_of_day']].head(10)

Unnamed: 0,order_hour_of_day,busiest_period_of_day
0,8,Average orders
1,7,Average orders
2,12,Most orders
3,7,Average orders
4,15,Most orders
5,7,Average orders
6,9,Most orders
7,14,Most orders
8,16,Most orders
9,8,Average orders


#### Print the frequency for this new column.

In [45]:
df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

busiest_period_of_day
Most orders       21118071
Average orders     9997651
Fewest orders      1289137
Name: count, dtype: int64

#### Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder

In [46]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Regularly busy,Least busy days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Middle-range product,Least busy,Least busy days,Most orders


In [47]:
# Exporting df_ords_prods_merged dataframe as “orders_products_merged_updated.pkl” in “Prepared Data” folder
df_ords_prods_merged.to_pickle(os.path.join(path, 'Prepared Data', 'orders_products_merged_updated.pkl'))