# 4.7. IC_Deriving New Variables

#### Context:
- In this Exercise, you learned how to create new columns based on existing ones in your dataframe and created some flags and summary columns along the way. For the task, you’ll continue to practice creating new columns in your ords_prods_merge dataframe. You can work in the same Jupyter notebook you used while following along in the Exercise.
#### Directions
- 1. Creating the “price_label” and “busiest_day” columns.
- 2. Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.
- 3. Check the values of this new column for accuracy. Note any observations in markdown format.
- 4. When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”
- 5. Print the frequency for this new column.
- 6. Ensure your notebook is clean and structured and that your code is well commented.
- 7. Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.


### This script contains the following points:

#### 0. Importing Libraries
#### 1. Loading and Checking the Data
#### 2. If-Statements with User-Defined Functions
#### 3. If-Statements with the loc() Function
#### 4. If-Statements with For-Loops
#### 5. Updating the "busiest_day" column
#### 6. Create a new "busiest_period_of_day" variable
#### 7. Exporting the Dataframe as a Pickle


## 0. Importing Libraries

In [1]:
# Import libraries: pandas, NumPy and os.

import pandas as pd
import numpy as np
import os

## 1. Loading and Checking the Data

Importing Data Files, using os.path.join() function

path = r'/folderpath_to main project folder/'

df = pd.read_csv(os.path.join(path,'folderpath','name.csv'), index_col = False)


In [2]:
# Define the path to the data files, folder path to my main project folder is now stored within variable 'path'

path = r'/Users/pau/06-05-2024 Instacart Basket Analysis'

#### Importing the “ords_prods_merge.pkl” data set into my Jupyter notebook using the os library as df_ords_prods_merge 

In [3]:
# Import the “ords_prods_merge.pkl” data from the “Prepared Data” folder as df_ords_prods_merge 
df_ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))

#### Checking the dimensions of the imported dataframe and if the data is correctly loaded

In [4]:
# Checking "orders_products_prior.csv" data is correctly loaded

print(df_ords_prods_merge.head()) # to ensure nothing looks off about our imported dataframes.
print(df_ords_prods_merge.info()) 
print(df_ords_prods_merge.shape) # to confirm the total size of our imported df. Great way to get a feel for the data and have a better idea how to proceed.

   product_id                product_name  aisle_id  department_id  prices  \
0           1  Chocolate Sandwich Cookies        61             19     5.8   
1           1  Chocolate Sandwich Cookies        61             19     5.8   
2           1  Chocolate Sandwich Cookies        61             19     5.8   
3           1  Chocolate Sandwich Cookies        61             19     5.8   
4           1  Chocolate Sandwich Cookies        61             19     5.8   

   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   3139998      138            28                   6                 11   
1   1977647      138            30                   6                 17   
2    389851      709             2                   0                 21   
3    652770      764             1                   3                 13   
4   1813452      764             3                   4                 17   

   days_since_prior_order  is_first_order  add_to_cart_order  reorde

## 2. If-Statements with User-Defined Functions

creating a flag that sorts products in our ords_prods_merge dataframe according to price. 
- Products within different ranges will be given different flags, which are stored within a new column.
- I need to write a user-defined function to create and assign these flags.


NOTE:
using user-defined functions on a large dataframe can lead to memory issues or trouble with processing power. And our Instacart dataframe is, indeed, large. 
To avoid any potential issues, we are going to work with a subset of the dataframe for now —the first one million rows.


In [5]:
# Create a subset of the "df_ords_prods_merge" dataframe that contains only the first one million rows
#  [:1000000] a number after the colon. This indicates that the subset should include everything from the beginning of the dataframe *up to* that number

df = df_ords_prods_merge[:1000000]

In [6]:
# Check the results of creating the subset
print(df.head())
print(df.info())
df.shape

   product_id                product_name  aisle_id  department_id  prices  \
0           1  Chocolate Sandwich Cookies        61             19     5.8   
1           1  Chocolate Sandwich Cookies        61             19     5.8   
2           1  Chocolate Sandwich Cookies        61             19     5.8   
3           1  Chocolate Sandwich Cookies        61             19     5.8   
4           1  Chocolate Sandwich Cookies        61             19     5.8   

   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   3139998      138            28                   6                 11   
1   1977647      138            30                   6                 17   
2    389851      709             2                   0                 21   
3    652770      764             1                   3                 13   
4   1813452      764             3                   4                 17   

   days_since_prior_order  is_first_order  add_to_cart_order  reorde

(1000000, 15)

### Creating the “price_label” column.

#### USER- DEFINE FUNCTION: 
#### order when working with user-defined functions: define first, apply second.


- DEFINING: As you’re creating this function yourself, you need to start by *defining* it using the `def` syntax at the beginning of the code.
- FUNTION NAME: Following this is the name you want to give your new function: `price_label`.
- ARGUMENT: In the parentheses is `row`, which is a standard argument telling the function to look at each row within the dataframe.
- COLON: Finally, everything’s finished off with a colon.
    - The colon separates the head, where you provide the name and argument(s) for your function, from the body, which is what the function will actually do.
    
if-statement: 

- Here, you’re using if-statements very literally by way of `if-else` constructs.
- As explained earlier, if-statements tell a function to determine “if” something is true and, if so, to perform some operation.
- In the code above, you can see the line: `if row['prices'] <= 5`.
    - This translates to “if the value in the ‘prices’ column within the given row is less than or equal to 5.”
- Right after this line comes the operation to be performed: `return 'Low-range product'`,
    - which translates to “return the string ‘Low-range product.’”
- The colon after the first line
    - translates to “then,” making the entire statement read: “If the value in the ‘prices’ column within the given row is less than or equal to 5, then return the string ‘Low-range product.’
- Note how the second line is indented further than the first line. This is crucial!

In [7]:
# Create a user-defined function "price_label"

def price_label(row):
    if row['prices'] <= 5: 
        return 'Low-range product'
    elif (row['prices'] > 5) and (row['prices'] <= 15):
        return 'Mid-range product'
    elif row['prices'] > 15:
        return 'High-range product'
    else: return 'Not enough data'

Now that you’ve defined it, you need to use it
- In the code rather than calling the `df` dataframe by itself, the syntax `df['price_range']` has been used.
    - This creates a new column within the `df` dataframe called “price_range” and designates it as the location for your labels.
- On the right side of the equals sign comes the code that runs your new function
    - `df.apply(price_label, axis=1)`, which tells Python to `apply` the `price_label`function on `axis=1`.
    - This `axis = 1` stands for “rows,” so this code essentially tells Python to apply the function to all *rows* within the dataframe.
    - Conversely, `axis = 0` would refer to all *columns* within the dataframe.

In [8]:
# Apply the new "price_range" function to the dataframe

df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


Warning message:
- in this case, it’s simply informing you that there could be something problematic about the way you executed your procedure. 
- it does give you a possible alternative to try: the loc() method 

running the value_counts() function to check the values in our new column "price_range"


In [9]:
# Checking the "price_range" function application to the "df" dataframe

df['price_range'].value_counts(dropna = False)

price_range
Mid-range product     652638
Low-range product     338018
High-range product      9344
Name: count, dtype: int64

using the max() function to check what the most expensive product within the subset:

In [10]:
# Use the "max()" function to find the most expensive product in "df"

df['prices'].max()

24.5

- created user-defined function and used it to sort your data into different categories. 
- Warning message we received about using the `loc()` method:
    - Using the `loc()` method is a preferable way of handling these types of operations.

## 3. If-Statements with the loc() Function

 - Python has suggested using the predefined function loc() to accomplish the same thing.
 - Using loc()
     - apply the conditional logic of an if-statement to a function without explicitly creating an if-else construct.
- loc() function 
    - locates a particular column in the dataframe it’s been assigned to.
    - a logical operator (smaller than, larger than, equal to, etc.) is being added to the function to create a condition

**if** = `df.loc[df['prices'] > 15,`

- the `loc()` function is being called on the `df` dataframe.
- And within the brackets,
    - the values in the “prices” column of the `df` dataframe are being compared to a value, 15, using the `>` operator. In normal language, you could say, “if the values in the ‘prices’ column of the `df` dataframe are greater than 15.”

**then** = `'price_range_loc'] = 'High-range product'`

- After the comma comes the implied “then.”
- Here, a new column called “price_range_loc” is being set equal to the string “High-range product.” This is the same as the label you created in your user-defined function.

**- The comma is key!
    - It’s what separates the “if” from the “then.”**

In [11]:
# Use the "loc()" function to check the price ranges

df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


 **if** = `df.loc[(df['prices'] <= 15) & (df['prices'] > 5),`
 
 **then** = `'price_range_loc'] = 'Mid-range product'`

- two conditions are combined by the **&** sign in the middle
- Additionally, the two conditions have been placed inside parentheses

In [12]:

df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [13]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
# Use the "value_counts" function to check the new "price_range_loc" column

df['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     652638
Low-range product     338018
High-range product      9344
Name: count, dtype: int64

#### Repeating the process, on our entire dataframe: using of ords_prods_merge instead of df subset

In [15]:
# Apply the "loc()" function above to the entire "df_ords_prods_merge" dataframe

df_ords_prods_merge.loc[df_ords_prods_merge['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [16]:
df_ords_prods_merge.loc[(df_ords_prods_merge['prices'] <= 15) & (df_ords_prods_merge['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [17]:
df_ords_prods_merge.loc[df_ords_prods_merge['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [18]:
# Use the "value_counts" function to check the results

df_ords_prods_merge['price_range_loc'].value_counts(dropna = False)

price_range_loc
Mid-range product     21860860
Low-range product     10126321
High-range product      417678
Name: count, dtype: int64

- Thanks to `loc()`, we can now filter the entire dataframe rather than just a subset.
- If we’d tried to do the same thing with our user-defined function, we would likely have received a memory error, but not so with `loc()`.

## 4. If-Statements with For-Loops

- **For-loops**
    - are loops for running the same block of code multiple times.
    - They’re used to perform the same function on multiple elements, for instance, by running through an entire dataframe and performing a function on each row within that dataframe.

### Creating the “busiest_day” column.

We’ll create a new column in your `ords_prods_merge` dataframe that summarizes how busy each day of the week is.
   - This information is valuable information for stakeholders as it gives them insight into what products are being bought on the busiest and slowest days.
   - They could use this information to tailor ads on specific days.

- To start, we need to know on which day most orders take place.
    - we can find this out by printing the frequency of the “orders_day_of_week” column
    - Printing the frequency of a column will quickly inform us which values appear most often within that column.

In [19]:
# Check the frequency of the “orders_day_of_week” column

df_ords_prods_merge['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

#### orders_day_of_the_week:
  - value 0 means Saturday. -> the busiest day
  - value 1 means Sunday. 
  - value 6 means Friday. 
  - value 2 means Monday. 

- using this information to create a new column, “busiest day,” that will contain one of three different values: “Busiest day,” “Least busy,” and “Regularly busy.” 
- using a for-loop to create this new column.
 

In [20]:
# Use a "for-loop" to classify how busy each day of the week is as either "Busiest day", "Least busy", or "Regularly busy"

result = []
for value in df_ords_prods_merge["orders_day_of_week"]:
    if value == 0:
        result.append("Busiest day")
    elif value == 4:
        result.append("Least busy")
    else:
        result.append("Regularly busy")

In [21]:
# Check the results

result

['Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Busiest day',
 'Least busy',
 'Busiest day',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Bus

- it contains an entry for every row within your dataframe.
- Only by combining it with our ords_prods_merge dataframe can it be used effectively. 
- To do so, we are going to create a new column within our ords_prods_merge dataframe and set it equal to result:

In [22]:
# Create a new column in the "df_ords_prods_merge" dataframe for the "results"

df_ords_prods_merge['busiest_day'] = result

- By adding the values in result to a new column in our dataframe, we can use our new labels more effectively, for instance, by showing at a glance how many sales occur on each different type of day.

- print the frequency of this new column and cross-check it with the frequency of “orders_day_of_week” you printed before. 

In [23]:
# Use the "value_counts" function to check the new "busiest_day" column

df_ords_prods_merge['busiest_day'].value_counts(dropna = False)

busiest_day
Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: count, dtype: int64

# _________________________________________________________


 # Task 4.7. 
 

## 5. Updating the "busiest_day" column

### Updating the “Busiest day” column to  “Busiest days” (plural). 
- This label should correspond with the two busiest days of the week as opposed to the single busiest day. 
- At the same time, create a new column for the two slowest days.
- Check the values of this new column for accuracy

In [24]:
# Use a "for-loop" to create a new column for "busiest_days" where the "Busiest days" = the two busiest days of the week

# The two slowest days of the week = "Slowest days"

busiest_days = []
for value in df_ords_prods_merge["orders_day_of_week"]:
    if value == 0 or value == 1:
        busiest_days.append("Busiest days")
    elif value == 3 or value == 4:
        busiest_days.append("Slowest days")
    else:
        busiest_days.append("Regular days")

#### Another option:
We could also have used the loc() function to create the labels:

- Create the 'Busiest Days' label:

ords_prods_merge.loc[ords_prods_merge['orders_day_of_week'].isin([0, 1]), 'busiest_days'] = 'Busiest Days'

- Slowest Days' label:

ords_prods_merge.loc[ords_prods_merge['orders_day_of_week'].isin([3, 4]), 'busiest_days'] = 'Slowest Days'

In [25]:
# Check the results

busiest_days

['Regular days',
 'Regular days',
 'Busiest days',
 'Slowest days',
 'Slowest days',
 'Busiest days',
 'Regular days',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Regular days',
 'Slowest days',
 'Slowest days',
 'Regular days',
 'Slowest days',
 'Regular days',
 'Regular days',
 'Regular days',
 'Busiest days',
 'Busiest days',
 'Regular days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Regular days',
 'Busiest days',
 'Busiest days',
 'Regular days',
 'Regular days',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Regular days',
 'Regular days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Slowest days',
 'Regular days',
 'Busiest days',
 'Regular days',
 'Busiest days',
 'Busiest days',
 'Regular days',
 'Busiest days

In [26]:
# Create a new column in "df_ords_prods_merge" for the "busiest_days" results

df_ords_prods_merge['busiest_days'] = busiest_days

In [27]:
# Use the "value_counts" function to check the new "busiest_days" column

df_ords_prods_merge['busiest_days'].value_counts(dropna = False)

busiest_days
Regular days    12916111
Busiest days    11864412
Slowest days     7624336
Name: count, dtype: int64

In [28]:
# Compare the new column with the original "orders_day_of_week" column

df_ords_prods_merge['orders_day_of_week'].value_counts(dropna = False)

orders_day_of_week
0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: count, dtype: int64

#### The counts all add up correctly

## 6. Create a new "busiest_period_of_day" variable

The new variable needs to identify the busiest hours of the day classified into "Most orders", "Average orders", and "Fewest orders"

In [29]:
# Start by checking the frequency of the "order_hour_of_day" column

df_ords_prods_merge['order_hour_of_day'].value_counts()

order_hour_of_day
10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: count, dtype: int64

In [30]:
# The top 8 hours = "Most orders", the middle 8 hours = "Average orders", the bottom 8 hours = "Fewest orders"

# Use a "for-loop" to classify the hours of the day accordingly

hours = []
for value in df_ords_prods_merge["order_hour_of_day"]:
    if value in [10, 11, 14, 15, 13, 12, 16, 9]:
        hours.append("Most orders")
    elif value in [23, 6, 0, 1, 5, 2, 4, 3]:
        hours.append("Fewest orders")
    else:
        hours.append("Average orders")

In [31]:
# Check the results

hours

['Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Fewest orders',
 'Average orders',
 'Fewest orders',
 'Fewest orders',
 'Fewest orders',
 'Fewest orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Fewest orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Most ord

In [32]:
# Create a new column in "df_ords_prods_merge" for the "busiest_period_of_day" variable

df_ords_prods_merge['busiest_period_of_day'] = hours

In [33]:
# Use the "value_counts" function to check the new "busiest_period_of_day" column

df_ords_prods_merge['busiest_period_of_day'].value_counts(dropna = False)

busiest_period_of_day
Most orders       21118071
Average orders     9997651
Fewest orders      1289137
Name: count, dtype: int64

#### Another option:
#### We could also have used the loc() function to create the labels

- group the top 8 hours as 'Most orders', then the next 8 hours as 'Average orders', and finally, the last 8 hours as 'Fewest orders'.
- use the loc() function as it takes less time to execute.

- #### Create the 'Most orders' label
ords_prods_merge.loc[ords_prods_merge['order_hour_of_day'].isin([10, 11, 14, 15, 13, 12, 16, 9]), 'busiest_period_of_day'] = 'Most orders'

- #### Create the 'Average orders' label
ords_prods_merge.loc[ords_prods_merge['order_hour_of_day'].isin([17, 8, 18, 19, 20, 7, 21, 22]), 'busiest_period_of_day'] = 'Average orders'
- #### Create the 'Fewest orders' label
ords_prods_merge.loc[ords_prods_merge['order_hour_of_day'].isin([23, 6, 0, 1, 5, 2, 4, 3]), 'busiest_period_of_day'] = 'Fewest orders'


## 7. Export the dataframe as a Pickle

In [34]:
# Perfrom a final check of the dataframe before exporting

print(df_ords_prods_merge.head())
print(df_ords_prods_merge.info())
print(df_ords_prods_merge.shape)

   product_id                product_name  aisle_id  department_id  prices  \
0           1  Chocolate Sandwich Cookies        61             19     5.8   
1           1  Chocolate Sandwich Cookies        61             19     5.8   
2           1  Chocolate Sandwich Cookies        61             19     5.8   
3           1  Chocolate Sandwich Cookies        61             19     5.8   
4           1  Chocolate Sandwich Cookies        61             19     5.8   

   order_id  user_id  order_number  orders_day_of_week  order_hour_of_day  \
0   3139998      138            28                   6                 11   
1   1977647      138            30                   6                 17   
2    389851      709             2                   0                 21   
3    652770      764             1                   3                 13   
4   1813452      764             3                   4                 17   

   days_since_prior_order  is_first_order  add_to_cart_order  reorde

In [35]:
# Export the "df_ords_prods_merge" dataframe with the new variables as "ords_prods_merge_new_var.pkl"

df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_new_var.pkl'))