# Deriving new variables

### Content List

#### 01. Importing data
#### 02. Creating a subset of the dataframe
#### 03. User-defined function
##### 03.1 Creating a new variable with a user defined function for price ranges
#### 04. Loc function
##### 04.1 Creating the same variable with a loc function 
#### 05. Loc function, now applicable to the whole dataframe
#### 06. For Loop statement (days of the week)
##### 06.1 New classification for days of the week
##### 06.1 New classification for days of the week, grouping days
#### 0.7 For loop statement (hours of the day)
#### 08. Export dataframe to pickle

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os

### 01. Importing data

In [2]:
path= r'C:\Users\isobr\OneDrive\02122022Instacart Basket Analysis'

In [3]:
path

'C:\\Users\\isobr\\OneDrive\\02122022Instacart Basket Analysis'

In [4]:
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_merged.pkl'))

In [5]:
df_ords_prods_merged.head()

Unnamed: 0,Unnamed: 0_x,product_id,product_name,aisle_id,department_id,prices,Unnamed: 0.1,Unnamed: 0_y,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,1987,1987,3139998,138,28,6,11,3.0,5,0,both
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1989,1989,1977647,138,30,6,17,20.0,1,1,both
2,0,1,Chocolate Sandwich Cookies,61,19,5.8,11433,11433,389851,709,2,0,21,6.0,20,0,both
3,0,1,Chocolate Sandwich Cookies,61,19,5.8,12198,12198,652770,764,1,3,13,7.0,10,0,both
4,0,1,Chocolate Sandwich Cookies,61,19,5.8,12200,12200,1813452,764,3,4,17,9.0,11,1,both


### 02. Creating a subset of the dataframe

In [6]:
#creating a subset df using only the first million rows
df = df_ords_prods_merged[:1000000]

In [7]:
df.shape

(1000000, 17)

### 03. User-defined function

##### 03.1 Creating a new variable with a user defined function for price ranges

In [11]:
#defining a new user-defined function - price label - using an if-else statement
def price_label(row):

  if row['prices'] <= 4:
    return 'Low-range product'
  elif (row['prices'] >= 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [12]:
#applying the function, wqhile creating a new column 'price range'
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [13]:
# Checking the value counts of price range
df.value_counts('price_range')

price_range
Mid-range product    655329
Low-range product    259885
Not enough data       75442
High range             9344
dtype: int64

In [14]:
#checking the maximum price value
df['prices'].max()

24.5

### 04. Loc function

##### 04.1 Creating the same variable with a loc function

In [16]:
#creating conditions - condition 1:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [13]:
#condition 2:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [14]:
#condition 3:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [15]:
#counting values to check the number of values within each condition
df.value_counts('price_range_loc')

price_range_loc
Mid-range product     652638
Low-range product     338018
High-range product      9344
dtype: int64

### 05. Loc function, now applicable to the whole dataframe

In [17]:
#creating conditions - condition 1:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [18]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] <= 15) & (df_ords_prods_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [19]:
#condition 3:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [20]:
df_ords_prods_merged.value_counts('price_range_loc')

price_range_loc
Mid-range product     21860860
Low-range product     10126321
High-range product      417678
dtype: int64

In [21]:
df_ords_prods_merged['prices'].max()

99999.0

### 06. For Loop statement (days of the week)

#### 06.1 New classification for days of the week

In [22]:
#checking which are the busiest days of the week 
df_ords_prods_merged['orders_day_of_week'].value_counts (dropna=False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

Results show that the busiest day is Saturday, while Wednesday (4) has the  lowest frequency

Results also show that sunday (1) is the second busiest day and after wednesday, tuesday (3) is the least busiest day.

In [23]:
#for-loop statement to create a new column with a classification of days of week from busiest to least busy
#the result creats an empty list to place the results of the loop that comes after
result = []

for value in df_ords_prods_merged['orders_day_of_week']:
  if value == 0:
    result.append('Busiest day')
  elif value == 4:
    result.append('Least busy')
  else:
    result.append('Regularly busy')

In [24]:
# checking the result
result

['Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Busiest day',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Busiest day',
 'Least busy',
 'Busiest day',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Bus

In [25]:
# create a new column in the df_ords_prods_merged dataframe with the result
df_ords_prods_merged['busiest_day']=result

In [26]:
#checking frequency
df_ords_prods_merged['busiest_day'].value_counts (dropna=False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

##### 06.1 New classification for days of the week, grouping days

In [15]:
#Creating a for-loop statement
result = []

for value in df_ords_prods_merged['orders_day_of_week']:
  if value <= 1:
    result.append('Busiest days')
  elif value >= 3 and value <= 4:
    result.append('Least busy')
  else:
    result.append('Regularly busy')

In [16]:
#checking the result
result

['Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Least busy',
 'Least busy',
 'Busiest days',
 'Regularly busy',
 'Least busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Least busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Least busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Busiest days',
 'Least busy',
 'Regularly busy',
 'Busiest days',
 'Regularly busy',
 'Busiest days',
 'Busiest days',
 'Regularly busy',
 '

In [17]:
#creating a new column which equals to result
df_ords_prods_merged['busier_days']=result

In [18]:
df_ords_prods_merged['busier_days'].value_counts(dropna=False)

Regularly busy    12916111
Busiest days      11864412
Least busy         7624336
Name: busier_days, dtype: int64

There are much less values within the least busy days' group, which indicates that the majority of days are quite busy. 

In [19]:
#checking some statistics on the busier days column
df_ords_prods_merged['busier_days'].describe()

count           32404859
unique                 3
top       Regularly busy
freq            12916111
Name: busier_days, dtype: object

### 0.7 For loop statement (hours of the day)

In [36]:
# Checking the most and least busy hours of the day
df_ords_prods_merged['order_hour_of_day'].value_counts(dropna=False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

The busiest hours are bewteen 9 and 16, then 17 and 18, as well as 8 and 19 are more or less busy and other hours as the least busy

In [38]:
#Creating a for-loop statement
result = []

for value in df_ords_prods_merged['order_hour_of_day']:
  if value >= 0 and value <=6:
    result.append('Fewest orders')
  elif value >= 9 and value <= 17:
    result.append('Most orders')
  else:
    result.append('Average orders')

In [39]:
result

['Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Fewest orders',
 'Average orders',
 'Fewest orders',
 'Fewest orders',
 'Average orders',
 'Fewest orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Average orders',
 'Fewest orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Mos

In [40]:
#creating a new column which equals to result
df_ords_prods_merged['busiest_period_of_day']=result

In [41]:
df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna=False)

Most orders       23205725
Average orders     8312313
Fewest orders       886821
Name: busiest_period_of_day, dtype: int64

In [42]:
df_ords_prods_merged['busiest_period_of_day'].describe()

count        32404859
unique              3
top       Most orders
freq         23205725
Name: busiest_period_of_day, dtype: object

### 08. Export dataframe to pickle

In [43]:
#Exporting data to new pickle
df_ords_prods_merged.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_prods_new.pkl'))