# 4.4 DATA WRANGLING & SUBSETTING
****

**SCRIPT CONTENTS:**

1. Importing Libraries & Files
2. Data Wrangling Procedures
3. Data Dictionaries
4. Subsetting Procedures
5. Exporting Dataframes: **orders_wrangled.csv & departments_wrangled.csv**

#### 1. IMPORTING LIBRARIES & FILES
** **

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Document File Location
path = r'C:\Users\G\12-2022 Instacart Basket Analysis'

In [3]:
# Import Orders file
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [4]:
# Import Products file
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [5]:
# Import Departments file
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'departments.csv'), index_col = False)

#### 2. DATA WRANGLING PROCEDURES
** **

**Q2. Find another identifier variable in the df_ords dataframe that doesn’t need to be included in your analysis as a numeric variable and change it to a suitable format.**|

In [6]:
# Verifying data types
df_ords.dtypes

order_id                    int64
user_id                     int64
eval_set                   object
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

_note: **eval_set** = non-numerical value_

In [7]:
# Update order_id data type
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [8]:
# Verify order_id data type
df_ords['order_id'].dtype

dtype('O')

In [9]:
# Drop Orders eval_set Column
df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [10]:
# New Orders dataframe w/o eval_set Column
df_ords2 = df_ords.drop(columns = ['eval_set'])

In [11]:
# Check New Orders DataFrame Output
df_ords2.head()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


** **
**Q3. Look for a variable in your df_ords dataframe with an unintuitive name and change its name without overwriting the data frame.**

In [12]:
# Identifying Column Names
df_ords.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

In [13]:
# Rename Orders Column user_id
df_ords2.rename(columns = {'user_id' : 'customer_id'}, inplace = True)

In [14]:
# Check New Orders DataFrame Column Name Update
df_ords2.head()

Unnamed: 0,order_id,customer_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [15]:
# Rename Orders Column order_dow
df_ords2.rename(columns = {'order_dow' : 'order_day_of_week'}, inplace = True)

In [16]:
# Check order_dow Column Name Update
df_ords2.head()

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


** **
**Q4. Your client wants to know what the busiest hour is for placing orders. Find the frequency of the corresponding variable and share your findings.**

In [17]:
# View Orders Dataframe Statistical Description
df_ords2['order_hour_of_day'].describe()

count    3.421083e+06
mean     1.345202e+01
std      4.226088e+00
min      0.000000e+00
25%      1.000000e+01
50%      1.300000e+01
75%      1.600000e+01
max      2.300000e+01
Name: order_hour_of_day, dtype: float64

_**note: order_hour_of_day** should be in a whole number value_

In [18]:
# Update order_hour_of_day data type
df_ords2['order_hour_of_day'] = df_ords['order_hour_of_day'].astype('int')

In [19]:
# Total Orders per Hour of the Day
df_ords2['order_hour_of_day'].value_counts(dropna = False)

10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: order_hour_of_day, dtype: int64

_**ANSWER: 10th hour is the busiest** within the day with 288418 orders_

#### 3. DATA DICTIONARY
** **



**Q5. Determine the meaning behind a value of 4 in the "department_id" column within the df_prods dataframe using a data dictionary.**

In [20]:
# Identitfying Column names
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [21]:
# Transposing df_dep
df_dep.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [22]:
# Tranposed vesion of Departments Dataframe
df_dep_t = df_dep.T

In [23]:
df_dep_t

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [24]:
# Adding index column to Department Transposed Dataframe
df_dep_t.reset_index()

Unnamed: 0,index,0
0,department_id,department
1,1,frozen
2,2,other
3,3,bakery
4,4,produce
5,5,alcohol
6,6,international
7,7,beverages
8,8,pets
9,9,dry goods pasta


In [25]:
# Take the first row of df_dep_t for the header
new_header = df_dep_t.iloc[0]

In [26]:
new_header

0    department
Name: department_id, dtype: object

In [27]:
# Removing exsiting to New header for Transposed Departments Datafame
df_dep_t_new = df_dep_t[1:]

In [28]:
df_dep_t_new

Unnamed: 0,0
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [29]:
# Set the header row as the df header
df_dep_t_new.columns = new_header 

In [30]:
df_dep_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [31]:
# Data Dictionary
data_dict = df_dep_t_new.to_dict('index')

In [32]:
# Checking Data Dictionary Output
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

**note:** _department id_ - **4 = produce**

In [33]:
# Checking Products Column Names
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [34]:
# Using Data Dictionary to reference code in Departments File (department_id column)
print(data_dict.get('4'))

{'department': 'produce'}


#### 4. DATA SUBSETTING PROCEDURES
** **

**Q6. The sales team in your client’s organization wants to know more about breakfast item sales. Create a subset containing only the required information.**

In [35]:
# Breakfast: Subset of Products Dataframe
df_bfst = df_prods.loc[df_prods['department_id']==14]

In [36]:
df_bfst

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


** **
**Q7. They’d also like to see details about customers who might be throwing dinner parties. Your task is to find all observations from the entire dataframe that include items from the following departments: alcohol, deli, beverages, and meat/seafood. You’ll need to present this subset to your client.**

In [37]:
# Alcohol Department: Subset of Products Dataframe
df_alchl = df_prods.loc[df_prods['department_id']==5]

In [38]:
df_alchl

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
51,52,Mirabelle Brut Rose,134,5,14.4
118,119,Chardonnay Paso Robles,62,5,5.5
149,150,Brut Rosé,134,5,12.9
233,234,Tennessee Whiskey,124,5,3.1
248,249,"Pinot Grigio, California, 2010",62,5,2.7
...,...,...,...,...,...
49548,49544,Cabernet Sauvignon Wine,28,5,9.8
49566,49562,Blanc De Noirs Sparkling Wine,134,5,5.7
49610,49606,Organic Natural Red,28,5,6.8
49665,49661,Porto,134,5,8.2


In [39]:
# Deli Department: Subset of Products Dataframe
df_deli = df_prods.loc[df_prods['department_id']==20]

In [40]:
df_deli

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
48,49,Vegetarian Grain Meat Sausages Italian - 4 CT,14,20,10.1
61,62,Premium Deli Oven Roasted Turkey Breast,96,20,14.6
73,74,Artisan Chick'n & Apple Sausage,14,20,3.0
84,85,Soppressata Piccante,96,20,8.9
108,109,Grape Leaf Hummus Wrap,13,20,7.1
...,...,...,...,...,...
49558,49554,Roasted Garlic Hommus,67,20,14.8
49564,49560,Selects Natural Slow Roasted Chicken Breast,96,20,14.5
49585,49581,Pinto Bean and Cheese Pupusa,13,20,10.5
49609,49605,Classic Hummus Family Size,67,20,3.5


In [41]:
# Beverages Department: Subset of Products Dataframe
df_bvg = df_prods.loc[df_prods['department_id']==7]

In [42]:
df_bvg

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
19,20,Pomegranate Cranberry & Aloe Vera Enrich Drink,98,7,6.0
...,...,...,...,...,...
49659,49655,Apple Cider,98,7,10.7
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5


In [43]:
# Meat/Seafood Department: Subset of Products Dataframe
df_mt_sfd = df_prods.loc[df_prods['department_id']==12]

In [44]:
df_mt_sfd

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
16,17,Rendered Duck Fat,35,12,17.1
22,23,Organic Turkey Burgers,49,12,8.2
34,35,Italian Herb Porcini Mushrooms Chicken Sausage,106,12,15.1
39,40,Beef Hot Links Beef Smoked Sausage With Chile ...,106,12,22.5
83,84,Lamb Shank,7,12,24.3
...,...,...,...,...,...
49425,49421,"Salame, Peppered",106,12,12.2
49440,49436,Imitation Crab Flakes,15,12,23.5
49509,49505,Hot Italian Sausage,106,12,15.8
49655,49651,Beef Brisket,122,12,20.7


In [45]:
# DINNER Products - Combined Subset for selected departments (Alcohol, Deli, Beverages & Meat/Seafood)
df_dinner = df_prods.loc[df_prods['department_id'].isin([5,20,7,12])]

In [46]:
df_dinner

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
...,...,...,...,...,...
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5
49686,49682,California Limeade,98,7,4.3


** **
**Q8. It’s important that you keep track of total counts in your dataframes. How many rows does the last dataframe you created have?**

**Alcohol**      : 1056 rows

**Deli**         : 1322 rows 

**Beverages**    : 4365 rows 

**Meat/Seafood** : 907 rows 

**Dinner**       : 7650 rows

** **
**Q9. Someone from the data engineers team in Instacart thinks they’ve spotted something strange about the customer with a "user_id" of “1.” Extract all the information you can about this user.**

In [47]:
#Extract User 1 (customer_id=1) Information
df_ords2.head()

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [48]:
# User1 Subset
df_user1 = df_ords2.loc[df_ords['user_id']==1]

In [49]:
df_user1

Unnamed: 0,order_id,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
6,550135,1,7,1,9,20.0
7,3108588,1,8,1,14,14.0
8,2295261,1,9,1,16,0.0
9,2550362,1,10,4,8,30.0


** **
**Q10. You also need to provide some details about this user’s behavior. What basic stats can you provide based on the information you have?**

In [50]:
# User1 Statistical Information
df_user1.describe()

Unnamed: 0,customer_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,11.0,11.0,11.0,11.0,10.0
mean,1.0,6.0,2.636364,10.090909,19.0
std,0.0,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,1.0,7.0,0.0
25%,1.0,3.5,1.5,7.5,14.25
50%,1.0,6.0,3.0,8.0,19.5
75%,1.0,8.5,4.0,13.0,26.25
max,1.0,11.0,4.0,16.0,30.0


** **
#### 5. EXPORTING DATAFRAMES

In [51]:
# Export Wrangled Orders Dataframe
df_ords2.to_csv(os.path.join(path, '02 Data','Prepared Data','orders_wrangled.csv'), index = False)

In [52]:
# Export Wrangled Departments Dataframe
df_dep_t_new.to_csv(os.path.join(path, '02 Data','Prepared Data','departments_wrangled.csv'), index = False)