# Data Exploration

## Objectives
- Data loading
- Initital inspection
- Identify potencial issues

## Datasets
- instacart_orders.csv
- products.csv  
- order_products.csv
- aisles.csv
- departments.csv

In [21]:
# Import libraries
import pandas as pd

In [22]:
# Reading the data to variables
df_instacart_orders = pd.read_csv('../data/raw/instacart_orders.csv', sep=';')
df_products = pd.read_csv('../data/raw/products.csv', sep=';')
df_order_products = pd.read_csv('../data/raw/order_products.csv', sep=';')
df_aisles = pd.read_csv('../data/raw/aisles.csv', sep=';')
df_departments = pd.read_csv('../data/raw/departments.csv', sep=';')

In [None]:
# Checking dataframes information
print("=== INSTACART ORDERS INFO ===")
df_instacart_orders.info(show_counts=True)
print("=== INSTACART ORDERS FIRST 5 ROWS ===")
print(df_instacart_orders.head())
print("=== COUNT UNIQUE VALUES ===")
print(df_instacart_orders.nunique())

#### Instacart orders dataframe
Total of 478 967 lines;<br>
Columns: 0. 'order_id' | 1. 'user_id' | 2. 'order_number' | 3. 'order_dow' | 4. 'order_hour_of_day' | 5. 'days_since_prior_order'
- "days_since_prior_order" have 28 819 missing values and type must be changed to int64 as it represent days;
- all the other columns have zero NaN values and data type is int64

In [None]:
# Checking dataframes information
print("=== PRODUCTS INFO ===")
df_products.info()
print("=== PRODUCTS FIRST 5 ROWS ===")
print(df_products.head())
print("=== COUNT UNIQUE VALUES ===")
print(df_products.nunique() )

#### Products dataframe 
Total of 49 694 lines;<br>
Columns:  0. 'product_id' | 1. 'product_name' | 2. 'aisle_id' | 3. 'department_id':
- "product_name" that have 1 258 missing values and str content must be tranformed in lowecase to allow proper compare;
- all the others have zero NaN values and data type is int64.

In [None]:
# Checking dataframes information
print("=== ORDERS PRODUCTS INFO ===")
df_order_products.info(show_counts=True)
print("=== ORDERS PRODUCTS FIRST 5 ROWS ===")
print(df_order_products.head())
print("=== COUNT UNIQUE VALUES ===")
print(df_instacart_orders.nunique() )


#### Order_products dataframe 
Total of 4 545 007  lines;<br>
Columns :  0. 'order_id' | 1. 'product_id' | 2. 'add_to_cart_order' | 3. 'reordered':
- "add_to_cart_order" have 836 missing values and data type must be changed to int64;
- 'reordered' must be changed to type bool;
- all the others have zero NaN values and data type is int64;
- we can sort the dataframe per 'order_id' so the lines related to same order are in sequence as we only have 478 952 orders.

In [26]:
# Checking dataframes information
print("=== AISLES INFO ===")
df_aisles.info()
print("=== AISLES FIRST 5 ROWS ===")
print(df_aisles.head())
print("=== COUNT UNIQUE VALUES ===")
print(df_aisles.nunique())

=== AISLES INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   aisle_id  134 non-null    int64
 1   aisle     134 non-null    str  
dtypes: int64(1), str(1)
memory usage: 2.2 KB
=== AISLES FIRST 5 ROWS ===
   aisle_id                       aisle
0         1       prepared soups salads
1         2           specialty cheeses
2         3         energy granola bars
3         4               instant foods
4         5  marinades meat preparation
=== COUNT UNIQUE VALUES ===
aisle_id    134
aisle       134
dtype: int64


#### Aisles dataframe
Total of 134 lines;<br>
Columns: 0. 'aisle_id' | 1.'aisle'.<br>
- no NaN values
- int64 for 'aisle_id'
- str for 'aisle'

In [28]:
# Checking dataframes information
print("=== DEPARTMENTS INFO ===")
df_departments.info()
print("=== DEPARTMENTS FIRST 5 ROWS ===")
print(df_departments.head())
print("=== COUNT UNIQUE VALUES ===")
print(df_departments.nunique())

=== DEPARTMENTS INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   department_id  21 non-null     int64
 1   department     21 non-null     str  
dtypes: int64(1), str(1)
memory usage: 468.0 bytes
=== DEPARTMENTS FIRST 5 ROWS ===
   department_id department
0              1     frozen
1              2      other
2              3     bakery
3              4    produce
4              5    alcohol
=== COUNT UNIQUE VALUES ===
department_id    21
department       21
dtype: int64


#### Departments dataframe
Total of 21 lines;<br>
Columns: 0. 'department_id' | 1. 'department':<br>
- no NaN values
- int64 for 'department_id'
- str for 'department'