# Grupo Bimbo Inventory Demand

## Exploratory data analysis

###  Introduction

We are going to run an exploratory data analysis of **Grupo Bimbo** datasets. First, download all datasets from Kaggle here - https://www.kaggle.com/c/5260/download-all.

So far, your directory should look like the following:

```
.
└── data
    └── csv
        ├── cliente_tabla.csv
        ├── producto_tabla.csv
        ├── sample_submission.csv
        ├── test.csv
        ├── town_state.csv
        └── train.csv 
```

### Import files

Second, let's import all the files.

In [1]:
# Import libraries
import pandas as pd
import os

In [2]:
# Access all files in data directory
os.listdir('../data/csv')

['producto_tabla.csv',
 'cliente_tabla.csv',
 'test.csv',
 'town_state.csv',
 'train.csv',
 'sample_submission.csv']

In [8]:
for f in os.listdir('../data/csv'):
    print(f.ljust(30) + str(round(os.path.getsize('../data/csv/' + f) / 1000000, 2)) + 'MB')

producto_tabla.csv            0.11MB
cliente_tabla.csv             21.25MB
test.csv                      251.11MB
town_state.csv                0.03MB
train.csv                     3199.36MB
sample_submission.csv         68.88MB


In [3]:
# Import each file as separate (key,values) pair of a single data dict
data_dict = {}
for file in os.listdir('../data/csv'):
    data_dict[file[:-4]] = pd.read_csv('../data/csv/{}'.format(file))

In [4]:
data_dict.keys()

dict_keys(['producto_tabla', 'cliente_tabla', 'test', 'town_state', 'train', 'sample_submission'])

In [10]:
data_dict['producto_tabla'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2592 entries, 0 to 2591
Data columns (total 2 columns):
Producto_ID       2592 non-null int64
NombreProducto    2592 non-null object
dtypes: int64(1), object(1)
memory usage: 40.6+ KB


In [11]:
data_dict['cliente_tabla'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 935362 entries, 0 to 935361
Data columns (total 2 columns):
Cliente_ID       935362 non-null int64
NombreCliente    935362 non-null object
dtypes: int64(1), object(1)
memory usage: 14.3+ MB


In [14]:
data_dict['test'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6999251 entries, 0 to 6999250
Data columns (total 7 columns):
id             int64
Semana         int64
Agencia_ID     int64
Canal_ID       int64
Ruta_SAK       int64
Cliente_ID     int64
Producto_ID    int64
dtypes: int64(7)
memory usage: 373.8 MB


In [15]:
data_dict['town_state'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 790 entries, 0 to 789
Data columns (total 3 columns):
Agencia_ID    790 non-null int64
Town          790 non-null object
State         790 non-null object
dtypes: int64(1), object(2)
memory usage: 18.6+ KB


In [16]:
data_dict['train'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74180464 entries, 0 to 74180463
Data columns (total 11 columns):
Semana               int64
Agencia_ID           int64
Canal_ID             int64
Ruta_SAK             int64
Cliente_ID           int64
Producto_ID          int64
Venta_uni_hoy        int64
Venta_hoy            float64
Dev_uni_proxima      int64
Dev_proxima          float64
Demanda_uni_equil    int64
dtypes: float64(2), int64(9)
memory usage: 6.1 GB


In [17]:
data_dict['train'].head()

Unnamed: 0,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
0,3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
1,3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
2,3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
4,3,1110,7,3301,15766,1242,3,22.92,0,0.0,3


### Run an exploratory analysis

Third, we run an exploratory analysis for the list of datasets below, and we use [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) to output one HTML output per dataset under a `report` folder.

In [7]:
dataset_list = ['producto_tabla',
                 'cliente_tabla',
                 'town_state',
                 'train']

In [9]:
from pandas_profiling import ProfileReport
from pathlib import Path

for file in dataset_list:
    df = data_dict[file]
    profile = ProfileReport(
            df, title=file
        )
    profile.to_file(Path("../data/report/{}.html".format(file)))

# TOO LONG

Now, your directory should look like the following:

```
.
└── data
    └── csv
        ├── cliente_tabla.csv
        ├── producto_tabla.csv
        ├── sample_submission.csv
        ├── test.csv
        ├── town_state.csv
        └── train.csv 
    └── report
        ├── cliente_tabla.html
        ├── producto_tabla.html
        ├── sample_submission.html
        ├── test.html
        ├── town_state.html
        └── train.html
```

### Code in bimbo/data.py

We need to be working on one working version of our data code so that we can work in team.

Within the `bimbo/data.py` file, we implement two methods:
- `get_data()`: that will return all data as a dictionary where each key contains each DataFrame
- `get_matching_table()`: that will return a DataFrame with the following columns: `customer_id`, `customer_unique_id`, `order_id`, `product_id`, `seller_id`. Only return data for orders that are `delivered`.
- Make sure you can import and inspect data from a notebook, by running:

```python
from bimbo.data import Bimbo
bimbo = Bimbo()
data = bimbo.get_data()
matching_table = olist.get_matching_table()
```