# Data Preparation

In [9]:
# "magic commands" to enable autoreload of your imported packages
%load_ext autoreload
%autoreload 2

Our goal is to load all 8 csv as `pandas.DataFrame` in a single dict named `data` where each key is the name of the csv file, and each value the dataframe created from the csv
```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### 1 Create the variable `csv_path`, which stores as a string the path to your csv folder

- When calling `pd.read_csv(csv_path)`, `csv_path` can be absolute or relative
- A relative path always starts with `.` or `..`, while an absolute path starts with `/` 
- A relative path is always computed with respect to your current working directory

In [None]:
# Check your current working directory using `os.getcwd()` below 
import os
os.getcwd()

☝️ current working directory refers to the absolute path _from which this notebook is being executed_

Create a relative `csv_path`.
Try using [`os.path.join`](https://docs.python.org/3/library/os.path.html), which replaces both Linux syntax (e.g. `../folder_name`) or Windows syntax (e.g. `..//folder_name`) and is therefore more robust

In [None]:
# Your code here


In [None]:
# Test your code below
import pandas as pd
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

### 2 Create the list `file_names` containing all csv file names in the csv directory

- It should look like this `file_names = ['olist_sellers_dataset.csv', ....]`
- You can use `os.listdir()`
- Make sure it only lists csv files!

In [None]:
# Your code here

### 3  Create the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>Hint</summary>

`stings` are iterables you can slice with [ ]
</details>

In [13]:
# Your code

### 4 Construct the dictionnary `data`

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```

<details>
    <summary>Hint</summary>

The `zip()` method is very usefull to iterate over two lists
```python
for (x, y) in zip(['a','b','c'], [1,2,3]):
    print(x,y)

# returns ('a', 1), ('b', 2), ('c', 3)
    
```
</details>

In [2]:
data = {}

### 5. Implement the method `get_data()` in `olist/data.py`

It should return the dictionary `data` upon calling it as per below

```python
from olist.data import Olist
Olist().get_data()
```
- Take time to understand what happens calling `Olist().get_data()`
- Your method `get_data()` needs to be callable from various places (e.g your Terminal, this notebook, another notebook located elsewhere, etc...)
- You can't use relative path this time as the current working directory `os.getcwd()` depends on where you run the code in the first place

In [None]:
# Test your code below
from olist.data import Olist
Olist().get_data()['sellers'].head()

❓This piece of code need to work from anywhere in your machine, not only in this notebook.
- Open a new terminal
- Go to your home folder `cd`
- Launch an `ipython` session
- Test the two lines of code above

🏁 If this works, congrats! Don't forget to commit & push this notebook as well as data.py