# Data Preparation

In [1]:
# "magic commands" to enable autoreload of your imported packages
%load_ext autoreload
%autoreload 2

Our goal is to load all 8 csvs each as a `pandas.DataFrame` in a single dict named `data` where each key is the name of the csv file, and each value is the dataframe created from the csv
```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### 1. Create the variable `csv_path`, which stores the path to your csv folder as a string

- When calling `pd.read_csv(csv_path)`, `csv_path` can be absolute or relative
- A relative path always starts with `.` or `..`, while an absolute path starts with `/` 
- A relative path is always computed with respect to your current working directory

In [2]:
# Check your current working directory using `os.getcwd()` below 
import os
os.getcwd()

'/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation'

☝️ current working directory refers to the absolute path _from which this notebook is being executed_

Create a relative `csv_path`.
Try using [`os.path.join`](https://docs.python.org/3/library/os.path.html), which replaces both Linux syntax (e.g. `../folder_name`) and Windows syntax (e.g. `..//folder_name`) and is therefore more robust

In [3]:
os.getcwd()

'/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation'

In [4]:
pathh = os.path.dirname(os.path.dirname(os.getcwd()))
pathh

'/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science'

In [5]:
# YOUR CODE HERE
csv_path = os.path.join(pathh, "data",  "csv")
csv_path

'/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/data/csv'

In [6]:
# Test your code below
import pandas as pd
pd.read_csv(os.path.join(csv_path, 'olist_sellers_dataset.csv')).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2. Create the list `file_names` containing all csv file names in the csv directory

- It should look like this `file_names = ['olist_sellers_dataset.csv', ....]`
- You can use `os.listdir()`
- Make sure it only lists csv files!

In [7]:
#csv_path  ---> '/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/data/csv'

In [8]:
os.listdir(csv_path)

['olist_sellers_dataset.csv',
 '.DS_Store',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 '.gitkeep',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

In [9]:
# YOUR CODE HERE
file_names = []
for i in os.listdir(csv_path):
    #print(i)
    if (".csv" in i):
        file_names.append(i)
file_names    

['olist_sellers_dataset.csv',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

### 3.  Create the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>Hint</summary>

- `.replace()`
    
- `str` ings are iterables you can slice with [ ]
</details>

In [10]:
# YOUR CODE HERE
key_names = [i.replace(".csv", "").replace("_dataset", "").replace("olist_", "") for i in file_names]
key_names

['sellers',
 'product_category_name_translation',
 'orders',
 'order_items',
 'customers',
 'geolocation',
 'order_payments',
 'order_reviews',
 'products']

### 4. Construct the dictionary `data`

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```

<details>
    <summary>Hint</summary>

The `zip()` method is very useful to iterate over two lists
```python
for (x, y) in zip(['a','b','c'], [1,2,3]):
    print(x,y)

# returns ('a', 1), ('b', 2), ('c', 3)
    
```
</details>

In [11]:
data_keys = {}
for (x, y) in zip( key_names, file_names):
    data_keys[x] = y
data_keys

{'sellers': 'olist_sellers_dataset.csv',
 'product_category_name_translation': 'product_category_name_translation.csv',
 'orders': 'olist_orders_dataset.csv',
 'order_items': 'olist_order_items_dataset.csv',
 'customers': 'olist_customers_dataset.csv',
 'geolocation': 'olist_geolocation_dataset.csv',
 'order_payments': 'olist_order_payments_dataset.csv',
 'order_reviews': 'olist_order_reviews_dataset.csv',
 'products': 'olist_products_dataset.csv'}

In [12]:
data = {}
for k,l in data_keys.items():
    data[k] =  pd.read_csv(os.path.join(csv_path, l))

### 5. Implement the method `get_data()` in `olist/data.py`

It should return the dictionary `data` upon calling it as per below

```python
from olist.data import Olist
Olist().get_data()
```
- Take time to understand what happens when `Olist().get_data()` is called
- Your method `get_data()` needs to be callable from various places (e.g your Terminal, this notebook, another notebook located elsewhere, etc...)
- You can't use a relative path this time as the current working directory `os.getcwd()` depends on where you run the code in the first place

In [13]:
pathh = os.path.dirname(os.path.dirname(os.getcwd()))
pathh

'/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science'

In [14]:
import sys; sys.path

['/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation',
 '/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science',
 '/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation',
 '/Users/kenzaelhoussaini/.pyenv/versions/3.8.6/lib/python38.zip',
 '/Users/kenzaelhoussaini/.pyenv/versions/3.8.6/lib/python3.8',
 '/Users/kenzaelhoussaini/.pyenv/versions/3.8.6/lib/python3.8/lib-dynload',
 '',
 '/Users/kenzaelhoussaini/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages',
 '/Users/kenzaelhoussaini/code/kelhoussaini/mlproject',
 '/Users/kenzaelhoussaini/code/kelhoussaini/TFM_TrainAtScale',
 '/Users/kenzaelhoussaini/code/kelhoussaini/TFM_PredictInProd',
 '/Users/kenzaelhoussaini/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/IPython/extensions',
 '/Users/kenzaelhoussaini/.ipython']

In [15]:
sys.path.append(pathh)

### Test your code

In [210]:
from nbresult import ChallengeResult
from olist.data import Olist
data = Olist().get_data()
result = ChallengeResult('get_data',
    keys_len=len(data),
    keys=sorted(list(data.keys())),
    columns=sorted(list(data['sellers'].columns))
    )
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/kenzaelhoussaini/.pyenv/versions/3.8.6/bin/python3
cachedir: .pytest_cache
rootdir: /Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation
plugins: dash-1.20.0, anyio-3.2.1
[1mcollecting ... [0mcollected 3 items

tests/test_get_data.py::TestGetData::test_columns [32mPASSED[0m[32m                 [ 33%][0m
tests/test_get_data.py::TestGetData::test_keys [32mPASSED[0m[32m                    [ 66%][0m
tests/test_get_data.py::TestGetData::test_len [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/get_data.pickle

[32mgit[39m commit -m [33m'Completed get_data step'[39m

[32mgit[39m push origin master


In [1]:
!pwd

/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation


In [6]:
from olist.data import Olist
Olist().__dict__

{}

In [5]:
Olist().get_data().keys()

/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation
/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science
/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/data/csv
olist_sellers_dataset.csv
.DS_Store
product_category_name_translation.csv
olist_orders_dataset.csv
.gitkeep
olist_order_items_dataset.csv
olist_customers_dataset.csv
olist_geolocation_dataset.csv
olist_order_payments_dataset.csv
olist_order_reviews_dataset.csv
olist_products_dataset.csv


dict_keys(['sellers', 'product_category_name_translation', 'orders', 'order_items', 'customers', 'geolocation', 'order_payments', 'order_reviews', 'products'])

In [5]:
from olist.data import Olist
Olist().get_data()['sellers'].head()

/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/01-Project-Setup/02-Data-Preparation
/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science
/Users/kenzaelhoussaini/code/kelhoussaini/data-challenges/04-Decision-Science/data/csv
olist_sellers_dataset.csv
.DS_Store
product_category_name_translation.csv
olist_orders_dataset.csv
.gitkeep
olist_order_items_dataset.csv
olist_customers_dataset.csv
olist_geolocation_dataset.csv
olist_order_payments_dataset.csv
olist_order_reviews_dataset.csv
olist_products_dataset.csv


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


❓This piece of code needs to work from anywhere on your machine, not only in this notebook.
- Open a new terminal
- Go to your home folder `cd`
- Launch an `ipython` session
- Test the two lines of code above

🏁 If this works, congrats! Don't forget to commit & push this notebook as well as data.py