# Orders

✏️ **Exercise**

Today, we will investigate the **orders**, and their associated review score.

For that purpose, we will create one single data table containing **all our orders with some engineered statistics for them as additional columns.**

Our goal is to create the following DataFrame, which will come in very handy for our modeling phase

  - `order_id` (_str) the id of the order_
  - `wait_time` (_float) the number of days between order_date and delivered_date_
  - `expected_wait_time` (_float) the number of days between order_date and estimated_delivery_date_
  - `delay_vs_expected` (_float) if the actual delivery date is later than the estimated delivery date, returns the absolute number of days between the two dates, otherwise return 0_
  - `order_status` (_str) the status of the order_
  - `dim_is_five_star` (_int) 1 if the order received a five_star, 0 otherwise_
  - `dim_is_one_star` (_int) 1 if the order received a one_star, 0 otherwise_
  - `review_score`(_int) from 1 to 5_
  - `number_of_product` (_int) number of products that the order contains_
  - `number_of_sellers` (_int) number of sellers involved in the order_
  - `price` (_float) total price of the order paid by customer_
  - `freight_value` (_float) value of the freight paid by customer_
  - (Optional) `distance_customer_seller` (_float) the distance in km between customer and seller_
  
We also want to filter out "non-delivered" orders, unless explicitly specified

❓ **Your challenge**: 

- Implement each feature as a separate method within the `Order` class available at `olist/order.py`
- Then, create a method `get_training_data()` that returns the complete DataFrame.

Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `order.py` once you are certain of your code logic
- Focus on the data manipulation logic now, we will analyse the dataset visually in the next challenges

<details>
    <summary>🔥 Notebook best practice (must read) </summary>

From now on, exploratory notebooks are going to get pretty long, and we strongly advise you to follow these notebook principles:
- Code your logic so that your Notebook can always be ran from top to bottom without crashing (Cell --> Run All)
- Name your variables carefully 
- Use dummy names such as `tmp` or `_` for intermediary steps when you know you won't need them for long
- Clear your code and merge cells when relevant to minimize Notebook size (`Shift-M`)
- Hide your cell output if you don't need to see it anymore (double click on the red `Out[]:` section to the left of your cell).
- Make heavy use of jupyber nbextention `Collapsable Headings` and `Table of Content` (call a TA if you can't find them)
- Use the following shortcuts 
    - `a` to insert a cell above
    - `b` to insert a cell below
    - `dd` to delete a cell
    - `esc` and `arrows` to move between cells
    - `Shift-Enter` to execute cell and move focus to the next one
    - use `Shift + Tab` when you are between method brackets e.g. `group_by()` to get the docs! Repeat a few times to open it permanently

</details>





In [0]:
# Auto reload imported module everytime a jupyter cell is executed (handy for olist.order.py updates)
%load_ext autoreload
%autoreload 2

In [0]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:
# Import olist data
from olist.data import Olist
olist=Olist()
data=olist.get_data()
matching_table = olist.get_matching_table()

## Code `order.py`

In [0]:
orders = data['orders'].copy() # good practice to be sure not to modify your `data` variable

### get_wait_time
Return a dataframe with [order_id, wait_time, expected_wait_time, delay_vs_expected, order_status]

Hints:
- Don't forget to convert dates from "string" type to "pandas.datetime' using [`pandas.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
- Take time to understand what python [`datetime`](https://docs.python.org/3/library/datetime.html) objects are 

In [0]:
# We give you the pseudo-code below for this first method:

# Inspect orders dataframe
# Filter dataframe on delivered orders
# handle datetime
# compute wait time
# compute expected wait time
# compute delay vs expected - carefully handles "negative" delays
# check new dataframe and copy code carefully to `olist/order.py`

In [0]:
orders = data['orders'].copy()
orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


In [0]:
orders.isna().sum()

order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

In [0]:
# handle datetime
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'])
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])

In [0]:
orders['order_delivered_customer_date'] - orders['order_purchase_timestamp']

0        8 days 10:28:40
1       13 days 18:46:08
2        9 days 09:27:40
3       13 days 05:00:36
4        2 days 20:58:23
              ...       
99436    8 days 05:13:56
99437   22 days 04:38:58
99438   24 days 20:37:34
99439   17 days 02:04:27
99440    7 days 16:11:00
Length: 99441, dtype: timedelta64[ns]

In [0]:
# Compute just the number of days in each time_delta 
import datetime
one_day_delta = datetime.timedelta(days=1) # a "timedelta" object of 1 day
one_day_delta = np.timedelta64(24, 'h') # a "timedelta64" object of 1 day (use the one you prefer)

# Assign compute delay vs expected
orders.loc[:,'wait_time'] = \
    (orders['order_delivered_customer_date'] - orders['order_purchase_timestamp']) / one_day_delta

orders.loc[:,'delay_vs_expected'] = \
    (orders['order_estimated_delivery_date'] - orders['order_delivered_customer_date']) / one_day_delta

orders.loc[:,'expected_wait_time'] = \
    (orders['order_estimated_delivery_date'] - orders['order_purchase_timestamp']) / one_day_delta

In [0]:
# Other method using pandas magic is less good: it rounds to the day so less interesting
(orders['order_delivered_customer_date'] - orders['order_purchase_timestamp']).dt.days

0         8.0
1        13.0
2         9.0
3        13.0
4         2.0
         ... 
99436     8.0
99437    22.0
99438    24.0
99439    17.0
99440     7.0
Length: 99441, dtype: float64

In [0]:
# We could use pandas' built in .clip method to remove anything below 0
# orders.loc[:,'delay_vs_expected'] = orders['delay_vs_expected'].clip(0)

# Or write a custom function and apply it to the column
def handle_delay(x):
    if x < 0:
        return abs(x)
    else:
        return 0

orders.loc[:,'delay_vs_expected'] = orders['delay_vs_expected'].apply(handle_delay)

In [0]:
orders[['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected']]

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0
3,949d5b44dbf5de918fe9c16f97b45f8a,13.208750,26.188819,0.0
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0
...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,8.218009,18.587442,0.0
99437,63943bddc261676b46f01ca7ac2f7bd8,22.193727,23.459051,0.0
99438,83c1379a015df1e13d02aae0204711ab,24.859421,30.384225,0.0
99439,11c177c8e97725db2631073c19f07b62,17.086424,37.105243,0.0


In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
# Test it below
from olist.order import Order
Order().get_wait_time()

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered
3,949d5b44dbf5de918fe9c16f97b45f8a,13.208750,26.188819,0.0,delivered
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,8.218009,18.587442,0.0,delivered
99437,63943bddc261676b46f01ca7ac2f7bd8,22.193727,23.459051,0.0,delivered
99438,83c1379a015df1e13d02aae0204711ab,24.859421,30.384225,0.0,delivered
99439,11c177c8e97725db2631073c19f07b62,17.086424,37.105243,0.0,delivered


### get_review_score
     Returns a DataFrame with:
        order_id, dim_is_five_star, dim_is_one_star, review_score

In [0]:
reviews = data['order_reviews'].copy()
reviews

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53
...,...,...,...,...,...,...,...
99995,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09 00:00:00,2017-12-11 20:06:42
99996,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43
99997,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01 00:00:00,2018-07-02 12:59:13
99998,be360f18f5df1e0541061c87021e6d93,f8bd3f2000c28c5342fedeb5e50f2e75,1,,Solicitei a compra de uma capa de retrovisor c...,2017-12-15 00:00:00,2017-12-16 01:29:43


In [0]:
# Fill in the functions below, which you will have to apply "element-wise" to each Series in the next cell below
# So as to create the 2 new columns requested 

def dim_five_star(d):
    # $CHALLENGIFY_BEGIN
    if d == 5:
        return 1
    else:
        return 0
    # $CHALLENGIFY_END


def dim_one_star(d):
    # $CHALLENGIFY_BEGIN
    if d == 1:
        return 1
    else:
        return 0
    # $CHALLENGIFY_END

In [0]:
reviews["dim_is_five_star"] = reviews["review_score"].map(dim_five_star) # --> Series([0, 1, 1, 0, 0, 1 ...])


reviews["dim_is_one_star"] = reviews["review_score"].map(dim_one_star) # --> Series([0, 1, 1, 0, 0, 1 ...])

In [0]:
reviews[["order_id", "dim_is_five_star", "dim_is_one_star", "review_score"]]

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5
...,...,...,...,...
99995,22ec9f0669f784db00fa86d035cf8602,1,0,5
99996,55d4004744368f5571d1f590031933e4,1,0,5
99997,7725825d039fc1f0ceb7635e3f7d9206,0,0,4
99998,f8bd3f2000c28c5342fedeb5e50f2e75,0,1,1


In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
# Test it below
from olist.order import Order
Order().get_review_score()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5
...,...,...,...,...
99995,22ec9f0669f784db00fa86d035cf8602,1,0,5
99996,55d4004744368f5571d1f590031933e4,1,0,5
99997,7725825d039fc1f0ceb7635e3f7d9206,0,0,4
99998,f8bd3f2000c28c5342fedeb5e50f2e75,0,1,1


### Check your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('reviews',
    dim_five_star=dim_five_star(5),
    dim_not_five_star=dim_five_star(3),
    dim_one_star=dim_one_star(1),
    dim_not_one_star=dim_one_star(2)
)
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /Users/krokrob/.pyenv/versions/3.8.6/envs/lewagon386/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/krokrob/code/lewagon/data-solutions/04-Decision-Science/02-Statistical-Inference/01-Orders
plugins: anyio-2.0.2
[1mcollecting ... [0mcollected 4 items

tests/test_reviews.py::TestReviews::test_dim_five_star [32mPASSED[0m[32m            [ 25%][0m
tests/test_reviews.py::TestReviews::test_dim_not_five_star [32mPASSED[0m[32m        [ 50%][0m
tests/test_reviews.py::TestReviews::test_dim_not_one_star [32mPASSED[0m[32m         [ 75%][0m
tests/test_reviews.py::TestReviews::test_dim_one_star [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/reviews.pickle

[32mgit[39m commit -m [33m'Completed reviews step'[39m

[32mgit[39m push origin master


### get_number_products:
     Returns a DataFrame with:
        order_id, number_of_products (total number of products per order)

In [0]:
data["order_items"].groupby("order_id").count()\
.rename(columns={"order_item_id": "number_of_products"})\
.sort_values("number_of_products")[['number_of_products']]

In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_products()

### get_number_sellers:
     Returns a DataFrame with:
        order_id, number_of_sellers (total number of unique sellers per order)

<details>
    <summary>Hint</summary>

`pd.Series.nunique()`
</details>

In [0]:
sellers = \
    data['order_items']\
    .groupby('order_id')['seller_id'].nunique().reset_index()

sellers.columns = ['order_id', 'number_of_sellers']
sellers.sort_values('number_of_sellers')

In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_sellers()

### get_price_and_freight
     Returns a DataFrame with:
        order_id, price, freight_value

<details>
    <summary>Hint</summary>

`pd.Series.agg()` allows you to apply one transformation method per column of your groupby object
</details>

In [0]:
price_freight = \
    data['order_items']\
    .groupby('order_id',
             as_index=False).agg({'price': 'sum',
                                  'freight_value': 'sum'})
price_freight

In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_price_and_freight()

### get_distance_seller_customer (OPTIONAL - Try only after finishing today's challenges - Skip to next section)
[order_id, distance_seller_customer] (the distance in km between customer and seller)

💡Have a look at the `haversine_distance` formula we coded for you in the `olist.utils` module

In [0]:
# Select sellers and customers 
sellers = data['sellers']
sellers.head(2)

In [0]:
customers = data['customers']
customers.head(2)

In [0]:
# Select geo dataset
geo = data['geolocation'].sort_values(by='geolocation_zip_code_prefix')
geo.head(5)

In [0]:
# Warning: Since one zipcode can map to multiple [lat, lng], we take the first one
geo = geo.groupby('geolocation_zip_code_prefix', as_index=False).first()

In [0]:
# merge geo_location for sellers
sellers_mask_columns = ['seller_id', 'seller_zip_code_prefix', 'seller_city', 
                        'seller_state', 'geolocation_lat', 'geolocation_lng']
sellers_geo = sellers.merge(geo,
                            how='left',
                            left_on='seller_zip_code_prefix',
                            right_on='geolocation_zip_code_prefix')[sellers_mask_columns]
sellers_geo.head(2)

In [0]:
#merge geo_location for customers
customers_mask_columns = ['customer_id', 'customer_zip_code_prefix', 'customer_city', 
                          'customer_state', 'geolocation_lat', 'geolocation_lng']
customers_geo = customers.merge(geo,
                            how='left',
                            left_on='customer_zip_code_prefix',
                            right_on='geolocation_zip_code_prefix')[customers_mask_columns]
customers_geo.head(2)

In [0]:
#use the matching table to merge customers 
matching_table.head(2)

In [0]:
# use the matching table to merge sellers 
matching_geo = matching_table.merge(sellers_geo, on='seller_id')
matching_geo.head(2)

In [0]:
matching_geo = matching_geo.merge(customers_geo, on='customer_id', suffixes=('_seller', '_customer'))

In [0]:
#check that shape is correct
matching_geo.shape

In [0]:
#any na? 
matching_geo.info()

_We find that some rows for geo_seller and geo_customer are nulls due to our left outer joins. Let's remove them with dropna()_

In [0]:
#remove na()
matching_geo = matching_geo.dropna()

In [0]:
# Add the distance between seller and customers using our utils function
from olist.utils import haversine_distance
print(haversine_distance.__doc__)

In [0]:
matching_geo["distance_seller_customer"] = matching_geo.apply(
    lambda row: haversine_distance(
        row["geolocation_lng_seller"],
        row["geolocation_lat_seller"],
        row["geolocation_lng_customer"],
        row["geolocation_lat_customer"],
    ),
    axis=1,
)

In [0]:
matching_geo.info()

In [0]:
# Check that distance is roughly accurate in sampling 3 cases (really, check it on google map!)
matching_geo.sample(3)[['seller_city', 
                        'customer_city', 
                        'distance_seller_customer']]

In [0]:
# A quick distribution plot for the road!
sns.distplot(matching_geo['distance_seller_customer'])

In [0]:
# Check median distance
matching_geo['distance_seller_customer'].describe()

In [0]:
# Since an order can have multiple sellers, return the average distance per order
mean_order_distance = matching_geo.groupby("order_id", as_index=False).agg(
    {"distance_seller_customer": "mean"}
)
mean_order_distance

In [0]:
mean_order_distance.describe()

In [0]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_distance_seller_customer()

# Test your newly coded module

❓ Time to code `get_training_data` making use of your previous coded methods.

In [0]:
from olist.order import Order
from nbresult import ChallengeResult
data = Order().get_training_data()
result = ChallengeResult('training',
    shape=data.shape,
    columns=sorted(list(data.columns))
)
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /Users/krokrob/.pyenv/versions/3.8.6/envs/lewagon386/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/krokrob/code/lewagon/data-solutions/04-Decision-Science/02-Statistical-Inference/01-Orders
plugins: anyio-2.0.2
[1mcollecting ... [0mcollected 2 items

tests/test_training.py::TestTraining::test_training_data_columns [32mPASSED[0m[32m  [ 50%][0m
tests/test_training.py::TestTraining::test_training_data_shape [32mPASSED[0m[32m    [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/training.pickle

[32mgit[39m commit -m [33m'Completed training step'[39m

[32mgit[39m push origin master


🏁 Congratulations! Commit and push your notebook before starting the next challenge.