# Seller( ) Class 

**Notebook Objective**

This notebook will be used to test and implement methods as part of a custom **Seller( )** class based on data provided by the Brazilian e-commerce platform **Olist**. 

Our final method `get_training_data()` will create a single DataFrame with **all unique sellers as index and all properties of these sellers as columns** hopefully making it easier to build models and perform analysis.

In [1]:
# Auto reload imported module everytime a cell is executed
%load_ext autoreload
%autoreload 2

In [2]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import paths

In [3]:
# Import Olist data and Order() class
from olist.data import Olist
from olist.order import Order
paths
olist=Olist()
data=olist.get_data()
matching_table = olist.get_matching_table()

## Method Implementation for Seller( ) Class

Let's implement methods to help prepare sellers-related data for statistical modeling and analysis.

### get_seller_features( )

Here we'll implement a method that returns a DataFrame with **`seller_id`, `seller_city`** and **`seller_state`**.

In [4]:
# Make copy and inspect sellers data
sellers = data['sellers'].copy()
sellers.drop('seller_zip_code_prefix', axis=1, inplace=True)
sellers.drop_duplicates(inplace=True)
sellers.head()

Unnamed: 0,seller_id,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,braganca paulista,SP


### get_seller_delay_wait_time( )

Here we'll implement a method that returns a DataFrame with **`seller_id`, `delay_to_carrier`** and **`seller_wait_time`**.

In [5]:
# Get data from only those orders that have been 'delivered'
order_items = data['order_items'].copy()
orders = data['orders'].query("order_status=='delivered'").copy()

ship = order_items.merge(orders, on='order_id')

# Handle datetime conversion
ship.loc[:, 'shipping_limit_date'] = pd.to_datetime(ship['shipping_limit_date'])
ship.loc[:, 'order_delivered_carrier_date'] = pd.to_datetime(ship['order_delivered_carrier_date'])
ship.loc[:, 'order_delivered_customer_date'] = pd.to_datetime(ship['order_delivered_customer_date'])
ship.loc[:, 'order_purchase_timestamp'] = pd.to_datetime(ship['order_purchase_timestamp'])

In [6]:
# Compute delay and wait_time
def delay_to_logistic_partner(d):
    days = np.mean(
            (d.shipping_limit_date - d.order_delivered_carrier_date)/np.timedelta64(24, 'h')
            )
    if days < 0:
        return abs(days)
    else:
        return 0
    
def order_wait_time(d):
    days = np.mean(
            (d.order_delivered_customer_date - d.order_purchase_timestamp)/np.timedelta64(24, 'h')
            )
    return days

In [7]:
delay = ship.groupby('seller_id')\
        .apply(delay_to_logistic_partner)\
        .reset_index()

delay.columns = ['seller_id', 'delay_to_carrier']
delay.head()

Unnamed: 0,seller_id,delay_to_carrier
0,0015a82c2db000af6aaaf3ae2ecb0532,0.0
1,001cca7ae9ae17fb1caed9dfb1094831,0.0
2,002100f778ceb8431b7a1020ff7ab48f,0.0
3,003554e2dce176b5555353e4f3555ac8,0.0
4,004c9cd9d87a3c30c522c48c4fc07416,0.0


In [8]:
wait = ship.groupby('seller_id')\
           .apply(order_wait_time)\
           .reset_index()
           
wait.columns = ['seller_id', 'seller_wait_time']
wait.head()

Unnamed: 0,seller_id,seller_wait_time
0,0015a82c2db000af6aaaf3ae2ecb0532,10.793885
1,001cca7ae9ae17fb1caed9dfb1094831,13.096632
2,002100f778ceb8431b7a1020ff7ab48f,16.192371
3,003554e2dce176b5555353e4f3555ac8,4.646806
4,004c9cd9d87a3c30c522c48c4fc07416,14.430364


In [9]:
df = delay.merge(wait, on = 'seller_id')

df.head()

Unnamed: 0,seller_id,delay_to_carrier,seller_wait_time
0,0015a82c2db000af6aaaf3ae2ecb0532,0.0,10.793885
1,001cca7ae9ae17fb1caed9dfb1094831,0.0,13.096632
2,002100f778ceb8431b7a1020ff7ab48f,0.0,16.192371
3,003554e2dce176b5555353e4f3555ac8,0.0,4.646806
4,004c9cd9d87a3c30c522c48c4fc07416,0.0,14.430364


### get_active_dates( )

Here we'll implement a method that returns a DataFrame with **`seller_id`, `date_first_sale`,** and **`date_last_sale`**.

In [10]:
# Create two new columns in view of aggregating
orders.loc[:,'date_first_sale'] = pd.to_datetime(orders['order_approved_at'])
orders['date_last_sale'] = orders['date_first_sale']

df = orders.merge(matching_table[['seller_id', 'order_id']], on="order_id")\
           .groupby('seller_id')\
           .agg({
            "date_first_sale": min,
            "date_last_sale": max
        })

df['months_on_olist'] = round((df['date_last_sale'] - df['date_first_sale']) / np.timedelta64(1, 'M'))

df.head()

Unnamed: 0_level_0,date_first_sale,date_last_sale,months_on_olist
seller_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0015a82c2db000af6aaaf3ae2ecb0532,2017-09-27 22:24:16,2017-10-18 23:56:20,1.0
001cca7ae9ae17fb1caed9dfb1094831,2017-02-04 19:15:39,2018-07-12 21:50:17,17.0
002100f778ceb8431b7a1020ff7ab48f,2017-09-14 01:10:15,2018-04-12 13:11:45,7.0
003554e2dce176b5555353e4f3555ac8,2017-12-15 07:11:03,2017-12-15 07:11:03,0.0
004c9cd9d87a3c30c522c48c4fc07416,2017-01-28 02:32:27,2018-05-05 10:15:17,15.0


### get_review_score( )

Here we'll implement a method that returns a DataFrame with **`seller_id`, `share_of_five_stars`, `share_of_one_stars`, `seller_review_score`** and **`review_cost_per_seller`**.

In [11]:
order = Order()
order_reviews = order.get_review_score().copy()


matching_table2 = matching_table[['order_id', 'seller_id']].drop_duplicates().copy()
df2 = matching_table2.merge(order_reviews, on='order_id')


# df2 = df2.groupby('seller_id',
#                   as_index=False).agg({'dim_is_one_star': 'mean',
#                                        'dim_is_five_star': 'mean',
#                                        'review_score': 'mean'})


# df2.columns = ['seller_id', 'share_of_one_stars', 'share_of_five_stars', 'seller_review_score']

# df2.head()


In [12]:
df2.head()

Unnamed: 0,order_id,seller_id,dim_is_five_star,dim_is_one_star,review_score
0,e481f51cbdc54678b7cc49136f2d6af7,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4
1,53cdb2fc8bc7dce0b6741e2150273451,289cdb325fb7e7f891c38608bf9e0962,0,0,4
2,47770eb9100c2d0c44946d9cf07ec65d,4869f7a5dfa277a7dca6462dcf3b52b2,1,0,5
3,949d5b44dbf5de918fe9c16f97b45f8a,66922902710d126a0e7d26b0e3805106,1,0,5
4,ad21c59c0840e6cb83a9ceb5573f8159,2c9e548be18521d1c43cde1c582c6de8,1,0,5


In [13]:
#df2.to_csv('/Users/atat/Downloads/My Tableau Repository/Datasources/2021.3/OLIST DATA - TABLEAU/csv/orders_sellers_ratings.csv')

### get_quantity( )

Here we'll implement a method that returns a DataFrame with **`seller_id`, `n_orders`, `quantity`** and **`quantity_per_order`**.

In [14]:
matching_table3 = matching_table.copy()

n_orders = matching_table3.groupby('seller_id')['order_id']\
            .nunique()\
            .reset_index()
n_orders.columns = ['seller_id', 'n_orders']

quantity = matching_table3.groupby('seller_id', as_index=False).agg({'order_id': 'count'})
quantity.columns = ['seller_id', 'quantity']
        
result = n_orders.merge(quantity, on='seller_id')
result['quantity_per_order'] = result['quantity'] / result['n_orders']

result.head()

Unnamed: 0,seller_id,n_orders,quantity,quantity_per_order
0,0015a82c2db000af6aaaf3ae2ecb0532,3,3,1.0
1,001cca7ae9ae17fb1caed9dfb1094831,200,239,1.195
2,001e6ad469a905060d959994f1b41e4f,1,1,1.0
3,002100f778ceb8431b7a1020ff7ab48f,51,56,1.098039
4,003554e2dce176b5555353e4f3555ac8,1,1,1.0


### get_sales( )

Here we'll implement a method that returns a DataFrame with **`seller_id`** and **`sales`**.

In [15]:
df3 = data['order_items'][['seller_id', 'price']]\
            .groupby('seller_id')\
            .sum()\
            .rename(columns={'price': 'sales'})

df3.head()

Unnamed: 0_level_0,sales
seller_id,Unnamed: 1_level_1
0015a82c2db000af6aaaf3ae2ecb0532,2685.0
001cca7ae9ae17fb1caed9dfb1094831,25080.03
001e6ad469a905060d959994f1b41e4f,250.0
002100f778ceb8431b7a1020ff7ab48f,1234.5
003554e2dce176b5555353e4f3555ac8,120.0


In [18]:
ratings = df2.merge(result, on='seller_id')
ratings.head()

Unnamed: 0,order_id,seller_id,dim_is_five_star,dim_is_one_star,review_score,n_orders,quantity,quantity_per_order
0,e481f51cbdc54678b7cc49136f2d6af7,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0
1,8736140c61ea584cb4250074756d8f3b,3504c0cb71d7fa48d967e0e4c94d59d9,1,0,5,53,53,1.0
2,a0151737f2f0c6c0a5fd69d45f66ceea,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0
3,a3bf941183211246f0d42ad757cba127,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0
4,1462290799412b71be32dd880eaf4e1b,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0


In [19]:
ratings = ratings.merge(df3, on='seller_id')
ratings.head()

Unnamed: 0,order_id,seller_id,dim_is_five_star,dim_is_one_star,review_score,n_orders,quantity,quantity_per_order,sales
0,e481f51cbdc54678b7cc49136f2d6af7,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
1,8736140c61ea584cb4250074756d8f3b,3504c0cb71d7fa48d967e0e4c94d59d9,1,0,5,53,53,1.0,2349.94
2,a0151737f2f0c6c0a5fd69d45f66ceea,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
3,a3bf941183211246f0d42ad757cba127,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
4,1462290799412b71be32dd880eaf4e1b,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94


In [22]:
ratings.head()

Unnamed: 0,order_id,seller_id,dim_is_five_star,dim_is_one_star,review_score,n_orders,quantity,quantity_per_order,sales
0,e481f51cbdc54678b7cc49136f2d6af7,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
1,8736140c61ea584cb4250074756d8f3b,3504c0cb71d7fa48d967e0e4c94d59d9,1,0,5,53,53,1.0,2349.94
2,a0151737f2f0c6c0a5fd69d45f66ceea,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
3,a3bf941183211246f0d42ad757cba127,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94
4,1462290799412b71be32dd880eaf4e1b,3504c0cb71d7fa48d967e0e4c94d59d9,0,0,4,53,53,1.0,2349.94


In [23]:
ratings.to_csv('/Users/atat/Downloads/My Tableau Repository/Datasources/2021.3/OLIST DATA - TABLEAU/csv/orders_sellers_ratings.csv')

## get_training_data( )

When `get_training_data()` is called on an instance of **Seller( )**, it will use the methods defined above to create a DataFrame with all unique sellers as index and the following columns: **`seller_id`, `seller_city`, `seller_state`, `delay_to_carrier`, `seller_wait_time`, `share_of_five_stars`, `share_of_one_stars`, `seller_review_score`, `n_orders`, `quantity`,  `quantity_per_order`, `date_first_sale`, `date_last_sale`, `sales`** and **`review_cost_per_seller`**

In [None]:
from olist.seller import Seller 
Seller().get_training_data().head()