# Sellers Data Analysis

Since our analysis of the orders dataset revealed that while `wait_time` was one of the more signicant factors in explaining `review_score`, the low R-squared suggested that there were other factors outside of the orders dataset.

Here we'll look at the the sellers dataset to uncover which elements are associated with the `review_score`.

If poor reviews are linked to particular sellers, perhaps we can identify the sellers who repeatedly perform poorly and try to understand why. 

We can then use this to respond to the CEO's request to improve Olist's profit margins.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from olist.data import Olist
from olist.order import Order

## Inspect Features

In [10]:
from olist.seller import Seller
seller = Seller()
sellers = seller.get_training_data()
sellers.head()

Unnamed: 0,seller_id,seller_city,seller_state,delay_to_carrier,seller_wait_time,date_first_sale,date_last_sale,months_on_olist,share_of_one_stars,share_of_five_stars,seller_review_score,review_cost_per_seller,n_orders,quantity,quantity_per_order,sales
0,3442f8959a84dea7ee197c632cb2df15,campinas,SP,0.0,13.018588,2017-05-05 16:25:11,2017-08-30 12:50:19,4.0,0.333333,0.333333,3.0,140,3,3,1.0,218.7
1,d1b65fc7debc3361ea86b5f14c68d2e2,mogi guacu,SP,0.0,9.065716,2017-03-29 02:10:34,2018-06-06 20:15:21,14.0,0.05,0.725,4.55,240,40,41,1.025,11703.07
2,ce3ad9de960102d0677a81f5d0bb7b2d,rio de janeiro,RJ,0.0,4.042292,2018-07-30 12:44:49,2018-07-30 12:44:49,0.0,0.0,1.0,5.0,0,1,1,1.0,158.0
3,c0f3eea2e14555b6faeea3dd58c1b1c3,sao paulo,SP,0.0,5.667187,2018-08-03 00:44:08,2018-08-03 00:44:08,0.0,0.0,1.0,5.0,0,1,1,1.0,79.99
4,51a04a8a6bdcb23deccc82b0b80742cf,braganca paulista,SP,3.353727,35.314861,2017-11-14 12:15:25,2017-11-14 12:15:25,0.0,1.0,0.0,1.0,100,1,1,1.0,167.99


In [11]:
sellers.columns

Index(['seller_id', 'seller_city', 'seller_state', 'delay_to_carrier',
       'seller_wait_time', 'date_first_sale', 'date_last_sale',
       'months_on_olist', 'share_of_one_stars', 'share_of_five_stars',
       'seller_review_score', 'review_cost_per_seller', 'n_orders', 'quantity',
       'quantity_per_order', 'sales'],
      dtype='object')

In [12]:
# Get summary stats for each column
sellers.describe()

Unnamed: 0,delay_to_carrier,seller_wait_time,months_on_olist,share_of_one_stars,share_of_five_stars,seller_review_score,review_cost_per_seller,n_orders,quantity,quantity_per_order,sales
count,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0,2970.0
mean,0.402786,12.160414,6.019529,0.12457,0.59213,4.08688,562.131313,33.617508,38.085185,1.16215,4566.515906
std,2.391687,7.103208,5.99424,0.19187,0.279057,0.810166,1941.000427,107.133714,122.417269,0.443348,14185.211617
min,0.0,1.214178,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,6.5
25%,0.0,8.289263,1.0,0.0,0.478261,3.818424,0.0,2.0,3.0,1.0,239.8
50%,0.0,11.120969,4.0,0.063856,0.6,4.2,100.0,7.0,8.0,1.0,893.5
75%,0.0,14.240673,10.0,0.166667,0.75,4.625,380.0,23.0,26.0,1.152009,3586.0225
max,45.434039,189.86316,23.0,1.0,1.0,5.0,40890.0,1854.0,2039.0,15.0,229472.63


In [13]:
sellers.describe().columns

Index(['delay_to_carrier', 'seller_wait_time', 'months_on_olist',
       'share_of_one_stars', 'share_of_five_stars', 'seller_review_score',
       'review_cost_per_seller', 'n_orders', 'quantity', 'quantity_per_order',
       'sales'],
      dtype='object')

## Distribution plots

Let's look at the distributions of these features in relevant groupings

### Distribution for `delay_to_carrier`, `seller_wait_time`

`delay_to_carrier` represents the average delay (in days) by a seller of getting an order to the carrier and is calculated for each seller. We can see this as representing a delay in an order attributable to the seller.

`seller_wait_time` represents the average `wait_time` (in days), from the customer placing an order online to getting delivery of the order, and is calculated per seller. 
(NOTE: this differs from the `wait_time` feature in the Orders dataset in that it has been calculated to be the average `wait_time` *for each seller*). 



In [None]:
sns.set_style('darkgrid')
fig, (ax1, ax2) = plt.subplots(1,2)
plt.figure(figsize=(15,6)).suptitle('Distribution of Delivery-related Variables', fontsize=20)
plt.close(1)

# 'delay_to_carrier' histogram
ax1 = plt.subplot(121)
ax1 = sns.histplot(sellers.delay_to_carrier)
plt.xlabel('delay_to_carrier (days)')

# 'seller_wait_time' histogram
ax2 = plt.subplot(122)
ax2 = sns.histplot(sellers.seller_wait_time)
plt.xlabel('seller_wait_time (days)')

plt.show(); 

In [None]:
sellers[['delay_to_carrier','seller_wait_time']].describe()

**Interpretation of Results**

- From these distribution plots and summary stats, we can see that delays in the order caused by the seller are quite rare, and so `delay_to_carrier` may not have enough data points to provide a complete enough explanation for low review scores.

- While the plot for `seller_wait_time` is heavily right-skewed, this appears to be the effect of only a few outliers. The mean is still close to the median, and most sellers are clustered tightly around the center. The coefficient of variation (standard deviation / mean) also indicates low variance when less than one (7.10 / 12.16 = 0.58).

In [None]:
# How many sellers contribute to delays?
num_sellers_delay = len((sellers[sellers['delay_to_carrier'] > 0]))
num_sellers_delay

In [None]:
#What percentage of sellers contribute to delays?
round(num_sellers_delay / len(sellers), 2) * 100

In [None]:
# What is the average review score for sellers that have delayed shipment to carriers?
sellers[sellers['delay_to_carrier'] > 0].seller_review_score.mean()


In [None]:
#How does that compare to the review score of all sellers? 
sellers.seller_review_score.mean()

With sellers who have had delayed shipment to carriers, there is a 0.78 (4.09 - 3.31) decrease in avg. review score. Let's keep 'delay_to_carrier' as a feature to include in OLS Regression since it appears to have an impact on customer reviews.

Appears to be many new sellers on Olist platform. Include 'months_on_olist' as feature in OLS regression to see impact 
on customer reviews

### Distributions for `n_orders`, `quantity`, `quantity_per_order`, `sales`, `months_on_olist`

These variables indicate the various quantities related to the order  

- `n_orders` the total number of orders a seller has had while on the platform.
- `quantity` the total number of individual items sold by a seller
- `quantity_per_order` the average number of items per order for a seller. This is derived from `n_orders` and `quantity`
- `sales` the total sales (in BRL) each seller has earned

Here we can see a large right skewed distribution of orders transacted by sellers. In fact the median `n_orders` is 7, and 75% of the sellers have had less than 23 orders on the platform. 

In [None]:
fig, (ax1, ax2, ax3, ax2) = plt.subplots(4,1)
plt.figure(figsize=(15,6)).suptitle('Distribution for Order-size related Variables', fontsize=20)
plt.close(1)

#'n_orders'
ax1 = plt.subplot(411)
ax1 = sns.histplot(sellers.n_orders)
plt.tight_layout()

#'quantity'
ax2 = plt.subplot(412)
ax2 = sns.histplot(sellers.quantity)

#'quantity_per_order'
ax3 = plt.subplot(413)
ax3 = sns.histplot(sellers.quantity_per_order)

#'sales'
ax3 = plt.subplot(414)
ax3 = sns.histplot(sellers.sales)

plt.show();

In [None]:
sellers[['n_orders', 'quantity', 'quantity_per_order', 'sales']].describe()

The outliers are quite extreme with these variables making the plots difficult to read. We'll re-plot them with an adjusted view that excludes outliers.


In [None]:
# Outlier calculations (based on 1.5*IQR)

features = ['n_orders', 'quantity', 'quantity_per_order', 'sales']

max_xlim = []
q3_xlim = []

for f in features:
    q3 = np.quantile(sellers[f], .75, axis=0) 
    q1 = np.quantile(sellers[f], .25, axis=0) 
    iqr = q3-q1 
    outlier_x = 1.5*iqr
    max_xlim.append(outlier_x)
    q3_xlim.append(q3)

max_xlim

In [None]:
# Re-plotting features with an adjusted view to exclude outliers
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2)
plt.figure(figsize=(15,6)).suptitle('Distribution for Order-size related Variables (Adjusted for Outliers)', fontsize=20)
plt.close(1)

#'n_orders'
ax1 = plt.subplot(221)
ax1 = sns.histplot(sellers.n_orders)
ax1.set_xlim(0, q3_xlim[0] + max_xlim[0])

#'quantity'
ax2 = plt.subplot(222)
ax2 = sns.histplot(sellers.quantity)
ax2.set_xlim(0, q3_xlim[1] + max_xlim[1])

#'quantity_per_order'
ax3 = plt.subplot(223)
ax3 = sns.histplot(sellers.quantity_per_order)
ax3.set_xlim(1, q3_xlim[2] + max_xlim[2])

#'sales'
ax4 = plt.subplot(224)
ax4 = sns.histplot(sellers.sales)
ax4.set_xlabel('sales (BRL)')
ax4.set_xlim(0, q3_xlim[3] + max_xlim[3])


plt.show();

**Interpretation of Results**

- Since both `n_orders` and `quantity` speak to the volume of product that sellers have sold, it's not surprising that their distribution is similar. However, `n_orders` can be more helpful as a gauge for experience that a seller has in processing orders. 

- With median `n_orders` of only **7**, the platform appears to have many sellers with a low number of transactions which could be a helpful feature in explaining low review scores
 
- The distribution of `quantity_per_order` indicates that most orders deal with a single item. If each order had multiple items, then we might consider that mix-ups in orders could be a factor contributing to low review scores, but here we see a low occurrence of multiple-item orders.  

- `sales` seem to follow a similar distribution with these other variables—a strong right skew, but with the bulk on the low end. Half of all sellers have generated less than **893.50 BRL** (~170 USD).

Another feature that might speak more directly to the experience of the seller is `months_on_olist`, which tells how long a seller has been on the platform.

In [None]:
# 'months_on_olist' histogram
plt.figure(figsize=(15,6))
sns.histplot(sellers.months_on_olist)
plt.show()           

In [None]:
# Summary stats 
sellers[['months_on_olist']].describe()

- As a fairly new platform, we can see that the oldest sellers have been using Olist for only **23 months**. And, half of all sellers have only been using the platform for **4 months or less**. 






### Distribution of `seller_review_score`, `share_of_one_stars`, `share_of_five_stars`

These variables indicate how sellers have been performing based on customer feedback.   

- `seller_review_score`: the average score of a seller's reviews 
- `share_of_one_stars`: the proportion of a seller's reviews with only one star 
- `share_of_five_stars`: the proportion of a seller's reviews with fives stars


In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3)
plt.close(1)
plt.figure(figsize=(15,8)).suptitle('Distribution of Review-related Variables', fontsize=20)

#'seller_review_score'
ax1 = plt.subplot(131)
ax1 = sns.histplot(sellers.seller_review_score)

#'share_of_one_stars'
ax2 = plt.subplot(132)
ax2 = sns.histplot(sellers.share_of_one_stars)

#'share_of_five_stars'
ax3 = plt.subplot(133)
ax3 = sns.histplot(sellers.share_of_five_stars)
plt.show();

In [None]:
sellers[['seller_review_score', 'share_of_one_stars', 'share_of_five_stars']].describe()

**Interpretation of plots**

With these distribution plots, we can see what proportion of sellers have been performing badly with low review scores. This will help us identify and set thresholds for sellers who might be candidates for removal from the platform, or who may be in need of more assistance in improving quality.    

- From our histogram and statistical summary, we can see that the average `seller_review_score` is **close to 4** (mean=4.087), while those receiving an average score of **3 or less** already fall into the bottom 25% of seller performance (Q1=3.82). 

- Examining `share_of_one_stars`, we see that **75% of all sellers** only see a small portion (less than a sixth of reviews, Q3=0.167) of their sales receiving a one-star rating. For the remaining 25% of sellers who are getting a higher proportion of one-star reviews, we can scrutinize their performance further.

- With `share_of_five_stars` we see a W-shaped distribution with three peaks. On the far right, we have a large number of 'super-performers' who have only ever gotten a five-star review. Then we have two smaller peaks occurring a) the center, where only half of their reviews received a 'superb' five-star review, and b) the far left, where a substantial number have never gotten the highest review score. 

For a fairer assessment that takes into consideration the total number of reviews each seller has received, we could explore assigning weights to each 1-star (or 5-star review) that are proportional to the total number of reviews received (i.e. we could control that a seller with a hundred 5-star reviews but a single 1-star review, would still rank higher than a seller with only ten reviews but which are all 5-stars).        

## Correlations

Let's check out the correlations between different features in the Sellers dataset

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(sellers.corr(), cmap='coolwarm');

In [None]:
sellers.corr()['seller_review_score']

Among features that aren't derived from `seller_review_score`, like the Orders dataset we have features that capture delivery time such as `delay_to_carrier` (r=-0.33) and `seller_wait_time` (r=-0.42) having the largest correlation with `seller_review_score`. 

And, since `seller_wait_time` includes the time a seller ships to the carrier, it makes sense that it is also correlated with `delay_to_carrier`.

In [None]:
# correlation between 'seller_wait_time' and 'delay_to_carrier'
sellers.corr()['seller_wait_time']['delay_to_carrier']

# Multivariate Regression  

We'll now run a multivariate regression model with select features from the Sellers dataset to better understand their impact on `seller_review_score`. 

To make sure we'll be able to compare the coefficents for features in different units, we'll first do some feature scaling. Since many of the features have outliers, we'll use the standardization approach over normalization
since standardization does not scale the features based on the range of values. 

In [None]:
# Apply feature scaling 

def standardize(df, features): 
    df_standardized = df.copy()
    for f in features:
        mu = df[f].mean()
        sigma = df[f].std()
        df_standardized[f] = df[f].map(lambda x: (x - mu) / sigma)
    return df_standardized


Select features and run a multi-variate linear regression model to determine how much each feature impacts `seller_review_score`

In [None]:
features = ['delay_to_carrier', 'seller_wait_time', 'months_on_olist', 'n_orders', 'quantity', 'quantity_per_order', 'sales']
sellers_standardized = standardize(sellers, features)
model = smf.ols(formula=f"seller_review_score ~ {'+ '.join(features)}", data=sellers_standardized).fit()

In [None]:
model.summary()

**Interpretation of Results**

- This model has a low R2 (~0.20) suggesting that the currently selected features only explain 20% of the variance found in the review scores. Also, the lower value for Adj. R-squared (0.199) suggests that we may have extra features in our model that are contributing more to noise than actually providing information on factors contributing to our target variable. 
- The f-proba shows that our R2 is statistically significant and that it performs better than a model without any features.
- The coefficients for the features `months_on_olist`, `n_orders`, `quantity`, and `sales` are not statistically significant as their p-values are not less than our chosen alpha of 0.05. 
- Additionally, we see that a value of 0 falls within the 95% confidence interval for these features. Since a zero coefficient would indicate that "the feature has no impact on the target variable (i.e. our null hypothesis)", this would mean we can't reject the null hypothesis and that that the given coefficients for these features are not reliable. 

Now let's remove the non-reliable features and run another regression model.

In [None]:
# Revised feature selection
features = ['delay_to_carrier', 'seller_wait_time', 'quantity_per_order']
sellers_standardized = standardize(sellers, features)
revised_model = smf.ols(formula=f"seller_review_score ~ {'+ '.join(features)}", data=sellers_standardized).fit()

In [None]:
revised_model.summary()

**Interpretation of Results**

- With this reduced set of features, our model is still only able to explain about 20% of the variance in the seller review scores (R2=0.20). 


In [None]:
model.params

In [None]:
model.params[1:].sort_values().plot(kind='barh');

In [None]:
RMSE

In [None]:
model.rsquared

In [None]:
sns.histplot(sellers['seller_review_score'], kde=True, stat='density', discrete=True)
sns.histplot(model.predict(sellers_standardized[features]), kde=True, stat='density', discrete=True);


In [None]:
sns.histplot(model.resid, kde=True, stat='density', discrete=True);

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(sellers.corr(), cmap='coolwarm');

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(sellers.corr(method='spearman'), cmap='coolwarm');

# for SPearman = One of the variable is categorical and ordered (ordinal).

 # Logistic Regression

The poor performance of our linear regression model is likely explained by the fact that our target variable (`seller_review_score`) is a categorical variable (five star rating system). 

Since logistic regressions are better able to capture the relationship between features and a categorical target variable, let's try that with our Sellers data.