# Predicting Customer Satisfaction - Ecommerce Data
Olist has released a dataset of 100k orders made between 2016 and 2018. Lets create a model to predict what's the score a customer will give for an order.

# 1. Exploratory Data Analysis
Some **EDAs (Exploratory Data Analysis)** were already made by other users and are publicly available at the dataset's kernels. That's why we're going to skip much of the EDA and jump into the problem[](http://). We recommend the following EDAs:
* [E-Commerce Exploratory Analysis](https://www.kaggle.com/jsaguiar/e-commerce-exploratory-analysis) by [Aguiar](https://www.kaggle.com/jsaguiar)
* [Data Cleaning, Viz and Stat Analysis on e-com](https://www.kaggle.com/goldendime/data-cleaning-viz-and-stat-analysis-on-e-com) by [Azim Salikhov](https://www.kaggle.com/goldendime)

Those analysis help us understand what is happening with data. After we are confortable with it, and confident of its value we may start working on bigger problems. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

orders = pd.read_csv('../input/olist_public_dataset_v2.csv')

# converting to datetime
orders['order_purchase_timestamp'] = pd.to_datetime(orders.order_purchase_timestamp)
orders['order_aproved_at'] = pd.to_datetime(orders.order_aproved_at).dt.date  
orders['order_estimated_delivery_date'] = pd.to_datetime(orders.order_estimated_delivery_date).dt.date  
orders['order_delivered_customer_date'] = pd.to_datetime(orders.order_delivered_customer_date).dt.date  

# get translations for category names
translation = pd.read_csv('../input/product_category_name_translation.csv')
orders = orders.merge(translation, on='product_category_name').drop('product_category_name', axis=1)

orders.head(3)

# 2. Defining the Problem
Let's say your manager asked you: 

**"What is the probable score that we getting from customers?"**

Our problem is to find a way to estimate, based on data about the product and order, what will be the customer review score.

# 3. The hypothesis
Our main hypothesis is that the product and how the order was fulfilled might influence the customer review score. Keep in mind that each feature we create is a new hypothesis we are testing.

# 4. Designing an Experiment
To answer that question we must implement collect data from each order up to delivery phase. With that, we should implement a model that estimates what will be the score given by the customer at the review phase.

![frame the problem](https://i.imgur.com/MTLzY55.png)

####  How would you frame this problem? 
If you would try a different approach, please leave a comment or write a kernel!


# a. Drop columns
Some columns have information about the review given by a customer (review_coment_message, review_creation_date, etc), but we don't want to use that. Our experiment assumes we don't have any information about the review, so we need to predict the score before a customer writes it. There are also some columns that are unninformative to predict the customer satisfaction.

In [None]:
orders = orders[['order_status', 'order_products_value',
                 'order_freight_value', 'order_items_qty', 'order_sellers_qty',
                 'order_purchase_timestamp', 'order_aproved_at', 'order_estimated_delivery_date', 
                 'order_delivered_customer_date', 'customer_state', 
                 'product_category_name_english', 'product_name_lenght', 'product_description_lenght', 
                 'product_photos_qty', 'review_score']]

# b. Spliting the Dataset
It is important that we split our data at the very beginning of our analysis. Doing that after might introduce some unwanted bias. 

> To split correctly, lets first see how classes are distributed over the full dataset.

In [None]:
# We keep the same proportion of classes
orders['review_score'].value_counts() / len(orders['review_score'])

## Simple split
Lets first try a simple random split and lets see if the proportions are kept equal.

In [None]:
from sklearn.model_selection import train_test_split

# split
train_set, test_set = train_test_split(orders, test_size=0.2, random_state=42)

In [None]:
test_set['review_score'].value_counts() / len(test_set['review_score'])

We see there is some difference between the proportion of each class compared to the original dataset.

## Stratified Split
Now lets do a stratified shuffle split and compare to the full dataset again.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Stratified Split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(orders, orders['review_score']):
    strat_train_set = orders.loc[train_index]
    strat_test_set = orders.loc[test_index]

In [None]:
strat_train_set['review_score'].value_counts() / len(strat_train_set['review_score'])

By doing a stratified split we keep the same proportion between classes. This split better represent the original data and will possibli reduce any bias. 

# c. Separate Labels From Features
We don't wanto to apply any transformation to the labels (review_score). To avoid that we just create a separate serie with labels, and drop the target column from features dataset.

In [None]:
orders_features = strat_train_set.drop('review_score', axis=1)
orders_labels = strat_train_set['review_score'].copy()

# d. Feature Engineering
If we see the original data there aren't many columns that are correlated to target.

In [None]:
corr_matrix = strat_train_set.corr()
corr_matrix['review_score'].sort_values(ascending=False)

It's clear that we have to create more informative features to model this problem.

## Features Hypotesis

#### Working Days Estimated Delivery Time
Gets the days between order approval and estimated delivery date. A customer might be unsatisfied if he is told that the estimated time is big.

#### Working Days Actual Delivery Time
Gets the days between order approval and delivered customer date. A customer might be more satisfied if he gets the product faster.

#### Working Days Delivery Time Delta
The difference between the actual and estimated date.  If negative was delivered early, if positive was delivered late. A customer might be more satisfied if the order arrives sooner than expected, or unhappy if he receives after the deadline

#### Is Late
Binary variable indicating if the order was delivered after the estimated date.

#### Average Product Value
Cheaper products might have lower quality, leaving customers unhappy.

#### Total Order Value
If a customer expends more, he might expect a better order fulfilment.

#### Order Freight Ratio
If a customer pays more for freight, he might expect a better service.

#### Purchase Day of Week
Does it affect how happy are the customers?

In [None]:
#plt.figure(figsize=(20,5))
#sns.heatmap(corr_matrix)
#plt.show()

In [None]:
# To consider Brazilian calendar and hollidays
!pip install workalendar
from workalendar.america import Brazil
cal = Brazil()

## Creating a Custom Transformer for FeatEng
We need to guarantee that we are apply exactly the same transformation to new/unseen data. To do that we will create custom transformers using scikit-learn BaseEstimator.

This first custom transformer will do the feature engineering that we just described earlier.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class AttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass    
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        df = X.copy()
        
        # Calculate the estimated delivery time and actual delivery time in working days. 
        # This would allow us to exclude hollidays that could influence delivery times.
        # If the order_delivered_customer_date is null, it returns 0.
        df['wd_estimated_delivery_time'] = df.apply(lambda x: cal.get_working_days_delta(x.order_aproved_at, 
                                                                                      x.order_estimated_delivery_date), axis=1)
        df['wd_actual_delivery_time'] = df.apply(lambda x: cal.get_working_days_delta(x.order_aproved_at, 
                                                                                   x.order_delivered_customer_date), axis=1)

        # Calculate the time between the actual and estimated delivery date. If negative was delivered early, if positive was delivered late.
        df['wd_delivery_time_delta'] = df.wd_actual_delivery_time - df.wd_estimated_delivery_time


        # Calculate the time between the actual and estimated delivery date. If negative was delivered early, if positive was delivered late.
        df['is_late'] = df.order_delivered_customer_date > df.order_estimated_delivery_date
        
        # Calculate the average product value.
        df['average_product_value'] = df.order_products_value / df.order_items_qty

        # Calculate the total order value
        df['total_order_value'] = df.order_products_value + df.order_freight_value
        
        # Calculate the order freight ratio.
        df['order_freight_ratio'] = df.order_freight_value / df.order_products_value
        
        # Calculate the order freight ratio.
        df['purchase_dayofweek'] = df.order_purchase_timestamp.dt.dayofweek
                       
        # With that we can remove the timestamps from the dataset
        cols2drop = ['order_purchase_timestamp', 'order_aproved_at', 'order_estimated_delivery_date', 
                     'order_delivered_customer_date']
        df.drop(cols2drop, axis=1, inplace=True)
        
        return df

### New Features - Working Days
Analysing the dataframe we see that the new features were succesfully created.

In [None]:
# Executing the estimator we just created
attr_adder = AttributesAdder()
feat_eng = attr_adder.transform(strat_train_set)
feat_eng.head(3)

### Correlation
What is the correlation of the features we have just created with the review score?

In [None]:
corr_matrix = feat_eng.corr()
corr_matrix['review_score'].sort_values(ascending=False)

Looks ok, there aren't any strong correlation. But it is clear that if a customer will give a lower score if he gets an order after the estimated date. 

## Any missing values?
Let's see if there are any missing values.

In [None]:
feat_eng.info()

Great! No missing values after this transformation!

# e. Data Viz
Now let's visually explore this dataset a little bit!

# f. Dealing with Categorical and Numerical Attributes
The way we handle categorical data is very different from the transformations needed for numerical features. We will create a transformer to select only categorical or numerical features for processing.

In [None]:
# selecting the numerical and text attributes
cat_attribs = ['order_status', 'customer_state', 'product_category_name_english']
num_attribs = orders_features.drop(cat_attribs, axis=1).columns

In [None]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.attribute_names]

## Numerical Attributes
Creating pipelines to handle unseen data

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# for now we wont work with categorical data. Planning to add it on next releases
num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),
                         ('attribs_adder', AttributesAdder()),
                         ('std_scaller', StandardScaler())
                        ])

In [None]:
# lets see how the resulting data looks like
orders_features_prepared = num_pipeline.fit_transform(orders_features)
orders_features_prepared

# g. Selecting a Model
Start simple.

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(orders_features_prepared, orders_labels)

In [None]:
some_data = orders_features.iloc[:8]
some_labels = orders_labels.iloc[:8]
some_data_prepared = num_pipeline.transform(some_data)

In [None]:
print('Predicted: {} \n Labels: {}'.format(list(lin_reg.predict(some_data_prepared)), list(some_labels.values)))

Looks like we are not even close to predicting the right values. Lets see whats the root mean squared error.

In [None]:
from sklearn.metrics import mean_squared_error

predictions = lin_reg.predict(orders_features_prepared)
lin_mse = mean_squared_error(orders_labels, predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

A typical prediction error of about 1.25 is not at all satisfying when we are trying to predict values that range from 1 to 5.  So let's try a different model.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(orders_features_prepared, orders_labels)

predictions = forest_reg.predict(orders_features_prepared)
forest_mse = mean_squared_error(orders_labels, predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

Much better! We got a typical error of 0.53 with Random Forest. Looks like it's a good algorithm! Let's see some examples of predictions.

In [None]:
print('Predicted: {} \n Labels: {}'.format(list(forest_reg.predict(some_data_prepared)), list(some_labels.values)))

# Next steps

1. Cross validation
2. Grid search
3. Full pipeline - transform and predict data
4. Validation on test set
5. Constructing a conclusion