# Customer Satisfaction Prediction - Brazillian e-Commerce Public Dataset
## 1. Business Problem:-
### 1.1 Description 

This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. A geolocation dataset that relates Brazilian zip codes to lat/lng coordinates has also been released.

This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. See more on the website: www.olist.com

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

CREDITS:- Kaggle
    
### 1.2 Problem Statement 
Predict Customer satisfaction of the purhase from the olist e-commerce site.

### 1.3 Sources/Useful Links

1. Source:- https://www.kaggle.com/olistbr/brazilian-ecommerce
2. Data Description:- https://www.kaggle.com/andresionek/understanding-the-olist-ecommerce-dataset
3. Discussion:- https://www.kaggle.com/olistbr/brazilian-ecommerce/discussion/66466
4. Data Analysis:- https://www.kaggle.com/duygut/brazilian-e-commerce-data-analysis
5. Existing Approach:- https://www.kaggle.com/andresionek/predicting-customer-satisfaction
6. Ensemble:- https://pdfs.semanticscholar.org/449e/7116d7e2cff37b4d3b1357a23953231b4709.pdf
7. Sentiment:- https://www.kaggle.com/thiagopanini/e-commerce-sentiment-analysis-eda-viz-nlp

#### 1.3.1 Real world/Business Objectives and Constraints 
1. No strict latency concerns.
2. Interpretability is important.

## 2. Machine Learning Probelm 
### 2.1 Data 
#### 2.1.1 Data Overview 

Source:- https://www.kaggle.com/olistbr/brazilian-ecommerce

The data is divided in multiple datasets for better understanding and organization. Please refer to the following data schema when working with it:
<img src="https://i.imgur.com/HRhd2Y0.png" />

#### 2.1.2 Data Description
The **olist_orders_dataset** have the order data for each purchase connected with other data using order_id and customer_id.
The **olist_order_reviews_dataset** have the labeled review data for each order in the order data table labelled as [1,2,3,4,5] where 5 being the highest and 1 being the lowest.
We will use reviews greater than 3 as positive and less than equal to 3 as negative review.
The table will be joined accordingly to get the data needed for the analysis, feature selection and model training.

### 2.2 Mapping the real world problem to an ML problem 
#### 2.2.1 Type of Machine Leaning Problem
It is a binary classification problem, for a given purchase order we need to predict if it will get a positive or negative review from the customer.

#### 2.2.2 Performance Metric 
Metric(s): 
* f1-score : https://www.kaggle.com/wiki/LogarithmicLoss
* Binary Confusion Matrix

### 2.3 Train and Test Construction

We build train and test by stratified random split of the data in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import re
import nltk
nltk.download('stopwords')
nltk.download('rslp')
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from tqdm import tqdm
import shutil
import os
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib
matplotlib.use(u'nbAgg')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import random
from scipy.sparse import hstack
from sklearn.metrics import log_loss,accuracy_score, confusion_matrix, f1_score
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import load_model

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/krgsharma17/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package rslp to /home/krgsharma17/nltk_data...
[nltk_data]   Package rslp is already up-to-date!


In [2]:
# loading the data tables
customer_data = pd.read_csv('olist_customers_dataset.csv')
geolocation_data = pd.read_csv('olist_geolocation_dataset.csv')
order_items_dataset = pd.read_csv('olist_order_items_dataset.csv')
order_payments_dataset = pd.read_csv('olist_order_payments_dataset.csv')
order_reviews_dataset = pd.read_csv('olist_order_reviews_dataset.csv')
order_dataset = pd.read_csv('olist_orders_dataset.csv')
order_products_dataset = pd.read_csv('olist_products_dataset.csv')
order_sellers_dataset = pd.read_csv('olist_sellers_dataset.csv')
product_translation_dataset = pd.read_csv('product_category_name_translation.csv')

order_reviews_dataset = order_reviews_dataset[['order_id','review_score', 'review_comment_message']]
order_review_data = order_reviews_dataset.merge(order_dataset,on='order_id')
order_products_dataset_english = pd.merge(order_products_dataset,product_translation_dataset,on='product_category_name'
                                          ,how='left')
order_products_dataset_english = order_products_dataset_english.drop(labels='product_category_name',axis=1)
order_product_item_dataset = pd.merge(order_items_dataset,order_products_dataset_english,on='product_id')
ordered_product_reviews = pd.merge(order_product_item_dataset,order_review_data,on='order_id')
ordered_product_reviews_payments = pd.merge(ordered_product_reviews,order_payments_dataset,on='order_id')
df_final = pd.merge(ordered_product_reviews_payments,customer_data,on='customer_id')

product_id = order_product_item_dataset.groupby('product_id').count()['seller_id'].index
seller_count = order_product_item_dataset.groupby('product_id').count()['seller_id'].values
product_seller_count = pd.DataFrame({'product_id':product_id,'sellers_count':seller_count})

order_id = order_product_item_dataset.groupby('order_id').count()['product_id'].index
pd_count = order_product_item_dataset.groupby('order_id').count()['product_id'].values
order_items_count = pd.DataFrame({'order_id':order_id,'products_count':pd_count})

df_final = pd.merge(df_final,product_seller_count,on='product_id')
df_final = pd.merge(df_final,order_items_count,on='order_id')

# df = pd.read_csv('olist_final.csv')

# separating the target variable
y = df_final['review_score']
X = df_final.drop(labels='review_score',axis=1)

# train test 80:20 split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=25)
print("Train data: ",X_train.shape,y_train.shape)
print("Train data: ",X_test.shape,y_test.shape)

Train data:  (94652, 33) (94652,)
Train data:  (23663, 33) (23663,)


In [3]:
def process_texts(texts):
    
    processed_text = []
    
    portuguese_stopwords = stopwords.words('portuguese') # portugese language stopwords
    stemmer = RSLPStemmer() # portugese language stemmer
    
    links = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' # check for hyperlinks
    dates = '([0-2][0-9]|(3)[0-1])(\/|\.)(((0)[0-9])|((1)[0-2]))(\/|\.)\d{2,4}' # check for dates
    currency = '[R]{0,1}\$[ ]{0,}\d+(,|\.)\d+' # check for currency symbols
    
    for text in texts:
        text = re.sub('[\n\r]', ' ', text) # remove new lines
        text = re.sub(links, ' URL ', text) # remove hyperlinks
        text = re.sub(dates, ' ', text) # remove dates
        text = re.sub(currency, ' dinheiro ', text) # remove currency symbols
        text = re.sub('[0-9]+', ' numero ', text) # remove digits
        text = re.sub('([nN][ãÃaA][oO]|[ñÑ]| [nN] )', ' negação ', text) # replace no with negative
        text = re.sub('\W', ' ', text) # remove extra whitespaces
        text = re.sub('\s+', ' ', text) # remove extra spaces
        text = re.sub('[ \t]+$', '', text) # remove tabs etc.
        text = ' '.join(e for e in text.split() if e.lower() not in portuguese_stopwords) # remove stopwords
#         text = ' '.join(stemmer.stem(e.lower()) for e in text.split()) # stemming the words
        processed_text.append(text.lower().strip())
        
    return processed_text

In [4]:
def tfidfWord2Vector(text,glove_words,tfidf_words,tf_values):
    w2vmodel = Word2Vec.load("word2vec.model")
    # compute average word2vec for each review.
    tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sentence in tqdm(text): # for each review/sentence
        vector = np.zeros(300) # as word vectors are of zero length
        tf_idf_weight =0; # num of words with a valid vector in the sentence/review
        for word in sentence.split(): # for each word in a review/sentence
            if (word in glove_words) and (word in tfidf_words):
                vec = w2vmodel.wv[word] # embeddings[word] 
                # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
                tf_idf = tf_values[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
                vector += (vec * tf_idf) # calculating tfidf weighted w2v
                tf_idf_weight += tf_idf
        if tf_idf_weight != 0:
            vector /= tf_idf_weight
        tfidf_w2v_vectors.append(vector)
    tfidf_w2v_vectors = np.asarray(tfidf_w2v_vectors)
    
    return tfidf_w2v_vectors

In [5]:
def pre_process(X):
    
    datetime_cols = ['order_purchase_timestamp','order_approved_at','order_delivered_customer_date',
                     'order_estimated_delivery_date','order_delivered_carrier_date']
    for col in datetime_cols:
        X[col] = pd.to_datetime(X[col]).dt.date
    # calculating estimated delivery time
    X['estimated_delivery_time'] = (X['order_estimated_delivery_date'] - X['order_approved_at']).dt.days

    # calculating actual delivery time
    X['actual_delivery_time'] = (X['order_delivered_customer_date'] - X['order_approved_at']).dt.days

    # calculating diff_in_delivery_time
    X['diff_in_delivery_time'] = X['estimated_delivery_time'] - X['actual_delivery_time']

    # finding if delivery was late
    X['on_time_delivery'] = X['order_delivered_customer_date'] < X['order_estimated_delivery_date']
    X['on_time_delivery'] = X['on_time_delivery'].astype('int')

    # calculating mean product value
    X['avg_product_value'] = X['price'].astype(float)/X['products_count'].astype(float)

    # finding total order cost
    X['total_order_cost'] = X['price'].astype(float) + X['freight_value'].astype(float)

    # calculating order freight ratio
    X['order_freight_ratio'] = X['freight_value'].astype(float)/X['price'].astype(float)

    # finding the day of week on which order was made
    X['purchase_dayofweek'] = pd.to_datetime(X['order_purchase_timestamp']).dt.dayofweek

    # adding is_reviewed where 1 is if review comment is given otherwise 0.
    X['is_reviewed'] = (X['review_comment_message'] != 'nan').astype('int')
    
    drop_columns = ['order_id', 'order_item_id', 'product_id', 'seller_id','shipping_limit_date','customer_id',
                    'order_purchase_timestamp', 'order_approved_at', 'order_delivered_customer_date', 'customer_state',
                    'order_estimated_delivery_date','customer_unique_id', 'customer_city','customer_zip_code_prefix',
                    'order_delivered_carrier_date']
    X = X.drop(columns=drop_columns, axis=1)
    
    num_feat = ['price', 'freight_value', 'product_name_lenght','product_description_lenght', 'product_photos_qty',
           'product_weight_g','product_length_cm', 'product_height_cm', 'product_width_cm','sellers_count', 
           'products_count', 'payment_sequential','payment_installments', 'payment_value','on_time_delivery', 
           'estimated_delivery_time','actual_delivery_time', 'diff_in_delivery_time','avg_product_value', 'purchase_dayofweek',
           'total_order_cost', 'order_freight_ratio','is_reviewed']

    # categorical features
    cat_feat = ['review_comment_message','product_category_name_english','order_status', 'payment_type']
    
    # handling missing values 
    imputer = pickle.load(open('MedianImputer.pkl','rb'))
    X[num_feat] = imputer.transform(X[num_feat])
    
    # processing text and categorical features
    X['review_comment_message'] = X['review_comment_message'].fillna('no_review')
    X['review_comment_message'] = process_texts(X['review_comment_message'])
    X['review_comment_message'] = X['review_comment_message'].replace({'no_review':'nao_reveja'}) 
    
    vectorizer_os = pickle.load(open('vectorizer_order_status.pkl','rb'))
    vectorizer_pc = pickle.load(open('vectorizer_product_category.pkl','rb'))
    payment_types = pickle.load(open('payment_types.pkl','rb'))
    
    X['payment_type'] = X['payment_type'].fillna(1)
    X['order_status'] = X['order_status'].fillna(vectorizer_os.get_feature_names()[0])
    X['product_category_name_english'] = X['product_category_name_english'].fillna(vectorizer_pc.get_feature_names()[0])
    
    X_os = vectorizer_os.transform(X['order_status'])
    X_pc = vectorizer_pc.transform(X['product_category_name_english'])
    X['payment_type'] = X['payment_type'].replace(payment_types)
    
    w2vmodel = Word2Vec.load("word2vec.model")
    tfidf_review_comments = pickle.load(open('tfidf_review_comments.pkl','rb'))
    # we are converting a dictionary with word as a key, and the idf as a value
    tf_values = dict(zip(tfidf_review_comments.get_feature_names(), list(tfidf_review_comments.idf_)))
    tfidf_words = set(tfidf_review_comments.get_feature_names())
    glove_words = list(w2vmodel.wv.vocab.keys())
    
    w2v_review_comments = tfidfWord2Vector(X['review_comment_message'].values,glove_words,tfidf_words,tf_values)
    
    word_index = pickle.load(open('word_index.pkl','rb'))
#     embedding_matrix = pickle.load(open('embedding_matrix.pkl','rb'))
    encoded_text = []
    for y in X['review_comment_message']:
        encoded_text.append([word_index[w] if w in word_index else 0 for w in y.split()])
    
    # pad documents to a max length of 122 words as 95 percentile is 122
    max_length = 122
    padded_text = pad_sequences(encoded_text, maxlen=max_length, padding='post')
    
    X = X.drop(labels=['review_comment_message','product_category_name_english','order_status'],axis=1)
    
    # encoding numerical features
    for i in num_feat:
        normalizer = pickle.load(open(i+'.pkl','rb'))
        X[i] = normalizer.transform(X[i].values.reshape(1,-1))[0]
        
    # merging our encoded categorical features with rest of the data 
    X_merge = hstack((X, X_pc, X_os, w2v_review_comments))
    X_other = hstack((X, X_pc, X_os))
    
    return X_merge, X_other, padded_text, X_os, X_pc, X

In [6]:
from keras import backend as K
# https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model

def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

Using TensorFlow backend.


In [57]:
def func1(X):
    X_merge, X_other, review_padded, X_os, X_pc, x = pre_process(X)
    
    # reshaping for input to DL models
    X_merge_inp = X_merge.toarray().reshape(X_merge.shape[0],X_merge.shape[1],1)
    X_other_inp = X_other.toarray().reshape(X_other.shape[0],X_other.shape[1],1)

    # ML models
#     lr = pickle.load(open('models/logistic.pkl','rb'))
#     svm = pickle.load(open('models/svm.pkl','rb'))
#     dt = pickle.load(open('models/decision_tree.pkl','rb'))
#     rf = pickle.load(open('models/random_forest.pkl','rb'))
#     xgb = pickle.load(open('models/xgb.pkl','rb'))
#     lgb = pickle.load(open('models/lgbm.pkl','rb'))
#     voting = pickle.load(open('models/best_voting.pkl','rb'))
#     best_model = pickle.load(open('models/best_stacking.pkl','rb'))
    
#     # DL models
    nn_model = load_model('models/NN_model.h5',custom_objects={'f1':f1}) # input is X_merge
    cnn_model1 = load_model('models/cnn_model1.h5',custom_objects={'f1':f1})# input is X_merge_inp
    cnn_model2 = load_model('models/cnn_model2.h5',custom_objects={'f1':f1}) # input is review_padded, X_other_inp
    lstm_model1 = load_model('models/lstm_model1.h5',custom_objects={'f1':f1}) # input is review_padded, X_os, X_pc, X 
    lstm_model2 = load_model('models/lstm_model2.h5',custom_objects={'f1':f1}) # input is review_padded, X_other_inp
    
    # stacked NN models
#     nn_soft_stacking = pickle.load(open('models/NN_soft_stacking.pkl','rb'))
    nn_hard_stacking = pickle.load(open('models/NN_hard_stacking.pkl','rb'))
    
    # IF using the NN stacking
    y_pred1 = nn_model.predict(X_merge.toarray())
    y_pred2 = cnn_model1.predict(X_merge_inp)
    y_pred3 = cnn_model2.predict([review_padded, X_other_inp])
    y_pred4 = lstm_model1.predict([review_padded, X_os.toarray(), X_pc.toarray(), x])
    y_pred5 = lstm_model2.predict([review_padded, X_other_inp])
    
#     y_pred = nn_soft_stacking.predict(np.stack((y_pred1[:,0],y_pred2[:,0],y_pred3[:,0],y_pred4[:,0],y_pred5[:,0]),axis=-1))
    y_pred = nn_hard_stacking.predict(np.stack((np.greater(y_pred1,0.5).astype(int)[:,0],
                                                np.greater(y_pred2,0.5).astype(int)[:,0],
                                                np.greater(y_pred3,0.5).astype(int)[:,0],
                                                np.greater(y_pred4,0.5).astype(int)[:,0],
                                                np.greater(y_pred5,0.5).astype(int)[:,0]),axis=-1))
# #     y_predicted = lr.predict(X_merge)
#     y_predicted = best_model.predict(X_merge)
    
    return y_pred#y_predicted

In [58]:
def func2(X,y):
    y = y.apply(lambda x:1 if x>3 else 0)
    
    y_pred = func1(X)
    
    print(f1_score(y,y_pred))
    
    return f1_score(y,y_pred)

In [59]:
%%time
func1(X_test.iloc[:2])

100%|██████████| 2/2 [00:00<00:00, 5637.51it/s]


CPU times: user 7.84 s, sys: 4.07 s, total: 11.9 s
Wall time: 4.54 s


array([1, 1])

In [60]:
%%time
func2(X_test,y_test)

100%|██████████| 23663/23663 [00:02<00:00, 10622.86it/s]


0.9196365029835223
CPU times: user 4min 59s, sys: 5min 4s, total: 10min 4s
Wall time: 1min 22s


0.9196365029835223

In [26]:
%%time
func2(X_test,y_test)

100%|██████████| 23663/23663 [00:02<00:00, 9812.35it/s] 


0.9259279328095367
CPU times: user 9.57 s, sys: 564 ms, total: 10.1 s
Wall time: 8.06 s


0.9259279328095367

In [11]:
%%time
func2(X_test,y_test)

100%|██████████| 23663/23663 [00:02<00:00, 9697.78it/s] 


0.9257074510017126


0.9257074510017126