# Lito - Case

<img align="left" width="80" height="200" src="https://img.shields.io/badge/python-v3.6-blue.svg">
<br>

## Table of contents

1. [Introduction](#Introduction)
2. [Problem Statement](#Problem-Statement)
3. [Import-Data](#Import-Data)
4. [Data Cleaning](#Data--Cleaning)
5. [Data Exploration](#Data-Exploration)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
import seaborn as sns
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score
import itertools
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
%matplotlib inline
sns.set(color_codes=True)
%autosave 30
import warnings
import math
plt.style.use('seaborn')
warnings.filterwarnings("ignore")
%matplotlib inline

Autosaving every 30 seconds


In [2]:
file = 'orders-6-months-clean.csv'
df = pd.read_csv(file)
df.head()
df.tail()
print('(number of observations, number of features) =', df.shape)

Unnamed: 0,Origin,Sequence,Creation Date,Client Document,State,City,Neighborhood,Carrier,Delivery Deadline,Status,Utmi,Payment System Name,Installments,ID_SKU,Category Ids Sku,SKU Value,SKU Selling Price,SKU Total Price,Shipping List Price,Shipping Value,Total Value,Discounts Totals
0,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
1,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
2,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
3,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
4,B2C Channel,614749,2018-02-28 02:26:18Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0


Unnamed: 0,Origin,Sequence,Creation Date,Client Document,State,City,Neighborhood,Carrier,Delivery Deadline,Status,Utmi,Payment System Name,Installments,ID_SKU,Category Ids Sku,SKU Value,SKU Selling Price,SKU Total Price,Shipping List Price,Shipping Value,Total Value,Discounts Totals
18908,B2C Channel,658033,2018-08-30 19:05:09Z,10619214767,RJ,Duque de Caxias,Centro,Retirada em Loja (188537b),6bd,Ready for shipping,cpc - google - - brand136193,/1/11/,25,0,25,0,0,0,-50.0,,,
18909,B2C Channel,658033,2018-08-30 19:05:09Z,10619214767,RJ,Duque de Caxias,Centro,Retirada em Loja (188537b),6bd,Ready for shipping,cpc - google - - brand120269,/1/11/,25,0,25,0,0,0,-50.0,,,
18910,B2C Channel,658042,2018-08-30 19:50:37Z,8570203748,RJ,Teres�polis,V�rzea,Transportadora,8bd,Payment pending,cpc - google - - brand,Mastercard,4,131834,/3/35/,199,199,199,26.58,26.58,225.58,0.0
18911,B2C Channel,658045,2018-08-30 19:55:28Z,15506641745,RJ,Rio de Janeiro,Botafogo,Retirada em Loja (18fd71a),6bd,Payment pending,- - -,Mastercard,5,131837,/3/35/,199,199,199,0.0,0.0,358.0,0.0
18912,B2C Channel,658045,2018-08-30 19:55:28Z,15506641745,RJ,Rio de Janeiro,Botafogo,Retirada em Loja (18fd71a),6bd,Payment pending,- - -,Mastercard,5,131434,/3/,159,159,159,0.0,0.0,358.0,0.0


(number of observations, number of features) = (18913, 22)


### Keeping a copy of the original dataframe `df`

In [3]:
raw = df.copy()

Taking a look at the data:

In [4]:
print('(number of observations, number of features) =', df.shape)

(number of observations, number of features) = (18913, 22)


Tidying up the names of the columns:

In [5]:
print(list(df.columns))

['Origin', 'Sequence', 'Creation Date', 'Client Document', 'State', 'City', 'Neighborhood', 'Carrier', 'Delivery Deadline', 'Status', 'Utmi', 'Payment System Name', 'Installments', 'ID_SKU', 'Category Ids Sku', 'SKU Value', 'SKU Selling Price', 'SKU Total Price', 'Shipping List Price', 'Shipping Value', 'Total Value', 'Discounts Totals']


In [7]:
df.columns = [col.lower().replace('-','_').replace(' ','_') for col in df.columns]

In [8]:
df.head()

Unnamed: 0,origin,sequence,creation_date,client_document,state,city,neighborhood,carrier,delivery_deadline,status,utmi,payment_system_name,installments,id_sku,category_ids_sku,sku_value,sku_selling_price,sku_total_price,shipping_list_price,shipping_value,total_value,discounts_totals
0,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
1,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
2,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
3,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
4,B2C Channel,614749,2018-02-28 02:26:18Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0


## Checking memory usage
[[go back to the top]](#Table-of-contents)

Checking memory usage is useful when `pandas` is dealing with very large datasets. This dataset is not particularly large but I will optimize it anyway and use the same code in the second part of the challenge.

Below, choosing the parameter `deep` in the method `info()` provides more [accuracy](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html).

In [9]:
df.info(memory_usage='deep') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18913 entries, 0 to 18912
Data columns (total 22 columns):
origin                 18913 non-null object
sequence               18911 non-null object
creation_date          18910 non-null object
client_document        18906 non-null object
state                  18907 non-null object
city                   18903 non-null object
neighborhood           18902 non-null object
carrier                18902 non-null object
delivery_deadline      18899 non-null object
status                 18899 non-null object
utmi                   18899 non-null object
payment_system_name    18899 non-null object
installments           18385 non-null object
id_sku                 18385 non-null object
category_ids_sku       18116 non-null object
sku_value              18116 non-null object
sku_selling_price      18116 non-null object
sku_total_price        18116 non-null object
shipping_list_price    18115 non-null float64
shipping_value         18113 non-nu

### Reducing memory usage

Though the dataframe's memory usage is relatively low, it can be reduced. We can use a [function](https://www.dataquest.io/blog/pandas-big-data/) to memory reducing.
Let us use a [function](https://www.dataquest.io/blog/pandas-big-data/) for that which assumes the input to be a `Dataframe`.

In [10]:
def compute_memory(x):
    # summing the memory usage of all columns (in mb)
    usage_mb = x.memory_usage(deep=True).sum()/(1024 ** 2 ) 
    return "Total usage is {} MB".format(usage_mb)

### Downcasting to float32

In [12]:
df_floats = df.select_dtypes(include=['float'])
converted_float = df_floats.apply(pd.to_numeric, downcast='float')
compare_floats = pd.concat([df_floats.dtypes, converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.head()

Unnamed: 0,before,after
shipping_list_price,float64,float32
shipping_value,float64,float32
total_value,float64,float32
discounts_totals,float64,float32


### Checking the memory gain:

In [13]:
optimized_df = df.copy()
optimized_df[converted_float.columns] = converted_float
df = optimized_df.copy()
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18913 entries, 0 to 18912
Data columns (total 22 columns):
origin                 18913 non-null object
sequence               18911 non-null object
creation_date          18910 non-null object
client_document        18906 non-null object
state                  18907 non-null object
city                   18903 non-null object
neighborhood           18902 non-null object
carrier                18902 non-null object
delivery_deadline      18899 non-null object
status                 18899 non-null object
utmi                   18899 non-null object
payment_system_name    18899 non-null object
installments           18385 non-null object
id_sku                 18385 non-null object
category_ids_sku       18116 non-null object
sku_value              18116 non-null object
sku_selling_price      18116 non-null object
sku_total_price        18116 non-null object
shipping_list_price    18115 non-null float32
shipping_value         18113 non-nu

We see there two types of data: `float`,`object`

In [15]:
df_obj = df.select_dtypes(include=['object']).copy()
df_obj.dtypes

origin                 object
sequence               object
creation_date          object
client_document        object
state                  object
city                   object
neighborhood           object
carrier                object
delivery_deadline      object
status                 object
utmi                   object
payment_system_name    object
installments           object
id_sku                 object
category_ids_sku       object
sku_value              object
sku_selling_price      object
sku_total_price        object
dtype: object

In [16]:
converted_obj = pd.DataFrame()
converted_obj.head()

In [17]:
compare_obj = pd.concat([df_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)

Unnamed: 0,before,after
object,18,


In [18]:
optimized_df[converted_obj.columns] = converted_obj
compute_memory(optimized_df)

'Total usage is 22.045705795288086 MB'

In [19]:
df = optimized_df.copy()
df.to_csv('optimized_df.csv')

## Data Cleaning
[[go back to the top]](#Table-of-contents)

### Some of the columns will are not relevant so I will drop them. 

In [23]:
df.head(2)

Unnamed: 0,origin,sequence,creation_date,client_document,state,city,neighborhood,carrier,delivery_deadline,status,utmi,payment_system_name,installments,id_sku,category_ids_sku,sku_value,sku_selling_price,sku_total_price,shipping_list_price,shipping_value,total_value,discounts_totals
0,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
1,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0


In [33]:
for col in df.columns:
    print(df[col].value_counts().head(3),'\n')

B2C Channel                                                18739
Marketplace                                                  159
Vtex.Commerce.Oms.Helper.ReadWriteCsv+CsvRowB2C Channel        4
Name: origin, dtype: int64 

624290    24
653493    23
645136    21
Name: sequence, dtype: int64 

2018-04-17 05:46:22Z    24
2018-08-07 23:02:38Z    23
2018-07-12 22:19:55Z    21
Name: creation_date, dtype: int64 

11153171724    76
9057971674     57
4333596173     54
Name: client_document, dtype: int64 

RJ    11521
SP     1710
MG     1239
Name: state, dtype: int64 

Rio de Janeiro    7435
S�o Paulo          812
Nova Igua�u        683
Name: city, dtype: int64 

Centro             2213
Tijuca              730
Barra da Tijuca     467
Name: neighborhood, dtype: int64 

Transportadora    5620
Sedex             4715
Ciclo Verde       2974
Name: carrier, dtype: int64 

4bd    4072
2bd    3251
6bd    2135
Name: delivery_deadline, dtype: int64 

Invoiced              12449
Canceled               5937

In [38]:
df.city.value_counts()

Rio de Janeiro                 7435
S�o Paulo                       812
Nova Igua�u                     683
Niter�i                         609
Duque de Caxias                 525
Bel�m                           490
Aracaju                         484
Bras�lia                        351
Belo Horizonte                  295
Fortaleza                       260
S�o Gon�alo                     258
Juiz de Fora                    254
Salvador                        224
S�o Jo�o de Meriti              221
Recife                          190
Vit�ria                         173
Campos dos Goytacazes           150
Vila Velha                      131
Maca�                           119
RIO DE JANEIRO                  104
Teresina                         97
Goi�nia                          95
Arma��o dos B�zios               88
Curitiba                         86
Mesquita                         85
Santos                           81
Porto Alegre                     81
Maric�                      

In [40]:
import unidecode


lst = []

for row in df['city']:
    row = str(row)
    lst.append(unidecode.unidecode(row))
    
df['city'] = lst

In [25]:
print(df.columns.tolist())

['origin', 'sequence', 'creation_date', 'client_document', 'state', 'city', 'neighborhood', 'carrier', 'delivery_deadline', 'status', 'utmi', 'payment_system_name', 'installments', 'id_sku', 'category_ids_sku', 'sku_value', 'sku_selling_price', 'sku_total_price', 'shipping_list_price', 'shipping_value', 'total_value', 'discounts_totals']


In [41]:
df.head()

Unnamed: 0,origin,sequence,creation_date,client_document,state,city,neighborhood,carrier,delivery_deadline,status,utmi,payment_system_name,installments,id_sku,category_ids_sku,sku_value,sku_selling_price,sku_total_price,shipping_list_price,shipping_value,total_value,discounts_totals
0,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
1,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
2,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
3,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
4,B2C Channel,614749,2018-02-28 02:26:18Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0


In [42]:
df.dtypes

origin                  object
sequence                object
creation_date           object
client_document         object
state                   object
city                    object
neighborhood            object
carrier                 object
delivery_deadline       object
status                  object
utmi                    object
payment_system_name     object
installments            object
id_sku                  object
category_ids_sku        object
sku_value               object
sku_selling_price       object
sku_total_price         object
shipping_list_price    float32
shipping_value         float32
total_value            float32
discounts_totals       float32
dtype: object

In [47]:
df.isnull().any()
df.shape
df.dropna(inplace=True)
df.shape

origin                 False
sequence                True
creation_date           True
client_document         True
state                   True
city                   False
neighborhood            True
carrier                 True
delivery_deadline       True
status                  True
utmi                    True
payment_system_name     True
installments            True
id_sku                  True
category_ids_sku        True
sku_value               True
sku_selling_price       True
sku_total_price         True
shipping_list_price     True
shipping_value          True
total_value             True
discounts_totals        True
dtype: bool

(18913, 22)

(18113, 22)

In [48]:
df.isnull().any()

origin                 False
sequence               False
creation_date          False
client_document        False
state                  False
city                   False
neighborhood           False
carrier                False
delivery_deadline      False
status                 False
utmi                   False
payment_system_name    False
installments           False
id_sku                 False
category_ids_sku       False
sku_value              False
sku_selling_price      False
sku_total_price        False
shipping_list_price    False
shipping_value         False
total_value            False
discounts_totals       False
dtype: bool

In [49]:
df.head()

Unnamed: 0,origin,sequence,creation_date,client_document,state,city,neighborhood,carrier,delivery_deadline,status,utmi,payment_system_name,installments,id_sku,category_ids_sku,sku_value,sku_selling_price,sku_total_price,shipping_list_price,shipping_value,total_value,discounts_totals
0,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
1,B2C Channel,614743,2018-02-28 02:01:51Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
2,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
3,B2C Channel,614746,2018-02-28 02:09:26Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,116363,/3/36/,239,239,239,9.3,0.0,438.0,0.0
4,B2C Channel,614749,2018-02-28 02:26:18Z,XYA87A9X7YX,RJ,Duque de Caxias,Parque Duque,Pac,12bd,Canceled,- - -,Visa,5,126755,/3/42/,199,199,199,9.3,0.0,438.0,0.0
