## Olist E-Commerce 

This public Brazilian e-commerce dataset of orders placed on Olist Store. The dataset contains information from 100,000 orders placed between 2016 and 2018 in various marketplaces in Brazil. Its features allow you to view an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and, finally, reviews written by customers. Also included is a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

### Libraries to use

In [75]:

import pandas as pd 
import glob
from scipy import stats

### Functions

In [76]:
def data_summary(DataFrame=pd.DataFrame):
    '''
    This function provides summary information of a Dataframe. 
    It displays the shape of the DF, the columns it contains, 
    the data type of each column, the number of null values, 
    the number of unique values, an example of the data 
    and finally the entropy of the data (is the average level of "information",
     "surprise", or "uncertainty" inherent in the variable's possible outcomes).
    '''
    print(f"Dataset Shape: {DataFrame.shape}")
    summary = pd.DataFrame(DataFrame.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Column'] = summary['index']
    summary = summary[['Column','dtypes']]
    summary['# Missing'] = DataFrame.isnull().sum().values    
    summary['# Unique'] = DataFrame.nunique().values
    summary['Example'] = DataFrame.loc[0].values

    for name in summary['Column'].value_counts().index:
        summary.loc[summary['Column'] == name, 'Entropy'] = round(stats.entropy(DataFrame[name].value_counts(normalize=True), base=2),2) 

    return summary
    


### Data intake

In [10]:
#All files are imported using the glob library. 
# Specify the pattern of the files and pass it as a parameter in the glob function.
csv_files = glob.glob('Datasets/*.csv')

#Check that the files have been correctly identified.
print(csv_files)

# A list containing the dataframes created from each .csv file is created.
list_data = []
  
# We write a loop that will go through each of the file names through globbing and the final result will be the list of dataframes.
for filename in csv_files:
    data = pd.read_csv(filename, decimal= ',', thousands= '.', encoding= 'UTF-8')
    if filename == 'Datasets\\olist_geolocation_dataset.csv':
        #Geolocalitation dataset
        data = pd.read_csv(filename,  encoding= 'UTF-8')
    list_data.append(data)


#list_data

['Datasets\\olist_closed_deals_dataset.csv', 'Datasets\\olist_customers_dataset.csv', 'Datasets\\olist_geolocation_dataset.csv', 'Datasets\\olist_marketing_qualified_leads_dataset.csv', 'Datasets\\olist_orders_dataset.csv', 'Datasets\\olist_order_items_dataset.csv', 'Datasets\\olist_order_payments_dataset.csv', 'Datasets\\olist_order_reviews_dataset.csv', 'Datasets\\olist_products_dataset.csv', 'Datasets\\olist_sellers_dataset.csv', 'Datasets\\product_category_name_translation.csv']
hola


#### Closed deals      

In [11]:
#Data frame is displayed as having been successfully loaded
list_data[0].head()
#A backup copy is made as a checkpoint.
closed_deals=list_data[0].copy()
#The action is verified
closed_deals.head()

Unnamed: 0,mql_id,seller_id,sdr_id,sr_id,won_date,business_segment,lead_type,lead_behaviour_profile,has_company,has_gtin,average_stock,business_type,declared_product_catalog_size,declared_monthly_revenue
0,5420aad7fec3549a85876ba1c529bd84,2c43fb513632d29b3b58df74816f1b06,a8387c01a09e99ce014107505b92388c,4ef15afb4b2723d8f3d81e51ec7afefe,2018-02-26 19:58:54,pet,online_medium,cat,,,,reseller,,0
1,a555fb36b9368110ede0f043dfc3b9a0,bbb7d7893a450660432ea6652310ebb7,09285259593c61296eef10c734121d5b,d3d1e91a157ea7f90548eef82f1955e3,2018-05-08 20:17:59,car_accessories,industry,eagle,,,,reseller,,0
2,327174d3648a2d047e8940d7d15204ca,612170e34b97004b3ba37eae81836b4c,b90f87164b5f8c2cfa5c8572834dbe3f,6565aa9ce3178a5caf6171827af3a9ba,2018-06-05 17:27:23,home_appliances,online_big,cat,,,,reseller,,0
3,f5fee8f7da74f4887f5bcae2bafb6dd6,21e1781e36faf92725dde4730a88ca0f,56bf83c4bb35763a51c2baab501b4c67,d3d1e91a157ea7f90548eef82f1955e3,2018-01-17 13:51:03,food_drink,online_small,,,,,reseller,,0
4,ffe640179b554e295c167a2f6be528e0,ed8cb7b190ceb6067227478e48cf8dde,4b339f9567d060bcea4f5136b9f5949e,d3d1e91a157ea7f90548eef82f1955e3,2018-07-03 20:17:45,home_appliances,industry,wolf,,,,manufacturer,,0


In [105]:
#Summary of the data
data_summary(closed_deals)

Dataset Shape: (842, 14)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,mql_id,object,0,842,5420aad7fec3549a85876ba1c529bd84,9.72
1,seller_id,object,0,842,2c43fb513632d29b3b58df74816f1b06,9.72
2,sdr_id,object,0,32,a8387c01a09e99ce014107505b92388c,3.99
3,sr_id,object,0,22,4ef15afb4b2723d8f3d81e51ec7afefe,3.92
4,won_date,datetime64[ns],0,824,2018-02-26 19:58:54,9.66
5,business_segment,object,1,33,pet,4.34
6,lead_type,object,6,8,online_medium,2.43
7,lead_behaviour_profile,object,177,9,cat,1.63
8,has_company,object,779,2,,0.4
9,has_gtin,object,778,2,,0.63


This dataset shows the closed deals with new suppliers (sellers). It is observed that the columns 'has_company', 'has_gtin', 'average_stock', 'declared_product_catalog_size' have a large number of missing values. When observing the entropy of each variable, it is observed that 'has_company', 'has_gtin' are not providing relevant information in the dataset, it is possible to discard these columns; however, 'average_stock', 'declared_product_catalog_size' despite having a large number of missing values, their entropy indicates that their data provide information that can be taken into account.



In [78]:
#Are duplicated data?
closed_deals[closed_deals.duplicated(keep=False)]

Unnamed: 0,mql_id,seller_id,sdr_id,sr_id,won_date,business_segment,lead_type,lead_behaviour_profile,has_company,has_gtin,average_stock,business_type,declared_product_catalog_size,declared_monthly_revenue


No duplicate data observed

In [39]:
# Transform the date type columns 
closed_deals['won_date']=closed_deals['won_date'].apply(pd.to_datetime)

#### Customers      

In [12]:
#Data frame is displayed as having been successfully loaded
list_data[1].head()
#A backup copy is made as a checkpoint.
custumers=list_data[1].copy()
#The action is verified
custumers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


In [79]:
data_summary(custumers)

Dataset Shape: (99441, 5)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,customer_id,object,0,99441,06b8999e2fba1a1fbc88172c00ba8bc7,16.6
1,customer_unique_id,object,0,96096,861eff4711a542e4b93843c6dd7febb0,16.53
2,customer_zip_code_prefix,int64,0,14994,14409,13.14
3,customer_city,object,0,4119,franca,8.17
4,customer_state,object,0,27,SP,3.08


This dataset shows the custumers' information. A discrepancy is observed with the 'customer_unique_id' values as the unique values are less than the total data, indicating that repeated data exists. Columns with lower entropy are converged as this allows for analysis at the geographic level.

In [103]:
#Are duplicated data?
custumers[custumers.duplicated(keep=False)]

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state


No duplicate data observed

#### Geolocation      

In [13]:
#Data frame is displayed as having been successfully loaded
list_data[2].head()
#A backup copy is made as a checkpoint.
geolocation=list_data[2].copy()
#The action is verified
geolocation.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


In [81]:
data_summary(geolocation)

Dataset Shape: (1000163, 5)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,geolocation_zip_code_prefix,int64,0,19015,1037,13.32
1,geolocation_lat,float64,0,717360,-23.545621,19.11
2,geolocation_lng,float64,0,717613,-46.639292,19.11
3,geolocation_city,object,0,8011,sao paulo,8.73
4,geolocation_state,object,0,27,SP,3.13


The geographic coordinates' locations of the other datasets are indicated.

In [82]:
#Are duplicated data?
geolocation[geolocation.duplicated(keep=False)]

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.644820,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
6,1047,-23.546273,-46.641225,sao paulo,SP
7,1013,-23.546923,-46.634264,sao paulo,SP
...,...,...,...,...,...
1000153,99970,-28.343273,-51.873734,ciriaco,RS
1000154,99950,-28.070493,-52.011342,tapejara,RS
1000159,99900,-27.877125,-52.224882,getulio vargas,RS
1000160,99950,-28.071855,-52.014716,tapejara,RS


The repeat data indicate locations very close to each other, however, when looking at the latitudes and longitudes, they are unique locations.

#### Marketing Qualified Leads      

In [15]:
#Data frame is displayed as having been successfully loaded
list_data[3].head()
#A backup copy is made as a checkpoint.
marketing_qualified_leads=list_data[3].copy()
#The action is verified
marketing_qualified_leads.head()

Unnamed: 0,mql_id,first_contact_date,landing_page_id,origin
0,dac32acd4db4c29c230538b72f8dd87d,2018-02-01,88740e65d5d6b056e0cda098e1ea6313,social
1,8c18d1de7f67e60dbd64e3c07d7e9d5d,2017-10-20,007f9098284a86ee80ddeb25d53e0af8,paid_search
2,b4bc852d233dfefc5131f593b538befa,2018-03-22,a7982125ff7aa3b2054c6e44f9d28522,organic_search
3,6be030b81c75970747525b843c1ef4f8,2018-01-22,d45d558f0daeecf3cccdffe3c59684aa,email
4,5420aad7fec3549a85876ba1c529bd84,2018-02-21,b48ec5f3b04e9068441002a19df93c6c,organic_search


In [83]:
data_summary(marketing_qualified_leads)

Dataset Shape: (8000, 4)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,mql_id,object,0,8000,dac32acd4db4c29c230538b72f8dd87d,12.97
1,first_contact_date,datetime64[ns],0,336,2018-02-01 00:00:00,7.95
2,landing_page_id,object,0,495,88740e65d5d6b056e0cda098e1ea6313,6.11
3,origin,object,60,10,social,2.74


In [104]:
marketing_qualified_leads.origin.unique()

array(['social', 'paid_search', 'organic_search', 'email', 'unknown',
       'referral', 'direct_traffic', 'display', nan, 'other_publicities',
       'other'], dtype=object)

This dataset indicates the number of people who have visited Olist's website and how many of them have purchased the services. The origin column indicates the means by which they have come to know about Olist.

In [85]:
#Are duplicated data?
marketing_qualified_leads[marketing_qualified_leads.duplicated(keep=False)]

Unnamed: 0,mql_id,first_contact_date,landing_page_id,origin


No duplicate data observed

In [43]:
#Transform date type columns 
marketing_qualified_leads['first_contact_date']=marketing_qualified_leads['first_contact_date'].apply(pd.to_datetime)

# Orders      

In [46]:
#Data frame is displayed as having been successfully loaded
list_data[4].head()
#A backup copy is made as a checkpoint.
orders=list_data[4].copy()
#The action is verified
orders.head()


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


In [86]:
data_summary(orders)

Dataset Shape: (99441, 8)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,order_id,object,0,99441,e481f51cbdc54678b7cc49136f2d6af7,16.6
1,customer_id,object,0,99441,9ef432eb6251297304e76186b10a928d,16.6
2,order_status,object,0,8,delivered,0.26
3,order_purchase_timestamp,datetime64[ns],0,98875,2017-10-02 10:56:33,16.59
4,order_approved_at,datetime64[ns],160,90733,2017-10-02 11:07:15,16.41
5,order_delivered_carrier_date,datetime64[ns],1783,81018,2017-10-04 19:55:00,16.15
6,order_delivered_customer_date,datetime64[ns],2965,95664,2017-10-10 21:25:13,16.54
7,order_estimated_delivery_date,datetime64[ns],0,459,2017-10-18 00:00:00,8.47


In this dataset we observe the purchases made by customers as well as the time and date of when the purchase was made, approved, picked up and delivered, as well as the estimated date of delivery.

In [87]:
#Are duplicated data?
orders[orders.duplicated(keep=False)]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date


No duplicate data observed

In [52]:
#Transform date type columns 
orders.columns
orders_dates=[ 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date','order_delivered_customer_date', 'order_estimated_delivery_date']
orders[orders_dates]=orders[orders_dates].apply(pd.to_datetime)

# Delivery time

A brief calculation is made of the percentage of time that would be reduced by eliminating the pick-up time of the packages

Mean of delivery time

In [108]:
carrier_time=(orders.order_delivered_carrier_date - orders.order_purchase_timestamp).mean()
delivered_time=(orders.order_delivered_customer_date - orders.order_purchase_timestamp).mean()

In [118]:
reduction_time= 100-(delivered_time-carrier_time)/delivered_time*100
print('The percentage of time reduced is:', round(reduction_time,2),'%')

The percentage of time reduced is: 25.75 %


#### Order Items     

In [21]:
#Data frame is displayed as having been successfully loaded
list_data[5].head()
#A backup copy is made as a checkpoint.
order_items=list_data[5].copy()
#The action is verified
order_items.head()


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,5890,1329
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,23990,1993
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,19900,1787
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,1299,1279
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,19990,1814


In [88]:
data_summary(order_items)

Dataset Shape: (112650, 7)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,order_id,object,0,98666,00010242fe8c5a6d1ba2dd792cb16214,16.49
1,order_item_id,int64,0,21,1,0.72
2,product_id,object,0,32951,4244733e06e7ecb4970a6e2683c13e61,13.63
3,seller_id,object,0,3095,48436dade18ac8b2bce089ec2a041202,9.48
4,shipping_limit_date,datetime64[ns],0,93318,2017-09-19 09:45:35,16.38
5,price,int64,0,5968,5890,9.58
6,freight_value,int64,0,6999,1329,10.51


The dataset tells us the transactions made between the buyer and the seller. This dataset becomes the master table. In this case the column 'order_item_id' has a low entropy, further analysis is considered to decide whether to keep or remove it as it is an identifier.

In [89]:
#Are duplicated data?
order_items[order_items.duplicated(keep=False)]

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value


No duplicate data observed

In [56]:
#Transform date type columns 
order_items['shipping_limit_date']=order_items['shipping_limit_date'].apply(pd.to_datetime)

In [63]:
order_items.sort_values('shipping_limit_date',  ascending=False)

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
85730,c2bb89b5c1dd978d507284be78a04cb2,2,87b92e06b320e803d334ac23966c80b1,7a241947449cc45dbfda4f9d0798d9d0,2020-04-09 22:35:08,9999,6144
85729,c2bb89b5c1dd978d507284be78a04cb2,1,87b92e06b320e803d334ac23966c80b1,7a241947449cc45dbfda4f9d0798d9d0,2020-04-09 22:35:08,9999,6144
8643,13bdf405f961a6deec817d817f5c6624,1,96ea060e41bdecc64e2de00b97068975,7a241947449cc45dbfda4f9d0798d9d0,2020-02-05 03:30:51,6999,1466
68516,9c94a4ea2f7876660fa6f1b59b69c8e6,1,282b126b2354516c5f400154398f616d,7a241947449cc45dbfda4f9d0798d9d0,2020-02-03 20:23:22,7599,1470
26104,3b61aab5de69abc1731138bd104a777f,1,6aa063e063f2ab982b471e58afe06d72,610f72e407cdd7caaa2f8167b0163fd8,2018-09-18 21:10:15,99999,2477
...,...,...,...,...,...,...,...
90368,cd3b8574c82b42fc8129f6d502690c3e,1,e2a1d45a73dc7f5a7f9236b043431b89,b499c00f28f4b7069ff6550af8c1348a,2016-10-08 10:34:01,2999,1096
84391,bfbd0f9bdef84302105ad712db648a6c,3,5a6b04657a4c5ee34285d1e4619a96b4,ecccfa2bb93b34a3bf033cc5d1dcdc69,2016-09-19 23:11:33,4499,283
84389,bfbd0f9bdef84302105ad712db648a6c,1,5a6b04657a4c5ee34285d1e4619a96b4,ecccfa2bb93b34a3bf033cc5d1dcdc69,2016-09-19 23:11:33,4499,283
84390,bfbd0f9bdef84302105ad712db648a6c,2,5a6b04657a4c5ee34285d1e4619a96b4,ecccfa2bb93b34a3bf033cc5d1dcdc69,2016-09-19 23:11:33,4499,283


The presence of dates inconsistent with the rest of the data is detected, being possible outliers

#### Order Payments      

In [22]:
#Data frame is displayed as having been successfully loaded
list_data[6].head()
#A backup copy is made as a checkpoint.
order_payments=list_data[6].copy()
#The action is verified
order_payments.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,9933
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,2439
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,6571
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,10778
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,12845


In [90]:
data_summary(order_payments)

Dataset Shape: (103886, 5)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,order_id,object,0,99440,b81ef226f3fe1789b1e8b2acac839d17,16.56
1,payment_sequential,int64,0,29,1,0.34
2,payment_type,object,0,5,credit_card,1.1
3,payment_installments,int64,0,24,8,2.44
4,payment_value,int64,0,29077,9933,13.88


The dataset provides information on the payment methods used in the transactions performed.In this case the column 'payment_sequential' has a low entropy, further analysis is considered to decide whether to keep or remove it.

In [91]:
#Are duplicated data?
order_payments[order_payments.duplicated(keep=False)]

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value


No duplicate data observed

#### Order Reviews     

In [23]:
#Data frame is displayed as having been successfully loaded
list_data[7].head()
#A backup copy is made as a checkpoint.
order_reviews=list_data[7].copy()
#The action is verified
order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [92]:
data_summary(order_reviews)

Dataset Shape: (99224, 7)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,review_id,object,0,98410,7bc2406110b926393aa56f80a40eba40,16.58
1,order_id,object,0,98673,73fc7af87114b39712e6da79b0a377eb,16.59
2,review_score,int64,0,5,4,1.73
3,review_comment_title,object,87656,4527,,9.92
4,review_comment_message,object,58247,36159,,14.8
5,review_creation_date,datetime64[ns],0,636,2018-01-18 00:00:00,8.77
6,review_answer_timestamp,datetime64[ns],0,98248,2018-01-18 21:46:59,16.58


The dataset contains the reviews made by the buyers. The columns 'review_comment_comment_title', 'review_comment_message' present a high percentage of null data, however, the information provided is useful for customer sentiment analysis. Those null values can be replaced with 'No comments'.

In [64]:
#Transform date type columns 
order_reviews.columns

Index(['review_id', 'order_id', 'review_score', 'review_comment_title',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp'],
      dtype='object')

In [119]:
order_reviews[['review_creation_date','review_answer_timestamp']]=order_reviews[['review_creation_date','review_answer_timestamp']].apply(pd.to_datetime)

In [120]:
order_reviews[order_reviews.duplicated(keep=False)]

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp


No duplicate data observed

#### Products      

In [25]:
#Data frame is displayed as having been successfully loaded
list_data[8].head()
#A backup copy is made as a checkpoint.
products=list_data[8].copy()
#The action is verified
products.head()

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0


In [93]:
data_summary(products)

Dataset Shape: (32951, 9)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,product_id,object,0,32951,1e9e8ef04dbcff4541ed26657ea517e5,15.01
1,product_category_name,object,610,73,perfumaria,4.8
2,product_name_lenght,float64,610,66,40.0,5.09
3,product_description_lenght,float64,610,2960,287.0,10.72
4,product_photos_qty,float64,610,19,1.0,2.16
5,product_weight_g,float64,2,2204,225.0,7.56
6,product_length_cm,float64,2,99,16.0,5.06
7,product_height_cm,float64,2,102,10.0,5.15
8,product_width_cm,float64,2,95,14.0,4.85


The dataset includes the information about the products. It is identified that in the columns 'product_category_name', 'product_name_lenght', 'product_description_lenght', 'product_photos_qty' the number of missing data is too often repeated, it is necessary a deeper analysis of these missing data in search of patterns.

In [94]:
#Are duplicated data?
products[products.duplicated(keep=False)]

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm


No duplicate data observed

#### Sellers      

In [26]:
#Data frame is displayed as having been successfully loaded
list_data[9].head()
#A backup copy is made as a checkpoint.
sellers=list_data[9].copy()
#The action is verified
sellers.head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


In [95]:
data_summary(sellers)

Dataset Shape: (3095, 4)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,seller_id,object,0,3095,3442f8959a84dea7ee197c632cb2df15,11.6
1,seller_zip_code_prefix,int64,0,2246,13023,10.9
2,seller_city,object,0,611,campinas,6.89
3,seller_state,object,0,23,SP,2.15


The dataset includes information on sellers. Without issues

In [96]:
#Are duplicated data?
sellers[sellers.duplicated(keep=False)]

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state


No duplicate data observed

Product category Name translation

In [27]:
#Data frame is displayed as having been successfully loaded
list_data[10].head()
#A backup copy is made as a checkpoint.
product_category_name_translation=list_data[10].copy()
#The action is verified
product_category_name_translation.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


In [97]:
data_summary(product_category_name_translation)

Dataset Shape: (71, 2)


Unnamed: 0,Column,dtypes,# Missing,# Unique,Example,Entropy
0,product_category_name,object,0,71,beleza_saude,6.15
1,product_category_name_english,object,0,71,health_beauty,6.15


The translation of the product category names is provided. From Portuguese to English. Without any inconvenience. The total number of rows coincides with the unique values found so that no duplicate data observed

After a preliminary review of the datasets it is concluded that all tables are useful for the kpis presented, but further analysis is required for the handling of outliers, null data and duplicates.