![mercado-libre](https://ml-challenge.mercadolibre.com/static/images/logo-mercado-libre_en.png)

# MeLi Data Challenge 2021
This notebook is my attempt on the 2021 Mercado Libre Data Challenge

## The Challenge
Build a model to forecast item inventory days based on Mercado Libre historical data.

## Repository
This notebook is hosted on the this repository: [github.com/matheusccouto/meli-data-challenge-2021](https://github.com/matheusccouto/meli-data-challenge-2021)

Check-out the different branches to see all approaches tested.

## Load Data

### Download data
Download challenge data from [ml-challenge.mercadolibre.com/downloads](https://ml-challenge.mercadolibre.com/downloads).

In [1]:
import os

import requests

# Folder where I will place data.
DATA_DIR = "data"

# URL to download data.
DATA_URL_LIST = [
    r"https://meli-data-challenge.s3.amazonaws.com/2021/test_data.csv",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/train_data.parquet",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/items_static_metadata_full.jl",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/sample_submission.csv.gz",
]


def download(url, path, ignore_if_exists=True):
    """Download files from the web."""
    # Check if it should be skipped.
    if ignore_if_exists and os.path.exists(path):
        return path  # return path to the user.
    # Request file.
    response = requests.get(url, allow_redirects=True)
    # Make sure the target folder exists.
    os.makedirs(os.path.dirname(path), exist_ok=True)
    # Save file.
    with open(path, mode="wb") as file:
        file.write(response.content)
    return path  # return path to the user.


for data_url in DATA_URL_LIST:
    # Create path for the file.
    data_name = os.path.basename(data_url)
    data_path = os.path.join(DATA_DIR, data_name)
    # Download file.
    print(f"Downloading {data_url} to {data_path}")
    download(url=data_url, path=data_path, ignore_if_exists=True)
    # Make sure download was succesful.
    assert os.path.exists(data_path)
    # And make sure the file is not damaged (at least 1MB)
    assert os.path.getsize(data_path) > 1e6

Downloading https://meli-data-challenge.s3.amazonaws.com/2021/test_data.csv to data/test_data.csv
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/train_data.parquet to data/train_data.parquet
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/items_static_metadata_full.jl to data/items_static_metadata_full.jl
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/sample_submission.csv.gz to data/sample_submission.csv.gz


### Train Set
This dataset comprises two months of daily sales data for a subset of Mercadolibre SKUs (stock keeping units) . Each row corresponds to a particular date-SKU combination. Besides SKU and date, for each row, the following fields are available:

|Attributes|Description|
|---|---|
|sold_quantity|Number of units of the corresponding SKU that were sold on that particular date.|
|current_price|Currency in which the price is expressed.|
|currency|Point in time correct listing price.|
|listing_type|Type of listing the SKU had for that particular date. Possible values are classic or premium and they relate to the exposure the items receive and the fee charged to the seller as a sales comission.|
|shipping_logistic_type|Type of shipping method the SKU offered, for that particular date. Possible values are fulfillment, cross_docking and drop_off.|
|shipping_payment|Whether the shipping for the offered SKU at that particular date was free or paid, from the buyer's perspective.|
|minutes_active|Number of minutes the SKU was available for purchase on that particular date.|

In [2]:
import pandas as pd

train_set = pd.read_parquet(os.path.join("data", "train_data.parquet"))
print(f"shape = {train_set.shape}")
train_set.head()

shape = (37660279, 9)


Unnamed: 0,sku,date,sold_quantity,current_price,currency,listing_type,shipping_logistic_type,shipping_payment,minutes_active
0,464801,2021-02-01,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
1,464801,2021-02-02,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
2,464801,2021-02-03,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
3,464801,2021-02-04,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
4,464801,2021-02-05,1,156.78,REA,classic,fulfillment,free_shipping,1440.0


### Test Set
For testing, the following file is provided test_data.csv. This file contains only two columns:

|Attribute|Description|
|---|---|
|SKU|indicates the SKU for which you have to make your prediction|
|target_stock|inventory level (aka number of units of the corresponding SKU for which you have to provide your estimation of inventory days.|

In [3]:
test_set = pd.read_csv(os.path.join("data", "test_data.csv"))
print(f"shape = {test_set.shape}")
test_set.head()

shape = (551472, 2)


Unnamed: 0,sku,target_stock
0,464801,3
1,645793,4
2,99516,8
3,538100,8
4,557191,10


### Items Data
In the file items_static_metadata.jl there is some extra data related to the SKUs characteristics. The file contains a list of dictionaries where each of them contains metadata for a specific SKU . The following fields are available:

|Attribute|Description|
|---|---|
|SKU|stock-keeping-unit. This is a unique identifier for each distinct, physical inventory unit.|
|item_id|unique identifier of the listing the SKU belongs to. The same listing can be associated with more than one SKUs, for example, if different variations of the same item are offered in the listing.|
|item_domain_id|listing's domain id. A domain is a kind of listings clustering within MercadoLibre.|
|item_title|the listing's title in the marketplace.|
|site_id|the MercadoLibre's site the listing belongs to. The labels MLB, MLA and MLM refer to Brazil, Argentina and Mexico respectively.|
|product_id|listing product id. Field might be null for some listings.|
|product_id_family|listing product family id. Field might be null for some listings.|

In [4]:
items_data = pd.read_json(os.path.join("data", "items_static_metadata_full.jl"), lines=True)
print(f"shape = {items_data.shape}")
items_data.head()

shape = (660916, 7)


Unnamed: 0,item_domain_id,item_id,item_title,site_id,sku,product_id,product_family_id
0,MLB-SNEAKERS,492155,Tênis Masculino Olympikus Cyber Barato Promoçao,MLB,0,,MLB15832732
1,MLB-SURFBOARD_RACKS,300279,Suporte Rack Prancha Parede C/ Regulagem Horiz...,MLB,1,,
2,MLM-NECKLACES,69847,5 Collares Plateados Dama Gargantilla Choker -...,MLM,2,,
3,MLM-RINGS,298603,Lindo Anillo De Bella Crepusculo Twilight Prom...,MLM,3,,
4,MLB-WEBCAMS,345949,Webcam Com Microfone Hd 720p Knup Youtube Pc V...,MLB,4,,


### Sample Submission
A sample submission file for you to visualize the expected format of a submission.

In [5]:
sample_submission = pd.read_csv(os.path.join("data", "sample_submission.csv.gz"), header=None)
print(f"shape = {sample_submission.shape}")
sample_submission.head()

shape = (551472, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,0.052,0.006,0.044,0.001,0.061,0.043,0.061,0.035,0.002,0.057,0.004,0.005,0.013,0.048,0.031,0.039,0.019,0.014,0.031,0.063,0.025,0.032,0.043,0.069,0.011,0.058,0.051,0.01,0.004,0.068
1,0.026,0.052,0.008,0.036,0.027,0.029,0.057,0.046,0.005,0.009,0.042,0.052,0.039,0.057,0.029,0.051,0.058,0.033,0.015,0.053,0.013,0.024,0.036,0.033,0.021,0.03,0.023,0.02,0.024,0.05
2,0.067,0.008,0.043,0.02,0.012,0.067,0.01,0.06,0.02,0.061,0.059,0.009,0.025,0.07,0.019,0.004,0.005,0.066,0.017,0.007,0.033,0.014,0.016,0.017,0.04,0.059,0.04,0.014,0.066,0.052
3,0.017,0.045,0.027,0.045,0.036,0.025,0.068,0.067,0.002,0.015,0.04,0.044,0.002,0.029,0.02,0.001,0.023,0.037,0.031,0.043,0.06,0.053,0.027,0.021,0.05,0.045,0.06,0.063,0.004,0.003
4,0.011,0.038,0.02,0.0,0.067,0.023,0.006,0.021,0.058,0.023,0.006,0.054,0.039,0.013,0.061,0.055,0.04,0.031,0.037,0.034,0.002,0.027,0.062,0.045,0.044,0.032,0.048,0.035,0.026,0.043
