<a href="https://colab.research.google.com/github/matheusccouto/meli-data-challenge-2021/blob/main/0_meli_data_challenge_2021_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![mercado-libre](https://ml-challenge.mercadolibre.com/static/images/logo-mercado-libre_en.png)

# MeLi Data Challenge 2021
# Data Wrangling
On this notebook I will load and clean the data from the 2021 Mercado Libre data challenge.

## The Challenge
Build a model to forecast item inventory days based on Mercado Libre historical data.

## The Task
The task is to predict how long it will take for the inventory of a certain item to be sold completely. In inventory management theory this concept is known as inventory days.

In the evaluation set I will be given the item target stock, and I will have to provide a prediction for the number of days it will take to run out. Possible values range from 1 to 30. Rather than giving a point estimate, you are expected to provide a score for each the possible outcomes.

To put it simply, you need to answer the following question:

**'What are the odds that the target stock will be sold out in one day?', 'What about in two days?' and so on until day 30.**

## Repository
This notebook is hosted on the this repository: [github.com/matheusccouto/meli-data-challenge-2021](https://github.com/matheusccouto/meli-data-challenge-2021)

Check-out the different branches to see all approaches tested.

## Load Data

### Download data
Download challenge data from [ml-challenge.mercadolibre.com/downloads](https://ml-challenge.mercadolibre.com/downloads).

In [1]:
import os

import requests

# Folder where I will place data.
DATA_DIR = "data"

# URL to download data.
DATA_URL_LIST = [
    r"https://meli-data-challenge.s3.amazonaws.com/2021/test_data.csv",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/train_data.parquet",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/items_static_metadata_full.jl",
    r"https://meli-data-challenge.s3.amazonaws.com/2021/sample_submission.csv.gz",
]


def download(url, path, ignore_if_exists=True):
    """Download files from the web."""
    # Check if it should be skipped.
    if ignore_if_exists and os.path.exists(path):
        return path  # return path to the user.
    # Request file.
    response = requests.get(url, allow_redirects=True)
    # Make sure the target folder exists.
    os.makedirs(os.path.dirname(path), exist_ok=True)
    # Save file.
    with open(path, mode="wb") as file:
        file.write(response.content)
    return path  # return path to the user.


for data_url in DATA_URL_LIST:
    # Create path for the file.
    data_name = os.path.basename(data_url)
    data_path = os.path.join(DATA_DIR, data_name)
    # Download file.
    print(f"Downloading {data_url} to {data_path}")
    download(url=data_url, path=data_path, ignore_if_exists=True)
    # Make sure download was succesful.
    assert os.path.exists(data_path)
    # And make sure the file is not damaged (at least 1MB)
    assert os.path.getsize(data_path) > 1e6

Downloading https://meli-data-challenge.s3.amazonaws.com/2021/test_data.csv to data/test_data.csv
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/train_data.parquet to data/train_data.parquet
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/items_static_metadata_full.jl to data/items_static_metadata_full.jl
Downloading https://meli-data-challenge.s3.amazonaws.com/2021/sample_submission.csv.gz to data/sample_submission.csv.gz


### Train Set
This dataset comprises two months of daily sales data for a subset of Mercadolibre SKUs (stock keeping units) . Each row corresponds to a particular date-SKU combination. Besides SKU and date, for each row, the following fields are available:

|Attributes|Description|
|---|---|
|sold_quantity|Number of units of the corresponding SKU that were sold on that particular date.|
|current_price|Point in time correct listing price.|
|currency|Currency in which the price is expressed.|
|listing_type|Type of listing the SKU had for that particular date. Possible values are classic or premium and they relate to the exposure the items receive and the fee charged to the seller as a sales comission.|
|shipping_logistic_type|Type of shipping method the SKU offered, for that particular date. Possible values are fulfillment, cross_docking and drop_off.|
|shipping_payment|Whether the shipping for the offered SKU at that particular date was free or paid, from the buyer's perspective.|
|minutes_active|Number of minutes the SKU was available for purchase on that particular date.|

In [2]:
import pandas as pd

train_set = pd.read_parquet(os.path.join("data", "train_data.parquet"))
print(f"shape = {train_set.shape}")
train_set.head()

shape = (37660279, 9)


Unnamed: 0,sku,date,sold_quantity,current_price,currency,listing_type,shipping_logistic_type,shipping_payment,minutes_active
0,464801,2021-02-01,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
1,464801,2021-02-02,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
2,464801,2021-02-03,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
3,464801,2021-02-04,0,156.78,REA,classic,fulfillment,free_shipping,1440.0
4,464801,2021-02-05,1,156.78,REA,classic,fulfillment,free_shipping,1440.0


### Test Set
For testing, the following file is provided test_data.csv. This file contains only two columns:

|Attribute|Description|
|---|---|
|SKU|indicates the SKU for which you have to make your prediction|
|target_stock|inventory level (aka number of units of the corresponding SKU for which you have to provide your estimation of inventory days.|

In [3]:
test_set = pd.read_csv(os.path.join("data", "test_data.csv"))
print(f"shape = {test_set.shape}")
test_set.head()

shape = (551472, 2)


Unnamed: 0,sku,target_stock
0,464801,3
1,645793,4
2,99516,8
3,538100,8
4,557191,10


### Items Data
In the file items_static_metadata.jl there is some extra data related to the SKUs characteristics. The file contains a list of dictionaries where each of them contains metadata for a specific SKU . The following fields are available:

|Attribute|Description|
|---|---|
|SKU|stock-keeping-unit. This is a unique identifier for each distinct, physical inventory unit.|
|item_id|unique identifier of the listing the SKU belongs to. The same listing can be associated with more than one SKUs, for example, if different variations of the same item are offered in the listing.|
|item_domain_id|listing's domain id. A domain is a kind of listings clustering within MercadoLibre.|
|item_title|the listing's title in the marketplace.|
|site_id|the MercadoLibre's site the listing belongs to. The labels MLB, MLA and MLM refer to Brazil, Argentina and Mexico respectively.|
|product_id|listing product id. Field might be null for some listings.|
|product_id_family|listing product family id. Field might be null for some listings.|

In [4]:
items_data = pd.read_json(os.path.join("data", "items_static_metadata_full.jl"), lines=True)
print(f"shape = {items_data.shape}")
items_data.head()

shape = (660916, 7)


Unnamed: 0,item_domain_id,item_id,item_title,site_id,sku,product_id,product_family_id
0,MLB-SNEAKERS,492155,Tênis Masculino Olympikus Cyber Barato Promoçao,MLB,0,,MLB15832732
1,MLB-SURFBOARD_RACKS,300279,Suporte Rack Prancha Parede C/ Regulagem Horiz...,MLB,1,,
2,MLM-NECKLACES,69847,5 Collares Plateados Dama Gargantilla Choker -...,MLM,2,,
3,MLM-RINGS,298603,Lindo Anillo De Bella Crepusculo Twilight Prom...,MLM,3,,
4,MLB-WEBCAMS,345949,Webcam Com Microfone Hd 720p Knup Youtube Pc V...,MLB,4,,


### Sample Submission
A sample submission file for you to visualize the expected format of a submission.

In [5]:
sample_submission = pd.read_csv(os.path.join("data", "sample_submission.csv.gz"), header=None)
print(f"shape = {sample_submission.shape}")
sample_submission.head()

shape = (551472, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,0.052,0.006,0.044,0.001,0.061,0.043,0.061,0.035,0.002,0.057,0.004,0.005,0.013,0.048,0.031,0.039,0.019,0.014,0.031,0.063,0.025,0.032,0.043,0.069,0.011,0.058,0.051,0.01,0.004,0.068
1,0.026,0.052,0.008,0.036,0.027,0.029,0.057,0.046,0.005,0.009,0.042,0.052,0.039,0.057,0.029,0.051,0.058,0.033,0.015,0.053,0.013,0.024,0.036,0.033,0.021,0.03,0.023,0.02,0.024,0.05
2,0.067,0.008,0.043,0.02,0.012,0.067,0.01,0.06,0.02,0.061,0.059,0.009,0.025,0.07,0.019,0.004,0.005,0.066,0.017,0.007,0.033,0.014,0.016,0.017,0.04,0.059,0.04,0.014,0.066,0.052
3,0.017,0.045,0.027,0.045,0.036,0.025,0.068,0.067,0.002,0.015,0.04,0.044,0.002,0.029,0.02,0.001,0.023,0.037,0.031,0.043,0.06,0.053,0.027,0.021,0.05,0.045,0.06,0.063,0.004,0.003
4,0.011,0.038,0.02,0.0,0.067,0.023,0.006,0.021,0.058,0.023,0.006,0.054,0.039,0.013,0.061,0.055,0.04,0.031,0.037,0.034,0.002,0.027,0.062,0.045,0.044,0.032,0.048,0.035,0.026,0.043


## Data Wrangling
In this section I will make sure the data is ready to be analyzed.

### Train Set

#### Dtypes
Make sure data is the appropiated format.

In [6]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37660279 entries, 0 to 37660278
Data columns (total 9 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   sku                     int64  
 1   date                    object 
 2   sold_quantity           int64  
 3   current_price           float64
 4   currency                object 
 5   listing_type            object 
 6   shipping_logistic_type  object 
 7   shipping_payment        object 
 8   minutes_active          float64
dtypes: float64(2), int64(2), object(5)
memory usage: 2.5+ GB


Date is stored as object and should be interpreted as date.

In [7]:
train_set["date"] = pd.to_datetime(train_set["date"])
print(f"Dates go from {train_set['date'].min().strftime('%Y-%m-%d')} to {train_set['date'].max().strftime('%Y-%m-%d')}")

Dates go from 2021-02-01 to 2021-03-31


Other columns can also be optimized. Since this dataset is huge and I am working very near the memory limit, it is a good idea to optimize as much as possible.

In [8]:
train_set[["sku", "sold_quantity", "minutes_active", "current_price"]].max()

sku                  660915.0
sold_quantity          6951.0
minutes_active         1440.0
current_price     999999999.0
dtype: float64

In [9]:
train_set = train_set.astype({
    "sku": "int32",
    "sold_quantity": "int16",
    "minutes_active": "int16",
    "current_price": "float32"
    })

# Check max valyes again to make sure it didn't clip any value.
train_set[["sku", "sold_quantity", "minutes_active", "current_price"]].max()

sku               6.609150e+05
sold_quantity     6.951000e+03
minutes_active    1.440000e+03
current_price     1.000000e+09
dtype: float64

The `current_price` clipped a single unit for the biggest value. It isn't problematic and I can keep up with it.

I will round to the second decimal case since it is a monetary value, and for a online listing the third case is not relevant.

In [10]:
train_set["current_price"] = train_set["current_price"].round(2)

I'll also see the possible values from all other object columns to see if there's something to be done.

In [11]:
for col in train_set.dtypes[train_set.dtypes == "object"].index:
    print(f"{col} classes are: {', '.join(train_set[col].unique())}")

currency classes are: REA, MEX, DOL, ARG
listing_type classes are: classic, premium
shipping_logistic_type classes are: fulfillment, cross_docking, drop_off
shipping_payment classes are: free_shipping, paid_shipping


#### Encoding
Object columns need to be encoded in order to be used in modelling. For the columns with only two classes I'll transform them into a single columns. The others with 3+ classes will receive a column for each column.

I won't drop one of them by now. Keeping all dummies may be bad for some algorithms, but for other dropping one of them is worse. I will keep all columns by now and decide what to do with these columns during modelling.

In [12]:
train_set = pd.get_dummies(train_set, columns=["listing_type", "shipping_payment"], drop_first=True)
train_set = pd.get_dummies(train_set, columns=["currency", "shipping_logistic_type"], drop_first=False)
train_set.head()

Unnamed: 0,sku,date,sold_quantity,current_price,minutes_active,listing_type_premium,shipping_payment_paid_shipping,currency_ARG,currency_DOL,currency_MEX,currency_REA,shipping_logistic_type_cross_docking,shipping_logistic_type_drop_off,shipping_logistic_type_fulfillment
0,464801,2021-02-01,0,156.779999,1440,0,0,0,0,0,1,0,0,1
1,464801,2021-02-02,0,156.779999,1440,0,0,0,0,0,1,0,0,1
2,464801,2021-02-03,0,156.779999,1440,0,0,0,0,0,1,0,0,1
3,464801,2021-02-04,0,156.779999,1440,0,0,0,0,0,1,0,0,1
4,464801,2021-02-05,1,156.779999,1440,0,0,0,0,0,1,0,0,1


Let's see memory usage now

In [13]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37660279 entries, 0 to 37660278
Data columns (total 14 columns):
 #   Column                                Dtype         
---  ------                                -----         
 0   sku                                   int32         
 1   date                                  datetime64[ns]
 2   sold_quantity                         int16         
 3   current_price                         float32       
 4   minutes_active                        int16         
 5   listing_type_premium                  uint8         
 6   shipping_payment_paid_shipping        uint8         
 7   currency_ARG                          uint8         
 8   currency_DOL                          uint8         
 9   currency_MEX                          uint8         
 10  currency_REA                          uint8         
 11  shipping_logistic_type_cross_docking  uint8         
 12  shipping_logistic_type_drop_off       uint8         
 13  shipping_l

Memory usage is much better now.

#### NaN
Check for NaNs in the dataset.

In [14]:
print(f"There are {train_set.isna().sum().sum()} NaNs in the dataset.")

There are 0 NaNs in the dataset.


### Test Set

#### Dtypes
Make sure data is the appropiated format.

In [15]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551472 entries, 0 to 551471
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   sku           551472 non-null  int64
 1   target_stock  551472 non-null  int64
dtypes: int64(2)
memory usage: 8.4 MB


It shows that there is no NaNs.

I will assign smaller dtypes for better memory usage. First I need to see the biggest values to check which dtype apply.

In [16]:
test_set.max()

sku             660914
target_stock     32710
dtype: int64

In [17]:
test_set = test_set.astype({"sku": "uint32", "target_stock": "uint16"})
test_set.max()  # Print max values to make sure values were not clipped.

sku             660914
target_stock     32710
dtype: int64

We want to test the next 30 days, so we need to explode this dataset for these dates.

In [18]:
start = train_set["date"].max() + pd.Timedelta(days=1)
end = start + pd.Timedelta(days=29)
date_range = list(pd.date_range(start, end, freq='D'))

test_set["date"] = test_set["sku"].apply(lambda row: date_range)
test_set = test_set.explode("date")

test_set.head()

Unnamed: 0,sku,target_stock,date
0,464801,3,2021-04-01
0,464801,3,2021-04-02
0,464801,3,2021-04-03
0,464801,3,2021-04-04
0,464801,3,2021-04-05


In [19]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16544160 entries, 0 to 551471
Data columns (total 3 columns):
 #   Column        Dtype         
---  ------        -----         
 0   sku           uint32        
 1   target_stock  uint16        
 2   date          datetime64[ns]
dtypes: datetime64[ns](1), uint16(1), uint32(1)
memory usage: 347.1 MB


### Items Data

#### Dtypes
Make sure data is the appropiated format.

In [20]:
items_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660916 entries, 0 to 660915
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   item_domain_id     660913 non-null  object
 1   item_id            660916 non-null  int64 
 2   item_title         660916 non-null  object
 3   site_id            660916 non-null  object
 4   sku                660916 non-null  int64 
 5   product_id         29668 non-null   object
 6   product_family_id  83184 non-null   object
dtypes: int64(2), object(5)
memory usage: 35.3+ MB


Let's check max values to see which types to apply.

In [21]:
items_data.max()

item_id                           517895
item_title    Útiles Set Basico Todomoda
site_id                              MLM
sku                               660915
dtype: object

Apply and verify if the transformation didn't clip.

In [22]:
items_data = items_data.astype({
    "item_id": "int32",
    "sku": "int32",
})
items_data.max()

item_id                           517895
item_title    Útiles Set Basico Todomoda
site_id                              MLM
sku                               660915
dtype: object

`sku` should be this dataset index.

In [23]:
items_data = items_data.set_index("sku")
items_data.head()

Unnamed: 0_level_0,item_domain_id,item_id,item_title,site_id,product_id,product_family_id
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,MLB-SNEAKERS,492155,Tênis Masculino Olympikus Cyber Barato Promoçao,MLB,,MLB15832732
1,MLB-SURFBOARD_RACKS,300279,Suporte Rack Prancha Parede C/ Regulagem Horiz...,MLB,,
2,MLM-NECKLACES,69847,5 Collares Plateados Dama Gargantilla Choker -...,MLM,,
3,MLM-RINGS,298603,Lindo Anillo De Bella Crepusculo Twilight Prom...,MLM,,
4,MLB-WEBCAMS,345949,Webcam Com Microfone Hd 720p Knup Youtube Pc V...,MLB,,


#### Encoding
Let's check if some of these columns are able to be encoded.

In [24]:
for col in items_data.select_dtypes("object").columns:
    print(f"{col} has {len(items_data[col].unique())} classes")

item_domain_id has 8409 classes
item_title has 478175 classes
site_id has 3 classes
product_id has 15864 classes
product_family_id has 29601 classes


It seems only `site_id` is categorical. The remainded cannot be encoded by now.

For the same reason as before. I won't drop one of the encoded categories. I will leave this decision to later on.

In [25]:
items_data = pd.get_dummies(items_data, columns=["site_id"], drop_first=False)
items_data.head()

Unnamed: 0_level_0,item_domain_id,item_id,item_title,product_id,product_family_id,site_id_MLA,site_id_MLB,site_id_MLM
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,MLB-SNEAKERS,492155,Tênis Masculino Olympikus Cyber Barato Promoçao,,MLB15832732,0,1,0
1,MLB-SURFBOARD_RACKS,300279,Suporte Rack Prancha Parede C/ Regulagem Horiz...,,,0,1,0
2,MLM-NECKLACES,69847,5 Collares Plateados Dama Gargantilla Choker -...,,,0,0,1
3,MLM-RINGS,298603,Lindo Anillo De Bella Crepusculo Twilight Prom...,,,0,0,1
4,MLB-WEBCAMS,345949,Webcam Com Microfone Hd 720p Knup Youtube Pc V...,,,0,1,0


#### NaNs
The columns `product_id` and `product_family_id` are mostly NaNs. I will deal with them later. 

The `item_domain_id` is has only a few missing rows. I will deal with them manually.

In [26]:
items_data[items_data['item_domain_id'].isna()]

Unnamed: 0_level_0,item_domain_id,item_id,item_title,product_id,product_family_id,site_id_MLA,site_id_MLB,site_id_MLM
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
454273,,245417,Fone De Ouvido Multilaser Tws Ph326 Airbud Sem...,,MLB16117735,0,1,0
459892,,283198,Headset Gamer Multilaser P2 Ph123 Preto E Verde,,MLB16158968,0,1,0
553503,,114028,Prancha De Cabelo Taiff Elegance Frizz Cerâmic...,,MLB17821818,0,1,0


In [27]:
# Iterate over SKU that has missing data on item_domain_id.
for sku in items_data[items_data["item_domain_id"].isna()].index:
    print(f"sku = {sku}")
    print(f"item_title = {items_data.loc[sku]['item_title']}")

    # Filter only lines from same product_family_id.
    product_family_id = items_data[items_data["product_family_id"] == items_data.loc[sku]["product_family_id"]]

    # If the only line is the one that we are trying to fill, send a error message.
    if product_family_id["item_domain_id"].isna().all():
        print("No other products on the same family to compare to.\n")
    
    # But if it has, fill the missing values.
    else:
        item_domain_id = product_family_id["item_domain_id"].dropna().iloc[0]
        print(f"item_domain_id = {item_domain_id}\n")
        items_data.loc[sku, "item_domain_id"] = item_domain_id

sku = 454273
item_title = Fone De Ouvido Multilaser Tws Ph326 Airbud Sem Fio Microfone
item_domain_id = MLB-HEADPHONES

sku = 459892
item_title = Headset Gamer Multilaser P2 Ph123 Preto E Verde
No other products on the same family to compare to.

sku = 553503
item_title = Prancha De Cabelo Taiff Elegance Frizz Cerâmica 230ºc Bivolt
item_domain_id = MLB-HAIR_STRAIGHTENERS



The *Headset Gamer Multilaser P2 Ph123 Preto E Verde* must be filled manually. I'll fill it with *MLB-HEADPHONES*

In [28]:
items_data.loc[459892, "item_domain_id"] = "MLB-HEADPHONES"
print(f"Amount of missing data = {items_data['item_domain_id'].isna().sum()}")

Amount of missing data = 0


### Sample Submission

#### Dtypes
Make sure data is the appropiated format.

In [29]:
sample_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551472 entries, 0 to 551471
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       551472 non-null  float64
 1   1       551472 non-null  float64
 2   2       551472 non-null  float64
 3   3       551472 non-null  float64
 4   4       551472 non-null  float64
 5   5       551472 non-null  float64
 6   6       551472 non-null  float64
 7   7       551472 non-null  float64
 8   8       551472 non-null  float64
 9   9       551472 non-null  float64
 10  10      551472 non-null  float64
 11  11      551472 non-null  float64
 12  12      551472 non-null  float64
 13  13      551472 non-null  float64
 14  14      551472 non-null  float64
 15  15      551472 non-null  float64
 16  16      551472 non-null  float64
 17  17      551472 non-null  float64
 18  18      551472 non-null  float64
 19  19      551472 non-null  float64
 20  20      551472 non-null  float64
 21  21      55

Since it is required a 4 digits precision, we can use a lighter dtype.

In [30]:
sample_submission = sample_submission.astype("float32")

## Save Data

### Google Drive
I will store processed data on Google Drive.

In [31]:
from google.colab import drive

drive.mount("/gdrive")
base_dir = os.path.join("/gdrive", "My Drive", "Code", "meli-data-challenge-2021")
os.chdir(base_dir)
os.makedirs(os.path.join("data", "0-data-wrangling"), exist_ok=True)
os.chdir(os.path.join(base_dir, "data", "0-data-wrangling"))

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


### Train Set

In [32]:
train_set.to_parquet("train_data.parquet")
train_set.dtypes.apply(lambda x: x.name).to_json("train_data_dtypes.json")

### Test Set

In [33]:
test_set.to_csv("test_data.csv", index=False)
test_set.dtypes.apply(lambda x: x.name).to_json("test_data_dtypes.json")

### Items Data

In [34]:
items_data.to_json("items_static_metadata_full.jl", orient="records", lines=True)
items_data.dtypes.apply(lambda x: x.name).to_json("items_static_metadata_full_dtypes.json")

### Sample Submission

In [35]:
sample_submission.to_csv("sample_submission.csv.gz", compression="gzip")
sample_submission.dtypes.apply(lambda x: x.name).to_json("sample_submission_dtypes.json")