#### Columnas Originales

* ITEM_ID: id unívoco de cada item publicado. (Ofuscado)
* SHP_WEIGHT: peso del paquete informado por el correo.
* SHP_LENGTH: largo del paquete informado por el correo.
* SHP_WIDTH: ancho del paquete informado por el correo.
* SHP_HEIGHT: altura del paquete informado por el correo.
* ATTRIBUTES: atributos como marca y modelo, entre otros, en formato json-lines
* CATALOG_PRODUCT_ID: id del catálogo (ofuscado).
* CONDITION: condición de venta (nuevo o usado).
* DOMAIN_ID: id de la categoría a la que pertenece la publicación.
* PRICE: precio en reales.
* SELLER_ID: id del vendedor (ofuscado).
* STATUS: estado de la publicación (activa, cerrada, pausada, etc.)
* TITLE: título de la publicación.


#### Columnas Actuales

* ITEM_ID
* SHP_WEIGHT
* SHP_LENGTH
* SHP_WIDTH
* SHP_HEIGHT
* PRICE
* STATUS
* TITLE
* LEN_ATR: cantidad de atributos
* DT_CAT_PROD: ID Catalogo del Producto-Revisado
* DT_CONDITION: Condición de Venta -Revisado
* DT_DOMAIN: Categoría de la Publicación -Revisado
* DT_SELLER: ID Vendedor -Revisado
* DT_BRAND: Marca del Producto -Revisado
* DT_MODEL: Modelo del Producto -Revisado
* EXCEDIDO: Si el producto excede el límite del correo

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

import random
random.seed(0)

In [2]:
DATASET = 'meli_dataset_a.csv'

In [3]:
df_raw = pd.read_csv(DATASET, low_memory=False)
df = df_raw.copy()

In [4]:
print(df.shape)
print(df.dtypes)
display(df.sample(5))

(296291, 16)
ITEM_ID          object
SHP_WEIGHT      float64
SHP_LENGTH      float64
SHP_WIDTH       float64
SHP_HEIGHT      float64
PRICE           float64
STATUS           object
TITLE            object
LEN_ATR           int64
DT_CAT_PROD      object
DT_CONDITION     object
DT_DOMAIN        object
DT_SELLER        object
DT_BRAND         object
DT_MODEL         object
EXCEDIDO          int64
dtype: object


Unnamed: 0,ITEM_ID,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,PRICE,STATUS,TITLE,LEN_ATR,DT_CAT_PROD,DT_CONDITION,DT_DOMAIN,DT_SELLER,DT_BRAND,DT_MODEL,EXCEDIDO
147376,YEK9TYQZM5,750.0,30.0,20.0,5.0,,under_review,Chapinha Nano Titanium Babyliss Profissional B...,0,H53U1H7Q5G,SIN_DATOS,SIN_DATOS,OTROS,SIN_DATOS,SIN_DATOS,0
263359,ERF672H5I3,673.0,45.0,26.0,20.0,174.9,active,Alicate Corte Diagonal + Corte Frontal Força D...,7,H53U1H7Q5G,new,OTROS,OTROS,WORKER,OTROS,0
63184,RN4Z2Q8WKZ,100.0,25.0,11.0,5.0,,under_review,Kit Carregador Tomada + Cabo Usb V8 1000ma Kai...,0,H53U1H7Q5G,SIN_DATOS,SIN_DATOS,Y0DXBLS7S0,SIN_DATOS,SIN_DATOS,0
234877,T9S7T6GJOK,754.0,35.0,22.0,15.0,,under_review,Tênis Vans Old Skool Na Caixa Preço Revenda,0,H53U1H7Q5G,SIN_DATOS,SIN_DATOS,OTROS,SIN_DATOS,SIN_DATOS,0
292726,RHEYU9IK4E,2473.0,28.0,23.0,18.0,149.78,active,Alto Falante Bravox Kit Fácil X 6 B3x60x + 6x9...,9,H53U1H7Q5G,new,MLB-AUTOMOTIVE_SPEAKERS,OTROS,BRAVOX,OTROS,0


### Se buscará armar un puntaje por envío que muestre la posibiliad de exceder el tope del correo según lo expuesto en el título de la publicación.
#### Sobremuestreo Envios Excedidos

In [5]:
df_excedido = df[ df.EXCEDIDO==1 ].sample(50000, replace= True)

In [6]:
df_ok = df[ df.EXCEDIDO==0 ].sample(150000)

In [7]:
df_balance = pd.concat([df_excedido, df_ok]).sample(200000).reset_index(drop=True)

In [8]:
df_balance.shape

(200000, 16)

In [9]:
X_train = df_balance.loc[:160000,'TITLE']
y_train = df_balance.loc[:160000,'EXCEDIDO']
X_test = df_balance.loc[160001:,'TITLE']
y_test = df_balance.loc[160001:,'EXCEDIDO']

In [10]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(160001,)
(160001,)
(39999,)
(39999,)


## Deep Model

### Get Vocabulary

In [11]:
tokenizer_obj = Tokenizer()
total_reviews = df.TITLE
tokenizer_obj.fit_on_texts(total_reviews)

In [12]:
max_length = max([len(s.split()) for s in total_reviews])

In [13]:
print(max_length)

23


In [14]:
vocab_size = len(tokenizer_obj.word_index) + 1

In [15]:
print(vocab_size)

96641


In [16]:
X_train_tokens = tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)

In [17]:
X_train_pad = pad_sequences(X_train_tokens, maxlen= max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen= max_length, padding='post')

In [18]:
# Build Model
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding

Using TensorFlow backend.


In [19]:
EMBEDDING_DIM = 100
print('Build model...')

Build model...


In [20]:
model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(GRU(units=48, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [21]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

In [22]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 23, 100)           9664100   
_________________________________________________________________
gru_1 (GRU)                  (None, 48)                21456     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 49        
Total params: 9,685,605
Trainable params: 9,685,605
Non-trainable params: 0
_________________________________________________________________


### Entrenar Modelo, se provee los pesos del modelo preentrenado puede saltar hasta la celda de Importar Pesos

In [23]:
model.fit(X_train_pad, y_train, batch_size=128, epochs=5, validation_data=(X_test_pad, y_test), verbose=2)

Instructions for updating:
Use tf.cast instead.
Train on 160001 samples, validate on 39999 samples
Epoch 1/5
 - 155s - loss: 0.3122 - acc: 0.8719 - val_loss: 0.2032 - val_acc: 0.9222
Epoch 2/5
 - 157s - loss: 0.1555 - acc: 0.9408 - val_loss: 0.1591 - val_acc: 0.9399
Epoch 3/5
 - 165s - loss: 0.1047 - acc: 0.9599 - val_loss: 0.1374 - val_acc: 0.9518
Epoch 4/5
 - 162s - loss: 0.0789 - acc: 0.9706 - val_loss: 0.1346 - val_acc: 0.9549
Epoch 5/5
 - 162s - loss: 0.0649 - acc: 0.9757 - val_loss: 0.1393 - val_acc: 0.9557


<keras.callbacks.History at 0x156cf970860>

In [24]:
#model.save('meli_title_nn_200k.h5')

#### Importar pesos y realizar predicciones

In [25]:
#model.load_weights('meli_title_nn_200k.h5')

In [26]:
test_txt = df.TITLE
test_txt_tokens = tokenizer_obj.texts_to_sequences(test_txt)
test_txt_tokens_pad = pad_sequences(test_txt_tokens, maxlen=max_length)

In [27]:
title_score = model.predict(x=test_txt_tokens_pad)

In [28]:
title_score

array([[3.2600760e-04],
       [4.5561194e-03],
       [4.5561194e-03],
       ...,
       [3.8499478e-01],
       [9.7332494e-03],
       [6.7802185e-01]], dtype=float32)

#### Adicionar score al df

In [29]:
df['SCORE'] = title_score

In [30]:
df[df.EXCEDIDO==1][['TITLE','EXCEDIDO','SCORE']].sample(10)

Unnamed: 0,TITLE,EXCEDIDO,SCORE
93002,Fogão Cooktop 5 Bocas Preto Askoi - 1 Ano De ...,1,0.945979
23802,Roçadeira À Gasolina 1 Hp 26 Cc 2 Tempos Tbc26...,1,0.999443
206236,Tapete De Atividades Gymini Move Play - Tiny Love,1,0.984532
174806,Monitor Samsung Led 27'' Full Hd Curved + Not...,1,0.992743
260425,Banqueta Junko Marrom Caputino Mercado De Pont...,1,0.999735
31769,Arara Roupas 100% Aço C/ Sapateira Cabideiro R...,1,0.999132
274465,Friso Lateral Etios Hatch 13 A 19 Vermelho Fur...,1,0.522903
126732,Jogo 3 Prat. Mdf Kit Organiz. Parede Invis. 60...,1,0.924078
189260,Monitores Mackie Cr3 Caixas Som,1,0.907548
45948,Taco Sinuca / Bilhar Profissional Cruz De Malt...,1,0.980295


In [31]:
pd.crosstab(df.EXCEDIDO, df.SCORE<0.5)

SCORE,False,True
EXCEDIDO,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19520,263460
1,12704,607


#### Se observa que la mayoría de los envíos Excedidos tienen un score > 0.5

In [32]:
recall = (df[df.EXCEDIDO==1].SCORE > 0.5).sum() / df.EXCEDIDO.sum() 
recall

0.9543986176846218

In [33]:
#Resultado Dataset
print(df.shape)
print(df.dtypes)
display(df.sample(5))

(296291, 17)
ITEM_ID          object
SHP_WEIGHT      float64
SHP_LENGTH      float64
SHP_WIDTH       float64
SHP_HEIGHT      float64
PRICE           float64
STATUS           object
TITLE            object
LEN_ATR           int64
DT_CAT_PROD      object
DT_CONDITION     object
DT_DOMAIN        object
DT_SELLER        object
DT_BRAND         object
DT_MODEL         object
EXCEDIDO          int64
SCORE           float32
dtype: object


Unnamed: 0,ITEM_ID,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,PRICE,STATUS,TITLE,LEN_ATR,DT_CAT_PROD,DT_CONDITION,DT_DOMAIN,DT_SELLER,DT_BRAND,DT_MODEL,EXCEDIDO,SCORE
275991,VC8J8F1DA0,131.0,16.0,11.0,11.0,,under_review,Anycast Adaptador Hdmi Chromecast Ezcast Wecas...,0,H53U1H7Q5G,SIN_DATOS,SIN_DATOS,OTROS,SIN_DATOS,SIN_DATOS,0,0.000423
200105,PF25P0XN2O,60.0,26.5,10.8,5.0,45.0,active,Dremel Ez476 Disco De Corte P/ Plástico Ez Loc...,13,H53U1H7Q5G,new,MLB-CUT_OFF_WHEELS,XYDSNCU3UV,DREMEL,OTROS,0,0.000555
89950,S4R3EF5I4W,1740.0,16.0,11.0,4.0,,under_review,Kit 10 Lâmpada 16w Branco Quente Led 4u Milho ...,0,H53U1H7Q5G,SIN_DATOS,SIN_DATOS,GA9FI6X2KH,SIN_DATOS,SIN_DATOS,0,0.292742
184942,TP62F8P7C5,1640.0,33.0,25.0,7.0,48.0,active,Kit 2 Bucha Barra Estabilizador Dianteiro Peug...,10,H53U1H7Q5G,new,MLB-SUSPENSION_CONTROL_ARM_BUSHINGS,EYBX2QNZ29,OTROS,SIN_DATOS,0,0.003401
146320,Y7N6ZGWAGZ,1970.0,45.0,22.0,20.0,159.99,paused,Maleta Maquiagem Profissional A87 C/ Kit 12 Pi...,8,H53U1H7Q5G,new,MLB-MAKEUP_TRAIN_CASES,XJC5VFREEM,OTROS,OTROS,0,0.000445


In [34]:
#Exportar dataset revisado
cols=['ITEM_ID', 'SHP_WEIGHT', 'SHP_LENGTH', 'SHP_WIDTH', 'SHP_HEIGHT', 'PRICE', 'STATUS', 'TITLE', 'LEN_ATR', 
      'DT_CAT_PROD', 'DT_CONDITION', 'DT_DOMAIN', 'DT_SELLER', 'DT_BRAND', 'DT_MODEL', 'SCORE', 'EXCEDIDO']

In [35]:
df[cols].to_csv('./meli_dataset_b.csv', index=False)