Importamos todas las librerias necesarias en el notebook

In [1]:
import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn
import sys
from scipy import stats

Leemos el dataset de Mercado-libre. La columna TITLE y DOMAIN_ID lo leemos como string. Para ATTRIBUTES usamos CustomParser. 

In [2]:

dataset = pandas.read_csv('meli_dataset_20190426.csv', converters={'ATTRIBUTES':str,'DOMAIN_ID': str, 'TITLE': str})
row0 = dataset.shape[0]
row0

499948

In [3]:
dataset.head()

Unnamed: 0,ITEM_ID,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,ATTRIBUTES,CATALOG_PRODUCT_ID,CONDITION,DOMAIN_ID,PRICE,SELLER_ID,STATUS,TITLE
0,M1CQ76ZT5W,,,,,,H53U1H7Q5G,,,,,404,
1,SN7ISIGQ9J,235.0,25.0,25.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-SKIN_CARE_SUPPLIES,68.0,QF4OJMYQ9Q,active,Ácido Hidroquinona 20% 30g + Sabonete Pré Pe...
2,JGEV50GW2U,1757.0,23.0,17.0,16.0,"[{'id': 'ACCESSORIES_INCLUDED', 'name': 'Acess...",YRBDJR6T7Y,new,MLB-NEBULIZERS,145.9,WEE71CZC2Q,active,Inalador E Nebulizador Infantil Nebdog Superfl...
3,JGEV50GW2U,1748.0,23.0,17.0,16.0,"[{'id': 'ACCESSORIES_INCLUDED', 'name': 'Acess...",YRBDJR6T7Y,new,MLB-NEBULIZERS,145.9,WEE71CZC2Q,active,Inalador E Nebulizador Infantil Nebdog Superfl...
4,JGEV50GW2U,,,,,"[{'id': 'ACCESSORIES_INCLUDED', 'name': 'Acess...",YRBDJR6T7Y,new,MLB-NEBULIZERS,145.9,WEE71CZC2Q,active,Inalador E Nebulizador Infantil Nebdog Superfl...


**Eliminamos valores cuyo status es `404` , luego eliminamos la columna `status` del dataset ya que solo es útil para limpieza.**

In [4]:
indices = dataset[ dataset['STATUS'] == '404' ].index
dataset.drop(indices , inplace=True)

In [5]:
row1 = dataset.shape[0]
row0-row1

78361

In [6]:
row0/(row0-row1)

6.380061510190018

Vemos que 78361 filas fueron removidas. Esto es un 6 porciento de data set original de casi 500 mil filas.

**Eliminamos los valores NaN de las columnas con prefijo `SHP_`. Estas son aquellas que representan o peso o dimensiones de un item.**

In [7]:
indices = dataset[dataset['SHP_WEIGHT'].isna() | dataset['SHP_LENGTH'].isna() | 
                  dataset['SHP_WIDTH'].isna() | dataset['SHP_HEIGHT'].isna() ].index
dataset.drop(indices , inplace=True)
row2 = dataset.shape[0]

In [8]:
print('Nos quedan ',row2,' filas. Fueron removidas ',row1-row2, '. Esto es un ',row2/(row1-row2),'porciento.')

Nos quedan  296325  filas. Fueron removidas  125262 . Esto es un  2.365641615174594 porciento.


**Agrupamos por item id y calcular mediana de peso y medidas. De esta forma nos queda una única fila por cada item_id.**


In [9]:
listColumns = list(dataset.columns)
columns_dic={}
for item in listColumns:
    if item.startswith('SHP_'):
        columns_dic[item] = 'median'
    else:
        columns_dic[item] = 'first'
dataset = dataset.groupby('ITEM_ID').agg(columns_dic)


In [10]:
dataset.describe()

Unnamed: 0,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,PRICE
count,236443.0,236443.0,236443.0,236443.0,210659.0
mean,1818.96622,31.398151,21.251572,11.503536,1958.341
std,3199.67595,18.46979,11.157975,8.23487,663823.2
min,1.0,0.0,0.0,0.0,0.1
25%,250.0,20.0,13.0,5.0,47.9
50%,650.0,25.0,20.0,10.0,99.99
75%,1883.75,36.0,25.0,16.0,179.9
max,50000.0,288.2,115.0,105.0,303248700.0


**4-Parsear la columna de atributos y extraer a columnas propias aquellos atributos cuyo `id` sea `BRAND` o `MODEL`. Estos atributos representan marca o modelo que el vendedor del item ingresó en la publicación. [Opcional] No es necesario limitarse a estos dos atributos, se puede probar quedarse con los N atributos más frecuentes.**



Demora aproximadamente 10 minutos en parsear el dataframe

In [11]:
import ast
def parse_attributes(row):
    if row['ATTRIBUTES'] == '':
        return row['DOMAIN_ID'] + ' ' + row['TITLE']
    data_dict = ast.literal_eval(row['ATTRIBUTES'])
    #print(data_dict)
    data_df = pandas.DataFrame.from_dict(data_dict)
    if not 'id' in data_df.columns:
        return row['DOMAIN_ID'] + ' ' + row['TITLE']
    brand_value_name = data_df[data_df['id']=='BRAND']['value_name'].values
    model_value_name = data_df[data_df['id']=='MODEL']['value_name'].values 
    
    if brand_value_name.size == 1:
        brand_value_name = brand_value_name[0]
        if brand_value_name is None:
            brand_value_name = ''
    else:
        brand_value_name = ''
    if model_value_name.size == 1:
        model_value_name = model_value_name[0]
        if model_value_name is None:
            model_value_name = ''
    else:
        model_value_name = ''
    return brand_value_name + ' ' + model_value_name + ' ' + row['DOMAIN_ID'] + ' ' + row['TITLE']
 
#dataset.sample(50).apply(parse_attributes, axis=1)
dataset['X'] = dataset.apply(parse_attributes, axis=1)
#dataset2 = dataset.apply(parse_attributes, axis=1)

In [12]:
dataset.head()

Unnamed: 0_level_0,ITEM_ID,SHP_WEIGHT,SHP_LENGTH,SHP_WIDTH,SHP_HEIGHT,ATTRIBUTES,CATALOG_PRODUCT_ID,CONDITION,DOMAIN_ID,PRICE,SELLER_ID,STATUS,TITLE,X
ITEM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
A002DG7EAZ,A002DG7EAZ,812.0,36.0,32.0,12.0,,H53U1H7Q5G,,,,U2T0EY02XB,under_review,Apresentador Multimídia Wireless Logitech R400...,Apresentador Multimídia Wireless Logitech R40...
A00SG33UIH,A00SG33UIH,2320.0,16.0,11.0,4.0,"[{'id': 'AUTHOR', 'name': 'Autor', 'value_id':...",H53U1H7Q5G,new,MLB-BOOKS,149.99,BCZWFNME44,active,Apostila Trt-sp 2018 - Analista Jud. Área Apo...,MLB-BOOKS Apostila Trt-sp 2018 - Analista Ju...
A00VIC9XL7,A00VIC9XL7,213.0,16.0,13.0,10.0,"[{'id': 'BRAND', 'name': 'Marca', 'value_id': ...",H53U1H7Q5G,new,MLB-BABY_GROOMING_KITS,329.0,T2JY69NPBA,active,Wetstop 3 Alarme Miccional Xixi Na Cama Enurese,Wet Stop 3+ MLB-BABY_GROOMING_KITS Wetstop 3 A...
A00VM7MP9F,A00VM7MP9F,175.0,25.0,20.0,15.0,"[{'id': 'ITEM_CONDITION', 'name': 'Condição do...",H53U1H7Q5G,new,MLB-BICYCLE_BOTTLE_CAGES,45.0,YA6XOOJU39,active,Suporte De Garrafa Zefal Wiiz Para Bicicleta,zefal wiiz MLB-BICYCLE_BOTTLE_CAGES Suporte De...
A00W1VSE3K,A00W1VSE3K,82.0,30.0,15.0,5.0,"[{'id': 'ALARM', 'name': 'Com alarme', 'value_...",H53U1H7Q5G,new,MLB-PEDOMETERS,31.98,DCLDPQAY43,active,Relógio Marcador De Passos Distancia E Caloria...,BLUELANS Led MLB-PEDOMETERS Relógio Marcador D...


**Vectorizamos la columna X utilizando TfidfVectorizer**

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import string 

In [14]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('portuguese')]

In [15]:
vectorizer = TfidfVectorizer(analyzer=text_process)

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pbrizuela/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Demora aproximadamente 5 minutos en realizar el fit

In [17]:
vectorizer.fit(dataset['X'])

TfidfVectorizer(analyzer=<function text_process at 0x7f2d23a1fbf8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.float64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), norm='l2',
        preprocessor=None, smooth_idf=True, stop_words=None,
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [18]:
len(vectorizer.vocabulary_)

180452

In [19]:
#print(vectorizer.vocabulary_.keys())

In [20]:
# encode document
vector = vectorizer.transform(dataset['X'])

In [21]:
#vector[vector>0.9]

In [22]:
labels=(dataset['SHP_LENGTH']>70) | (dataset['SHP_WIDTH'] >70) | (dataset['SHP_HEIGHT'] >70)

In [23]:
len(labels)

236443

In [24]:
numpy.sum(labels==True)

10434

In [25]:
numpy.sum(labels==False)

226009

**1-Splitear el dataset en train/test (80-20). Recordar la utilidad train_test_split de scikit-learn. Utilizar los parámetros `random_state` y `stratify` y explicar su utilidad.**

In [26]:
from sklearn.model_selection import train_test_split
data_train, data_test, label_train, label_test = train_test_split(vector, labels, test_size = 0.2, random_state=0, stratify=labels)

In [27]:
len(label_train)

189154

In [28]:
numpy.sum(label_train==True)

8347

In [29]:
numpy.sum(label_train==False)

180807

In [30]:
len(label_test)

47289

In [31]:
numpy.sum(label_test==True)

2087

In [32]:
numpy.sum(label_test==False)

45202

El parametro random_state se utiliza para inicializar las semillas aleatorias y que el resultado de su ejecucion sea reproducible.
El parametro stratify se utiliza para que la separacion de entre train y split tengan las misma distribucion en sus clases.

**2-Entrenar y evaluar con al menos 3 nuevos modelos (Sugerencias: SVM, RandomForest, GradientBoostingClassifier, etc.) Obligatorio: Probar con una red neuronal. Puede ser de scikit-learn o de alguna otra librería que deseen como  keras, pytorch, etc.). Junto con las métricas debe entregarse una breve descripción de cómo funciona cada modelo.**

Generamos un data frame para almacenar la mejor metrica obtenida para cada modelo.

In [33]:
import time
from sklearn.model_selection import GridSearchCV

results = pandas.DataFrame(columns=('clf', 'best_acc'))

Aplicamos Grid Search para encontrar el mejor modelo para NB.

In [34]:
from sklearn.naive_bayes import MultinomialNB
start_time = time.time()
nb_param = {
    'alpha': [1.0],
}

nb = MultinomialNB()
nb_clf = GridSearchCV(nb, nb_param, scoring='f1', cv=5, iid=False, n_jobs=-1)
nb_clf.fit(data_train, label_train)

best_nb_clf = nb_clf.best_estimator_
print('Best NB scoring: ', nb_clf.best_score_)
print(best_nb_clf)
results = results.append({'clf': best_nb_clf , 'best_acc': nb_clf.best_score_}, ignore_index=True)
print(f'Seconds: {time.time() - start_time}')



Best NB scoring:  0.01918120986748998
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Seconds: 2.6763265132904053


In [35]:
from sklearn.metrics import classification_report,confusion_matrix
y_pred = best_nb_clf.predict(data_test)
print(classification_report(label_test,y_pred))
print(confusion_matrix(label_test,y_pred))

              precision    recall  f1-score   support

       False       0.96      1.00      0.98     45202
        True       0.87      0.02      0.04      2087

   micro avg       0.96      0.96      0.96     47289
   macro avg       0.91      0.51      0.51     47289
weighted avg       0.95      0.96      0.94     47289

[[45195     7]
 [ 2039    48]]


Aplicamos Grid Search para encontrar el mejor modelo para Perceptron

In [36]:
from sklearn.linear_model import Perceptron
start_time = time.time()
perceptron_param = {
    'alpha': [1.0],
}

In [37]:
clf_perceptron = Perceptron(tol=1e-3, random_state=0)

In [38]:
perceptron = Perceptron(tol=1e-3, random_state=0)
perceptron_clf = GridSearchCV(perceptron, perceptron_param, scoring='f1', cv=5, iid=False, n_jobs=-1)
perceptron_clf.fit(data_train, label_train)

best_perceptron_clf = perceptron_clf.best_estimator_
print('Best Perceptron scoring: ', perceptron_clf.best_score_)
print(best_perceptron_clf)
results = results.append({'clf': best_nb_clf , 'best_acc': nb_clf.best_score_}, ignore_index=True)
print(f'Seconds: {time.time() - start_time}')

Best Perceptron scoring:  0.46704596501424633
Perceptron(alpha=1.0, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=None, n_iter=None, n_iter_no_change=5,
      n_jobs=None, penalty=None, random_state=0, shuffle=True, tol=0.001,
      validation_fraction=0.1, verbose=0, warm_start=False)
Seconds: 2.4606375694274902


In [39]:
y_pred = best_perceptron_clf.predict(data_test)
print(classification_report(label_test,y_pred))
print(confusion_matrix(label_test,y_pred))

              precision    recall  f1-score   support

       False       0.98      0.98      0.98     45202
        True       0.48      0.47      0.48      2087

   micro avg       0.95      0.95      0.95     47289
   macro avg       0.73      0.73      0.73     47289
weighted avg       0.95      0.95      0.95     47289

[[44108  1094]
 [ 1096   991]]


Aplicamos Grid Search para encontrar el mejor modelo para SVM. Probamos con dos kernel y con diferentes class_weight y que en nuestro caso tenemos clases desbalanceadas.

In [40]:
#SVM

from sklearn.svm import LinearSVC

start_time = time.time()
svm_param = {
    'class_weight': [{True:1},{True:2},{True:3},{True:4},{True:5}],
}

svm = LinearSVC()
svm_clf = GridSearchCV(svm, svm_param, scoring='f1', cv=5, iid=False, n_jobs=-1)
svm_clf.fit(data_train, label_train)

best_svm_clf = svm_clf.best_estimator_
print('Best SVM scoring: ', svm_clf.best_score_)
print(best_svm_clf)
results = results.append({'clf': best_svm_clf , 'best_acc': svm_clf.best_score_}, ignore_index=True)

print('The best classifier so far is: ')
print(results.loc[results['best_acc'].idxmax()]['clf'])
print(f'Seconds: {time.time() - start_time}')



Best SVM scoring:  0.5530504104544234
LinearSVC(C=1.0, class_weight={True: 3}, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
The best classifier so far is: 
LinearSVC(C=1.0, class_weight={True: 3}, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Seconds: 25.507173538208008


In [41]:
from sklearn.metrics import classification_report,confusion_matrix
y_pred = best_svm_clf.predict(data_test)
print(classification_report(label_test,y_pred))
print(confusion_matrix(label_test,y_pred))

              precision    recall  f1-score   support

       False       0.98      0.98      0.98     45202
        True       0.59      0.55      0.57      2087

   micro avg       0.96      0.96      0.96     47289
   macro avg       0.78      0.77      0.77     47289
weighted avg       0.96      0.96      0.96     47289

[[44387   815]
 [  935  1152]]


hacemos word2vect para reducir dimensionalidad (TBC)

!pip install gensim

from gensim.sklearn_api import W2VTransformer

sentences=[]
for index, row in dataset.iterrows():
    tokenized= []
    for word in row['X'].split(' '):
        word = word.split('.')[0]
        word = word.lower()
        tokenized.append(word)
    sentences.append(tokenized)

sentences

model = W2VTransformer(size=100, min_count=1, seed=1)

wordvecs = model.fit(sentences)

wordvecs.transform(['alarme','wetstop'])

Aplicamos Grid Search para encontrar el mejor modelo para Random Forest.

In [42]:
# random forest
import time
from sklearn.ensemble import RandomForestClassifier

start_time = time.time()
rfc_param = {
    'n_estimators': [10,15],
    'criterion': ['gini'],
#    'max_depth': [1, 10, 100, 1000],
#    'min_samples_split': [2, 5, 10, 100],
#    'min_samples_leaf': [1, 2, 5, 10, 100]
}


In [43]:
rfc = RandomForestClassifier(random_state=0, n_jobs=-1)
rfc_clf = GridSearchCV(rfc, rfc_param, scoring='f1', cv=5, iid=False, n_jobs=-1)
rfc_clf.fit(data_train, label_train)

best_rfc_clf = rfc_clf.best_estimator_
print('Best Random Forest scoring: ', rfc_clf.best_score_)
print(best_rfc_clf)
results = results.append({'clf': best_rfc_clf , 'best_acc': rfc_clf.best_score_}, ignore_index=True)

print('The best classifier so far is: ')
print(results.loc[results['best_acc'].idxmax()]['clf'])
print(f'Seconds: {time.time() - start_time}')



Best Random Forest scoring:  0.4610420361745356
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
The best classifier so far is: 
LinearSVC(C=1.0, class_weight={True: 3}, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Seconds: 2281.7930777072906


In [44]:
y_pred = best_rfc_clf.predict(data_test)
print(classification_report(label_test,y_pred))
print(confusion_matrix(label_test,y_pred))

              precision    recall  f1-score   support

       False       0.97      0.99      0.98     45202
        True       0.74      0.35      0.47      2087

   micro avg       0.97      0.97      0.97     47289
   macro avg       0.86      0.67      0.73     47289
weighted avg       0.96      0.97      0.96     47289

[[44951   251]
 [ 1364   723]]


Aplicamos Grid Search para encontrar el mejor modelo para Multi Layer Perceptron.

In [None]:
# neural network
from sklearn.neural_network import MLPClassifier
mlp_param = {
    'hidden_layer_sizes': [(10, 10, 10),(20, 20, 20),(30, 30, 30)],
    'activation': ['relu']
}

mlp = MLPClassifier(activation='relu', solver='adam', alpha=1e-5, hidden_layer_sizes=(30, 30, 30), random_state=1)
mlp_clf = GridSearchCV(mlp, mlp_param, scoring='f1', cv=5, iid=False, n_jobs=-1)
mlp_clf.fit(data_train, label_train)

best_mlp_clf = mlp_clf.best_estimator_
print('Best MLP scoring: ', mlp_clf.best_score_)
print(best_mlp_clf)
results = results.append({'clf': best_rfc_clf , 'best_acc': rfc_clf.best_score_}, ignore_index=True)

print('The best classifier so far is: ')
print(results.loc[results['best_acc'].idxmax()]['clf'])
print(f'Seconds: {time.time() - start_time}')

In [None]:
y_pred = best_mlp_clf.predict(data_test)
print(classification_report(label_test,y_pred))
print(confusion_matrix(label_test,y_pred))

**3-Para estos nuevos modelos tunear hiper-parámetros. Para las evaluaciones utilizar la técnica de k-fold cross-validation (ver cross-validation) y explicar los resultados.**


In [None]:
data_train, data_test, label_train, label_test = train_test_split(vector, labels, test_size = 0.2)

**4- Elegir el mejor modelo de scikit-learn entrenado hasta el momento según f1-score e implementar una función `predict_with_threshold(model, X, threshold)` tal que si la probabilidad dada por el método `model.predict_proba(X)` es mayor a `threshold` entonces la clase es positiva (no maquinable) y negativa en caso contrario. Notar que predict_with_threshold(model, X, 0.5) debería ser equivalente a `model.predict(X)`. Luego evaluar el mismo modelo para distintos thresholds 0.3, 0.4, 0.5, 0.6, 0.7. Reportar métricas e interpretar los resultados..**

In [None]:
def predict_with_threshold(model, X, threshold):
    return (model.predict_proba(X)[:,1] > threshold)


Para el caso de LinearSVM no contamos con el metodo predict_proba. Para obtener las probabilidades se puede usar
CalibratedClassifierCV

In [None]:
from sklearn.calibration import CalibratedClassifierCV
cclf = CalibratedClassifierCV(base_estimator=LinearSVC(penalty='l2', dual=False), cv=5)
cclf.fit(data_train, label_train)
res = cclf.predict_proba(data_test)[:, 1];
#an array containing probabilities of belonging to the 1st class

In [None]:
from sklearn.metrics import classification_report

model = cclf
y_true = label_test
for threshold in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]: 
    y_pred = predict_with_threshold(model, data_test, threshold)
    print(threshold)
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))
    

Para el mejor modelo segun el F1 score si tenemos disponible la funcion predict_proba.


In [None]:
model = cclf
y_true = label_test
for threshold in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]: 
    y_pred = predict_with_threshold(model, data_test, threshold)
    print(threshold)
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))