# Ensemble Model

En este notebook se prepara el ensamble de modelos usado en la última submission. El valor **ncdg** en el **leaderboard** público fue **0.31293** consiguiendo la segunda posición.

### Summary

El ensamble consiste en tomar las predicciones de los 2 modelos de factorización de matrices (MF) entrenados (notebook 02-AlternatingLeastSquaresModel). Se tomaron hasta las 50 mejores predicciones promediando sus scores.

La segunda parte consiste en scorear a los productos visitados en la sesión (considerando los pesos por posición). Este simple enfoque que consigue buenos resultados, permite complementar a los modelos MF, sobre todo en los casos de visitas en ítems que no se usaron en el entrenamiento (limitación de cold start). Se normalizan los scores entre 0 y 1 para ponerlos en la misma escala que las predicciones de MF.

Se combinan las predicciones usando una suma de scores por pesos, dándole más importancia a la predicciones MF. Y finalmente si las primeras 3 predicciones tienen un score alto, solo se seleccionan ítems de esos dominios en la predicción.

### Resultados

El valor **ncdg** en test fue **0.3180** y en validación **0.3186**. Tomando el dominio del primer producto recomendado, se logra un accuracy de **0.42** sobre el dominio del producto comprado.



In [1]:
import pandas as pd
import numpy as np
import json
from collections import Counter, defaultdict
import heapq
import pickle

Lee del catatálogo los dominios de los productos (se usa en la evalaución)

In [2]:
ITEM_TO_DOMAIN = {}
with open("./data/item_data.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        ITEM_TO_DOMAIN[data["item_id"]] = data["domain_id"]

In [3]:
ITEM_TO_COUNTRY = {}
with open("./data/item_data.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        ITEM_TO_COUNTRY[data["item_id"]] =data["category_id"][2]

In [4]:
IDCG = np.sum([(1 if i != 1 else 12) / np.log2(1 + i) for i in range(1, 11)])

def dcg(rec, y_item_id, n=10):
    y_domain = ITEM_TO_DOMAIN[y_item_id]
    
    return np.sum([(1 if yhat_item_id != y_item_id else 12) / np.log2(1 + i)\
                   for i, yhat_item_id in enumerate(rec[:n], 1)\
                  if (ITEM_TO_DOMAIN[yhat_item_id] == y_domain)])
    

Los productos sin precio o dominio se excluyen de las recomendaciones

In [5]:
BLACK_LIST = set()
with open("./data/item_data.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        if (data["domain_id"]== None):
            BLACK_LIST.add(data["item_id"]) 
       

In [6]:
len(BLACK_LIST)

851

## Lee recomendaciones de modelos de factorización de matrices

In [14]:
with open("data/models/implicit_test_reco_scores_model1.pkl", "rb") as fd:
    test_mf_vars = pickle.load(fd)
    test_mf_scores = test_mf_vars["test_reco_scores"]
    val_mf_scores = test_mf_vars["val_reco_scores"]

In [15]:
with open("data/models/implicit_test_reco_scores_model2.pkl", "rb") as fd:
    test_mf_vars = pickle.load(fd)
    test_mf_scores_2 =  test_mf_vars["test_reco_scores"]
    val_mf_scores_2 = test_mf_vars["val_reco_scores"]


In [16]:
with open("data/models/implicit_matrix_variables.pkl", "rb") as fd:
    ITEM_TO_IDX = pickle.load(fd)["ITEM_TO_IDX"]

## Recomendaciones por popularidad

Se usan como relleno (cold-start).

Se toman productos más visitados, y más visitados por dominio.

In [26]:
most_viewed_items = Counter()
most_viewed_by_domain = {}

line_idx = 0
with open("./data/train_dataset.jl", "rt") as fd:
    for line in fd:
        line_idx += 1
        data = json.loads(line)
        view = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        views_counter.update(view)

        for item_id in set(view):
            domain = ITEM_TO_DOMAIN[item_id]
            if not domain in most_viewed_by_domain:
                most_viewed_by_domain[domain] = Counter()
            most_viewed_by_domain[domain][item_id] +=1
            

Más visitados por país

In [27]:
most_viewed_items_br =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_viewed_items.items() if ITEM_TO_COUNTRY[item_id] == "B" }).most_common(10)]

most_viewed_items_mx  =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_viewed_items.items() if ITEM_TO_COUNTRY[item_id] == "M" }).most_common(10)]


views_counter = most_viewed_items

most_viewed_items = [item for item, _ in most_viewed_items.most_common(10)]

for domain, counter in most_viewed_by_domain.items():
    most_viewed_by_domain[domain] = [item for item, _ in counter.most_common(10)]

Función para rellenar recomendaciones:

In [19]:
def fill_rec(rec, fill, n=10 ):
    assert len(fill) >= n
    fill_index = 0
    while len(rec) < n:
        if fill[fill_index] not in rec:
            rec.append(fill[fill_index] )
        fill_index += 1
    return rec

# Ensamble de modelos

### Evaluación en test

In [28]:
W0 = 1.0 # peso usado para predicciones por visitas
WI = 1.5 # peso usado para predicciones de MF

n_recs = 0
y_test = []
model_sum_dcg = 0.0
tp_domain = 0
with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        y_test.append(item_bought)
        
        # promedia predicciones por modelos MF
        model_rec_scores = {i:s for i, s in test_mf_scores[n_recs]}
        model_rec_scores_2 =  {i:s for i, s in test_mf_scores_2[n_recs]}
        model_rec_scores = {i: (model_rec_scores.get(i, 0) *0.5 +\
                                model_rec_scores_2.get(i, 0) * 0.5)
                            for i in (model_rec_scores.keys() | model_rec_scores_2.keys())}
        
        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1)
        
        rec_scores = defaultdict(dict)
        # scores por visitas
        for item_view, pv_count in items_pv_count.items():
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + pv_count 
        
        # normaliza scores por visitas
        sum_scores = sum([s for s in rec_scores.values()])
        if sum_scores:
            c = ITEM_TO_COUNTRY[items_views[0]]
            rec_scores = {i: s / sum_scores  for i, s in rec_scores.items() }
        # excluye recomendaciones de bajo score
        rec_scores = {i: s for i, s in rec_scores.items() if s > 0.05}
        
        # Suma ambos scores usando pesos. Si el item_id no se uso en el entrenamiento modelo de MF, suma una constante.
        rec_scores = {i: model_rec_scores.get(i, 0) * WI + rec_scores.get(i, 0) * W0 + 0 if i in ITEM_TO_IDX else 0.2\
             for i in (rec_scores.keys() | model_rec_scores.keys()) if not i in BLACK_LIST }
        
        # ordena por score
        rec = [item for item, score in heapq.nlargest(50, rec_scores.items(), key=lambda item: item[1])]
        
        # selecciona los dominios de los top 3 productos si es que cumplen con un threshold
        domains = set([ITEM_TO_DOMAIN[rec[i]] for i in range(3) if rec_scores[rec[i]] >= 1])
        rec_fill = [r for r in rec if rec_scores[r]]
        
        # se se seleccionan dominios, utiliza este filtro en la recomendaciones
        if domains:
            rec = [r for r in rec if ITEM_TO_DOMAIN[r] in domains]
            rec_fill = [r for r in rec_fill if r not in rec]

        #pass
        rec = rec[:10]
        

        # rellena en caso de no tener recomendaciones
        if len(rec) < 10:
            if len(rec):
                # rellena con más visitados de los dominios de selección
                fill_scores = {r: views_counter[r]  for domain_i in domains for r in most_viewed_by_domain.get(domain_i, [])}                
                fill = [item for item, score in heapq.nlargest(10, fill_scores.items(), key=lambda item: item[1])]
                if len(fill) < 10:
                    # si no alcanza, agrega los descartados en la selección de dominios
                    fill += rec_fill
            else: 
                fill = most_viewed_items
            rec = fill_rec(rec, fill)
        assert len(rec) == 10
        
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        rec_dom = ITEM_TO_DOMAIN[rec[0]]
        tp_domain += ITEM_TO_DOMAIN[item_bought] in rec_dom
        
        n_recs += 1

In [29]:
print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")

NDCG:  0.3180 (20000 recomendaciones)


In [31]:
print(f"Accuracy (domain): {tp_domain / n_recs: .4f} ({n_recs} recomendaciones)")

Accuracy (domain):  0.4180 (20000 recomendaciones)


### Evaluación en validación

In [32]:
W0 = 1.0 
WI = 1.5

n_recs = 0
y_val = []
model_sum_dcg = 0.0
tp_domain = 0
with open("./data/train_dataset-val_split.jl", "rt") as fd:
    for line in fd:
        try:
            data = json.loads(line)
        except:
            continue
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        y_val.append(item_bought)
        
        # promedia predicciones por modelos MF
        model_rec_scores = {i:s for i, s in val_mf_scores[n_recs]}
        model_rec_scores_2 =  {i:s for i, s in val_mf_scores_2[n_recs]}
        model_rec_scores = {i: (model_rec_scores.get(i, 0) *0.5 +\
                                model_rec_scores_2.get(i, 0) * 0.5)
                            for i in (model_rec_scores.keys() | model_rec_scores_2.keys())}
        
        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1)
        
        rec_scores = defaultdict(dict)
        # scores por visitas
        for item_view, pv_count in items_pv_count.items():
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + pv_count 
        
        # normaliza scores por visitas
        sum_scores = sum([s for s in rec_scores.values()])
        if sum_scores:
            c = ITEM_TO_COUNTRY[items_views[0]]
            rec_scores = {i: s / sum_scores  for i, s in rec_scores.items() }
            
        rec_scores = {i: s for i, s in rec_scores.items() if s > 0.05}
        
        # suma ambos scores
        rec_scores = {i: model_rec_scores.get(i, 0) * WI + rec_scores.get(i, 0) * W0 + 0 if i in ITEM_TO_IDX else 0.2\
             for i in (rec_scores.keys() | model_rec_scores.keys()) if not i in BLACK_LIST }
        
        
        rec = [item for item, score in heapq.nlargest(50, rec_scores.items(), key=lambda item: item[1])]

        domains = set([ITEM_TO_DOMAIN[rec[i]] for i in range(3) if rec_scores[rec[i]] >= 1])
        rec_fill = [r for r in rec if rec_scores[r]]

        if domains:
            rec = [r for r in rec if ITEM_TO_DOMAIN[r] in domains]
            rec_fill = [r for r in rec_fill if r not in rec]

        #pass
        rec = rec[:10]
        

        # rellena en caso de no tener recomendaciones
        if len(rec) < 10:
            if len(rec):
                fill_scores = {r: views_counter[r]  for domain_i in domains for r in most_viewed_by_domain.get(domain_i, [])}                
                fill = [item for item, score in heapq.nlargest(10, fill_scores.items(), key=lambda item: item[1])]
                if len(fill) < 10:
                    fill += rec_fill
            else: 
                fill = most_viewed_items
            rec = fill_rec(rec, fill)
        assert len(rec) == 10
        
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        rec_dom = ITEM_TO_DOMAIN[rec[0]]
        tp_domain += ITEM_TO_DOMAIN[item_bought] in rec_dom
        
        n_recs += 1

In [33]:
print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")

NDCG:  0.3186 (19998 recomendaciones)


In [34]:
print(f"Accuracy (domain): {tp_domain / n_recs: .4f} ({n_recs} recomendaciones)")

Accuracy (domain):  0.4176 (19998 recomendaciones)


Y así se consigue el famoso Efecto Bolo