# Modelo de filtrado colaborativo items to items

### Summary
En este notebook se exploró la estrategia de filtrado colaborativo items to items. Esta estrategia simple, tiene la ventaja de requerir poca capacidad de procesamiento, consiguiendo una buena relación entre performance y resultados, aunque quizá lo logre captar patrones más complejos.

Se forman las siguientes asociaciones de productos a partir de las sesiones:
* Asociación entre productos por visitas en una misma sesión
* Asociación entre productos visitados y productos comprados.
* Asociación entre términos de búsquedas y productos visitados.
* Asociación entre términos de búsquedas y productos comprados.

También se scorean los productos visitados con una función que toma en cuenta las posiciones de los page views del ítem en la sesión. Tomando la posición 1 a la más reciente a la la compra, el score se define como:

\begin{equation*}
score(item_i) = \sum_{pos}  \frac{1}{log_{10}(pos + 1)}
\end{equation*}

Solamente con esta estrategia de score de productos visitados, y utilizando como relleno los productos más vistos del dominio de la primera recomendación, se logra un **ndcg 0.2639** en test.

Finalmente se pesan todos los scores de asociación consiguiendo un **ndcg 0.2889** en test y **0.2876** en submission.

In [1]:
import pandas as pd
import numpy as np
import json
from collections import Counter, defaultdict
import heapq

Lee del catatálogo los dominios de los productos (se usa en la evalaución)

In [2]:
catalog = {}
with open("./data/item_data.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        item_id = data.pop("item_id")
        data.pop("title")
        catalog[item_id] = data
        # obtiene el país
        catalog[item_id]["country"] = data["category_id"][2]

In [3]:
ITEM_TO_DOMAIN = {}
for item_id, item_info in catalog.items():
    ITEM_TO_DOMAIN[item_id] = item_info["domain_id"]

In [4]:
IDCG = np.sum([(1 if i != 1 else 12) / np.log2(1 + i) for i in range(1, 11)])

def dcg(rec, y_item_id, n=10):
    y_domain = ITEM_TO_DOMAIN[y_item_id]
    
    return np.sum([(1 if yhat_item_id != y_item_id else 12) / np.log2(1 + i)\
                   for i, yhat_item_id in enumerate(rec[:n], 1)\
                  if (ITEM_TO_DOMAIN[yhat_item_id] == y_domain)])
    

Los productos sin precio o dominio se excluyen de las recomendaciones

In [5]:
BLACK_LIST = set()
for item_id, item_info in catalog.items():
    if (item_info["domain_id"] == None) or (item_info["price"] == None):
        BLACK_LIST.add(item_id) 

In [6]:
len(BLACK_LIST)

854

## Recomendaciones Baseline
Se usa como relleno (cold-start)

In [7]:
most_viewed_items = Counter()
most_sessions_items = Counter()
most_bought_items = Counter()

most_viewed_by_domain = {}

line_idx = 0
with open("./data/train_dataset-train_split.jl", "rt") as fd:
    for line in fd:
        line_idx += 1
        data = json.loads(line)
        view = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        most_viewed_items.update(view)
        most_sessions_items.update(set(view))
        most_bought_items[data["item_bought"]] += 1
        
        for item_id in set(view):
            domain = ITEM_TO_DOMAIN[item_id]
            if not domain in most_viewed_by_domain:
                most_viewed_by_domain[domain] = Counter()
            most_viewed_by_domain[domain][item_id] +=1

In [8]:
most_viewed_items_br =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_viewed_items.items() if catalog[item_id]["country"] == "B" }).most_common(10)]

most_viewed_items_mx  =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_viewed_items.items() if catalog[item_id]["country"] == "M" }).most_common(10)]

most_viewed_items = [item for item, _ in most_viewed_items.most_common(10)]

In [9]:
most_sessions_items_br =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_sessions_items.items() if catalog[item_id]["country"] == "B" }).most_common(10)]

most_sessions_items_mx  =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_sessions_items.items() if catalog[item_id]["country"] == "M" }).most_common(10)]

most_sessions_items = [item for item, _ in most_sessions_items.most_common(10)]

In [10]:
most_bought_items_br =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_bought_items.items() if catalog[item_id]["country"] == "B" }).most_common(10)]

most_bought_items_mx  =[item_id for item_id, _ in
    Counter({item_id: count for item_id, count\
             in most_bought_items.items() if catalog[item_id]["country"] == "M" }).most_common(10)]

most_bought_items = [item for item, _ in most_bought_items.most_common(10)]

In [11]:
for domain, counter in most_viewed_by_domain.items():
    most_viewed_by_domain[domain] = [item for item, _ in counter.most_common(10)]

In [12]:
viewed_count = 0
n_recs = 0

pop_view_sum_dcg = 0
pop_session_sum_dcg = 0
pop_bought_sum_dcg = 0


country_pop_view_sum_dcg = 0
country_pop_session_sum_dcg = 0
country_pop_bought_sum_dcg = 0

with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1 

        data = json.loads(line)
        item_bought = data["item_bought"]
        
        pop_view_sum_dcg += dcg(most_viewed_items, item_bought)
        pop_session_sum_dcg += dcg(most_sessions_items, item_bought)
        pop_bought_sum_dcg += dcg(most_bought_items, item_bought)
        
        if catalog[item_bought]["country"] == "B":
            country_pop_view_sum_dcg += dcg(most_viewed_items_br, item_bought)
            country_pop_session_sum_dcg += dcg(most_sessions_items_br, item_bought)
            country_pop_bought_sum_dcg += dcg(most_bought_items_br, item_bought)
        elif catalog[item_bought]["country"] == "M":
            country_pop_view_sum_dcg += dcg(most_viewed_items_mx, item_bought)
            country_pop_session_sum_dcg += dcg(most_sessions_items_mx, item_bought)
            country_pop_bought_sum_dcg += dcg(most_bought_items_mx, item_bought)
        else:
            raise("Error")


In [13]:
print(f"NDCG (top visitas) : {pop_view_sum_dcg / (IDCG * n_recs): .4f}")
print(f"NDCG (top sesiones): {pop_session_sum_dcg / (IDCG * n_recs): .4f}")
print(f"NDCG (top compras) : {pop_bought_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (top visitas) :  0.0106
NDCG (top sesiones):  0.0106
NDCG (top compras) :  0.0083


In [14]:
print(f"NDCG (top visitas por país) : {country_pop_view_sum_dcg / (IDCG * n_recs): .4f}")
print(f"NDCG (top sesiones por país): {country_pop_session_sum_dcg / (IDCG * n_recs): .4f}")
print(f"NDCG (top compras por país) : {country_pop_bought_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (top visitas por país) :  0.0149
NDCG (top sesiones por país):  0.0147
NDCG (top compras por país) :  0.0118


In [15]:
def fill_rec(rec, fill, n=10 ):
    assert len(fill) >= n
    fill_index = 0
    while len(rec) < n:
        if fill[fill_index] not in rec:
            rec.append(fill[fill_index] )
        fill_index += 1
    return rec

## Score sobre productos visitados

Del análisis descriptivo se obtuvo que el 30% de los productos comprados fueron vistos en la sesión.

Se analizan estrategias para rankear los productos visitados en la recomendación.

* La mejor configuración fue contar las visitas pesando con la inversa del logaritmo de la posición, así más reciente mejor.

In [52]:
n_recs = 0
sum_dcg = 0
model_sum_dcg = 0

with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1
        # lee registro
        data = json.loads(line)
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        
        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1)
        
        rec_scores = defaultdict(dict)
        # scores por visitas
        for item_view, pv_count in items_pv_count.items():
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + pv_count 
        
        
        # equivalent to counter.most_common(10)
        # https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity
        rec = [item for item, _ in heapq.nlargest(10, rec_scores.items(), key=lambda item: item[1])]

        # rellena en caso de no tener recomendaciones
        if len(rec) < 10:
            if len(rec):
                country = catalog[rec[0]]["country"]
                domain = ITEM_TO_DOMAIN[rec[0]]
                fill = most_viewed_by_domain.get(domain, []) +\
                        (most_viewed_items_br if country == "B" else most_viewed_items_mx)
            else: 
                fill = most_viewed_items
            rec = fill_rec(rec, fill)
            
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        if (n_recs % 1000) == 0:
            print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")
 

NDCG:  0.2816 (1000 recomendaciones)
NDCG:  0.2791 (2000 recomendaciones)
NDCG:  0.2790 (3000 recomendaciones)
NDCG:  0.2762 (4000 recomendaciones)
NDCG:  0.2711 (5000 recomendaciones)
NDCG:  0.2692 (6000 recomendaciones)
NDCG:  0.2685 (7000 recomendaciones)
NDCG:  0.2667 (8000 recomendaciones)
NDCG:  0.2647 (9000 recomendaciones)
NDCG:  0.2640 (10000 recomendaciones)
NDCG:  0.2645 (11000 recomendaciones)
NDCG:  0.2632 (12000 recomendaciones)
NDCG:  0.2633 (13000 recomendaciones)
NDCG:  0.2643 (14000 recomendaciones)
NDCG:  0.2643 (15000 recomendaciones)
NDCG:  0.2640 (16000 recomendaciones)
NDCG:  0.2646 (17000 recomendaciones)
NDCG:  0.2641 (18000 recomendaciones)
NDCG:  0.2642 (19000 recomendaciones)
NDCG:  0.2639 (20000 recomendaciones)


In [53]:
print(f"NDCG (i2i) : {model_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (i2i) :  0.2639


## Asociaciones entre productos visitados y comprados

Mejor configuración encontrada:

* Tomar sesiones en el count de matriz
* Normalizar count por el total de ventas del ítem vendido. Se aplica un smoothing la cantidad vendidos (suma 100 al denominador)
* En la predicción se tiene en cuenta la cantidad de visitas y el orden de visita (más cercano mayor peso)
* Se prioriza los visitados en la recomendación

In [12]:
LOAD = True

In [13]:
if LOAD == False:
    user_idx = 0
    view2bought = {}

    counter_sessions = Counter()
    counter_bought = Counter()

    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            
            item_bought = data["item_bought"]
            country = catalog[item_bought]["country"]
            items_views = set([event["event_info"] for event in data["user_history"]\
                   if event["event_type"] == "view" and\
                   # exclude association in different countries
                   country == catalog[event["event_info"]]["country"]])

            counter_bought[item_bought] += 1

            items_pv_count = Counter(items_views)
            for item_view, count in items_pv_count.items():

                counter_sessions[item_view] += 1
                if item_view in view2bought:
                    view2bought[item_view][item_bought] += 1
                else:
                    counter = Counter()
                    counter[item_bought] += 1
                    view2bought[item_view] = counter

In [14]:
if LOAD == False:
    rows = []
    for item_i, counter in view2bought.items():
        for item_j, ses_count in counter.most_common():
            rows.append((item_i, item_j, counter_sessions[item_i], counter_bought[item_j], ses_count))

    df_bought_count = pd.DataFrame(rows, columns=("item_i", "item_j", "count_i", "count_j", "count_ij"))

    df_bought_count['item_i'] = df_bought_count.item_i.astype(np.uint32)
    df_bought_count['item_j'] = df_bought_count.item_j.astype(np.uint32)

    df_bought_count['count_i'] = df_bought_count.count_i.astype(np.uint16)
    df_bought_count['count_j'] = df_bought_count.count_j.astype(np.uint16)
    df_bought_count['count_ij'] = df_bought_count.count_ij.astype(np.uint16)

    del rows

    df_bought_count.to_pickle("./data/models/df_items_assosiation_bought_count.pkl")
else:
    df_bought_count = pd.read_pickle("./data/models/df_items_assosiation_bought_count.pkl")

In [15]:
df_bought_count.head()

Unnamed: 0,item_i,item_j,count_i,count_j,count_ij
0,1381888,1330214,1,5,1
1,361733,361733,200,200,72
2,361733,1514607,200,20,6
3,361733,353783,200,57,5
4,361733,1680124,200,25,3


In [16]:
LOAD = True

In [17]:
if LOAD == False:
    import itertools
    
    min_freq = 2
    counter_sessions = Counter()
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            item_bought = data["item_bought"]
            country = catalog[item_bought]["country"]
            items_views = set([event["event_info"] for event in data["user_history"]\
                   if event["event_type"] == "view" and\
                   # exclude association in different countries
                   country == catalog[event["event_info"]]["country"]])

            counter_sessions.update(items_views)
    item_set = set([k for k, v in counter_sessions.items() if  v >= min_freq])
    
    pair_items_counter = Counter()
    counter_sessions = Counter()
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            items_views = set([event["event_info"] for event in data["user_history"] if event["event_type"] == "view" and event["event_info"] in item_set])

            if len(items_views) > 1: 
                pair_items_counter.update(itertools.permutations(items_views, 2))
                counter_sessions.update(items_views)

In [18]:
if LOAD == False:
    rows = []
    for (item_i, item_j), ses_count in pair_items_counter.items():
        rows.append((item_i, item_j, counter_sessions[item_i], counter_sessions[item_j], ses_count))

    df_association_views = pd.DataFrame(rows, columns=("item_i", "item_j", "count_i", "count_j", "count_ij"))

    df_association_views['item_i'] = df_association_views.item_i.astype(np.uint32)
    df_association_views['item_j'] = df_association_views.item_j.astype(np.uint32)

    df_association_views['count_i'] = df_association_views.count_i.astype(np.uint16)
    df_association_views['count_j'] = df_association_views.count_j.astype(np.uint16)
    df_association_views['count_ij'] = df_association_views.count_ij.astype(np.uint16)

    del rows

    df_association_views.to_pickle("./data/models/df_items_assosiation_views_count.pkl")
else:
    df_association_views = pd.read_pickle("./data/models/df_items_assosiation_views_count.pkl")

In [19]:
df_association_views.head()

Unnamed: 0,item_i,item_j,count_i,count_j,count_ij
0,361733,503045,185,7,1
1,361733,1831689,185,1,1
2,361733,1174410,185,2,1
3,361733,1059724,185,2,1
4,361733,431884,185,3,1


In [20]:
df_bought_count.head()

Unnamed: 0,item_i,item_j,count_i,count_j,count_ij
0,1381888,1330214,1,5,1
1,361733,361733,200,200,72
2,361733,1514607,200,20,6
3,361733,353783,200,57,5
4,361733,1680124,200,25,3


In [21]:
item2item = defaultdict(dict)
for row in df_association_views.itertuples():
    item2item[row.item_i][row.item_j] = row.count_ij / (10 + row.count_i)

In [22]:
from collections import defaultdict
view2bought = defaultdict(dict)
for row in df_bought_count.itertuples():
    view2bought[row.item_i][row.item_j] = row.count_ij  / (100 + row.count_j)

In [23]:
del df_association_views
del df_bought_count

In [40]:
n_recs = 0
sum_dcg = 0
model_sum_dcg = 0

W0 = 1
W1 = 1
W3 = 3

with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1
        # lee registro
        data = json.loads(line)
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]

        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1) 
        
            
        # recomendaciones por modelo
        rec_scores = {}
        for item_view, pv_count in items_pv_count.items():
            
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + np.log10(pv_count + 1) * W0 
            
            # Asigna scores por asociaciones de visitas
            if item_view in item2item:
                item_scores = item2item[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * W1 \
                                   for key in item_scores.keys()})
                
            # Asigna scores por de compras
            if item_view in view2bought:
                item_scores = view2bought[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * pv_count * W3 \
                                   for key in item_scores.keys()})
                
        # exclude items from black List
        rec_scores = {k: v for k, v in rec_scores.items() if k not in BLACK_LIST}
        
        # equivalent to counter.most_common(10)
        # https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity
        rec = [item for item, _ in heapq.nlargest(10, rec_scores.items(), key=lambda item: item[1])]

        # rellena en caso de no tener recomendaciones
        if len(rec) < 10:
            if len(rec):
                domain = ITEM_TO_DOMAIN[rec[0]]
                fill = most_viewed_by_domain.get(domain, []) + most_viewed_items
            else:
                fill = most_viewed_items

            rec = fill_rec(rec, fill)
            
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        if (n_recs % 1000) == 0:
            print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")
        

NDCG:  0.3066 (1000 recomendaciones)
NDCG:  0.3016 (2000 recomendaciones)
NDCG:  0.3012 (3000 recomendaciones)
NDCG:  0.2990 (4000 recomendaciones)
NDCG:  0.2935 (5000 recomendaciones)
NDCG:  0.2917 (6000 recomendaciones)
NDCG:  0.2917 (7000 recomendaciones)
NDCG:  0.2893 (8000 recomendaciones)
NDCG:  0.2884 (9000 recomendaciones)
NDCG:  0.2874 (10000 recomendaciones)
NDCG:  0.2873 (11000 recomendaciones)
NDCG:  0.2855 (12000 recomendaciones)
NDCG:  0.2853 (13000 recomendaciones)
NDCG:  0.2867 (14000 recomendaciones)
NDCG:  0.2864 (15000 recomendaciones)
NDCG:  0.2862 (16000 recomendaciones)
NDCG:  0.2866 (17000 recomendaciones)
NDCG:  0.2857 (18000 recomendaciones)
NDCG:  0.2856 (19000 recomendaciones)
NDCG:  0.2851 (20000 recomendaciones)


In [90]:
print(f"NDCG (i2i) : {model_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (i2i) :  0.2851


## Asociaciones con search

In [41]:
LOAD = True

In [42]:
# Contador de sesiones por busqueda y tokens
# También cuenta el df de token tomando como documentos los textos de búsquedas
if LOAD == False:
    counter_searchs = Counter()
    token_sessions = Counter()
    counter_bought = Counter()
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            searchs = set([event["event_info"] for event in data["user_history"] if event["event_type"] == "search"])
            item_bought = data["item_bought"]
            counter_searchs.update(searchs)
            counter_bought[item_bought] +=1

    token_df = Counter()
    for text in counter_searchs.keys():
        token_df.update(set(text.split()))

    token_sessions = Counter()
    for text, ses_count in counter_searchs.items():
        for token in set(text.split()):
            token_sessions[token] += ses_count

In [43]:
# Contador de asociaciones de tokens de búsquedas con productos comprados
if LOAD == False:
    token2bought = {}
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            searchs = set([event["event_info"] for event in data["user_history"] if event["event_type"] == "search"])
            item_bought = data["item_bought"]
            for text in searchs:
                for token in text.split():
                    if token in token2bought:
                        token2bought[token][item_bought] += 1
                    else:
                        counter = Counter()
                        counter[item_bought] += 1
                        token2bought[token] = counter

In [44]:
# Contador de asocaciones de busquedas con productos comprados
if LOAD == False:
    searchset = set([search for search, c in counter_searchs.items() if c > 1])
    searchbought = {}
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            searchs = set([event["event_info"] for event in data["user_history"] if event["event_type"] == "search" and event["event_info"] in searchset])
            item_bought = data["item_bought"]
            for search in searchs:
                if search in searchbought:
                    searchbought[search][item_bought] += 1
                else:
                    counter = Counter()
                    counter[item_bought] += 1
                    searchbought[search] = counter

In [45]:
# Contador de asociación entre busqueda y visitas

# se toma visitas posteriores a la búsqueda y previas de siguiente búsqueda
if LOAD == False:
    search2views = {}
    with open("./data/train_dataset-train_split.jl", "rt") as fd:
        for line in fd:
            data = json.loads(line)
            
            events = data["user_history"] 
            last_search = None
            search_itemset = {}
            for event in events:
                if event["event_type"] == "search":
                    last_search = event["event_info"]
                elif last_search:
                    if last_search in search_itemset:
                        search_itemset[last_search].add(event["event_info"])
                    else:
                        search_itemset[last_search] = set([event["event_info"]])

            for search, itemset in search_itemset.items():
                if search in search2views:
                    search2views[search].update(itemset)
                else:
                    search2views[search] = Counter(itemset)
                    


In [46]:
if LOAD == False:
    # contador de productos vistos luego de búsquedas
    counter_views_insearch = Counter()
    for counter in search2views.values():
        counter_views_insearch.update(dict(counter.items()))
        
    # Contador de asociación entre tokens y visitas
    token2views = {}
    for search, item_counter in search2views.items():
        for token in search.split():
            if token in token2views:
                token2views[token].update(dict(item_counter.items()))
            else:
                token2views[token] = Counter(dict(item_counter.items()))

In [47]:
# Dataset "Texto Búsqueda" -> "Productos comprados"
if LOAD == False:
    rows = []
    for search, items_count  in searchbought.items():
        for item, ses_count, in items_count.items():
            rows.append((search, item, counter_searchs[search], counter_bought[item], ses_count))

    df_search_bought = pd.DataFrame(rows, columns=("search", "item", "count_s", "count_i", "count_si"))

    df_search_bought['item'] = df_search_bought.item.astype(np.uint32)

    df_search_bought['count_i'] = df_search_bought.count_i.astype(np.uint16)
    df_search_bought['count_s'] = df_search_bought.count_s.astype(np.uint16)
    df_search_bought['count_si'] = df_search_bought.count_si.astype(np.uint16)

    del rows

    df_search_bought.to_pickle("./data/models/df_assosiation_search_text_bought.pkl")
else:
    df_search_bought = pd.read_pickle("./data/models/df_assosiation_search_text_bought.pkl")

In [48]:
df_search_bought.head()

Unnamed: 0,search,item,count_s,count_i,count_si
0,MAQUINA CORTAR CABELO BARBA,1330214,195,5,1
1,MAQUINA CORTAR CABELO BARBA,1086760,195,1,1
2,MAQUINA CORTAR CABELO BARBA,1514059,195,12,1
3,MAQUINA CORTAR CABELO BARBA,1761121,195,37,5
4,MAQUINA CORTAR CABELO BARBA,636487,195,41,1


In [49]:
# Dataset "Tokens búsquedas" -> "Productos comprados"
if LOAD == False:
    rows = []
    for token, items_count  in token2bought.items():
        for item, ses_count, in items_count.items():
            rows.append((token, item, token_sessions[token], counter_bought[item], ses_count))

    df_search_token_bought = pd.DataFrame(rows, columns=("token", "item", "count_t", "count_i", "count_ti"))

    df_search_token_bought['item'] = df_search_token_bought.item.astype(np.uint32)

    df_search_token_bought['count_i'] = df_search_token_bought.count_i.astype(np.uint16)
    df_search_token_bought['count_t'] = df_search_token_bought.count_t.astype(np.uint16)
    df_search_token_bought['count_ti'] = df_search_token_bought.count_ti.astype(np.uint16)

    del rows

    df_search_token_bought.to_pickle("./data/models/df_assosiation_search_token_bought.pkl")
else:
    df_search_token_bought = pd.read_pickle("./data/models/df_assosiation_search_token_bought.pkl")

In [50]:
df_search_token_bought.head()

Unnamed: 0,token,item,count_t,count_i,count_ti
0,MOCHILA,1330214,5637,5,4
1,MOCHILA,1324879,5637,30,1
2,MOCHILA,528769,5637,25,1
3,MOCHILA,804644,5637,23,1
4,MOCHILA,1423562,5637,9,1


In [51]:
# Dataset "Texto Búsqueda" -> "Productos visitados"
if LOAD == False:
    rows = []
    for search, items_count  in search2views.items():
        for item, ses_count, in items_count.items():
            rows.append((search, item, counter_searchs[search], counter_views_insearch[item], ses_count))

    df_search_views = pd.DataFrame(rows, columns=("search", "item", "count_s", "count_i", "count_si"))

    df_search_views['item'] = df_search_views.item.astype(np.uint32)

    df_search_views['count_i'] = df_search_views.count_i.astype(np.uint16)
    df_search_views['count_s'] = df_search_views.count_s.astype(np.uint16)
    df_search_views['count_si'] = df_search_views.count_si.astype(np.uint16)

    del rows

    df_search_views.to_pickle("./data/models/df_assosiation_search_text_views.pkl")
else:
    df_search_views = pd.read_pickle("./mei_challenger/data/models/df_assosiation_search_text_views.pkl")

In [52]:
df_search_views.head()

Unnamed: 0,search,item,count_s,count_i,count_si
0,MAQUINA CORTAR CABELO BARBA,1381888,195,1,1
1,MAQUINA CORTAR CABELO BARBA,361733,195,214,54
2,MAQUINA CORTAR CABELO BARBA,1831689,195,2,1
3,MAQUINA CORTAR CABELO BARBA,1174410,195,2,1
4,MAQUINA CORTAR CABELO BARBA,1059724,195,2,1


In [53]:
# Dataset "Tokens búsquedas" -> "Productos Visitados"
if LOAD == False:
    rows = []
    for token, items_count  in token2views.items():
        for item, ses_count, in items_count.items():
            rows.append((token, item, token_sessions[token], counter_views_insearch[item], ses_count))

    df_search_token_views = pd.DataFrame(rows, columns=("token", "item", "count_t", "count_i", "count_ti"))

    df_search_token_views['item'] = df_search_token_views.item.astype(np.uint32)

    df_search_token_views['count_i'] = df_search_token_views.count_i.astype(np.uint16)
    df_search_token_views['count_t'] = df_search_token_views.count_t.astype(np.uint16)
    df_search_token_views['count_ti'] = df_search_token_views.count_ti.astype(np.uint16)

    del rows

    df_search_token_views.to_pickle("./data/models/df_assosiation_search_token_views.pkl")
else:
    df_search_token_views = pd.read_pickle("./data/models/df_assosiation_search_token_views.pkl")

In [54]:
df_search_token_views.head()

Unnamed: 0,token,item,count_t,count_i,count_ti
0,MAQUINA,1381888,11668,1,1
1,MAQUINA,361733,11668,214,142
2,MAQUINA,1831689,11668,2,1
3,MAQUINA,1174410,11668,2,1
4,MAQUINA,1059724,11668,2,2


In [55]:
from collections import defaultdict
token2bought = defaultdict(dict)
for row in df_search_token_bought.itertuples():
    token2bought[row.token][row.item] = (row.count_ti )

In [56]:
from collections import defaultdict
token2view = defaultdict(dict)
for row in df_search_token_views.itertuples():
    token2view[row.token][row.item] = (row.count_ti )

In [57]:
from collections import defaultdict
text2bought = defaultdict(dict)
for row in df_search_bought.itertuples():
    text2bought[row.search][row.item] = (row.count_si ) 

In [58]:
from collections import defaultdict
text2view = defaultdict(dict)
for row in df_search_views.itertuples():
    text2view[row.search][row.item] = (row.count_si ) 

In [59]:
del df_search_views
del df_search_token_bought
del df_search_token_views
del df_search_bought

In [60]:
# se queda con el top para optimizar consultas
n = 128
for text in text2view.keys():
    text2view[text] = {item: score for item, score in heapq.nlargest(n, text2view[text].items(), key=lambda item: item[1])}
    
for token in token2view.keys():
    token2view[token] = {item: score for item, score in heapq.nlargest(n, token2view[token].items(), key=lambda item: item[1])}
    
n = 256
for text in text2bought.keys():
    text2bought[text] = {item: score for item, score in heapq.nlargest(n, text2bought[text].items(), key=lambda item: item[1])}
    
for token in token2bought.keys():
    token2bought[token] = {item: score for item, score in heapq.nlargest(n, token2bought[token].items(), key=lambda item: item[1])}

In [61]:
counter_searchs = Counter()
with open("./data/train_dataset-train_split.jl", "rt") as fd:
    for line in fd:
        data = json.loads(line)
        searchs = set([event["event_info"] for event in data["user_history"] if event["event_type"] == "search"])
        counter_searchs.update(searchs)

token_df = Counter()
for text in counter_searchs.keys():
    token_df.update(set(text.split()))
del counter_searchs

In [69]:
n_recs = 0
sum_dcg = 0
model_sum_dcg = 0

Wb = 1
Wv = 1
with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1
        # lee registro
        data = json.loads(line)
        item_bought = data["item_bought"]
        searchs = [event["event_info"] for event in data["user_history"] if event["event_type"] == "search"]
        

        # Ranking de palabras de búsqueda
        tf_counter = {}
        text_counter = {}
        searchs = searchs[::-1]            
        for pos, text in enumerate(searchs, 1):
            text_counter[text] = tf_counter.get(text, 0) + 1 / np.log10(pos + 1)
            for token in text.split():
                tf_counter[token] = tf_counter.get(token, 0) + 1 / np.log10(pos + 1)
            
        # recomendaciones por modelo
        rec_scores = {}
        
        for text, tf in text_counter.items():
            if text in text2bought:
                item_scores = text2bought[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * Wb\
                                  for key in item_scores.keys()})
        if text in  text2view:
            for text, tf in text_counter.items():
                item_scores = text2view[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * Wv\
                                  for key in item_scores.keys()})
            
        for token, tf in tf_counter.items():
            if len(token) < 2:
                continue
            if token in token2bought:
                item_scores = token2bought[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * Wb\
                                  for key in item_scores.keys()})
            if token in token2view:
                item_scores = token2view[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * Wv\
                                  for key in item_scores.keys()})
        if rec_scores:
            country_rec = catalog[max(rec_scores, key=rec_scores.get)]["country"]
            rec_scores = {k: v for k,v in rec_scores.items() if catalog[k]["country"] == country_rec}
         
        
        # equivalent to counter.most_common(10)
        # https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity
        rec = [item for item, _ in heapq.nlargest(10, rec_scores.items(), key=lambda item: item[1])]

        # rellena en caso de no tener recomendaciones
        if len(rec) < 10:
            if len(rec):
                domain = ITEM_TO_DOMAIN[rec[0]]
                fill = most_viewed_by_domain.get(domain, []) + most_viewed_items
            else:
                fill = most_viewed_items

            rec = fill_rec(rec, fill)
            
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        if (n_recs % 1000) == 0:
            print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")

NDCG:  0.1314 (1000 recomendaciones)
NDCG:  0.1303 (2000 recomendaciones)
NDCG:  0.1280 (3000 recomendaciones)
NDCG:  0.1269 (4000 recomendaciones)
NDCG:  0.1251 (5000 recomendaciones)
NDCG:  0.1229 (6000 recomendaciones)
NDCG:  0.1234 (7000 recomendaciones)
NDCG:  0.1216 (8000 recomendaciones)
NDCG:  0.1216 (9000 recomendaciones)
NDCG:  0.1216 (10000 recomendaciones)
NDCG:  0.1198 (11000 recomendaciones)
NDCG:  0.1201 (12000 recomendaciones)
NDCG:  0.1194 (13000 recomendaciones)
NDCG:  0.1198 (14000 recomendaciones)
NDCG:  0.1201 (15000 recomendaciones)
NDCG:  0.1200 (16000 recomendaciones)
NDCG:  0.1204 (17000 recomendaciones)
NDCG:  0.1199 (18000 recomendaciones)
NDCG:  0.1200 (19000 recomendaciones)
NDCG:  0.1197 (20000 recomendaciones)


In [92]:
print(f"NDCG (i2i) : {model_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (i2i) :  0.1198


# Modelo Final: Se toman todas las asociaciones

Se ensamblan todas las asociaciones usando pesos en los scores.

In [73]:
n_recs = 0
sum_dcg = 0
model_sum_dcg = 0

W0 = 100 # ítems visitados en sesión
W1 = 100 # asociación con visitas
W3 = 300 # asociación con compras
W4 = 0.001 # asociación con búsquedas

fills_count = 0
with open("./data/train_dataset-test_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1
        # lee registro
        data = json.loads(line)
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        searchs = [event["event_info"] for event in data["user_history"] if event["event_type"] == "search"]

        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1) 
        
            
        # recomendaciones por modelo
        rec_scores = {}
        for item_view, pv_count in items_pv_count.items():
            
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + np.log10(pv_count + 1) * W0 
            
            # Asigna scores por asociaciones de visitas
            if item_view in item2item:
                item_scores = item2item[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * W1 \
                                   for key in item_scores.keys()})
                
            # Asigna scores por de compras
            if item_view in view2bought:
                item_scores = view2bought[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * pv_count * W3 \
                                   for key in item_scores.keys()})
                
        
        # Ranking de palabras de búsqueda
        tf_counter = {}
        text_counter = {}
        searchs = searchs[::-1]            
        for pos, text in enumerate(searchs, 1):
            text_counter[text] = tf_counter.get(text, 0) + 1 / np.log10(pos + 1)
            for token in text.split():
                tf_counter[token] = tf_counter.get(token, 0) + 1 / np.log10(pos + 1)

        for text, tf in text_counter.items():
            if text in text2bought:
                item_scores = text2bought[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * W4\
                                  for key in item_scores.keys()})
        if text in  text2view:
            for text, tf in text_counter.items():
                item_scores = text2view[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * W4\
                                  for key in item_scores.keys()})

        for token, tf in tf_counter.items():
            if len(token) < 2:
                continue
            if token in token2bought:
                item_scores = token2bought[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * W4\
                                  for key in item_scores.keys()})
            if token in token2view:
                item_scores = token2view[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * W4\
                                  for key in item_scores.keys()})
        if rec_scores:
            country_rec = catalog[max(rec_scores, key=rec_scores.get)]["country"]
            rec_scores = {k: v for k,v in rec_scores.items() if catalog[k]["country"] == country_rec}
            # exclude items from black List
            rec_scores = {k: v for k, v in rec_scores.items() if k not in BLACK_LIST}
        
        # equivalent to counter.most_common(10)
        # https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity
        rec = [item for item, _ in heapq.nlargest(10, rec_scores.items(), key=lambda item: item[1])]

        
         # rellena en caso de no tener recomendaciones suficientes
        if len(rec) < 10:
            if len(rec):
                domain = ITEM_TO_DOMAIN[rec[0]]
                fill = most_viewed_by_domain.get(domain, []) +\
                        (most_viewed_items_br if country_rec == "B" else most_viewed_items_mx)
            else: 
                fill = most_viewed_items
            rec = fill_rec(rec, fill)
            
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        if (n_recs % 1000) == 0:
            print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")

NDCG:  0.3094 (1000 recomendaciones)
NDCG:  0.3065 (2000 recomendaciones)
NDCG:  0.3057 (3000 recomendaciones)
NDCG:  0.3041 (4000 recomendaciones)
NDCG:  0.2982 (5000 recomendaciones)
NDCG:  0.2962 (6000 recomendaciones)
NDCG:  0.2962 (7000 recomendaciones)
NDCG:  0.2938 (8000 recomendaciones)
NDCG:  0.2926 (9000 recomendaciones)
NDCG:  0.2917 (10000 recomendaciones)
NDCG:  0.2914 (11000 recomendaciones)
NDCG:  0.2894 (12000 recomendaciones)
NDCG:  0.2892 (13000 recomendaciones)
NDCG:  0.2906 (14000 recomendaciones)
NDCG:  0.2904 (15000 recomendaciones)
NDCG:  0.2903 (16000 recomendaciones)
NDCG:  0.2905 (17000 recomendaciones)
NDCG:  0.2896 (18000 recomendaciones)
NDCG:  0.2895 (19000 recomendaciones)
NDCG:  0.2889 (20000 recomendaciones)


In [74]:
print(f"NDCG (i2i) : {model_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (i2i) :  0.2889


Evalualción de validación:

In [101]:
n_recs = 0
sum_dcg = 0
model_sum_dcg = 0

W0 = 100
W1 = 100
W3 = 300
W4 = 0.001

with open("./data/train_dataset-val_split.jl", "rt") as fd:
    for line in fd:
        n_recs += 1
        # lee registro
        try:
            data = json.loads(line)
        except:
            continue
        item_bought = data["item_bought"]
        items_views = [event["event_info"] for event in data["user_history"] if event["event_type"] == "view"]
        searchs = [event["event_info"] for event in data["user_history"] if event["event_type"] == "search"]

        # Ranking de items visitados
        items_pv_count = {}
        items_views = items_views[::-1]            
        for pos, item_view in enumerate(items_views, 1):
            items_pv_count[item_view] = items_pv_count.get(item_view,0) + 1 / np.log10(pos + 1) 
        
            
        # recomendaciones por modelo
        rec_scores = {}
        for item_view, pv_count in items_pv_count.items():
            
            # Asigna un score por item visitado
            rec_scores[item_view] = rec_scores.get(item_view, 0) + np.log10(pv_count + 1) * W0 
            
            # Asigna scores por asociaciones de visitas
            if item_view in item2item:
                item_scores = item2item[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * W1 \
                                   for key in item_scores.keys()})
                
            # Asigna scores por de compras
            if item_view in view2bought:
                item_scores = view2bought[item_view]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * pv_count * W3 \
                                   for key in item_scores.keys()})
                
        
        # Ranking de palabras de búsqueda
        tf_counter = {}
        text_counter = {}
        searchs = searchs[::-1]            
        for pos, text in enumerate(searchs, 1):
            text_counter[text] = tf_counter.get(text, 0) + 1 / np.log10(pos + 1)
            for token in text.split():
                tf_counter[token] = tf_counter.get(token, 0) + 1 / np.log10(pos + 1)
        
        for text, tf in text_counter.items():
            if text in text2bought:
                item_scores = text2bought[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * W4\
                                  for key in item_scores.keys()})
        if text in  text2view:
            for text, tf in text_counter.items():
                item_scores = text2view[text] 
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tf * W4\
                                  for key in item_scores.keys()})
            
        for token, tf in tf_counter.items():
            if len(token) < 2:
                continue
            if token in token2bought:
                item_scores = token2bought[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * W4\
                                  for key in item_scores.keys()})
            if token in token2view:
                item_scores = token2view[token] 
                tfidf =  tf  / token_df[token]
                rec_scores.update({key: rec_scores.get(key, 0) + item_scores.get(key, 0) * tfidf * W4\
                                  for key in item_scores.keys()})
                
        # equivalent to counter.most_common(10)
        # https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity
        rec = [item for item, _ in heapq.nlargest(10, rec_scores.items(), key=lambda item: item[1])]

        # rellena en caso de no tener recomendaciones suficientes
        if len(rec) < 10:
            if len(rec):
                domain = ITEM_TO_DOMAIN[rec[0]]
                fill = most_viewed_by_domain.get(domain, []) + most_viewed_items
            else:
                fill = most_viewed_items

            rec = fill_rec(rec, fill)
            
        # evaluación
        model_sum_dcg += dcg(rec, item_bought)
        
        if (n_recs % 1000) == 0:
            print(f"NDCG: {model_sum_dcg / (IDCG * n_recs): .4f} ({n_recs} recomendaciones)")

NDCG:  0.2815 (1000 recomendaciones)
NDCG:  0.2805 (2000 recomendaciones)
NDCG:  0.2816 (3000 recomendaciones)
NDCG:  0.2828 (4000 recomendaciones)
NDCG:  0.2856 (5000 recomendaciones)
NDCG:  0.2873 (6000 recomendaciones)
NDCG:  0.2890 (7000 recomendaciones)
NDCG:  0.2866 (8000 recomendaciones)
NDCG:  0.2867 (9000 recomendaciones)
NDCG:  0.2872 (10000 recomendaciones)
NDCG:  0.2882 (11000 recomendaciones)
NDCG:  0.2892 (12000 recomendaciones)
NDCG:  0.2889 (13000 recomendaciones)
NDCG:  0.2880 (14000 recomendaciones)
NDCG:  0.2882 (15000 recomendaciones)
NDCG:  0.2881 (16000 recomendaciones)
NDCG:  0.2885 (17000 recomendaciones)
NDCG:  0.2877 (18000 recomendaciones)
NDCG:  0.2885 (19000 recomendaciones)


In [102]:
print(f"NDCG (i2i) : {model_sum_dcg / (IDCG * n_recs): .4f}")

NDCG (i2i) :  0.2889
