# Recommender System Prototype

In this notebook there is the prototype of the first version of the recommender system. 

What we intended to do was making a list of all merchants where a particular user has made any purchase, and build a similarity matrix. 

In that way, having the identification of a particular user, there is a pool of merchants (and ideally their products) that we can choose to recommend.



We propose a series stages of development for the analytical section of the project. We foresee the following tasks that need to be done to complete the project:

Build a matrix of customers (rows) against stores or store clusters (columns) in which you can easily identify which customers bought at which merchants and how many times they did so in a comparable time window.


Calculate a measure of similarity and / or concordance between each pair of columns, which aims to quantify the concordance between two merchants (or two merchant clusters) with respect to those who buy in them.


For each merchant (or merchant cluster) the other merchant can be ranked from highest to lowest with respect to the evidence of similarity and / or agreement that was calculated earlier.


When a customer buys in a certain merchant (or merchant cluster) we can identify which other merchant (s) is (are) the most consistent with it and base the purchase recommendation in that fact.


If within each merchant (or merchant cluster) there is information about the products that each customer bought, the similarity and / or concordance between products could also be quantified and thus make a recommendation not only of merchants but also of products in particular. 

As variables to have the account, the time, the type of payment (TDC, TD), and the amount of the transaction


All of the above can be done for different geographical areas and / or different times of the year in order to make more effective recommendations. For example, do steps 1 to 5 for each country and for each quarter of the year.


# (Item-based Collaborative Filtering) 





### Initial recomendation system algorithm

With the gatehred data we started to test a basic recomender sytem algorithm in which we compute the similarity between the merchant and the payer. In other words, we wat to find how likely a payer is going to make a transaction in a certain type of commerce based on the previous transaction.


Due this algorithm is in iterative mode we tested it with 1 million samples. The time to compite this algorithm is around 15 min. With the full dataset, the aproximated time to have the result can take up to 1.5H because the merchants increase from ~270 up to ~570.


For this feature we expect to advance by testing libraries like surprise or tensorflow recommenders package


In [None]:
import pandas as pd
import numpy as np
import time

user_id = "transaction_payer_id"
merchant_id = "merchant_id"
bd = pd.read_csv('bdfn.csv',header=0,nrows=1000000)
bd = bd[[user_id,merchant_id]]
bd.rename(columns = {user_id:'user',merchant_id:'item'},inplace = True)
bd = bd.dropna()

In [None]:
bd2 = bd.reset_index().groupby(by=["item","user"],as_index=False).count().rename(columns={"index":"counts"})
users2 = pd.DataFrame(bd2["user"].unique()).reset_index()
bd2 = pd.merge(bd2,users2,how="inner",left_on="user",right_on=0).drop(columns=["user",0]).rename(columns={"index": "user"})
users0 = bd2.groupby(by=["user"],as_index=False).count()
users0 = users0[users0["item"] > 1]
items = bd2["item"].unique()
for item in items:
    a = bd2[["counts"]][bd2["item"]==item]
    bd2.loc[bd2["item"]==item,["counts"]] = a/np.sqrt(np.square(a).sum())
bd20 = pd.merge(bd2,users0,how="inner",left_on="user",right_on="user",suffixes=("","_y")).drop(columns=["item_y","counts_y"])
bd20 = bd20.sort_values(by=["item","user"])

In [None]:
start_time = time.time()
bd20 = bd20.sort_values(by=["item","user"])
r = len(items)
print(r,"different merchant ids\n")
def coseno_simil(df):
    if df.shape[0] > 1:
        return((df["counts_x"]*df["counts_y"]).sum())
    else:
        return(0)
similarities2 = pd.DataFrame(columns = ["item1","item2","similarity"])
for i in np.arange(len(items)-1):
    if i>0 and i%5 == 0: 
        proce = 100*(r*(r-1) - (r-i)*(r-1-i))/(r*(r-1))
        print("{p:2.2f}% processed...\n".format(p=proce))
    a = bd20[["user","counts"]][bd20["item"]==items[i]]
    a.set_index(['user'], inplace=True)
    a = pd.DataFrame(a)
    for j in np.arange(i+1,len(items)):
        b = bd20[["user","counts"]][bd20["item"]==items[j]]
        b.set_index(['user'], inplace=True)
        b = pd.DataFrame(b)
        df = a.join(b, lsuffix='_x', rsuffix='_y',how="inner")
        df2 = pd.DataFrame([[items[i],items[j],coseno_simil(df)]],columns=["item1","item2","similarity"])
        similarities2 = similarities2.append(df2,ignore_index=True)
print("--- %s seconds ---" % (time.time() - start_time))

252 different merchant ids

3.94% processed...

7.79% processed...

11.57% processed...

15.27% processed...

18.89% processed...

22.43% processed...

25.90% processed...

29.28% processed...

32.58% processed...

35.81% processed...

38.96% processed...

42.02% processed...

45.01% processed...

47.92% processed...

50.75% processed...

53.50% processed...

56.17% processed...

58.76% processed...

61.28% processed...

63.71% processed...

66.07% processed...

68.35% processed...

70.54% processed...

72.66% processed...

74.70% processed...

76.66% processed...

78.54% processed...

80.35% processed...

82.07% processed...

83.71% processed...

85.28% processed...

86.76% processed...

88.17% processed...

89.50% processed...

90.75% processed...

91.92% processed...

93.01% processed...

94.02% processed...

94.95% processed...

95.81% processed...

96.58% processed...

97.28% processed...

97.89% processed...

98.43% processed...

98.89% processed...

99.27% processed...

99.57% p

In [None]:
similarities2.sort_values(by="similarity",ascending=False).head(10)

Unnamed: 0,item1,item2,similarity
29422,CO0000000935,CO0000001364,0.246778
1740,CO0000000598,CO0000000024,0.150012
3677,CO0000000951,CO0000000966,0.130571
510,CO0000000017,CO0000000024,0.128157
2156,CO0000000932,CO0000000935,0.108099
327,CO0000000923,CO0000001159,0.081944
505,CO0000000017,CO0000000598,0.078998
3678,CO0000000951,CO0000001487,0.077829
16263,CO0000000958,CO0000001575,0.060325
273,CO0000000923,CO0000000391,0.057593


The following is an example of the recommendation system using a randomly selected buyer from the data set.

In [None]:
usuarios = pd.DataFrame(bd["user"].unique()).rename(columns={0:"user"})
usuarios["random"] = np.random.uniform(0,1,len(usuarios))
minimo = min(usuarios["random"])
usuario = usuarios["user"][usuarios["random"]==minimo]
print("El comprador seleccionado tiene el siguiente identificador:\n")
usuario

El comprador seleccionado tiene el siguiente identificador:



85704    F*,E[CZ!];\/IPUWU#ZS;_Q&5ED*F2LTBEXN+YOJ[4J
Name: user, dtype: object

In [None]:
out = bd.merge(usuario,how="inner",left_on="user",right_on="user").groupby(by="item",as_index=False).count()
out = out.rename(columns={"item":"merchant_id","user":"No.Compras"})
print("Este cliente ha comprado en\n")
print(out)
out = pd.merge(out,similarities2,how="inner",left_on="merchant_id",right_on="item1")
out = out[out["similarity"] > 0]
out["score"] = out["similarity"]*out["No.Compras"]
out = out.groupby(by="item2",as_index=False).sum()
out.sort_values(by="score",ascending=False)
print("\nLa recomendación para este cliente es la siguiente:\n")
out["item2"].head(5)

Este cliente ha comprado en

    merchant_id  No.Compras
0  CO0000000349           3
1  CO0000000598           2

La recomendación para este cliente es la siguiente:



0    CO0000000024
1    CO0000000028
2    CO0000000046
3    CO0000000096
4    CO0000000124
Name: item2, dtype: object

In [None]:
user_id = "transaction_payer_id"
merchant_id = "merchant_id"
bdze = pd.read_csv('ptp28_col.csv',header=0,nrows=1000000)
bdze = bdze[[user_id,merchant_id]]
bdze.rename(columns = {user_id:'user',merchant_id:'item'},inplace = True)
bdze = bdze.dropna()

In [None]:
bd2 = bdze.reset_index().groupby(by=["item","user"],as_index=False).count().rename(columns={"index":"counts"})
users2 = pd.DataFrame(bd2["user"].unique()).reset_index()
bd2 = pd.merge(bd2,users2,how="inner",left_on="user",right_on=0).drop(columns=["user",0]).rename(columns={"index": "user"})
users0 = bd2.groupby(by=["user"],as_index=False).count()
users0 = users0[users0["item"] > 1]
items = bd2["item"].unique()
for item in items:
    a = bd2[["counts"]][bd2["item"]==item]
    bd2.loc[bd2["item"]==item,["counts"]] = a/np.sqrt(np.square(a).sum())
bd20 = pd.merge(bd2,users0,how="inner",left_on="user",right_on="user",suffixes=("","_y")).drop(columns=["item_y","counts_y"])
bd20 = bd20.sort_values(by=["item","user"])

In [None]:
l = len(items)
print(l,"different merchant ids\n")
def coseno_simil(df):
    if df.shape[0] > 1:
        return((df["counts_x"]*df["counts_y"]).sum())
    else:
        return(0)
similarities2 = pd.DataFrame(columns = ["item1","item2","similarity"])
for i in np.arange(len(items)-1):
    proce = 100*(l*(l-1) - (l-i)*(l-1-i))/(l*(l-1))
    print("{p:2.2f}% processed...\n".format(p=proce))
    a = bd20[["user","counts"]][bd20["item"]==items[i]]
    for j in np.arange(i+1,len(items)):
        b = bd20[["user","counts"]][bd20["item"]==items[j]]
        df = pd.merge(a,b,how="inner",left_on="user",right_on="user")
        df2 = pd.DataFrame([[items[i],items[j],coseno_simil(df)]],columns=["item1","item2","similarity"])
        similarities2 = similarities2.append(df2,ignore_index=True)

In [None]:
similarities2.sort_values(by="similarity",ascending=False).head(50)

Unnamed: 0,item1,item2,similarity
29422,CO0000000935,CO0000001364,0.246778
1740,CO0000000598,CO0000000024,0.150012
3677,CO0000000951,CO0000000966,0.130571
510,CO0000000017,CO0000000024,0.128157
2156,CO0000000932,CO0000000935,0.108099
327,CO0000000923,CO0000001159,0.081944
505,CO0000000017,CO0000000598,0.078998
3678,CO0000000951,CO0000001487,0.077829
16263,CO0000000958,CO0000001575,0.060325
273,CO0000000923,CO0000000391,0.057593


In [None]:
!pip install pyarrow
similarities2.info()
similarities2.to_csv('similarities.csv', header=True, index=False)
similarities2.to_feather('similarities.feather')


Collecting pyarrow
  Downloading pyarrow-2.0.0-cp38-cp38-win_amd64.whl (10.7 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-2.0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159895 entries, 0 to 159894
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   item1       159895 non-null  object 
 1   item2       159895 non-null  object 
 2   similarity  159895 non-null  float64
dtypes: float64(1), object(2)
memory usage: 3.7+ MB


El siguiente es un ejemplo del sistema de recomendación usando un comprador seleccionado aleatoriamente del conjunto de datos. 

In [None]:
usuarios = pd.DataFrame(bd["user"].unique()).rename(columns={0:"user"})
usuarios["random"] = np.random.uniform(0,1,len(usuarios))
minimo = min(usuarios["random"])
usuario = usuarios["user"][usuarios["random"]==minimo]
print("El comprador seleccionado tiene el siguiente identificador:\n")
usuario

El comprador seleccionado tiene el siguiente identificador:



1492639    K8DGU31V25
Name: user, dtype: object

In [None]:
out = bd.merge(usuario,how="inner",left_on="user",right_on="user").groupby(by="item",as_index=False).count()
out = out.rename(columns={"item":"merchant_id","user":"No.Compras"})
print("Este cliente ha comprado en\n")
print(out)
out2 = pd.merge(out,similarities2,how="inner",left_on="merchant_id",right_on="item1").drop(columns="item1").rename(columns={"item2":"item"})
out3 = pd.merge(out,similarities2,how="inner",left_on="merchant_id",right_on="item2").drop(columns="item2").rename(columns={"item1":"item"})
out2 = out2.append(out3,ignore_index=True)
out2 = pd.merge(out2,out,how="left",right_on="merchant_id",left_on="item",indicator=True)
out2 = out2[out2['_merge']=='left_only']
out2["score"] = out2["similarity"]*out2["No.Compras_x"]
out2 = out2[["item","score"]].groupby(by="item",as_index=False).sum().sort_values(by="score",ascending=False)
print("\nLas recomendaciones para este cliente son las siguientes, de mayor a menor en orden de importancia:\n")
out2["item"].head(5)

Este cliente ha comprado en

                                     merchant_id  No.Compras
0               65-Seguros y fondos de pensiones           1
1            84-Administración pública y defensa           1
2  86-Actividades de atención de la salud humana           3

Las recomendaciones para este cliente son las siguientes, de mayor a menor en orden de importancia:



41    79-Agencias de viajes y operadores turísticos
27                  59-Actividades cinematográficas
26                        58-Actividades de edición
32                         64-Servicios financieros
29                            61-Telecomunicaciones
Name: item, dtype: object

In [None]:
usuarios = pd.DataFrame(bd["user"].unique()).rename(columns={0:"user"})
usuarios["random"] = np.random.uniform(0,1,len(usuarios))
minimo = min(usuarios["random"])
usuario = usuarios["user"][usuarios["random"]==minimo]
print("El comprador seleccionado tiene el siguiente identificador:\n")
usuario

El comprador seleccionado tiene el siguiente identificador:



2632666    MSK81CKQTK
Name: user, dtype: object

In [None]:
out = bd.merge(usuario,how="inner",left_on="user",right_on="user").groupby(by="item",as_index=False).count()
out = out.rename(columns={"item":"merchant_id","user":"No.Compras"})
print("Este cliente ha comprado en\n")
print(out)
out2 = pd.merge(out,similarities2,how="inner",left_on="merchant_id",right_on="item1").drop(columns="item1").rename(columns={"item2":"item"})
out3 = pd.merge(out,similarities2,how="inner",left_on="merchant_id",right_on="item2").drop(columns="item2").rename(columns={"item1":"item"})
out2 = out2.append(out3,ignore_index=True)
out2 = pd.merge(out2,out,how="left",right_on="merchant_id",left_on="item",indicator=True)
out2 = out2[out2['_merge']=='left_only']
out2["score"] = out2["similarity"]*out2["No.Compras_x"]/sum(out2["No.Compras_x"])
out2 = out2[["item","score"]].groupby(by="item",as_index=False).sum().sort_values(by="score",ascending=False)
print("\nLas recomendaciones para este cliente son las siguientes, de mayor a menor en orden de importancia:\n")
out2["item"].head(5)

Este cliente ha comprado en

    merchant_id  No.Compras
0  CO0000000706           1

Las recomendaciones para este cliente son las siguientes, de mayor a menor en orden de importancia:



1      CO0000000017
172    CO0000000825
149    CO0000000748
160    CO0000000789
123    CO0000000598
Name: item, dtype: object

# Comments and remarks


This prototype recommender system is based on memory. It takes the transactions we already know about a particular user, to offer something alike. 

So... this is the start, we are going to improve this model in next weeks.

But a side comment: maybe, this is not the most useful approach, because as we have witnessed during the EDA, the majority of users are a single time users. They dont have history, or there is very few information about them at the moment of making the recommendation. 



## Different approaches considered:

There are some other ways do implement a solution to this problem. We have considered just one. 

* **Collaborative Filtering**

* **Memory Based Algorithm**

* **Model Based Filtering**

The last one is what we aim for: a comprehensive model that can be used with few variables that can be captured by the first interactions with the potential or loyal user. In this model we consider the most critical attributes of the user to build an easily calculated function, like a polinome, that can be computed easily and outputs a classification pool or a range of options for recommendation. 