# H&M Product Recommendations 2

In this notebook we will create a 2 neo4j graphs using 2019_autumn_transactions_train.csv and 2020_autumn_transactions_train.csv, and then use a similarity algorithm to create personalized recommendations for each customer.

In [2]:
# Importing libraries

import pandas as pd
import auxiliars

In [2]:
# Neo4j driver conf

uri = 'uri' # example 'bolt://localhost:7687'
user = 'user'
password = 'password'

### Importing the data and creating the first graph

In [5]:
transactions_2019 = pd.read_csv('data/2019_autumn_transactions_train.csv')
transactions_2019

Unnamed: 0,customer_id,t_dat,article_id,price,sales_channel_id
0,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,2019-09-03,757957001,0.022017,2
1,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,2019-09-03,805986001,0.033881,2
2,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,2019-09-03,785464001,0.042356,2
3,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,2019-09-03,794572001,0.016932,2
4,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,2019-09-03,763863002,0.008458,2
...,...,...,...,...,...
308866,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,2019-11-22,820428001,0.010831,1
308867,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,2019-11-22,791522004,0.021729,1
308868,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,2019-11-22,765739001,0.016254,1
308869,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,2019-11-22,551044045,0.021678,1


In [13]:
# Lets see how many customers and articles are in the dataset

customers_2019 = transactions_2019.customer_id.unique()
articles_2019 = transactions_2019.article_id.unique()

print('There are ', len(customers_2019), ' customers in the dataset')
print('There are ', len(articles_2019), ' articles in the dataset')

There are  87519  customers in the dataset
There are  265  articles in the dataset


**Creating the graph database**

In order to create the neo4j graph database, we will perform the following steps:
- Add customers as nodes with the *Customer* label.
- Add articles as nodes with the *Article* label.
- Add a relationship labeled as *Bought* for every transaction in the dataset. For example, customer 1 bought article 5. 

In [7]:
#Create driver
neo4j = auxiliars.graphDriver(uri=uri, user=user, password=password)

# Add customers
for i in customers_2019:
    neo4j.create_customer_node(i)

# Add articles
for i in articles_2019:
    neo4j.create_article_node(str(i)) # Giving input as string to save the id in str format

# Add relationships
for index, row in transactions_2019.iterrows():
    neo4j.create_transaction(row.customer_id, row.article_id, row.price, row.t_dat)
    
# Close driver
neo4j.close()

### Getting customer similarity

Using the neo4j node similarity algorithm, we will get the similarity between each node (customer) pair in the dataset.

In [9]:
# Create driver
neo4j = auxiliars.graphDriver(uri=uri, user=user, password=password)

# Get similarity df
similarity_df = neo4j.get_customer_similarity()

similarity_df

Unnamed: 0,customer_1_id,customer_2_id,similarity
0,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,3aa5e6555480b566b23669072f51284fe681c1afa3f321...,0.061033
1,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,578e704b2c2fc32d93261b4be9d4797301518f653a9c17...,0.055276
2,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,9facc7e5247374694a2cffe333d373abcb00545564326c...,0.052402
3,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,0aa2639de115b950b6fb73e632c4895bdea9129445e320...,0.051948
4,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,6892f3043c0e821c05e70d30a480bca56f4dc60062e9ea...,0.049550
...,...,...,...
55065,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,736bca97b545cb156292d590ef0c84ac3440192eeda6e3...,0.039474
55066,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,cae43013ac28d76c74c98fd77caaea2f937da1c2f743c7...,0.039474
55067,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,d702491f7d800b4fafe5b4bf0b4ac9fb284fbbb733618c...,0.038961
55068,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,02b635822ad5378d0f8faba38dd0d0f3a6ec9d43b2564e...,0.037500


In [10]:
# Lets save the dataframe in csv

similarity_df.to_csv('data/similarity.csv', index=False)

### Creating the second graph and getting recommendations

First, we will create other graph with the data of autumn 2020, and we will make recommendations based on the similarity score we got in the previous section, and the articles each customer bought in autumn 2020.

In order to asign recommendations to customer X, we will asign the articles that a similar customer bought but the customer X didn't, and we will order these recomended articles attending to the similarity between the customers. 

First, we will create the second graph.

In [20]:
transactions_2020 = pd.read_csv('data/2020_autumn_transactions_train.csv')
transactions_2020

Unnamed: 0,customer_id,article_id,t_dat,price,sales_channel_id
0,8587b6abee36ea6659a20ff123243e79b7fef9779f4234...,751471001,2020-09-01,0.033881,2
1,8587b6abee36ea6659a20ff123243e79b7fef9779f4234...,909014001,2020-09-01,0.088966,2
2,8587b6abee36ea6659a20ff123243e79b7fef9779f4234...,873279001,2020-09-09,0.042356,2
3,8587b6abee36ea6659a20ff123243e79b7fef9779f4234...,872537001,2020-09-09,0.084729,2
4,e64e2798bc55c242e8fea2dcb72af1684112bf82c473e4...,751471001,2020-09-01,0.033881,2
...,...,...,...,...,...
7196,4ebaab0fab59c10a4aebc458de70477499a356716e606e...,673677022,2020-09-22,0.025407,2
7197,b762834e8edffbc5756535208cb708ef18aba6fedba2c7...,762143001,2020-09-03,0.013542,2
7198,173afba067e1c1fd20c404c6da639b99e9277b3a45a748...,828991003,2020-09-04,0.033881,2
7199,8b121faa353eb41a3cc7e98c3d0ff68432335536c800bf...,828991003,2020-09-13,0.033881,2


In [21]:
# Lets see how many customers and articles are in the dataset

customers_2020 = transactions_2020.customer_id.unique()
articles_2020 = transactions_2020.article_id.unique()

print('There are ', len(customers_2020), ' customers in the dataset')
print('There are ', len(articles_2020), ' articles in the dataset')

There are  1967  customers in the dataset
There are  265  articles in the dataset


In [16]:
#Create driver
neo4j = auxiliars.graphDriver(uri=uri, user=user, password=password)

# Neo4j community edition only allows to have one database, so we will clear the previous database to create this new one.
neo4j.clear_database()

# Add customers
for i in customers_2020:
    neo4j.create_customer_node(i)

# Add articles
for i in articles_2020:
    neo4j.create_article_node(str(i)) # Giving input as string to save the id in str format

# Add relationships
for index, row in transactions_2020.iterrows():
    neo4j.create_transaction(row.customer_id, row.article_id, row.price, row.t_dat)
    
# Close driver
neo4j.close()

**Getting recommendations for each customer**

In [3]:
similarity_df = pd.read_csv('data/similarity.csv')
similarity_df

Unnamed: 0,customer_1_id,customer_2_id,similarity
0,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,3aa5e6555480b566b23669072f51284fe681c1afa3f321...,0.061033
1,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,578e704b2c2fc32d93261b4be9d4797301518f653a9c17...,0.055276
2,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,9facc7e5247374694a2cffe333d373abcb00545564326c...,0.052402
3,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,0aa2639de115b950b6fb73e632c4895bdea9129445e320...,0.051948
4,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,6892f3043c0e821c05e70d30a480bca56f4dc60062e9ea...,0.049550
...,...,...,...
55065,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,736bca97b545cb156292d590ef0c84ac3440192eeda6e3...,0.039474
55066,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,cae43013ac28d76c74c98fd77caaea2f937da1c2f743c7...,0.039474
55067,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,d702491f7d800b4fafe5b4bf0b4ac9fb284fbbb733618c...,0.038961
55068,d8bfae75ec21959c1abbcd141b5d19111fe355eb48729b...,02b635822ad5378d0f8faba38dd0d0f3a6ec9d43b2564e...,0.037500


In [33]:
def get_recommendations(cust_id, n, driver, similarity_df, customers_2020):

    if (cust_id not in list(similarity_df['customer_1_id'])) and (cust_id not in list(similarity_df['customer_2_id'])):
        return "Customer not in similarity dataframe"    

    similar_customers1 = similarity_df[similarity_df['customer_1_id'].str.contains(cust_id)].drop('customer_1_id', axis=1).rename(columns={'customer_2_id':'customer_id'})
    similar_customers2 = similarity_df[similarity_df['customer_2_id'].str.contains(cust_id)].drop('customer_2_id', axis=1).rename(columns={'customer_1_id':'customer_id'})

    similar_customers = pd.concat([similar_customers1, similar_customers2], axis=0).drop_duplicates().sort_values(by=['similarity'], ascending=False)
    
    if cust_id not in customers_2020:
        articles_bought = []
    else:
        articles_bought = driver.get_articles_bought_by(cust_id)

    recomended_articles = set(articles_bought)
    for i in similar_customers.customer_id:
        if i in customers_2020:
            aritcles_bought_by_i = driver.get_articles_bought_by(i)
            recomended_articles.update(aritcles_bought_by_i)
        if len(recomended_articles) > len(articles_bought) + n:
            break

    recomended_articles = [art for art in recomended_articles if art not in articles_bought]
    return recomended_articles[0:n]

# Set max number of recommendations
n = 5

# Customer id to get the recommendations
cust_id = 'be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b'

# Create driver
neo4j = auxiliars.graphDriver(uri=uri, user=user, password=password)

# get recommendations
get_recommendations(cust_id=cust_id, n=n, driver=neo4j, similarity_df=similarity_df, customers_2020=customers_2020)


['898694002', '884319003', '919273002', '923758001', '828912004']