# Abstract
The algorithm computes a recommendation for the user based on the user's answers to the questions, the history events of the user and other users and the similarity between the events based on their description. The answers to the questions function as a direct filter to the recommendations deployed, whereas the recommendation based on the user event history is computed with an artificial neuronal network (ANN). The similarity of descriptions is computed using embeddings with cosine similarities. The connection between user, trends and the events provides a recommendation for users. As data we use dummy data to showcase the algorithm.
![Image of Sequenzdiagram](Sequenzdiagramm.jpeg)

In [1]:
import numpy as np
import pandas as pd

In [31]:
# Read data
data = pd.read_csv('Datensatz_events.csv')

In [32]:
# Show data
# Keywords for future matching of descriptions of events with another algorithm
data.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Ort,Beschreibung,Indoor,Preis,Familie,Paar,Abenteuer,...,Bewertung,AnzahlBewertung,Keyword1,Keyword2,Keyword3,Keyword4,Keyword5,Keyword6,Keyword7,Keyword8
0,0,1,Felix Nussbaum Haus,Osnabrück,"Das Felix-Nussbaum-Haus, erbaut nach einem Ent...",1,0,1,1,0,...,4.3,267,,,,,,,,
1,1,2,Stadtführung Zeitseeing,Osnabrück,"Egal ob zu Fuß, auf zwei oder auch vier Rädern...",0,1,1,1,0,...,4.8,4,,,,,,,,
2,2,3,Nachwächter Rundgang,Osnabrück,Als im Jahre 1913 der letzte Nachtwächter sein...,0,1,1,1,1,...,4.0,9,,,,,,,,
3,3,4,Museum Industriekultur,Osnabrück,Das Museum Industriekultur ist ein Museum in d...,1,1,1,1,0,...,4.5,445,,,,,,,,
4,4,5,Alpaka Wanderung,Osnabrück,Angefangen hat es im Herbst 2003 mit einem Fer...,0,1,1,1,1,...,5.0,13,,,,,,,,


## Filter

In [4]:
# Example user behaviour, in deployment data comes from users
val0 = 0  # Indoor
val1 = 1 # Preis
val21 = 1  # Familie
val22 = 1  # Paar
val31 = 1  # Abenteuer
val32 = 0  # Erholung

# logical masking for questions with user data to filter only relevant events
data_f = data[data['Indoor ']==val0]
data_f = data_f[data_f['Preis']==val1]
data_f = data_f[data_f['Familie']==val21]
data_f = data_f[data_f['Paar']==val22]
data_f = data_f[data_f['Abenteuer']==val31]
data_f = data_f[data_f['Erholung']==val32]

# output of the filter
data_f.head()

Unnamed: 0,ID,Name,Ort,Beschreibung,Indoor,Preis,Familie,Paar,Abenteuer,Erholung,Bewertung,AnzahlBewertung,Keyword1,Keyword2,Keyword3,Keyword4,Keyword5,Keyword6,Keyword7,Keyword8
2,3,Nachwächter Rundgang,Osnabrück,Als im Jahre 1913 der letzte Nachtwächter sein...,0,1,1,1,1,0,4.0,9,,,,,,,,
27,28,Varusschlacht,Osnabrück,Von der Ausgrabung zum Museum\nMehr als 2000 J...,0,1,1,1,1,0,4.3,1147,,,,,,,,


# Old tries
We believe these methods might prove to perform better with low number of data than simplex regression

## Try 1) Classify with all historys
- When the user doesn't have a history, the algorithm cannot be used (no effect of the output for the app)
- Each user has a click history vector. The current user will get a recommendation based on recommendation other users with similar history vector have received. 
- we need a distance measure between vectors with different length (each history vector has a different length) which is hard to define
- Alternatively, we can use principal component analysis to reduce dimensionality
- however when we have a big discrepancy between the length of the longest history vector and the length of the shortest history vector, lots of information will be lost

## Try 2) Classify individual user
- Take only history of current user
- For every event the user we compute a similarity of all events in the database and output the events with the best similarity first
- $recommend = argmax_j \frac{1}{I} \sum_{i \in I} Similarity(H_i, A_j)$ with $H_i$ being an i-th event in user history and $A_j$ being an j-th event in all events.
- Similarity is simply defined as number of co-ocurrences as all features have binary values for simplicity.
- In theory, one can use e.g. natural language processing to compute similarity between descrptions of two events
- The algorithm doesn't learn from new user data but only figure out similarities of events based on features

In [5]:
def similarity_binary(x,y):
    '''
    measure the similarity between two events based on binary features
    :param x: pandas Series, an event in the user history
    :param y: pandas Series, an event from all possible events
    :return: int, similarity score
    '''
 
    # take only features with binary values
    x = x.filter(['Indoor ', 'Preis', 'Familie','Paar', 'Abenteuer', 'Erholung']).to_numpy()
    y = y.filter(['Indoor ', 'Preis', 'Familie','Paar', 'Abenteuer', 'Erholung']).to_numpy()
    return(np.sum(x == y))


# # The method is now deprecated. Use recommend() instead.
# def recommend_best_event(data, events_ID, history_ID):
#     """
#     compute avg. similarity and recommend a maximially similar event
#     :param data: pandas DataFrame, unfiltered data
#     :param events_ID: numpy array, array of indicies of all events
#     :param history_ID: numpy array, array of indicies of events in user history
#     :return: list, indicies of best matching events
#     """
#     maxsim = 0  # init with a very low similarity
#     argmax_index = []  # no event found at the start
#     for j in events_ID:  # iterate over all events
#         # compute avg. similarity between j and events in history
#         sumsim = 0
#         for i in history_ID:
#             sumsim += similarity_binary(data.iloc[i], data.iloc[j])
#         sumsim = sumsim / len(history_ID)
#         # take argmax
#         if sumsim == maxsim:  # multiple maxima
#             argmax_index.append(j)
#         elif sumsim > maxsim:
#             maxsim = sumsim
#             argmax_index.clear()  # all maxima found until now are not maxima anymore
#             argmax_index.append(j)
    
#     # the algorithm only return the best matching event, but it can be easily modified
#     # to return all events listed based on its similarity score
#     return(argmax_index)


def recommend(data, events_ID, history_ID):
    # output ordered lists of events with its score
    score_dict = {}
    for j in events_ID:  # iterate over all events
        # compute avg. similarity between j and events in history
        sumsim = 0
        for i in history_ID:
            sumsim += similarity_binary(data.iloc[i], data.iloc[j])
        sumsim = sumsim / len(history_ID)
        score_dict[j] = sumsim
    # return sorted dictionary based on similarity scores 
    return({k: score_dict[k] for k in sorted(score_dict, key=score_dict.get, reverse=True)})

Use case: the user already booked events with indicies (0,2,10) and wants a recommendation.

Note that the output is trivial since our similarity measure is based exactly on binary features used for filtering. We would need a more refined similarity measure to get better results.

In [6]:
recommended = recommend(data, range(30), [0,2,10])

# combining filteralgorithm with user bevahaviour to get individual recommendation for user 
trash_index = np.delete(range(30), data_f.index)  # indicies of events that are filtered out
for key in trash_index:
    del recommended[key]
recommended

{2: 4.0, 27: 4.0}

## Try 3) Softmax regression
- Use a neural network to compute matching probabilities after gathering enough history data from multiple users. This means that a single network is trained using data from multiple users
- Input: n-dim vector with n event, where each element represents num. of booking
- Output: n-dim vector being a simplex, depicting probabilities for each event
![Image of NN](Beschreibung_neuronales_Netz.jpeg)

In [7]:
import keras
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


generate random dummy data to train and test the algorithm

In [8]:
# import csv
# import random

# with open('dataset.csv', mode='w') as datafile:
#     datafile_writer = csv.writer(datafile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

#     row1 = ["Persnr"]
#     for x in range(1,31):
#         row1.append("Input"+str(x))
#     for x in range(1,31):
#         row1.append("Output"+str(x))

#     datafile_writer.writerow(row1)

#     list2 = []
#     for x in range(1,1000):
#         del list2[:]
#         list2.append(x)
#         for i in range (1,31):
#             ri = random.randint(0, 1)
#             list2.append(ri)
#             ro = random.randint(1,30)
#         for x in range(1,31):
#             if ro == x:
#                 list2.append(1)
#             else:
#                 list2.append(0)
#         datafile_writer.writerow(list2)

# N = 30 # num. events
# history_data = np.genfromtxt('dataset.csv', delimiter=',')
# X,Y = history_data[1:,:N], history_data[1:,N+1:]


We manually construct a history data to prove our idea. We hypothesize a situation where lots of people who went to Felix-Nussbaum Haus and Museum Industrikultur also went to Varusschlacht. In given example data is focused on cultural events. The data is to showcase the algorithm.

In [52]:
n_samples = 1000
N = 30

X = np.random.randint(0,2,(n_samples,N))
Y = np.eye(N)[np.random.choice(N, n_samples)]  # one-hot encoding
X[0:200, (0,3)] = 1 
Y[0:200, 27] = 1
X[300:400, (4,7,8)] = 1
Y[300:400, 28] = 1
X[400:500, 25] = 1
Y[400:500, 26] = 1
X_train, X_test, y_train, y_test = train_test_split(X,Y, train_size=0.7)

Train a test a network

In [54]:
# architecture
# num. units and layers were choosen arbitarily at the moment, which needs a fine tuning
# based on the data
model = keras.Sequential([
    keras.layers.Dense(units=5, activation='sigmoid'),  # ReLu or tanh at x=0 0 is bad
    keras.layers.Dense(units=5, activation='sigmoid'),
    keras.layers.Dense(units=10, activation='sigmoid'),
    keras.layers.Dense(units=N, activation='softmax')  # output vector is a prob. simplex
])
model.compile(optimizer='adadelta', 
              loss='categorical_crossentropy')  # one-hot output labels

# training
history = model.fit(X_train, y_train, epochs=50)

# testing
model.evaluate(X_test, y_test)

# save model and load for next use
model.save('NN_recommend')

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Give a recommendation based on trained network

In [55]:
X_new =  np.array([np.random.randint(0, high=2, size=N)])  # history of a user  
# this user also went to Felix-Nussbaum Haus and Museum Industrikultur beforehand
X_new[0,(0,3)] = 1  

model = keras.models.load_model('NN_recommend')  # load data
predictions = model.predict(X_new)[0]
print('output from NN')
print(predictions)  # probability for each event
# one sees due to artificial bias in the data that Varusschlacht gets a high probability

predictions[trash_index] = 0  # combine filter results with predictions from NN
normalizer = np.sum(predictions)
predictions = predictions / normalizer  # normalize to 1
print('output from NN combined with answers from filter questions')
print(predictions)
print('sum of all output should be one: {}'.  # small numerical error is allowed
      format(np.sum(model.predict(X_new)))) 

output from NN
[0.01555743 0.02774332 0.02575275 0.01387231 0.0303464  0.02076266
 0.02079718 0.02022288 0.005834   0.03171593 0.01474826 0.02904988
 0.01797499 0.01893104 0.00905938 0.01903398 0.02118823 0.02117291
 0.01113563 0.02336001 0.0262093  0.0149738  0.01295154 0.04317614
 0.01543717 0.02292109 0.10795573 0.23132737 0.10674228 0.02004637]
output from NN combined with answers from filter questions
[0.         0.         0.10017402 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.899826   0.         0.        ]
sum of all output should be one: 0.9999999403953552


## Text similarities
- Here we use a simple word embedding and cosine similarities to compute similarities between descriptions of events
- Optimally, we combine the result with a history vector to improve the recommendation. This might be done by multiplying probability of every event with similarity score of that event with other events in the prediction vector
- In our running example, since there are only two events with non-zero probabilities, multiplication with similarity score wouldn't make any difference

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [56]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0)  # embeddings using 1-3gram
ling_data = data['Beschreibung'].values.astype('U')
tfidf_matrix = tf.fit_transform(ling_data)

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)  # compute similarities

- Visualize similarities between events solely on descriptions. The matrix is a diagonal matrix with perfect similarities on main diagonal.
- For example, the similarity between Hase Kanu (5) and Alfsee Wasserski (18) is relatively high.
- Munipical Theater (25) is very unsimilar to other events as it doesn't have a description.


In [22]:
%matplotlib notebook
import matplotlib.pyplot as plt

In [25]:
plt.imshow(cosine_similarities, cmap='Reds', vmin=0, vmax=0.1)
plt.colorbar()
plt.show()

<IPython.core.display.Javascript object>