## Building a Event Recommendation Engine using a dataset made by VIT students themselves

We will be using an original dataset made by the students of VIT. This dataset contains 18000 ratings across 2000 events for 119 users. 

We are going to build a recommendation engine which will suggest events for a user which he hasn't attended yet based on the events which he has already rated. We will be using k-nearest neighbour algorithm which we will implement from scratch.

In [1]:
import pandas as pd

event file contains information like event id, title, domains of event and ratings file contains data like user id, event id, rating and timestamp in which each line after header row represents one rating of one event by one user.

In [2]:
event_file = "events.csv"
event_data = pd.read_csv(event_file, usecols = [0, 1])
event_data.head()

Unnamed: 0,eventId,title
0,1,Replica Reaction
1,2,Domain Lake
2,3,Division Understanding
3,4,Java Room
4,5,Little Last


In [3]:
ratings_file = "ratings.csv"
ratings_info = pd.read_csv(ratings_file, usecols = [0, 1, 2])
ratings_info.head()

Unnamed: 0,userId,eventId,rating
0,1,476,2.5
1,1,1707,3.0
2,1,1912,3.0
3,1,873,2.0
4,1,1538,4.0


In [4]:
event_info = pd.merge(event_data, ratings_info, left_on = 'eventId', right_on = 'eventId')
event_info.head()

Unnamed: 0,eventId,title,userId,rating
0,1,Replica Reaction,15,1.0
1,1,Replica Reaction,20,4.5
2,1,Replica Reaction,30,4.0
3,1,Replica Reaction,48,4.0
4,1,Replica Reaction,66,5.0


In [5]:
event_info = pd.DataFrame.sort_values(event_info, ['userId', 'eventId'], ascending = [0, 1])
event_info.head()

Unnamed: 0,eventId,title,userId,rating
26,3,Division Understanding,119,4.0
62,7,Parameter Running,119,1.0
143,16,Domain Getting,119,4.0
152,17,Division Lake,119,5.0
251,28,Little Dive,119,4.0


Let us see the number of users and number of events in our dataset

In [6]:
num_users = max(event_info.userId)
num_events = max(event_info.eventId)
print(num_users)
print(num_events)

119
2000


how many events were rated by each user and the number of users rated each event

In [7]:
event_per_user = event_info.userId.value_counts()
event_per_user.head()

15     1700
73     1610
30     1011
23      726
102     678
Name: userId, dtype: int64

In [8]:
users_per_event = event_info.title.value_counts()
users_per_event.head()

Recommend Study        9
Parameter Speed        9
Parameter Professor    9
Java Exhaled           9
Freedom Kaggle         9
Name: title, dtype: int64

Function to find top N favourite events of a user

In [9]:
def fav_event(current_user, N):
    # get rows corresponding to current user and then sort by rating in descending order
    # pick top N rows of the dataframe
    fav_event = pd.DataFrame.sort_values(event_info[event_info.userId == current_user], ['rating'], ascending = [0]) [:N]
    # return list of titles
    return list(fav_event.title)

print(fav_event(5, 3))
    
    

['Square Profession', 'Data Calls', 'Job Calls']


Lets build recommendation engine now

- We will use a neighbour based collaborative filtering model. 
- The idea is to use k-nearest neighbour algorithm to find neighbours of a user
-  We will use their ratings to predict ratings of a event not already rated by a current user.

We will represent events attended by a user in a vector - the vector will have values for all the events in our dataset.
If a user hasn't rated a event, it would be represented as NaN.

In [10]:
user_event_rating_matrix = pd.pivot_table(event_info, values = 'rating', index=['userId'], columns=['eventId'])
user_event_rating_matrix

eventId,1,2,3,4,5,6,7,8,9,10,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,1.0,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,5.0,,,,,,,,...,5.0,,,,,,,,,
5,,,,,,,,5.0,,,...,,,,,3.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,,,,,,,,,,,...,,,,,,,,,,
116,,,,,,,,,,,...,,,,,,,,,,
117,,,,,,,,,,,...,,,,,,,,,,
118,,,,,,,,,,,...,,,,,,,4.0,,,


Now, we will find the similarity between 2 users by using correlation 

In [11]:
from scipy.spatial.distance import correlation
import numpy as np
def similarity(user1, user2):
    # normalizing user1 rating i.e mean rating of user1 for any event
    # nanmean will return mean of an array after ignore NaN values 
    user1 = np.array(user1) - np.nanmean(user1) 
    user2 = np.array(user2) - np.nanmean(user2)
    
    # finding the similarity between 2 users
    # finding subset of event rated by both the users
    common_event_ids = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]
    if(len(common_event_ids) == 0):
        return 0
    else:
        user1 = np.array([user1[i] for i in common_event_ids])
        user2 = np.array([user2[i] for i in common_event_ids])
        return correlation(user1, user2)

 We will now use the similarity function to find the nearest neighbour of a current user

In [12]:
# nearest_neighbour_ratings function will find the k nearest neighbours of the current user and
# then use their ratings to predict the current users ratings for other unrated event 

def nearest_neighbour_ratings(current_user, K):
    # Creating an empty matrix whose row index is userId and the value
    # will be the similarity of that user to the current user
    similarity_matrix = pd.DataFrame(index = user_event_rating_matrix.index, 
                                    columns = ['similarity'])
    for i in user_event_rating_matrix.index:
        # finding the similarity between user i and the current user and add it to the similarity matrix
        similarity_matrix.loc[i] = similarity(user_event_rating_matrix.loc[current_user],
                                             user_event_rating_matrix.loc[i])
    # Sorting the similarity matrix in descending order
    similarity_matrix = pd.DataFrame.sort_values(similarity_matrix,
                                                ['similarity'], ascending= [0])
    # now we will pick the top k nearest neighbour
    nearest_neighbours = similarity_matrix[:K]

    neighbour_event_ratings = user_event_rating_matrix.loc[nearest_neighbours.index]

    # This is empty dataframe placeholder for predicting the rating of current user using neighbour events ratings
    predicted_event_rating = pd.DataFrame(index = user_event_rating_matrix.columns, columns = ['rating'])

    # Iterating all events for a current user
    for i in user_event_rating_matrix.columns:
        # by default, make predicted rating as the average rating of the current user
        predicted_rating = np.nanmean(user_event_rating_matrix.loc[current_user])

        for j in neighbour_event_ratings.index:
            # if user j has rated the ith event
            if(user_event_rating_matrix.loc[j,i] > 0):
                predicted_rating += ((user_event_rating_matrix.loc[j,i] -np.nanmean(user_event_rating_matrix.loc[j])) *
                                                    nearest_neighbours.loc[j, 'similarity']) / nearest_neighbours['similarity'].sum()

        predicted_event_rating.loc[i, 'rating'] = predicted_rating

    return predicted_event_rating

Predicting top N recommendations for a current user

In [13]:
def top_n_recommendations(current_user, N):
    predicted_event_rating = nearest_neighbour_ratings(current_user, 10)
    events_already_attended = list(user_event_rating_matrix.loc[current_user]
                                  .loc[user_event_rating_matrix.loc[current_user] > 0].index)
    
    predicted_event_rating = predicted_event_rating.drop(events_already_attended)
    
    top_n_recommendations = pd.DataFrame.sort_values(predicted_event_rating, ['rating'], ascending=[0])[:N]
    
    top_n_recommendation_titles = event_data.loc[event_data.eventId.isin(top_n_recommendations.index)]

    return list(top_n_recommendation_titles.title)

finding out the recommendations for a user

In [14]:
current_user = int(input("user id"))
print("User's favorite attended events are : ", fav_event(current_user, 5),
      "\nUser's top recommendations are: ", top_n_recommendations(current_user, 3))

user id20


  dist = 1.0 - uv / np.sqrt(uu * vv)


User's favorite attended events are :  ['Regular Control', 'Compute Speed', 'Virus Room', 'Art Coding', 'Job Study'] 
User's top recommendations are:  ['Decoding Running', 'Domain Story', 'Lesson Speed']


## Conclusion
We have built a event recommendation engine using k-nearest neighbour algorithm implemented from scratch. 