# DATA 612 Final Project - Developing a Recommender System

By Mike Silva

## Introduction

The goal of this project is to develope a recommender system.  I intend to develop a Flask app to give access to the retem.  If time permits I will integrate user session data into the recomender system.  For this project I will be using data scrapped from BoardGameGeek.com (BGG).

### About the BGG Dataset
The BoardGameGeek dataset was collected by myself by scrapping data from the API that forms the backend of [BoardGameGeek's website](https://boardgamegeek.com/). Data scrapping in ongoing but this particular data set has over 1.9 million ratings (implicit and explicit) for about 88,000 games by 219,000 users. I have previously exported the ratings from the SQLite database, then exported the data into a CSV for processing.

In [1]:
import pandas as pd
import networkx as nx
import sqlite3
import html
import json
import pickle

# I will be saving object for use with the recommender system
def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

## Data Wrangling

I will begin by reading in the rating data that I have scrapped.

In [2]:
df = pd.read_csv("bgg_ratings.csv")

This data is a mix of explicit and implicit ratings.  We will split the rating data into both categories.

In [3]:
is_an_explicit_rating = df["rating"] > 0
explicit_rating = df[is_an_explicit_rating]
implicit_rating = df[~is_an_explicit_rating]

I will be implementing a recommender system biased toward the popularity of the game.  I will be reducing the ratings into a binary variable: it was liked or not liked.  Since some users are tougher reviewers than others I will be adjusting for the user bias.

In [4]:
global_mean = explicit_rating["rating"].mean()
user_bias = explicit_rating[["user_id", "rating"]].groupby("user_id").mean().reset_index().rename(columns={"rating": "avg_rating"})
user_bias["global_mean"] = global_mean
user_bias["user_bias"] = user_bias["global_mean"] - user_bias["avg_rating"]
user_bias

Unnamed: 0,user_id,avg_rating,global_mean,user_bias
0,2,10.000000,7.609077,-2.390923
1,4,7.306250,7.609077,0.302827
2,5,5.090909,7.609077,2.518168
3,6,6.862821,7.609077,0.746256
4,7,8.426667,7.609077,-0.817590
...,...,...,...,...
160100,529207,9.000000,7.609077,-1.390923
160101,529208,8.000000,7.609077,-0.390923
160102,529209,8.000000,7.609077,-0.390923
160103,529210,8.000000,7.609077,-0.390923


Now that I have the user biases I will merge them in with the explicit ratings and adjust the ratings by the user bias. 

In [5]:
explicit_rating = pd.merge(explicit_rating, user_bias)
explicit_rating["adj_rating"] = explicit_rating["rating"] + explicit_rating["user_bias"]
explicit_rating

Unnamed: 0,id,item_id,user_id,rating,rating_tstamp,avg_rating,global_mean,user_bias,adj_rating
0,1,3,987,9.0,2016-07-29 08:22:03,7.724503,7.609077,-0.115427,8.884573
1,1703,93,987,8.0,2018-01-12 08:31:00,7.724503,7.609077,-0.115427,7.884573
2,3998,200,987,7.0,2013-01-13 14:38:14,7.724503,7.609077,-0.115427,6.884573
3,8626,571,987,7.5,2012-12-27 15:20:34,7.724503,7.609077,-0.115427,7.384573
4,21737,1607,987,9.1,2012-12-27 14:43:14,7.724503,7.609077,-0.115427,8.984573
...,...,...,...,...,...,...,...,...,...
1013321,1936282,88841,359261,10.0,2019-06-24 05:36:08,10.000000,7.609077,-2.390923,7.609077
1013322,1936283,88841,207630,10.0,2020-04-07 15:28:37,10.000000,7.609077,-2.390923,7.609077
1013323,1936311,40504,457329,9.0,2016-10-28 11:56:33,9.000000,7.609077,-1.390923,7.609077
1013324,1936313,40504,64354,9.0,2019-01-01 12:24:54,9.000000,7.609077,-1.390923,7.609077


Now that thep as ratings have been adjusted I need to identify which ratings to keep as evidence that the user liked the game.  I will look at the distribution of the adjusted ratings.

In [6]:
explicit_rating["adj_rating"].describe()

count    1.013326e+06
mean     7.609077e+00
std      1.370001e+00
min     -1.380923e+00
25%      7.009077e+00
50%      7.609077e+00
75%      8.359077e+00
max      1.655106e+01
Name: adj_rating, dtype: float64

I will say any game that has an adjusted rating equal to or above the global mean are "liked" by the user.

In [7]:
is_liked = explicit_rating["adj_rating"] >= global_mean
explicit_likes = explicit_rating[is_liked][["item_id", "user_id"]].reset_index(drop=True)
explicit_likes

Unnamed: 0,item_id,user_id
0,3,987
1,93,987
2,1607,987
3,1671,987
4,2281,987
...,...,...
607578,88841,359261
607579,88841,207630
607580,40504,457329
607581,40504,64354


Now I will combine the implicit likes with the explicit likes into one dataset.

In [8]:
implicit_likes = implicit_rating[["item_id", "user_id"]].reset_index(drop=True)
likes = pd.concat([explicit_likes, implicit_likes]).sort_values(["item_id", "user_id"]).reset_index(drop=True)

There is 1.5 million likes in the dataset.  I'm going to determine the count of likes for each game and identify the top 25 most popular games.

In [9]:
n_likes = dict()
n_likes_df = likes.groupby(['item_id']).count().reset_index().rename(columns={'user_id':'likes'}).sort_values('likes', ascending  = False).reset_index(drop=True)
current_likes = 0
ranking = 0
most_popular = list()

for index, row in n_likes_df.iterrows():
    if row['likes'] != current_likes:
        ranking += 1
        current_likes = row['likes']
    n_likes[row['item_id']] = {'likes': row['likes'], 'ranking':ranking}
    if ranking < 25:
        most_popular.append(row['item_id'])
most_popular

[21403,
 53216,
 46151,
 49883,
 47502,
 35140,
 50530,
 42326,
 82563,
 21175,
 34113,
 61330,
 7507,
 16761,
 35373,
 43261,
 14674,
 15673,
 4029,
 47753,
 54736,
 41029,
 32621,
 39560,
 39879]

## Web of Likes

I want to combine content filtering with the ratings in a meaningful way.  I will be using a graph based approach to connect these concepts.  The recommender system will then use the graph to for the recommendations.  I will be using NetworkX to build the graph.  In this stage I will connect the users with the games.  I will also connect the games identified as popular to that concept. 

In [10]:
G = nx.Graph()
G.add_node("popular")

nodes = set("popular")
items = list()

for index, row in df.iterrows():
    item = "game_" + str(row["item_id"])
    user = "user_" + str(row["user_id"])
    if item not in nodes:
        G.add_node(item)
        nodes.add(item)
        items.append(row["item_id"])
    if user not in nodes:
        G.add_node(user)
        nodes.add(user)
    G.add_edge(item, user)
    # Check to see if this should be tagged as popular
    if row["item_id"] in most_popular:
        G.add_edge(item, "popular")

### Augment by Attributes

I will pull attributes from the scrapped data.  I will connect the game to the categories (allowing me to weigh similar types of games higher), by the year it was released, and if the game is identified as an integrating with another game (i.e. an expansion of a base game).

In [11]:
attributes = dict()
DATABASE = "../../boardgamegeek/games.db"
conn = sqlite3.connect(DATABASE)
cur = conn.cursor()
category_counts = dict()

for item_id in items:
    node_id = "game_" + str(item_id)
    cur.execute("SELECT * FROM item WHERE id = " + str(item_id))
    c = cur.fetchone()
    # Initialize the attributes for this game
    attributes_row = {"name": c[1], "likes": n_likes[item_id]["likes"], "ranking": n_likes[item_id]["ranking"], "category": list(), "mechanic": list()}
    # Parse the XML file for more attributes
    xml_file = "../../boardgamegeek/scrapped/" + str(c[2]) + ".xml"
    xml = open(xml_file, "r",  encoding="utf8")
    for line in xml.readlines():
        if "<description>" in line:
            description = html.unescape(line.split("</name>")[0].split(">")[1])
            attributes_row["description"] = description
        if "</boardgamemechanic>" in line:
            boardgamemechanic = html.unescape(line.split("</boardgamemechanic>")[0].split(">")[1])
            attributes_row["mechanic"].append(boardgamemechanic)
            mechanic_node = "mechanic_" + boardgamemechanic.lower().replace(" ", "_")
            if mechanic_node not in nodes:
                nodes.add(mechanic_node)
                G.add_node(mechanic_node)
            G.add_edge(node_id, mechanic_node)
        if "<yearpublished>" in line:
            try:
                yearpublished = html.unescape(line.split("</yearpublished>")[0].split(">")[1])
                year_node = "year_" + yearpublished
                if year_node not in nodes:
                    nodes.add(year_node)
                    G.add_node(year_node)
                G.add_edge(node_id, year_node)
                attributes_row["year"] = int(yearpublished)
            except:
                continue
        if "<minplayers>" in line:
            try:
                minplayers = html.unescape(line.split("</minplayers>")[0].split(">")[1])
                attributes_row["minplayers"] = int(minplayers)
            except:
                continue
        if "<maxplayers>" in line:
            try:
                minplayers = html.unescape(line.split("</maxplayers>")[0].split(">")[1])
                attributes_row["maxplayers"] = int(minplayers)
            except:
                continue
        if "<age>" in line:
            try:
                age = html.unescape(line.split("</age>")[0].split(">")[1])
                attributes_row["age"] = int(age)
            except:
                continue
        if "<thumbnail>" in line:
            thumbnail = html.unescape(line.split("</thumbnail>")[0].split(">")[1])
            attributes_row["thumbnail"] = thumbnail
        if "<image>" in line:
            image = html.unescape(line.split("</image>")[0].split(">")[1])
            attributes_row["image"] = image
        if "</boardgamecategory>" in line:
            boardgamecategory = html.unescape(line.split("</boardgamecategory>")[0].split(">")[1])
            category_counts[boardgamecategory] = category_counts.get(boardgamecategory, 0) + 1
            attributes_row["category"].append(boardgamecategory)
            category_node = "category_" + boardgamecategory.lower().replace(" ", "_")
            if category_node not in nodes:
                nodes.add(category_node)
                G.add_node(category_node)
            G.add_edge(node_id, category_node)
        if "</boardgameintegration>" in line:
            objectid = line.split(">")[0].split("objectid=")[1].split('"')[1]
            cur.execute("SELECT id FROM item WHERE bgg_id = " + str(objectid))
            try:
                c2 = cur.fetchone()
                node_id2 = "game_" + c2[0]
                itegration_id = "integrates_with_" + node_id2
                if itegration_id not in nodes:
                    nodes.add(itegration_id)
                    G.add_node(itegration_id)
                G.add_edge(node_id, itegration_id)
                G.add_edge(node_id2, itegration_id)
            except:
                continue
            
    attributes[item_id] = attributes_row

## Saving My Work

Now that the data has been wrangled, I will save my work.  I will be pickling the data.

In [12]:
save_object(attributes, "attributes.pickle")
save_object(most_popular, "most_popular.pickle")
save_object(category_counts, "categories.pickle")
nx.write_gpickle(G, "G.pickle")

 ## Next Steps

I will be using the saved data in the Flask app which deploys the recommender system.  It will use something like this:

In [13]:
def get_recommendations(G, game_id, n_recommendations, return_weights = False):
    root_node = "game_" + str(game_id)
    counts = dict()
    for n in G.neighbors(root_node):
        if n == "popular":
            weight = 1
        elif "category" in n or "mechanic" in n:
            weight = 2
        elif "integrates_with_"  in n:
            weight = 20
        elif "year" in n:
            weight = 1
        elif "user" in n:
            weight = 1
        else:
            weight = 0
        for node in G.neighbors(n):
            node_id = int(node.replace("game_", ""))
            if node != root_node:
                counts[node_id] = counts.get(node_id, 0) + weight
    
    recommendations = [u[1] for u in sorted(((value, key) for (key,value) in counts.items()), reverse=True)][0:n_recommendations]
    
    if return_weights:
        recommendations = [(u, counts[u]) for u in recommendations]
    
    return(recommendations)

game_id = most_popular[5]
print("Recommendations for " + attributes[game_id]['name'])
for game, weight in get_recommendations(G, game_id, 10, True):
    print(attributes[game]['name'] + ":" + str(weight))

Recommendations for Ticket to Ride: Europe
Ticket to Ride Map Collection: Volume 3 – The Heart of Africa:15
Ticket to Ride: Märklin:14
Ticket to Ride: Mystery Train Expansion:14
Ticket to Ride Map Collection: Volume 6 – France & Old West:13
Ticket to Ride Map Collection: Volume 5 – United Kingdom & Pennsylvania:13
Ticket to Ride: Germany:12
Thurn and Taxis:12
Ticket to Ride Map Collection: Volume 4 – Nederland:12
Ticket to Ride Map Collection: Volume 1 – Team Asia & Legendary Asia:12
Ticket to Ride: Orient Express:11
