<b>Final Project Submission</b>

Please fill out:
* Student name: Nadine Amersi-Belton
* Student pace: part time
* Scheduled project review date/time: 02/11/2020 at 3:30pm GMT
* Instructor name: Victor Geislinger
* Blog post URL: https://nadinezab.medium.com/visualising-embeddings-with-t-sne-b54bf6b635f?sk=4606e6721bb5406e09943e4221f104dc


# Video Games Recommendation Engine

## Introduction

### Problem Statement

As a (hypothetical) junior data scientist at Steam, my role involves drawing insights from data and creating data-driven tools to boost sales and revenue. 

Firstly, I have been tasked with **providing useful recommendations for the marketing team**, who are currently reviewing their avertising contracts with various publishers.

As a second component, I have been asked to **build a user recommendation engine** using collaborative filtering methods to incentivise purchases. Steam currently uses a content based model to genereate recommendations and is keen to see which method performs better.

Finally, I will be looking at **item-to-item recommendations**, providing similar items which can be listed on a game's page.

### Business Value

Recommendation engines are a key proponent in boosting sales and revenues. Consumers are provided with targetted games, based on their unique taste and preferences. Instead of browsing games, which relies on the consumer having time and willing to search for the right item, displaying 5 or 10 games which are relevant to them leads to quicker sales and potential impulse purchases.

There is therefore no doubt that a well-performing recommendation engine has significant business value for Steam.

### Data

The data can be downloaded from https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data. 

There are two relevant files for the purposes of our project: `Version 1: User and Item Data` and `Version 2: Item metadata`.

*References*

Self-attentive sequential recommendation
Wang-Cheng Kang, Julian McAuley
ICDM, 2018

Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018

Generating and personalizing bundle recommendations on Steam
Apurva Pathak, Kshitiz Gupta, Julian McAuley
SIGIR, 2017

### Methodology

We will adopt the OSEMiN data science workflow, which involves:

* Obtain (import the data)
* Scrub (conduct preprocessing)
* Explore (answer descriptive questions using EDA)
* Model (build our recommendation models)
* iNterpret (comment on our model and findings)


## Notebooks

Seperate Jupyter Notebooks were used for various parts of this project. Please find a summary below to assist in navigating.

<a href = 'https://github.com/nadinezab/video-game-recs/blob/master/1-Data-Loading.ipynb'>1-Data-Loading</a>

In this Notebook, we retrieved the data and saved it as a JSON file. Whilst the data was said to be in JSON format, when running it through a JSON linter, we noticed that it was not proper JSON due to the use of single quotes instead of double quotes.

To remedy this, we explored options such as replacing the single quotes with double quotes. However this led to issues where the name of a game included an apostrophe, e.g. `Assassin's Creed`. 

A workable solution was to read the data carefully using the `ast.literal_eval()` method.

<a href = 'https://github.com/nadinezab/video-game-recs/blob/master/2-Data-Preprocessing.ipynb'>2-Data-Preprocessing</a>

In this Notebook, we conducted initial preprocessing of the two JSON files we obtained in the previous Notebook. 

We created relevant Pandas DataFrames and saved files as `csv` for subsequent exploration and modelling. In particular, we extracted user ids and game ids to create a user-item interactions DataFrame, with each row being a particular user-item relationship (namely owned item).

<a href = 'https://github.com/nadinezab/video-game-recs/blob/master/3-Data-Exploration.ipynb'>3-Data-Exploration</a>

In this Notebook,  we explored the data to gain insights. We provided our stakeholders with business recommendations based on the data.

We explored the following topics:
* game release dates - what time is most popular?
* game library size - how many games do most users own?
* game genres - which are the most popular genres?
* game publishers - who are the top publishers?

<a href = 'https://github.com/nadinezab/video-game-recs/blob/master/4-Modelling.ipynb'>4-Modelling</a>

In this Notebook, we undertook the modelling tasks.

We started with additional preprocessing to create a sparse train and test matrices, before implementing a user recommendation model using LightFM.

We also explored and visualized the embedding space using t-SNE.

Finally we built an item to item recommendation engine.

## Results and Recommendations

### Marketing recommendations

We investigated when games were released and show below the number of games released per month.

![image.png](attachment:image.png)

We saw that October, November and December have the highest number of game releases. 

**Recommendation:**

We would recommend ensuring advertisement deals are priced at a premium during this period.


We investigated the distribution of number of games owned and show below the distribution of this feature.

![image.png](attachment:image.png)

We saw that the average number of games owned is 58.

**Recommendation:** 
A focussed campaign on users who have below the average number of games of 58. These users are more likely to find games they do not own which appeal.

Finally, we looked at the price of games and show below the distribution of this feature.

![image.png](attachment:image.png)

We saw that 75% of games are below $10. 

**Recommendation**:

Focus on volume of sales and game bundles for higher single transactions. It would be worth investigating bundles further to see which games to group to maximise sales, as a delicate balance needs to be achieved where the user would not have purchased all games individually (at a higher cost) but still thinks the bundle is worthwhile.

### User recommendations

We created a function which takes in a user id and returns `k` recommended items.

Once integrated with the Steam interface, this can be used directly to display recommended games on a user's homepage. We would need to conduct A/B testing to see if it outperforms the existing similar games recommender. Based on domain knowledge, the recommendations appear sensible.

Sample output:

![Screenshot%202020-10-30%20at%2011.46.19.png](attachment:Screenshot%202020-10-30%20at%2011.46.19.png)

![Screenshot%202020-10-30%20at%2011.46.40.png](attachment:Screenshot%202020-10-30%20at%2011.46.40.png)

### Item recommendations

We created a function which takes in a game id and returns `k` most similar games, using cosine similarity as a distance metric. 

Once integrated with the Steam interface, this can be used directly to display similar games on a given game's page. We would need to conduct A/B testing to see if it outperforms the existing similar games recommender. Based on domain knowledge, the recommendations appear sensible.

Sample output:

![item-output.png](attachment:item-output.png)

### Embedding Space

We explored the embedding space of games and used t-SNE to visualise it in 2 dimensions.

![image.png](attachment:image.png)

We saw that games we thought were similar may not be, such as RollerCoaster Tycoon World being different than the other RollerCoaster Tycoon games. 

![image.png](attachment:image.png)

These type of insights help us understand games better and ensure that the right games are recommended. The visualisation supports the `item-to-item` recommendations. Further work would be to explore where genres/tags sit and in particular see if any pairs of tags stand out. We also note the pocked of games in the top middle, which appear to be different to the others.

## Future Work

The first phase of building a recommendation engine using collaborative filtering is complete. Here are the proposed next steps:
* Further tweak model hyperparameters and run over a larger number of epochs
* Integrate with Steam's API to generate recommendations for existing users
* Conduct thorough A/B testing to compare the collaborative filtering model with Steam's current recommendation engine

We could still look to tweak our model by undertaking the following:
* Accounting for game genre/tags and building a hybrid model
* Accounting for hours played
* Accounting for thumbs up/down ratings

From a data exploration point of view, we would also be interested in side projects involving:
* Exploring user reviews
* Further investigating the embedding space