## Create complete game info file for recommender app, including SVD features

### Springboard Capstone 2 project: building a recommendation engine
### John Burt


### Purpose of this notebook:

Generate a game data file for my Item Search by Nearest Neighbors (ISNN) recommender model. This model uses a "game coordinate space" to search for games similar to ones offered by the user as "liked games". To generate the coordinate space, I take the ALS filled game x user ratings matrix generated in another notebook and apply PCA along the user dimension to create a reduced set of features for each game to use as coordinates.

Output is a data file with game metadata and SVD feature space coordinates. This file will be deployed with the recommender app.


#### The method:

- Read the game metadata file and ALS filled ratings matrix.
- Compute SVD (using sklearn PCA) along user axis to create game coordinates.
- Combine game metadata and game coordinates into one dataframe and save it in HDF5 format.




## Load data



In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
from datetime import datetime

datadir = './data/'

n_SVD_dims = 85

# load game info file
games = pd.read_csv(datadir+'bgg_game_info.csv')

# load the ALS filled item x user rating matrix
# ratings = pd.read_csv(datadir+'mx_items_filled_minr=10.csv')

# inputfile = 'bgg_game_mx_filled.h5' # bad
inputfile = 'bgg_game_mx_filled_v2.h5' # good

# outputfile = "bgg_game_data_big.h5" # bad
outputfile = "bgg_game_data_big_v2.h5" # good

# ratings = pd.read_hdf(datadir+'bgg_game_mx_filled.h5', 'mx')
ratings = pd.read_hdf(datadir+inputfile, 'mx')

# reset gameID as the index
ratings.reset_index(inplace=True)

## Reduce user axis using SVD

The output is a matrix of features that can be used as "game coordinates" in nearest neighor search.


In [None]:
from sklearn.decomposition import TruncatedSVD, PCA

# first, select only games that are present in both the game info and ratings datasets
gameids = list(set(games['id']).intersection(set(ratings['gameID'])))

# set indices to game id, 
#  select the intersecting game ids so that rows are in same order
sel_games = games.set_index('id').loc[gameids]
sel_ratings = ratings.set_index('gameID').loc[gameids]

# next, do SVD in n_SVD_dims dims, for rec model features
# features = TruncatedSVD(n_components=n_SVD_dims).fit_transform(sel_ratings.values)
# Note: I use PCA now, which performs SVD but also normalizes each feature
features = PCA(n_components=n_SVD_dims, whiten=True).fit_transform(sel_ratings.values)
feature_cols = ['f_%d'%(i) for i in range(n_SVD_dims)]

out_games = pd.concat([sel_games, 
           pd.DataFrame(features, index=sel_games.index, columns=feature_cols)],
          axis=1).sort_index().reset_index()
 
# out_games.to_csv(
#     datadir+"bgg_game_data.csv", index=False, encoding="utf-8")
out_games.to_hdf(
    datadir+outputfile, key='gamedata', index=False, encoding="utf-8")

out_games.head()