# Recommendation Engines

(This is for the Applied Data Science Group November/December 2017 session.)

This notebook tries to build a recommendation engine, which an e-commerce sites would use to recommend other items to you.  Matt Borthwick scraped the data from user reviews at boardgamegeek.com.  This is an initial runthrough to check the quality of the data, and try to play with the distributions.  I'll try to check that the dataset seems sane, check the shape of the distributions.

## Possible questions:

I did a similar brainstorming exercise (without looking at the data) to what we did in the first week:

### Exploratory questions

- What is the most popular game?
  - Which has the highest average rating?
  -  Which has the most reviews?
  -  Same for lowest, least reviews

 - What is the most divisive game?
  (Greatest spread in review scores)

- Data quality: NA, None, NAN
   Number of reviews per user?
   Number of reviews per game?
   Check scale of review scores
 - Check distributions of scores

## Analysis/Modelling questions

- Recommend new games based on similarities with others interests.

   Build clustering algorithm based on scores in games.
   - Assign each user a vector in Ngame-dim space.
   - Find users with similar vectors, based on dot-product.  (K-means or some other clustering)?
   - Remove games that are already reviewed, or with negative scores.
   - Recommend remaining game with highest score.

- User analysis:
   Are there multiple audiences here? "Hardcore" vs "casual" to use the gamer terms.
   - How many 1-review users are there? What games do they try out?
   - What games do users with multiple reviews enjoy? 

- Scoring: How will we score/test our recommendations?
    - Some sort of cross-validation where we keep a game's scores back, 
    and try to predict how reviewers will score it, based on their other reviews?

    - Is a naive test/train split worthwhile/valid?

Handling sparsity:
         - Use global function to estimate missing values.  Treat them as the average user.
         - use TF-IDF?  Not just most frequent, but ratio of frequency to number of users

Latent factor analysis:
       - Collaborative filter
       -decompose matrix into 2 matrices.  user features vs game features.
       - if S_{i,j} is matrix element for user i's score of game j, then
       decompose S=UW, where U is N_{user} x N_{hidden}, and W = N_{hidden} x N_{game}.  (This is similar to training word-vectors in natural language processing)
       - Train on data.

Content-based filtering  (genre tases)
user-based filter  (users similar)
item-item collaborative filtering.  (game similarities)

Try: training on "elite" users to define the clusters?

Do text analysis on game titles for similarities? (Useful for marketing a game)

Associative rule mining

Pattern exploration: Which games did people rate at all?

In [None]:
#Simple measures?

How to impute missing data?  Average score? (Laplace smoothing from spam?)



I aim to try k-means clustering, and the latent factor analysis approach.
I will also try some straightforward collaborative filtering with similarities based on user/game vectors.  I also thought about trying to analyze what games are preferred by reviewers with only a few reviews (n<5) vs many reviews (n>50).

Need to think though the optimization criteria, and include appropriate regularization to avoid overfitting.  Maybe use mean-square-error on game scores for games the user has actually reviewed?

In [1]:
#standard library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

#makes larger plots

#save graphics as pdf too (for less revolting exported plots)
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

In [2]:
#read in the data.  (13MB or so)
#(N.B. I put Matt's header on it's own line, which is skipped, and added the UserID)
#initial playing data
#df=pd.read_csv('data/boardgame-ratings.csv',skiprows=1)
#frequent users

df=pd.read_csv('data/boardgame-frequent-users.csv',skiprows=1)

#full matrix (2E5 users, 400 games)
#df=pd.read_csv('data/boardgame-users.csv',skiprows=1)
df.columns=('userID','gameID','rating')

In [3]:
#Matt made a csv file of ids and names  Load into dataframe, put into dict.
name_df=pd.read_csv('data/boardgame-titles.csv',index_col=0)
name_dict=name_df.to_dict()

## Exploratory Analysis

I'm going to do a few things:
- check for NaN/missing values.
- check the scores look right
- check the numbers of reviews, and games.
- match up the names with the unique gameIDs (I'll find some missing entries here)
- plot the number of reviews/user and reviews/game.
- check for duplicates

In [4]:
#test for NaN
nan_array=np.isnan(df.values)
print('Number of NaN',np.sum(nan_array))
#check scale of review scores.
print('Max/min scores',df['rating'].min(),df['rating'].max())

Number of NaN 0
Max/min scores 1.0 10.0


In [5]:
#How many users, how many games?
#Find the unique entries in each list
users=df['userID'].unique()
games=df['gameID'].unique()

In [6]:
print('Number of unique users is:',len(users))
print('Number of unique games is:',len(games))
print('Total number of reviews is:',len(df))

Number of unique users is: 2473
Number of unique games is: 402
Total number of reviews is: 528871


In [7]:
degree_of_sparsity = len(df)/(len(users)*len(games))
print(degree_of_sparsity)

0.5319852416043519


In [8]:
#check for duplicates
dup=df.duplicated()
df_dup=df[dup]
print('Number of duplicates: ',np.sum(dup))

Number of duplicates:  0


### Number of reviews/user and reviews/game

In [9]:
avg_num_reviews=len(df)/len(users)
print(avg_num_reviews)

213.85806712494946


So on average, each user reviews 26 games.  Let's try to build a histogram of users with a given number of reviews.  (and then the same with games)

In [11]:
#However, this version took a few seconds.
user_review_counts=df.groupby(['userID']).count()
#note that there really are users with ids going from 1 to 1000, its not a screwup.


In [10]:
#find the counts of reviews for each game.
game_review_counts=df.groupby(['gameID']).count()
#make a list matching up gameIDs and names.  Use that list as a new index
new_index=[]
i=0
for ind in game_review_counts.index:
    i+=1
    new_index.append(name_dict['title'][ind])

game_review_counts.index=new_index

In [12]:
#plt.figure(figsize=(12,9))
plt.figure()
plt.hist(user_review_counts.iloc[:,0].values,log=True)
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Users')
plt.title('Reviewer distribution: Number of reviews per user')
plt.show()

<matplotlib.figure.Figure at 0x7efbc5ff7240>

So this is a really long-tailed distribution.  It might be nice to look at this histogram on a log-x scale.  

In [13]:
plt.figure(figsize=(12,9))
game_review_counts.iloc[:,1].plot('bar')
plt.ylabel('Number of reviews')
plt.title('Number of reviews per game')
plt.show()

<matplotlib.figure.Figure at 0x7efbc4bc9e48>

In [None]:
#make a histogram of number of games with a given number of reviews



Lets also try to look at the distributions of scores.  I'll try to make a box-plot.
That will let me check the distributions in an easy manner.
I'll pivot the data frame to make rows users, columns be games, with the entries given by the score. 

## Boxplots and Transforming the data

Rearranging the data to use the gameIDs as columns would make sense for recommendation.
For this data set, with 27 dim that's should be no problem. (Another question on what is best to do with thousands of entries).
This would also make it easier to look at histograms on a per-game basis.
I'm nigh certain pandas has a reshape function to do exactly this.  Pivot maybe?
(http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

In [14]:
#make a small dataframe for debugging purposes
#df_small=df.iloc[0:1000]
#make a dense dataframe
df_pivot=df.pivot(index='userID',columns='gameID',values='rating')
df_pivot=df_pivot.rename(columns=name_dict['title'])
df_pivot.head()

gameID  Samurai  Acquire  Elfenland  Bohnanza   Ra  Catan  RoboRally  \
userID                                                                 
83          NaN      7.0        NaN       NaN  8.0    8.0        8.0   
119         7.0      7.0        NaN       8.0  8.0    7.0        NaN   
144         NaN      NaN        NaN       7.0  NaN    6.0        6.0   
156         7.5      6.5        NaN       7.0  8.0    4.0        7.0   
186         7.0      NaN        NaN       6.0  8.0    7.0        NaN   

gameID  Can't Stop  Tigris & Euphrates  Liar's Dice        ...          \
userID                                                     ...           
83             NaN                 8.0          7.0        ...           
119            6.0                 7.4          7.0        ...           
144            7.0                 NaN          NaN        ...           
156            NaN                 8.0          NaN        ...           
186            8.0                 8.0          NaN

In [51]:
#df_pivot.to_csv('data/boardgame-ratings-pivot.gz',compression='gzip')
#?df.boxplot

In [15]:
plt.figure()
game_review_counts=df_pivot.boxplot(rot=90,grid=False)
plt.title('Score distributions by title')
plt.ylabel('Rating')
plt.show()

<matplotlib.figure.Figure at 0x7efbc4e35dd8>

These mostly look positive.  Not any radically skewed distributions, like all 1 or all 10.  

(I'll imitate a plot I saw the more experienced folk do at the first finance-data meetup)
Try a correlation map based on columns to see how close the score distributions are.
I think this intuitively corresponds to: How much are the score distributions in one game similar to another?
Running across the rows would yield something analogous for users (but would take an age, since that is a 1E5 x 1E5 matrix).


In [24]:
corr_mat=df_pivot.corr()

In [42]:
plt.figure(figsize=(10,10))
plt.imshow(corr_mat)
plt.colorbar()
plt.show()

<matplotlib.figure.Figure at 0x7efbc42e5f60>

As for building a dataset for recommendations engines, the low correlation is worrisome?  A high correlation implies that everyone likes the same games, in which case there is no space for a skillful recommendation.  Thse are average correlations, rather than user-wise correlations.

The low correlation might also be an artifact of lots of reviewers with only a single review. Those entries will have little correlation with anyone else, and may artificially lower the scores?  I also tried keeping only reviews with more than a few scores - it did nothing to change the overall picture.

In [37]:
#def reduced_corr(df)
Nrow,Ncol=df_pivot.shape

rcorr = np.zeros((Ncol,Ncol))
Ncorr = np.zeros((Ncol,Ncol))
Nreviews = np.zeros(Ncol)
mu  = df_pivot.mean(axis=1)
mu_sort=mu.sort_values()
med = df_pivot.median(axis=1)
med_sort=med.sort_values()
std = df_pivot.std(axis=1)

#compute scaled dataframe
scaled = (df_pivot.subtract(mu,axis='index').div(std,axis='index')).values
#scaled = ((df_pivot-5.5)/10).values


In [38]:
scaled

array([[        nan, -0.31096383,         nan, ...,         nan,         nan,
                nan],
       [ 0.14001116,  0.14001116,         nan, ...,         nan,         nan,
                nan],
       [        nan,         nan,         nan, ...,  0.6698287 ,         nan,
         1.04578647],
       ..., 
       [ 2.13349807, -2.40634737,         nan, ...,         nan,         nan,
                nan],
       [-1.13239989, -0.42418073, -1.13239989, ...,         nan,         nan,
        -0.07007115],
       [ 1.42969969,  1.06068742,         nan, ...,         nan,         nan,
                nan]])

In [None]:
Nmax=Ncol
#compute correlations between entries, only where both games have been rated.
for i in range(Nmax):
    mski = ~np.isnan(scaled[:,i])
    for j in range(i,Nmax):
        mskj = ~np.isnan(scaled[:,j])
        msk_tot = mski & mskj
        x = scaled[msk_tot,i]
        y = scaled[msk_tot,j]
        Ncommon=np.sum(msk_tot)
        c= np.dot(x,y)/(Ncommon-1)
        rcorr[i,j]=c
        rcorr[j,i]=c
        Ncorr[i,j]=Ncommon
        Ncorr[j,i]=Ncommon

# #check that the correlation is measuring something like the dot-product between the distributions.
# x0=df_pivot.iloc[:,0].values
# x1=df_pivot.iloc[:,1].values

# x0_mu=np.nanmean(x0)
# x1_mu=np.nanmean(x1)


In [78]:
#find number of reviews within each integer bin size
def game_hist(df):
    df_counts=pd.DataFrame()
    for i in range(1,11):
        Ntot=(df.round()==i).sum(axis=0).astype(int)
        df_counts=df_counts.append(Ntot,ignore_index=True)
    df_counts.index=np.arange(1,11)
    df_counts=df_counts/np.sum(df_counts)
    return df_counts

#compute histograms for dot products. 
def raw_hist(df,Nbins=10):
    df_counts=pd.DataFrame()
    imin=-1
    imax=1
    dx = (imax-imin)/Nbins
    for i in range(Nbins):
        i0 = i*dx
        i1 = (i+1)*dx
        Ntot=(df>i0 & df<i1).sum(axis=0).astype(int)
        df_counts=df_counts.append(Ntot,ignore_index=True)
    df_counts.index=np.linspace(imin,imax,Nbins)
    df_counts=df_counts/np.sum(df_counts)
    return df_counts

def sort_dataframe_columns(df,method='mean'):
    """sort_dataframe_columns(df,method='mean')
    Sort a dataframe's columns based on the columns mean, median,
    variance, or inter-quartile range.
    Method can be "mean", "median", or "std"
    """
    if (method=='mean'):
        sort_series=df.mean(axis=0)
    elif (method=='median'):
        sort_series = df.median(axis=0)
    elif (method=='std'):
        sort_series = df.std(axis=0)
    else:
        print('Method not allowed.')
    sort_series.sort_values(inplace=True)
    df=df.loc[:,med_sort.index]
    return df
                      


In [76]:
df_counts=game_hist(df_pivot)
df_counts_sorted=sort_dataframe_columns(df_counts)



TypeError: sort_values() got an unexpected keyword argument 'in_place'

In [77]:
?med.sort_values()

In [59]:
med_loc = df_counts.median(axis=0)
med_sort=med_loc.sort_values()
# #sort by median
df_counts.loc[:,med_sort.index]


   1960: The Making of the President  6 nimmt!  7 Wonders  7 Wonders: Cities  \
1                           0.004936  0.004685   0.003646           0.000000   
2                           0.004936  0.006024   0.002735           0.004155   
3                           0.006910  0.012048   0.008204           0.004155   
4                           0.029615  0.034137   0.021878           0.008310   
5                           0.052320  0.081660   0.037375           0.033241   

   7 Wonders: Leaders  A Feast for Odin  A Game of Thrones (first edition)  \
1            0.001931          0.005714                           0.008637   
2            0.003861          0.000000                           0.017274   
3            0.004826          0.009143                           0.031670   
4            0.019305          0.011429                           0.051823   
5            0.046332          0.020571                           0.081574   

   A Game of Thrones: The Board Game (Second Editi

Is there much similarity in the score distributions?  Not really useful question to ask.
But the histogram plot is useful.

In [116]:
?plt.imshow()

In [20]:
plt.figure(figsize=(15,10))
plt.imshow(np.log(df_counts),aspect='auto')
plt.colorbar()
plt.ylabel('Score')
plt.xlabel('Games')
plt.show()

<matplotlib.figure.Figure at 0x7efbb67f4c18>

  


The above image is a plot of the score densities for all of the games.  I'm just trying to get a sense of what the score distributions look like.
Most of the games are scored within 6-8.

It might be interesting to identify games by their variance?  Which games is there a consensus on, and which games are divisive?

In [44]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.imshow(corr_mat)
plt.colorbar()
plt.title('Pandas Correlation')
plt.subplot(1,3,2)
plt.imshow(rcorr)
plt.colorbar()
plt.title('"Corrected" Correlation')
plt.subplot(1,3,3)
plt.imshow(Ncorr)
plt.colorbar()
plt.title('Fraction of Common Reviews')
plt.show()

<matplotlib.figure.Figure at 0x7efbbd902c50>

In [21]:
i=5
j=90
plt.scatter(df_pivot.iloc[:,i],df_pivot.iloc[:,j])

<matplotlib.figure.Figure at 0x7efbb67c1ef0>

<matplotlib.collections.PathCollection at 0x7efbb672cc18>

The average pair-wise correlations between user's opinions of a game are really low.
Maybe a cluster analysis might yield something non-trivial?
This says on average, it's hard to guess what a person will think (in terms of variations from the mean).

Looking over the numbers, the correlations seem to be based on 1000 ratings in common for any pair of games.

In [56]:
##make logical array for actual reviews.
#ntot=np.sum(df_pivot>0,axis=1)
##only keep those with more than 6 review.
#keep_msk=ntot>20
#df_pivot2=df_pivot[keep_msk]
#len(df_pivot2)

111

## Conclusions regarding state of data

- The number of reviews per user is skewed towards new folks (not unreasonable, given how few people can stick at something as time intensive as playing and reviewing board games).

- Looking at the box-plots, the scores seem fairly high, which tallies with what Matt said about picking popular games.  There doesn't seem anything obviously wrong with the distributions (all zero, or all 10s).

- I think for analysis, it would be beneficial to reshape the dataframe/array, but that is probably best left to the participants, as is removing any data with few reviews.  I used "pivot" to transform the gameID column, into a new set of columns, while keeping the reviewers as rows.  This will make building feature vectors straightforward.

- I tried building up some histograms via looping, and it was indeed quite slow.  In contrast, the arcane, built-in functions (groupby) are super fast.  The smaller dataset should allow accessibility to new people, while the full dataset is quite manageable if you find the right set of functions.  I haven't done any actual machine-learning with this yet, so maybe I'll eat my words about "manageable".

- The correlation plots seem to show a small, positive correlation.  Is this even a sensible measure?  Its something like the overlap between the shapes of the ratings distributions.

## Splitting into training/test

I'm going to manually force a training/test split.  I'm going to randomly select 10% of users, and 10% of games.  To my mind any measure of similarity shuld be able to detect generic tastes, and be able to predict how well a  
Our goal is to recommend games people will like.  We can do this by holding back 

In [24]:
Ngames=len(games)
Nusers=len(users)

seed = 27128
np.random.seed=seed

#make a list of uniform random numbers (times appropriate lengths)
game_ix=np.random.random(size=Ngames)<0.1
user_ix=np.random.random(size=Nusers)<0.1

#keep testing examples from to test new users on old games, and new games on old users.
df_game_test=df_pivot.iloc[~user_ix,game_ix].copy()
df_user_test=df_pivot.iloc[user_ix,game_ix].copy()

#keep only the non-testing examples
df_train = df_pivot.iloc[~user_ix,~game_ix]
#We are then free to try predicting "new users" scores (given a way of decomposing new users)

The "game" test is for feature vectors trained on a set of users, can we get the correct rating on games they rated.
Second, given some "new" user, can we get the correct scores.  

In [25]:
df_train.shape

(2234, 361)

In [29]:
df_counts=game_hist(df_train)
df0=df_train.groupby(
plt.figure(figsize=(12,5))
plt.imshow(df_counts/np.sum(df_counts),aspect='auto')
plt.show()

<matplotlib.figure.Figure at 0x7efbb6411e10>

In [30]:
?df_train.sort

Object `df_train.sort` not found.


## Similarity vectors.

Let's now try to make vectors for each person and game, and measure the distance between people?
(I think this realy needs to be centered.  Otherwise, there seems to be a really high left over correlation between users opinions?

In [405]:
df0=df_train.values
mu = df_train.mean(axis=1)
sd = df_train.std(axis=1)
#because broadcasting only goes so far.
#need to use direct ops to scale
scaled = df_train.sub(mu,axis=0).div(sd,axis=0)

#scale features to be on -1,1
scaled = (df_train-5.5)/10

#df2=(df2-mu)#/df_train.std(axis=1)
#df2 = df2.values

In [407]:
nan_msk=np.isnan(scaled)
scaled[nan_msk]=0

In [408]:
scaled.columns

Index(['Samurai', 'Acquire', 'Elfenland', 'Bohnanza', 'Ra', 'Catan',
       'RoboRally', 'Can't Stop', 'Tigris & Euphrates', 'Liar's Dice',
       ...
       'Viticulture Essential Edition', 'Star Wars: Rebellion',
       'Sushi Go Party!', 'Great Western Trail', 'Santorini',
       'Clank!: A Deck-Building Adventure', 'Kingdomino',
       'Mansions of Madness: Second Edition', 'Arkham Horror: The Card Game',
       'Mechs vs. Minions'],
      dtype='object', name='gameID', length=361)

In [324]:
df3=np.matmul(scaled,scaled.T)
df3_d=np.diag(df3)
df3=df3/df3_d
#try to see small changes
one_msk=df3==1
df3[one_msk]=0

  This is separate from the ipykernel package so we can avoid doing imports until


In [318]:
df3
plt.figure(figsize=(8,8))
plt.imshow(df3)
plt.colorbar()
plt.show()

<matplotlib.figure.Figure at 0x7efbbdd1df98>

## Clustering

In [55]:
##Try some clustering to identify populations.
from sklearn.cluster import MiniBatchKMeans, KMeans

In [409]:
# #Initial attempt at clustering with raw review scores
# df_new=df_train.copy()
# #convert dataframe into 
# df_mat=df_new.values
# nan_msk=np.isnan(df_mat)
# df_mat[nan_msk]=0

#New attempt at clustering with raw review scores
#convert dataframe into 
df_mat=scaled.copy()

In [410]:
#Try minibatch Kmeans (as recommended)
Nclasses=20
km=KMeans(n_clusters=Nclasses,n_jobs=3)
km.fit(df_mat.values)
ypred=km.predict(df_mat.values)
df_mat['Class']=ypred

In [411]:
#now to try visualizing
plt.figure(figsize=(15,10))
Ncol=4
Nrows=Nclasses//Ncol+1
for i in range(Nclasses):
    plt.subplot(Nrows,Ncol,i+1)
    #determine which rows are in a given class.
    msk=df_mat['Class']==i
    #then plot the scores for each class (from unscaled data)
    d0=game_hist(df_train[msk])
    plt.imshow(d0,aspect='auto')

plt.show()

<matplotlib.figure.Figure at 0x7efbbc052b00>

In [339]:
?KMeans

So this is once again complete trash.

How else to visualize these?  Try plotting the means within a given cluster.  

In [347]:
cent=km.cluster_centers_
inert=km.inertia_

inert


363146.62347590784

In [None]:
## Principle Components Analysis

Let's try a PCA on this.  

In [453]:
from sklearn.decomposition import PCA

df_new=scaled.copy()
#convert dataframe into 
df_mat=df_new.values
nan_msk=np.isnan(df_mat)
df_mat[nan_msk]=0

pca = PCA(n_components=10)
pca.fit(df_mat)


PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [454]:
#let's look at the amount of explained variance.
plt.figure()
plt.plot(pca.explained_variance_,'-x')
plt.axis([0,20,0,.6])
plt.show()

<matplotlib.figure.Figure at 0x7efbc4e1ca90>

In [447]:
?pca

In [450]:
#try visualizing the components
pca_comp = pca.components_

plt.figure()
for i in range(5):
    plt.plot(pca_comp[i,:])

plt.show()


<matplotlib.figure.Figure at 0x7efbc4e5bef0>

In [451]:
#plot the minimum rating given out by each user.  
plt.hist(np.std(df_mat,axis=1))
plt.show()

<matplotlib.figure.Figure at 0x7efbc4e549b0>

As a first pass, let's try 4 principle components.  I'd like to visualize these components.
This has selected out 361 features.  These features could be said to correspond to taste profiles? - how much did each class like a given game.
Decompose each user into a superposition of each group.

In [455]:
df_decomp=np.dot(df_mat,pca_comp.T)

In [456]:
df_decomp

array([[-2.18892319, -0.02312199,  0.14230487, ..., -0.21633113,  0.06926489,
        -0.33280267],
       [-1.33620348, -0.23943944,  0.74373358, ...,  0.07377366,  0.10761448,
         0.00955454],
       [-1.89127898, -0.09527716,  0.22349309, ...,  0.0526667 ,  0.10148451,
         0.02895422],
       ..., 
       [-1.79521595,  0.96701436,  0.33837945, ...,  0.17229478,  0.08216916,
         0.07340455],
       [-3.04566306, -0.29459339,  0.75142233, ..., -0.32349976,  0.27358695,
         0.13498643],
       [-1.7444841 ,  0.70695794,  0.63127698, ..., -0.18157476, -0.2584679 ,
        -0.06526969]])

In [469]:
plt.figure(figsize=(20,4))
# plt.imshow(df_decomp.T[0:10],aspect='auto')
# plt.colorbar()
for i in range(5):
    plt.plot(df_decomp[:,i],label=str(i))
plt.legend()            
plt.show()

<matplotlib.figure.Figure at 0x7efb7d1c6d30>

In [357]:
cov

array([[ 11.49272035,   2.39836865,   2.43987682, ...,  -1.39935203,
         -1.02021922,  -1.17207476],
       [  2.39836865,  11.57305121,   1.89213868, ...,  -1.61709379,
         -1.3275407 ,  -1.0037423 ],
       [  2.43987682,   1.89213868,  11.36776863, ...,  -0.75502413,
         -1.2557539 ,  -1.14314742],
       ..., 
       [ -1.39935203,  -1.61709379,  -0.75502413, ...,  12.31648846,
          4.88844605,   3.05726093],
       [ -1.02021922,  -1.3275407 ,  -1.2557539 , ...,   4.88844605,
         10.99727437,   3.03394189],
       [ -1.17207476,  -1.0037423 ,  -1.14314742, ...,   3.05726093,
          3.03394189,  12.08365687]])

#Vector embeddings

I'd like to reprise an approach borrowed from text-mining for training word vectors.  We'd like to extract information both on the similarities of
users, and games.  As suggested, a matrix factorization method.  We'd train the embeddings on some subset of the matrix.  New games are treated on the assumption that similar ratings.  