# Recommendation Engines

(This is for the Applied Data Science Group November/December 2017 session.)

This notebook tries to build a recommendation engine, which an e-commerce sites would use to recommend other items to you.  Matt Borthwick scraped the data from user reviews at boardgamegeek.com.  This is an initial runthrough to check the quality of the data, and try to play with the distributions.  I'll try to check that the dataset seems sane, check the shape of the distributions.

## Possible questions:

I did a similar brainstorming exercise (without looking at the data) to what we did in the first week:

### Exploratory questions

- What is the most popular game?
  - Which has the highest average rating?
  -  Which has the most reviews?
  -  Same for lowest, least reviews

 - What is the most divisive game?
  (Greatest spread in review scores)

- Data quality: NA, None, NAN
   Number of reviews per user?
   Number of reviews per game?
   Check scale of review scores
 - Check distributions of scores

## Analysis/Modelling questions

- Recommend new games based on similarities with others interests.

   Build clustering algorithm based on scores in games.
   - Assign each user a vector in Ngame-dim space.
   - Find users with similar vectors, based on dot-product.  (K-means or some other clustering)?
   - Remove games that are already reviewed, or with negative scores.
   - Recommend remaining game with highest score.

- User analysis:
   Are there multiple audiences here? "Hardcore" vs "casual" to use the gamer terms.
   - How many 1-review users are there? What games do they try out?
   - What games do users with multiple reviews enjoy? 

- Scoring: How will we score/test our recommendations?
    - Some sort of cross-validation where we keep a game's scores back, 
    and try to predict how reviewers will score it, based on their other reviews?

    - Is a naive test/train split worthwhile/valid?

Handling sparsity:
         - Use global function to estimate missing values.  Treat them as the average user.
         - use TF-IDF?  Not just most frequent, but ratio of frequency to number of users

Latent factor analysis:
       - Collaborative filter
       -decompose matrix into 2 matrices.  user features vs game features.
       - if S_{i,j} is matrix element for user i's score of game j, then
       decompose S=UW, where U is N_{user} x N_{hidden}, and W = N_{hidden} x N_{game}.  (This is similar to training word-vectors in natural language processing)
       - Train on data.

Content-based filtering  (genre tases)
user-based filter  (users similar)
item-item collaborative filtering.  (game similarities)

Try: training on "elite" users to define the clusters?

Do text analysis on game titles for similarities? (Useful for marketing a game)

Associative rule mining

Pattern exploration: Which games did people rate at all?

In [None]:
#Simple measures?

How to impute missing data?  Average score? (Laplace smoothing from spam?)




I aim to try k-means clustering, and the latent factor analysis approach.
I will also try some straightforward collaborative filtering with similarities based on user/game vectors.  I also thought about trying to analyze what games are preferred by reviewers with only a few reviews (n<5) vs many reviews (n>50).

Need to think though the optimization criteria, and include appropriate regularization to avoid overfitting.  Maybe use mean-square-error on game scores for games the user has actually reviewed?

In [1]:
%load_ext autoreload

In [2]:
#standard library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

#save graphics as pdf too (for less revolting exported plots)
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

In [21]:
#read in the data.  (13MB or so)
#(N.B. I put Matt's header on it's own line, which is skipped, and added the UserID)
#initial playing data
#df=pd.read_csv('data/boardgame-ratings.csv',skiprows=1)
#frequent users
df=pd.read_csv('data/boardgame-frequent-users.csv',skiprows=1)

#full matrix (2E5 users, 400 games)
#df=pd.read_csv('data/boardgame-users.csv',skiprows=1)
df.columns=('userID','gameID','rating')

In [19]:
#Matt made a csv file of ids and names  Load into dataframe, put into dict.
name_df=pd.read_csv('data/boardgame-titles.csv',index_col=0)
name_dict=name_df.to_dict()

## Exploratory Analysis

I'm going to do a few things:
- check for NaN/missing values.
- check the scores look right
- check the numbers of reviews, and games.
- match up the names with the unique gameIDs (I'll find some missing entries here)
- plot the number of reviews/user and reviews/game.
- check for duplicates

In [8]:
#test for NaN
nan_array=np.isnan(df.values)
print('Number of NaN',np.sum(nan_array))
#check scale of review scores.
print('Max/min scores',df['rating'].min(),df['rating'].max())

Number of NaN 0
Max/min scores 1.0 10.0


In [22]:
#How many users, how many games?
#Find the unique entries in each list
users=df['userID'].unique()
games=df['gameID'].unique()

In [14]:
print('Number of unique users is:',len(users))
print('Number of unique games is:',len(games))
print('Total number of reviews is:',len(df))

Number of unique users is: 154655
Number of unique games is: 27
Total number of reviews is: 834415


In [15]:
degree_of_sparsity = len(df)/(len(users)*len(games))
print(degree_of_sparsity)

0.19982709423723294


In [16]:
#check for duplicates
dup=df.duplicated()
df_dup=df[dup]
print('Number of duplicates: ',np.sum(dup))

Number of duplicates:  0


### Number of reviews/user and reviews/game

Plot some histograms of reviews per user, and reviews per game. 

In [23]:
avg_num_reviews=len(df)/len(users)
print(avg_num_reviews)

213.85806712494946


So on average, each user reviews 213 games.  Let's try to build a histogram of users with a given number of reviews.  (and then the same with games)

In [25]:
#However, this version took a few seconds.
user_review_counts=df.groupby(['userID']).count()
#note that there really are users with ids going from 1 to 1000, its not a screwup.

In [26]:
#find the counts of reviews for each game.
game_review_counts=df.groupby(['gameID']).count()
#make a list matching up gameIDs and names.  Use that list as a new index
new_index=[]
i=0
for ind in game_review_counts.index:
    i+=1
    new_index.append(name_dict['title'][ind])

game_review_counts.index=new_index

In [27]:
#plt.figure(figsize=(12,9))
plt.figure()
plt.hist(user_review_counts.iloc[:,0].values,log=True)
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Users')
plt.title('Reviewer distribution: Number of reviews per user')
plt.show()

KeyboardInterrupt: 

So this is a really long-tailed distribution.  It might be nice to look at this histogram on a log-x scale.  

In [14]:
plt.figure(figsize=(12,9))
game_review_counts.iloc[:,1].plot('bar')
plt.ylabel('Number of reviews')
plt.title('Number of reviews per game')
plt.show()

KeyboardInterrupt: 

In [None]:
#make a histogram of number of games with a given number of reviews



Lets also try to look at the distributions of scores.  I'll try to make a box-plot.
That will let me check the distributions in an easy manner.
I'll pivot the data frame to make rows users, columns be games, with the entries given by the score. 

## Boxplots and Transforming the data

Rearranging the data to use the gameIDs as columns would make sense for recommendation.
For this data set, with 27 dim that's should be no problem. (Another question on what is best to do with thousands of entries).
This would also make it easier to look at histograms on a per-game basis.
I'm nigh certain pandas has a reshape function to do exactly this.  Pivot maybe?
(http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

In [29]:
#make a small dataframe for debugging purposes
#df_small=df.iloc[0:1000]
#make a dense dataframe
df_pivot=df.pivot(index='userID',columns='gameID',values='rating')
df_pivot=df_pivot.rename(columns=name_dict['title'])
#df_pivot.head()

In [51]:
#df_pivot.to_csv('data/boardgame-ratings-pivot.gz',compression='gzip')
#?df.boxplot

In [30]:
plt.figure()
game_review_counts=df_pivot.boxplot(rot=90,grid=False)
plt.title('Score distributions by title')
plt.ylabel('Rating')
plt.show()

KeyboardInterrupt: 

<matplotlib.figure.Figure at 0x7f553b54bcc0>

KeyboardInterrupt: 

These mostly look positive.  Not any radically skewed distributions, like all 1 or all 10.  

(I'll imitate a plot I saw the more experienced folk do at the first finance-data meetup)
Try a correlation map based on columns to see how close the score distributions are.
I think this intuitively corresponds to: How much are the score distributions in one game similar to another?
Running across the rows would yield something analogous for users (but would take an age, since that is a 1E5 x 1E5 matrix).


In [24]:
corr_mat=df_pivot.corr()

In [42]:
plt.figure(figsize=(10,10))
plt.imshow(corr_mat)
plt.colorbar()
plt.show()

<matplotlib.figure.Figure at 0x7efbc42e5f60>

As for building a dataset for recommendations engines, the low correlation is worrisome?  A high correlation implies that everyone likes the same games, in which case there is no space for a skillful recommendation.  Thse are average correlations, rather than user-wise correlations.

The low correlation might also be an artifact of lots of reviewers with only a single review. Those entries will have little correlation with anyone else, and may artificially lower the scores?  I also tried keeping only reviews with more than a few scores - it did nothing to change the overall picture.

In [14]:
#def reduced_corr(df)
Nrow,Ncol=df_pivot.shape

rcorr = np.zeros((Ncol,Ncol))
Ncorr = np.zeros((Ncol,Ncol))
Nreviews = np.zeros(Ncol)
mu  = df_pivot.mean(axis=1)
mu_sort=mu.sort_values()
med = df_pivot.median(axis=1)
med_sort=med.sort_values()
std = df_pivot.std(axis=1)

#compute scaled dataframe
scaled = (df_pivot.subtract(mu,axis='index').div(std,axis='index')).values
#scaled = ((df_pivot-5.5)/10).values


In [16]:
Nmax=Ncol
#compute correlations between entries, only where both games have been rated.
for i in range(Nmax):
    mski = ~np.isnan(scaled[:,i])
    for j in range(i,Nmax):
        mskj = ~np.isnan(scaled[:,j])
        msk_tot = mski & mskj
        x = scaled[msk_tot,i]
        y = scaled[msk_tot,j]
        Ncommon=np.sum(msk_tot)
        c= np.dot(x,y)/(Ncommon-1)
        rcorr[i,j]=c
        rcorr[j,i]=c
        Ncorr[i,j]=Ncommon
        Ncorr[j,i]=Ncommon

# #check that the correlation is measuring something like the dot-product between the distributions.
# x0=df_pivot.iloc[:,0].values
# x1=df_pivot.iloc[:,1].values

# x0_mu=np.nanmean(x0)
# x1_mu=np.nanmean(x1)


In [31]:
#find number of reviews within each integer bin size
def game_hist(df):
    df_counts=pd.DataFrame()
    for i in range(1,11):
        Ntot=(df.round()==i).sum(axis=0).astype(int)
        df_counts=df_counts.append(Ntot,ignore_index=True)
    df_counts.index=np.arange(1,11)
    df_counts=df_counts/np.sum(df_counts)
    return df_counts

#compute histograms for dot products. 
def raw_hist(df,Nbins=10):
    df_counts=pd.DataFrame()
    imin=-1
    imax=1
    dx = (imax-imin)/Nbins
    for i in range(Nbins):
        i0 = i*dx
        i1 = (i+1)*dx
        Ntot=(df>i0 & df<i1).sum(axis=0).astype(int)
        df_counts=df_counts.append(Ntot,ignore_index=True)
    df_counts.index=np.linspace(imin,imax,Nbins)
    df_counts=df_counts/np.sum(df_counts)
    return df_counts

def sort_dataframe_columns(df,method='mean'):
    """sort_dataframe_columns(df,method='mean')
    Sort a dataframe's columns based on the columns mean, median,
    variance, or inter-quartile range.
    Method can be "mean", "median", or "std"
    """
    if (method=='mean'):
        sort_series=df.mean(axis=0)
    elif (method=='median'):
        sort_series = df.median(axis=0)
    elif (method=='std'):
        sort_series = df.std(axis=0)
    else:
        print('Method not allowed.')
    sort_series.sort_values(inplace=True)
    df=df.loc[:,sort_series.index]
    return df
                      


In [32]:
df_counts=game_hist(df_pivot)
df_counts_sorted=sort_dataframe_columns(df_counts,method='median')

Is there much similarity in the score distributions?  Not really useful question to ask.
But the histogram plot is useful.

In [116]:
?plt.imshow()

In [19]:
plt.figure(figsize=(15,10))
plt.imshow(np.log(df_counts_sorted),aspect='auto')
plt.colorbar()
plt.ylabel('Score')
plt.xlabel('Games')
plt.show()

<matplotlib.figure.Figure at 0x7efbb71cadd8>

  


The above image is a plot of the score densities for all of the games.  I'm just trying to get a sense of what the score distributions look like.
Most of the games are scored within 6-8.

It might be interesting to identify games by their variance?  Which games is there a consensus on, and which games are divisive?

In [44]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.imshow(corr_mat)
plt.colorbar()
plt.title('Pandas Correlation')
plt.subplot(1,3,2)
plt.imshow(rcorr)
plt.colorbar()
plt.title('"Corrected" Correlation')
plt.subplot(1,3,3)
plt.imshow(Ncorr)
plt.colorbar()
plt.title('Fraction of Common Reviews')
plt.show()

<matplotlib.figure.Figure at 0x7efbbd902c50>

In [21]:
i=5
j=90
plt.scatter(df_pivot.iloc[:,i],df_pivot.iloc[:,j])

<matplotlib.collections.PathCollection at 0x7efbb672cc18>

<matplotlib.figure.Figure at 0x7efbb67c1ef0>

The average pair-wise correlations between user's opinions of a game are really low.
Maybe a cluster analysis might yield something non-trivial?
This says on average, it's hard to guess what a person will think (in terms of variations from the mean).

Looking over the numbers, the correlations seem to be based on 1000 ratings in common for any pair of games.

In [56]:
##make logical array for actual reviews.
#ntot=np.sum(df_pivot>0,axis=1)
##only keep those with more than 6 review.
#keep_msk=ntot>20
#df_pivot2=df_pivot[keep_msk]
#len(df_pivot2)

111

## Conclusions regarding state of data

- The number of reviews per user is skewed towards new folks (not unreasonable, given how few people can stick at something as time intensive as playing and reviewing board games).

- Looking at the box-plots, the scores seem fairly high, which tallies with what Matt said about picking popular games.  There doesn't seem anything obviously wrong with the distributions (all zero, or all 10s).

- I think for analysis, it would be beneficial to reshape the dataframe/array, but that is probably best left to the participants, as is removing any data with few reviews.  I used "pivot" to transform the gameID column, into a new set of columns, while keeping the reviewers as rows.  This will make building feature vectors straightforward.

- I tried building up some histograms via looping, and it was indeed quite slow.  In contrast, the arcane, built-in functions (groupby) are super fast.  The smaller dataset should allow accessibility to new people, while the full dataset is quite manageable if you find the right set of functions.  I haven't done any actual machine-learning with this yet, so maybe I'll eat my words about "manageable".

- The correlation plots seem to show a small, positive correlation.  Is this even a sensible measure?  Its something like the overlap between the shapes of the ratings distributions.

## Splitting into training/test

I'm going to manually force a training/test split.  I'm going to randomly select 10% of users, and 10% of games.  To my mind any measure of similarity shuld be able to detect generic tastes, and be able to predict how well a  
Our goal is to recommend games people will like.  We can do this by holding back 

In [33]:
Ngames=len(games)
Nusers=len(users)

np.random.seed(seed=27128)
r=np.random.random()
print(r)
#make a list of uniform random numbers (times appropriate lengths)
game_ix=np.random.random(size=Ngames)<0.1
user_ix=np.random.random(size=Nusers)<0.1

#keep testing examples from to test new users on old games, and new games on old users.
df_game_test=df_pivot.iloc[~user_ix,game_ix].copy()
df_user_test=df_pivot.iloc[user_ix,game_ix].copy()

#keep only the non-testing examples
df_train = df_pivot.iloc[~user_ix,~game_ix]
#We are then free to try predicting "new users" scores (given a way of decomposing new users)

0.5406178678304046


The "game" test is for feature vectors trained on a set of users, can we get the correct rating on games they rated.
Second, given some "new" user, can we get the correct scores.  

In [22]:
df_train.shape

(2215, 355)

In [26]:
df_counts=game_hist(df_train)
df_counts_sorted=sort_dataframe_columns(df_counts,method='median')
plt.figure(figsize=(12,5))
plt.imshow(df_counts_sorted,aspect='auto')
plt.show()

<matplotlib.figure.Figure at 0x7efbb73b8978>

## Similarity vectors.

Let's now try to make vectors for each person and game, and measure the distance between people?
(I think this realy needs to be centered.  Otherwise, there seems to be a really high left over correlation between users opinions?

In [52]:
df_train_sorted=sort_dataframe_columns(df_train,method='median')
df0=df_train_sorted.values

## OG scaling
# mu = df_train_sorted.mean(axis=0)
# sd = df_train_sorted.std(axis=0)
# #because broadcasting only goes so far.
# #need to use direct ops to scale
# scaled = df_train_sorted.sub(mu,axis=1).div(sd,axis=1)
# #scaled = (df_train_sorted-mu).sub(mu,axis=0)/sd

mu = df_train_sorted.mean(axis=1)
sd = df_train_sorted.std(axis=1)
#because broadcasting only goes so far.
#need to use direct ops to scale
scaled = df_train_sorted.sub(mu,axis=0)#.div(sd,axis=0)
#scaled = (df_train_sorted-mu).sub(mu,axis=0)/sd


#scale features to be on -1,1
#scaled = (df_train-5.5)/10

In [53]:
nan_msk=np.isnan(scaled)
scaled[nan_msk]=0

In [226]:
scaled.shape

(2215, 355)

In [54]:
#find distances between users
df3=np.matmul(scaled,scaled.T)
df3_d=np.diag(df3)
#subtract off the diagonal
df3=df3-np.diag(df3_d)
df3_scaled=df3/df3_d
#try to see small changes
one_msk=df3_scaled==1
df3_scaled[one_msk]=0

  


In [38]:
#try to find the number of vectors with high/low similarity.
Nbin=40
def mat_hist(df,Nbin=100,log_flag=False):
    """mat_hist(df,Nbin)
    Return histogram for square-matrix.  
    Finds number of entries of presumed symmetric matrix between a given number of values. 
    Attempted to scale out number.
    """
    Nuser=len(df[:,0])
    xmax = df.max()
    xmin = df.min()
    x = np.linspace(xmin,xmax,Nbin+1)
    dot_hist=np.zeros(Nbin)
    for i in range(Nbin):
        m1 = df>x[i]
        m2 = df<x[i+1]
        dot_hist[i]=np.sum(m1 &m2 )
    xc = (x[:-1]+x[1:])/2
    return xc,dot_hist    

nmsk = df3<0
pmsk = df3>0
omsk = df3==0
xpos,dot_pos=mat_hist(np.log(np.abs(df3*pmsk)+1E-16))
xneg,dot_neg=mat_hist(np.log(np.abs(df3*nmsk)+1E-16))

In [151]:
print('Number of identical vectors: {}'.format(np.sum(omsk)))

Number of identical vectors: 0


In [157]:
#Check number of users with positive vectors
np.sum(df3<-100,axis=0)

array([11,  0,  4, ...,  0,  2, 35])

In [39]:
plt.semilogy(xpos,dot_pos,'-x',label='Pos')
plt.semilogy(xneg,dot_neg,'-x',label='Neg')
plt.legend()
plt.xlabel('Log of Dot-Products of un-scaled user score vectors') 
plt.show()

<matplotlib.figure.Figure at 0x7f55339419e8>

In [251]:
# plt.imshow(df3>0.5)
#plot's whether users have above a certain overlap.  
plt.imshow((np.log(np.abs(df3))))
plt.colorbar()

<matplotlib.figure.Figure at 0x7efbbc3fdcf8>

<matplotlib.colorbar.Colorbar at 0x7efbc408f4a8>

  This is separate from the ipykernel package so we can avoid doing imports until


In [55]:
#Pick out k-nearest neighbours for given user
Nuser,Ngame = df_train.shape
userID = df_train.index
kn = 1

#grab a row from dot-product matrix
k_ind = np.zeros([Nuser,kn])
k_guess = np.zeros([Nuser,Ngame])
for i in range(Nuser):
    ind = np.arange(Nuser)
    #sort and grab top k results.  Also need indices - use index
    s=pd.Series(df3[i],index=userID)
    #sort series by values
    s.sort_values(ascending=False,inplace=True)
    #extract user IDs for first kn values
    k_ind=s[0:kn].index.values
    #compose vector by adding up vectors, weighted by their overlap.
    # k_vec = np.zeros(Ngame)
    # ci_tot=0
    #for k in range(kn):
        # ci = np.abs(s.iloc[k])
        # k_vec+=ci*scaled.loc[k_ind[k]]
        # ci_tot+=ci
    #some simple vectorization    
    ci = np.abs(s.iloc[:kn])
    k_vec=np.sum(scaled.loc[k_ind].mul(ci,axis=0))/sum(ci)
    k_guess[i]=k_vec
#List off games, predicted scores.

In [56]:
k_guess

array([[ 0.        ,  0.        ,  0.        , ...,  3.56833977,  3.56833977,
         3.56833977],
       [ 0.        ,  0.        ,  0.        , ..., -1.57383966,  3.42616034,
         0.        ],
       [ 0.        ,  0.        , -6.12396907, ...,  0.87603093,  0.        ,
         2.87603093],
       ..., 
       [-5.28599222,  0.        , -5.28599222, ...,  0.        ,  3.71400778,
         3.71400778],
       [-4.38910506, -1.38910506, -1.38910506, ..., -1.38910506,  4.61089494,
         0.        ],
       [-4.06091371, -4.06091371, -4.06091371, ...,  1.93908629,  0.        ,
         3.93908629]])

In [58]:
k_df=pd.DataFrame(k_guess,index=df_train.index,columns=df_train.columns)
#k_df = k_df.mul(sd,axis=1).add(mu,axis=1)
k_df = k_df.add(mu,axis=0)


In [64]:
k_df.loc[388].sort_values(ascending=False)

gameID
Pathfinder Adventure Card Game: Rise of the Runelords – Base Set    12.727254
Coup                                                                12.727254
Tiny Epic Galaxies                                                  12.727254
Star Wars: Rebellion                                                12.727254
Food Chain Magnate                                                  11.727254
Mascarade                                                           11.727254
Evolution                                                           11.727254
Abyss                                                               11.727254
Mombasa                                                             11.727254
Mechs vs. Minions                                                   11.727254
Viticulture Essential Edition                                       11.727254
Through the Ages: A New Story of Civilization                       10.727254
Eldritch Horror                                          

In [66]:
#Compare predicted score to actual scores.
score=np.sum(np.abs(k_guess-scaled)*(df_train>0),axis=1)/np.sum(df_train>0,axis=1)
score

userID
83        1.237355
119       1.597028
156       1.228417
186       1.367723
225       1.458960
238       1.477584
272       2.231510
319       1.316953
387       0.984268
388       1.098570
430       1.279173
437       1.667063
467       1.164113
598       1.674492
614       1.346083
630       1.092664
632       1.377932
839       1.315425
1004      1.364606
1155      1.751986
1590      1.439000
1845      1.695086
1896      1.423402
1920      1.544009
1929      1.857700
1932      2.123560
1964      1.231595
1980      1.560174
2044      1.268116
2055      1.454789
            ...   
190132    1.536417
190196    1.834604
190258    1.597544
190519    1.444429
191022    1.334256
191116    1.746285
191258    2.149396
191320    1.402582
191331    1.179133
191390    1.413790
191433    1.622437
191459    1.578782
191507    1.369584
191586    1.925355
191597    1.574644
191803    1.525329
192043    2.925074
192053    1.467869
192057    1.673241
192098    1.305254
192151    1.752194
19218

In [40]:
score.shape

(2215, 355)

In [42]:
print(k_guess[0:10,0:4])
print(scaled.iloc[0:10,0:4])

[[ 0.20074014  0.97100402  1.22347953  0.36878477]
 [-0.35494666 -0.7753645  -0.36365546 -0.26116641]
 [ 0.03439543  0.00466181  0.24265421 -0.01878128]
 [ 0.39381427  0.89230497  1.47305605  0.28953728]
 [ 0.37400216  0.90905516  1.28070215  0.2800053 ]
 [-0.02246707  0.183208    0.56460927 -0.11301381]
 [-0.25179017 -0.35019928 -0.28860244  0.18611182]
 [-0.27031361 -0.37359635 -0.55445633 -0.1714355 ]
 [ 0.68766844  0.96185614  1.44841871  0.46218136]
 [ 0.34456765  0.79496545  1.36362955  0.28946679]]
gameID  The Game of Life  Battleship  Monopoly  Checkers
userID                                                  
83              0.000000    0.000000  0.000000  0.000000
119             0.000000    0.000000 -0.527622  0.000000
156             0.000000   -0.678567 -1.156601 -0.479645
186             0.450655    0.797441  0.730334  0.265650
225             0.000000    0.000000  0.000000  0.000000
238             0.450655    0.000000  1.359312  0.000000
272             0.000000    0.000

In [50]:
(np.sum(k_guess-scaled)/Nuser).sort_values()

gameID
Endeavor                                        -0.957917
Agricola: All Creatures Big and Small           -0.865689
The Manhattan Project                           -0.840160
Saint Petersburg                                -0.832226
Castles of Mad King Ludwig                      -0.810621
Pandemic Legacy: Season 1                       -0.807642
Eclipse                                         -0.806866
Ticket to Ride: Märklin                         -0.797809
Suburbia                                        -0.797703
The Castles of Burgundy                         -0.782096
Ticket to Ride                                  -0.772272
Ingenious                                       -0.742266
Star Wars: Imperial Assault                     -0.733103
Dominant Species                                -0.732940
Libertalia                                      -0.725686
Ticket to Ride: Nordic Countries                -0.722842
The Pillars of the Earth                        -0.708014
Power G

In [216]:
#Retrieve those vectors, and average their results together.


157336

In [207]:
k_ind[0:4]

array([[    83, 100174, 170579, 132203, 157336, 112355,  87830, 162938, 153971,
        142253],
       [   119, 135070, 147732,  13146, 100309, 110460,    272,  94828, 121369,
        130258],
       [100174,    156, 157336, 132203,  27761,  83255,  98155, 122722,  60520,
        145718],
       [100174, 170579, 132203,    186, 112355,  88176,   7709,  87830, 157336,
        129611]])

In [187]:
?s.sort_values()

In [183]:
A = np.array([[1,12,9],[1,2,3]])

In [184]:
B=np.sort(A,axis=0)
print(B)

[[ 1  2  3]
 [ 1 12  9]]


# Clustering

THis tries to cluster users together based on just their scores.  I think this is doomed to failure.
There is mostly a blob with games scored at 8.  Maybe with the expanded information, this would go somewhere.


In [94]:
##Try some clustering to identify populations.
from sklearn.cluster import MiniBatchKMeans, KMeans

In [97]:
# #Initial attempt at clustering with raw review scores
# df_new=df_train.copy()
# #convert dataframe into 
# df_mat=df_new.values
# nan_msk=np.isnan(df_mat)
# df_mat[nan_msk]=0

#New attempt at clustering with raw review scores
#convert dataframe into 
df_mat=scaled.copy()

In [120]:
#Try minibatch Kmeans (as recommended)
def fit_kmeans(df,Nclasses):
    km=KMeans(n_clusters=Nclasses,n_jobs=3)
    km.fit(df.values)
    ypred=km.predict(df.values)
    df['Class']=ypred
    return df, km

In [99]:
#now to try visualizing
plt.figure(figsize=(15,10))
Ncol=4
Nrows=Nclasses//Ncol+1
for i in range(Nclasses):
    plt.subplot(Nrows,Ncol,i+1)
    #determine which rows are in a given class.
    msk=df_mat['Class']==i
    #then plot the scores for each class (from unscaled data)
    d0=game_hist(df_train[msk])
    plt.imshow(d0,aspect='auto')

plt.show()

<matplotlib.figure.Figure at 0x7efbb6fa9a58>

In [339]:
?KMeans

So this is once again complete trash.

How else to visualize these?  Try plotting the means within a given cluster.  

In [347]:
cent=km.cluster_centers_
inert=km.inertia_



363146.62347590784

In [None]:
Idea: Try scaling data to map CDF to linear function? 

## Principle Components Analysis

Let's try a PCA on this, and do some dimensionality reduction.

In [57]:
from sklearn.decomposition import PCA

df_new=df_train.copy()
#convert dataframe into 
df_mat=df_new.values
#impute NaN as zero, after scaling/centering.
nan_msk=np.isnan(df_mat)
df_mat[nan_msk]=0

pca = PCA(n_components=100)
pca.fit(df_mat)


PCA(copy=True, iterated_power='auto', n_components=100, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [58]:
#let's look at the amount of explained variance.  There's typically an elbow around 10
plt.figure()
plt.plot(pca.explained_variance_,'-x')
plt.show()

<matplotlib.figure.Figure at 0x7efbc4327160>

In [447]:
?pca

In [118]:
#try visualizing the components
pca_comp = pca.components_

plt.figure()
for i in range(5):
    plt.plot(pca_comp[i,:],label=str(i))
    
plt.legend()
plt.show()

<matplotlib.figure.Figure at 0x7efbbc491ba8>

In [221]:
#plot the minimum rating given out by each user.  
plt.hist(np.mean(df_mat,axis=1))
plt.show()

<matplotlib.figure.Figure at 0x7efb85d055c0>

As a first pass, let's try 100 principle components.  I'd like to visualize these components.
This has selected out 361 features.  These features could be said to correspond to taste profiles? - how much did each class like a given game.
Decompose each user into a superposition of each group.

In [60]:
#decompose the users into their components.
df_decomp=np.dot(df_mat,pca_comp[:10,:].T)

In [63]:
df_decomp.shape

(2215, 10)

In [64]:
plt.figure(figsize=(20,4))
for i in range(20):
    plt.plot(df_decomp[i,:],'-x',label=str(i))
plt.axis([0,40,-10,10])    
plt.legend()            
plt.show()

<matplotlib.figure.Figure at 0x7efbbc302048>

In [None]:
#now try k-means on this reduced dataset.  

#Vector embeddings

I'd like to reprise an approach borrowed from text-mining for training word vectors.  We'd like to extract information both on the similarities of
users, and games.  As suggested, a matrix factorization method.  We'd train the embeddings on some subset of the matrix.  New games are treated on the assumption that similar ratings.  Look up alternating-least-squares.

(Try Batch SGD for fitting?)

In [61]:
from sklearn.decomposition import NMF

df_trainv=df_train.values
nan_msk=np.isnan(df_trainv)
df_trainv[nan_msk]=0

In [62]:
#use Scikit-Learn's Non-Negative Matrix Factorization. (good for small cases, might need to do own minibatch SGD code)
nmf_model=NMF(n_components=10,init='nndsvd',
    random_state=2043,
    alpha=2,
    l1_ratio=0.5,
    verbose=False,
    tol=0.001,max_iter=1000,shuffle=True)
#note that tol is the tolerance on the relative percentage change, not the tolerance on the loss. 
W=nmf_model.fit_transform(df_trainv)
H=nmf_model.components_

In [125]:
?NMF

In [63]:
print('Reconstruction Error:',nmf_model.reconstruction_err_)

#round to half-integer scores.
WH=np.floor(2*np.dot(W,H))/2
r,c=WH.shape
WH_diff=df_trainv-WH
#checking the rough reconstruction error, seems to check out
print('my error',np.sqrt(np.sum(WH_diff**2)))

print('mean absolute deviation:',np.mean(abs(WH_diff)))

Reconstruction Error: 2562.72992851
my error 2570.92401614
mean absolute deviation: 2.41186810343


In [None]:
#Now try finding similarity of vectors.
#

#K-nearest neighbours recommender

Assign each user a vector.
Find 100 nearest users (with non-zero distance) (to avoid replicated users with identical tastes)

In [304]:
#compute distances between pairs of users.
df_mat=scaled.copy().values
df_small=df_mat[0:100,:]
dist=np.dot(df_small,df_small.T)
dist_d=np.diag(dist)
dist=dist/dist_d
##try to see small changes
# one_msk=dist==1
# dist[one_msk]=0

In [186]:
np.sum(dist==0)

0

In [201]:
plt.figure(figsize=(8,8))
plt.imshow(dist[dist<1])
plt.colorbar()
plt.show()

TypeError: Invalid dimensions for image data

<matplotlib.figure.Figure at 0x7efb96b0b278>

TypeError: Invalid dimensions for image data