# Recommendation Engines

(This is for the Applied Data Science Group November/December 2017 session.)

This notebook tries to build a recommendation engine, which an e-commerce sites would use to recommend other items to you.  Matt Borthwick scraped the data from user reviews at boardgamegeek.com.  This is an initial runthrough to check the quality of the data, and try to play with the distributions.  I'll try to check that the dataset seems sane, check the shape of the distributions.

## Possible questions:

I did a similar brainstorming exercise (without looking at the data) to what we did in the first week:

### Exploratory questions

- What is the most popular game?
  - Which has the highest average rating?
  -  Which has the most reviews?
  -  Same for lowest, least reviews

 - What is the most divisive game?
  (Greatest spread in review scores)

- Data quality: NA, None, NAN
   Number of reviews per user?
   Number of reviews per game?
   Check scale of review scores
 - Check distributions of scores


## Analysis/Modelling questions

- Recommend new games based on similarities with others interests.

   Build clustering algorithm based on scores in games.
   - Assign each user a vector in Ngame-dim space.
   - Find users with similar vectors, based on dot-product.  (K-means or some other clustering)?
   - Remove games that are already reviewed, or with negative scores.
   - Recommend remaining game with highest score.

- User analysis:
   Are there multiple audiences here? "Hardcore" vs "casual" to use the gamer terms.
   - How many 1-review users are there? What games do they try out?
   - What games do users with multiple reviews enjoy? 

- Scoring: How will we score/test our recommendations?  
    - Some sort of cross-validation where we keep a game's scores back, 
    and try to predict how reviewers will score it, based on their other reviews?

In [1]:
#standard library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

#makes larger plots
plt.rcParams['figure.figsize']=(10,6)

#save graphics as pdf too (for less revolting exported plots)
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

In [8]:
#read in the data.  (13MB or so)
#(N.B. I put Matt's header on it's own line, which is skipped, and added the UserID)
#df=pd.read_csv('data/boardgame-ratings.csv',skiprows=1)
df=pd.read_csv('data/boardgame-users.csv',skiprows=1)
df.columns=('userID','gameID','rating')

In [9]:
name_df=pd.read_csv('data/boardgame-titles.csv')

   boardgamegeek.com game ID                    title
0                      13004  The Downfall of Pompeii
1                      66188                   Fresco
2                        503       Through the Desert
3                      66690     Dominion: Prosperity
4                        150                 PitchCar

In [12]:
name_df.columns=('gameID','title')
name_df.head()

   gameID                    title
0   13004  The Downfall of Pompeii
1   66188                   Fresco
2     503       Through the Desert
3   66690     Dominion: Prosperity
4     150                 PitchCar

In [4]:
?pd.read_csv

## Exploratory Analysis

I'm going to do a few things:
- check for NaN/missing values.
- check the scores look right
- check the numbers of reviews, and games.
- match up the names with the unique gameIDs (I'll find some missing entries here)
- plot the number of reviews/user and reviews/game.
- check for duplicates

In [13]:
#test for NaN
nan_array=np.isnan(df.values)
print('Number of NaN',np.sum(nan_array))
#check scale of review scores.
print('Max/min scores',df['rating'].min(),df['rating'].max())

Number of NaN 0
Max/min scores 1.4013e-45 10.0


In [14]:
#How many users, how many games?
#Find the unique entries in each list
users=df['userID'].unique()
games=df['gameID'].unique()

In [15]:
print('Number of unique users is:',len(users))
print('Number of unique games is:',len(games))
print('Total number of reviews is:',len(df))

Number of unique users is: 193504
Number of unique games is: 402
Total number of reviews is: 5148624


In [17]:
#check for duplicates
dup=df.duplicated()
df_dup=df[dup]
print('Number of duplicates: ',np.sum(dup))

TypeError: duplicated() got an unexpected keyword argument 'axis'

In [5]:
#make a dict to convert game labels to names (provided by Matt)
#  Looks like we are missing the keys for 33154, 197376.  
#I made up some keys for plotting/naming purposes.
name_dict={11:"Bohnanza",
68448:"7 Wonders",
39856:"Dixit",
40692:"Small World",
31260:"Agricola",
148228:"Splendor",
13:"Catan",
178900:"Codenames",
34635:"Stone Age",
28143:"Race for the Galaxy",
129622:"Love Letter",
14996:"Ticket to Ride: Europe",
3076:"Puerto Rico",
30549:"Pandemic",
65244:"Forbidden Island",
478:"Citadels",
15987:"Arkham Horror",
110327:"Lords of Waterdeep",
36218:"Dominion",
2651:"Power Grid",
9209:"Ticket to Ride",
103:"Titan",
822:"Carcassonne",
2163:"Space Hulk",
1927:"Munchkin",
70323:"King of Tokyo",
33154:"Wasabi",
197376:"Charterstone"}

#check name dictionary is working.
#I put in fake names earlier to find the missings ones
i=0
for num in games:
    print(i,num,name_dict[num])
    i+=1


0 14996 Ticket to Ride: Europe
1 68448 7 Wonders
2 13 Catan
3 31260 Agricola
4 178900 Codenames
5 9209 Ticket to Ride
6 30549 Pandemic
7 129622 Love Letter
8 36218 Dominion
9 3076 Puerto Rico
10 2651 Power Grid
11 110327 Lords of Waterdeep
12 822 Carcassonne
13 478 Citadels
14 39856 Dixit
15 103 Titan
16 148228 Splendor
17 40692 Small World
18 11 Bohnanza
19 28143 Race for the Galaxy
20 34635 Stone Age
21 1927 Munchkin
22 15987 Arkham Horror
23 70323 King of Tokyo
24 33154 Wasabi
25 2163 Space Hulk
26 197376 Charterstone


In [7]:
#Number of games with missing names
msk1=df['gameID']==33154
msk2=df['gameID']==197376
print(np.sum(msk1),np.sum(msk2))

3699 91


### Number of reviews/user and reviews/game

In [8]:
avg_num_reviews=len(df)/len(users)
print(avg_num_reviews)

5.395331544405289


So on average, each user reviews 5 games.  Let's try to build a histogram of users with a given number of reviews.  (and then the same with games)

In [6]:
##others mentioned issues with looping. let's also try a straightforward approach.
##the following commented out code took a minute or two - untenable for only 1000 users!
##make em
#df_user=pd.DataFrame(columns=['Ngames'],index=users)
# for user in users[0:4]:
#     print(user)
#     print('Ngames=',np.sum(df['userID']==user))

#However, this version took a few seconds.
user_review_counts=df.groupby(['userID']).count()
#note that there really are users with ids going from 1 to 1000, its not a screwup.

#On reflection, reshaping would be super helpful on this data set.  
#Use 27 columns with one for each game, with a column for each score.      

In [7]:
#plt.figure(figsize=(12,9))
plt.figure()
plt.hist(user_review_counts.iloc[:,0].values,bins=np.arange(1,27))
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Users')
plt.title('Reviewer distribution: Number of reviews per user')
plt.show()

<matplotlib.figure.Figure at 0x7f795f62d4a8>

Lots of single game reviewers, and then a long tail.  Relatively few people with more than 20 reviews. Lots of scope to recommend new gamesto people.  Might have to trim out the 1-game reviews when building the engine?
Also a spike at 8 and 14 games?  Those might be interesting to look at too.

In [8]:
#find the counts of reviews for each game.
game_review_counts=df.groupby(['gameID']).count()
#make a list matching up gameIDs and names.  Use that list as a new index
new_index=[]
for ind in game_review_counts.index:
    new_index.append(name_dict[ind])

game_review_counts.index=new_index

In [9]:
plt.figure()
game_review_counts.iloc[:,1].plot('bar')
plt.ylabel('Number of reviews')
plt.title('Number of reviews per game')
plt.show()

<matplotlib.figure.Figure at 0x7f795cef9588>

So, only a few games with less than 10000 reviews.  Charterstone (What I've previously called "NOTAGAME2: NOT HARDER"), only has 91 reviews, far fewer than everything else.
From the point of view of the exercise, it's probably worth keeping that game, to let people consider the value in keeping that data point.
Otherwise, this looks fairly balanced, and includes a few games with relatively few reviews, even 91 seems like a lot to be honest.

Lets also try to look at the distributions of scores.  I'll try to make a box-plot.
That will let me check the distributions in an easy manner.
I'll pivot the data frame to make rows users, columns be games, with the entries given by the score. 

## Boxplots and Transforming the data

Rearranging the data to use the gameIDs as columns would make sense for recommendation.
For this data set, with 27 dim that's should be no problem. (Another question on what is best to do with thousands of entries).
This would also make it easier to look at histograms on a per-game basis.
I'm nigh certain pandas has a reshape function to do exactly this.  Pivot maybe?
(http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

In [363]:
#make a small dataframe for debugging purposes
#df_small=df.iloc[0:1000]
df_pivot=df.pivot(index='userID',columns='gameID',values='rating')
df_pivot=df_pivot.rename(columns=name_dict)
#df_pivot.head()

In [19]:
#df_pivot.to_csv('data/boardgame-ratings-pivot.csv')
#?df.boxplot

In [33]:
plt.figure()
game_review_counts=df_pivot.boxplot(rot=90,grid=False)
plt.title('Score distributions by title')
plt.ylabel('Rating')
plt.show()

<matplotlib.figure.Figure at 0x7f795d609b00>

These mostly look positive.  Not any radically skewed distributions, like all 1 or all 10.  Of the games I've played (Pandemic, Catan), the high ratings seem about plausible.  The overall ratings skew high, as one might expect for popular games.  There's a smattering of 10s from fanatics, and 1s from people who hated the games.   

(I'll imitate a plot I saw the more experienced folk do at the first finance-data meetup)
Try a correlation map based on columns to see how close the score distributions are.
I think this intuitively corresponds to: How much are the score distributions in one game similar to another?
Running across the rows would yield something analogous for users (but would take an age, since that is a 1E5 x 1E5 matrix).


In [54]:
corr_mat=df_pivot.corr()

plt.figure()
plt.imshow(corr_mat)
plt.show()

<matplotlib.figure.Figure at 0x7f7956552668>

In [34]:
#Barfs out a huge matrix of the actual numbers
#print(c)
?df_pivot.corr

So what does that tell us?  Very Little?  The one game with an obvious signal (and negative correlations), 
is the one with 91 reviews.
The games with high correlations are "Ticket to ride" and "Ticket to ride: Europe", which might be sequels, or expansions?  Otherwise, the other correlations all hover around the 0.1-0.3 range.

As for building a dataset for recommendations engines, the low correlation is worrisome?  A high correlation implies that everyone likes the same games, in which case there is no space for a skillful recommendation.

The low correlation might also be an artifact of lots of reviewers with only a single review. Those entries will have little correlation with anyone else, and may artificially lower the scores?  I also tried keeping only reviews with more than a few scores - it did nothing to change the overall picture.

In [94]:
#def reduced_corr(df)
Nrow,Ncol=df_pivot.shape

rcorr = np.zeros((Ncol,Ncol))
Ncorr = np.zeros((Ncol,Ncol))
Nreviews = np.zeros(Ncol)
mu  = df_pivot.mean(axis=0)
std = df_pivot.std(axis=0)

#compute scaled dataframe
scaled = ((df_pivot-mu)/std).values
#scaled = ((df_pivot-5.5)/10).values

Nmax=Ncol
#compute correlations between entries, only where both games have been rated.
for i in range(Nmax):
    mski = ~np.isnan(scaled[:,i])
    for j in range(i,Nmax):
        mskj = ~np.isnan(scaled[:,j])
        msk_tot = mski & mskj
        x = scaled[msk_tot,i]
        y = scaled[msk_tot,j]
        Ncommon=np.sum(msk_tot)
        c= np.dot(x,y)/(Ncommon-1)
        rcorr[i,j]=c
        rcorr[j,i]=c
        Ncorr[i,j]=Ncommon

# #check that the correlation is measuring something like the dot-product between the distributions.
# x0=df_pivot.iloc[:,0].values
# x1=df_pivot.iloc[:,1].values

# x0_mu=np.nanmean(x0)
# x1_mu=np.nanmean(x1)

Ncorr=0.5*(Ncorr+Ncorr.T)


In [410]:
#find number of reviews within each integer bin size
def game_hist(df):
    df_counts=pd.DataFrame()
    for i in range(1,11):
        Ntot=(df.round()==i).sum(axis=0).astype(int)
        df_counts=df_counts.append(Ntot,ignore_index=True)
    df_counts.index=np.arange(1,11)
    df_counts=df_counts/np.sum(df_counts)
    return df_counts


In [395]:
df_counts=game_hist(df_pivot)

Is there much similarity in the score distributions?  Not really useful question to ask.
But the histogram plot is useful.

In [411]:
plt.imshow(df_counts)
plt.show()

<matplotlib.figure.Figure at 0x7f7920454c88>

In [412]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.imshow(corr_mat)
plt.colorbar()
plt.title('Pandas Correlation')
plt.subplot(1,3,2)
plt.imshow(rcorr)
plt.colorbar()
plt.title('"Corrected" Correlation')
plt.subplot(1,3,3)
plt.imshow(Ncorr)
plt.colorbar()
plt.title('Fraction of Common Reviews')
plt.show()

<matplotlib.figure.Figure at 0x7f792041dfd0>

In [106]:
i=5
j=10
plt.scatter(df_pivot.iloc[:,i],df_pivot.iloc[:,j])

<matplotlib.collections.PathCollection at 0x7f795463f400>

<matplotlib.figure.Figure at 0x7f79547d92b0>

The average pair-wise correlations between user's opinions of a game are really low.
Maybe a cluster analysis might yield something non-trivial?
This says on average, it's hard to guess what a person will think (in terms of variations from the mean).

Looking over the numbers, the correlations seem to be based on 1000 ratings in common for any pair of games.

In [56]:
##make logical array for actual reviews.
#ntot=np.sum(df_pivot>0,axis=1)
##only keep those with more than 6 review.
#keep_msk=ntot>20
#df_pivot2=df_pivot[keep_msk]
#len(df_pivot2)

111

## Conclusions regarding state of data

The only thing that I really think needs fixing are the missing names for: 33154, 197376.

The data does suggest some interesting questions on its own on the population of boardgamers.
I suspect there is plenty of information that can be extracted here.  
There is also a matter of having some data cleaning/munging to do on the full data set.
Nothing looks too weird in the data.

- The number of reviews per user is skewed towards new folks (not unreasonable, given how few people can stick at something as time intensive as playing and reviewing board games).

- Looking at the box-plots, the scores seem fairly high, which tallies with what Matt said about picking popular games.  There doesn't seem anything obviously wrong with the distributions (all zero, or all 10s).

- I think for analysis, it would be beneficial to reshape the dataframe/array, but that is probably best left to the participants, as is removing any data with few reviews.  I used "pivot" to transform the gameID column, into a new set of columns, while keeping the reviewers as rows.  This will make building feature vectors straightforward.

- I tried building up some histograms via looping, and it was indeed quite slow.  In contrast, the arcane, built-in functions (groupby) are super fast.  The smaller dataset should allow accessibility to new people, while the full dataset is quite manageable if you find the right set of functions.  I haven't done any actual machine-learning with this yet, so maybe I'll eat my words about "manageable".

- The correlation plots seem to show a small, positive correlation.  Is this even a sensible measure?  Its something like the overlap between the shapes of the ratings distributions.

## Splitting into training/test

I'm going to manually force a training/test split.  I'm going to randomly select 10% of users. 
Our goal is to recommend games people will like.  I'll try to test the predictions by holding back both some "new" users.
I thought about keepng back some game data, but that's fairly sparse, and I'm not sure about the distribution (how representative any one game is).
I think I'll try it anyway: select 2 games as holdouts.  I'm going to do that by hand (a game I've heard of, and a game that's new to me).

Some questions:do we have enough data for this to make sense?  I am mostly concerned by the number of games.

In [394]:
Ngames=len(games)
Nusers=len(users)

ngame_select=2
nuser_select=int(len(users)/10)

#make a list of uniform random numbers (times appropriate lengths)
game_fixed=[12,14]  #I've at least heard of Agricola, (rules heavy "Eurogame")
user_ix=np.random.random(size=Nusers)<0.1

#keep testing examples from to test new users on old games, and new games on old users.
df_user_test=df_pivot.iloc[user_ix,~game_ix]
df_game_test=df_pivot.iloc[~user_ix,game_ix]

#keep only the non-testing examples
df_train = df_pivot.iloc[~user_ix,:]
df_train = df_train.iloc[:,~game_ix]

In [396]:
df_counts=game_hist(df_train)
plt.imshow(df_counts/np.sum(df_counts))
plt.show()

<matplotlib.figure.Figure at 0x7f79204ba208>

## Clustering

In [175]:
##Try some clustering to identify populations.
from sklearn.cluster import MiniBatchKMeans

In [397]:
#convert dataframe into 
df_mat=df_train.values
nan_msk=np.isnan(df_mat)
df_mat[nan_msk]=0

In [428]:
#Try minibatch Kmeans (as recommended)
Nclasses=4
km=KMeans(n_clusters=Nclasses)
km.fit(df_mat)
ypred=km.predict(df_mat)
df_train['Class']=ypred

In [429]:
#now to try visualizing
plt.figure(figsize=(15,6))
for i in range(Nclasses):
    plt.subplot(3,3,i+1)
    msk=df_train['Class']==i
    d0=game_hist(df_train[msk])
    plt.imshow(d0)

plt.show()

<matplotlib.figure.Figure at 0x7f79203a14a8>

In [426]:
df_train.columns[7]


0.23303188547375681

In [427]:
df_train.iloc[:,7].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f792039a048>

<matplotlib.figure.Figure at 0x7f79208728d0>

So this naive initial clustering, is choosing based on the score to game 7, "Power Grid".  What's special about that game?
