# Implementing SVD To Build a Subreddit Recommendation Engine

We're going to be implementing an SVD model to build a subreddit recommendation engine.

In this notebook, I'll go through the process of actually building the engine itself, after having previously collected reddit comment data.

The source for this code is here: https://beckernick.github.io/matrix-factorization-recommender/

In [1]:
# First let's load some packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sqlite3

In [2]:
# and now we'll load in the comment data

conn = sqlite3.Connection("./reddit_rec_data.sqlite")
data = pd.read_sql("SELECT * FROM comment_data", con=conn)
data.rename(columns={'variable': 'subreddit', 'value': 'comments'}, inplace=True)

In [3]:
data.head()

Unnamed: 0,index,user,subreddit,comments
0,0,DCbean,r/10cloverfieldlane,1.0
1,1,fakedeepusername,r/1200isplenty,4.0
2,2,Shyguy380,r/13ReasonsWhy,1.0
3,3,jamjax12,r/13ReasonsWhy,5.0
4,4,Death215,r/2007scape,2.0


In [4]:
# now we need to pivot the data to be able to get it so each row is a user, and the columns are subreddits.
# let's start with just a sample of the data

sample = data.iloc[0:50000, :]
R_df = sample.pivot(index='user', columns='subreddit', values='comments').fillna(0)

In [6]:
R_df.head()

subreddit,r/0xProject,r/100ballshack,r/100yearsago,r/1022,r/10cloverfieldlane,r/10mm,r/1200isjerky,r/1200isplenty,r/12Monkeys,r/13ReasonsWhy,...,r/zyzz,u/Cazazkq,u/Nintendo_America,u/OfficialValKilmer,u/Python422,u/Shitty_Watercolour,u/SmilsumKcuf,u/_BindersFullOfWomen_,u/maximumcrisis,u/washingtonpost
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--ManBearPig--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Agathia-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Big_Bad-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Chrown-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Claive-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# Now we need to "de-mean" the ratings, that is subtract the mean # of comments from each users number of comments

R = R_df.as_matrix()
R_means = np.mean(R, axis = 1)
R_demeaned = R - R_means.reshape(-1,1)

In [11]:
R_demeaned

array([[-0.00325648, -0.00325648, -0.00325648, ..., -0.00325648,
        -0.00325648, -0.00325648],
       [-0.00143285, -0.00143285, -0.00143285, ..., -0.00143285,
        -0.00143285, -0.00143285],
       [-0.00325648, -0.00325648, -0.00325648, ..., -0.00325648,
        -0.00325648, -0.00325648],
       ..., 
       [-0.00325648, -0.00325648, -0.00325648, ..., -0.00325648,
        -0.00325648, -0.00325648],
       [-0.00325648, -0.00325648, -0.00325648, ..., -0.00325648,
        -0.00325648, -0.00325648],
       [-0.00156311, -0.00156311, -0.00156311, ..., -0.00156311,
        -0.00156311, -0.00156311]])

# SVD

Now that we have properly normalized data we can do the SVD modeling

In [12]:
from scipy.sparse.linalg import svds

U, sigma, Vt = svds(R_demeaned, k=50)

In [13]:
# sigma is just values, so we need to convert it to a matrix with sigma as the diagonal

sigma = np.diag(sigma)

In [18]:
# now, to get back to the predicted matrix, we take the dot product of all three matrices, and then add back
# the means
R_pred = np.dot(np.dot(U, sigma), Vt) + R_means.reshape(-1,1)

In [21]:
preds_df = pd.DataFrame(R_pred, columns = R_df.columns)

In [26]:
# we can evaluate the performance of the model by taking the RMSE of the original and new dataframes
from sklearn.metrics import mean_squared_error

RMSE = mean_squared_error(R_df, preds_df)**0.5
print RMSE

0.147133831456


What we should do now is split the data up into training and testing, and then try to optimize for the k value of latent features that minimizes RMSE.  We'll come back to that.

For now, let's write up code that serves us recommendations based on the recomposed matrix we just built.

In [56]:
# DCbean is going to be my test person

data[data['user']=='DCbean']

Unnamed: 0,index,user,subreddit,comments
0,0,DCbean,r/10cloverfieldlane,1.0
263,263,DCbean,r/AskReddit,1.0
692,692,DCbean,r/Breath_of_the_Wild,1.0
860,860,DCbean,r/CrappyDesign,1.0
1671,1671,DCbean,r/MBMBAM,2.0
1825,1825,DCbean,r/Mommit,1.0
2368,2368,DCbean,r/Showerthoughts,1.0
3826,3826,DCbean,r/funny,10.0
4600,4600,DCbean,r/maximumfun,2.0
5242,5242,DCbean,r/pics,2.0


In [72]:
def recommend_subreddits(predictions_df, username, original_ratings_df, unpivoted_df, num_recommendations=10):
    # we want to get the index of the row of the new DF that corresponds to this user
    user_row = list(original_ratings_df.index).index(username)
    
    sorted_user_predictions = predictions_df.iloc[user_row].sort_values(ascending=False)
    
    # let's check out which subreddits have already been commented on
    commented = unpivoted_df[unpivoted_df['user']==username].subreddit.values
    sorted_user_predictions = sorted_user_predictions.reset_index()
    sorted_user_predictions.columns = ['subreddit', 'predicted_comments']
    
    # only want to get recs for 
    recs = sorted_user_predictions[~sorted_user_predictions['subreddit'].isin(commented)]
    
    recs_limited = recs.iloc[0:num_recommendations, :]
    return recs_limited
    

In [73]:
recommend_subreddits(predictions_df=preds_df, username='DCbean', original_ratings_df=R_df, unpivoted_df=sample)

Unnamed: 0,subreddit,predicted_comments
1,r/videos,3.113572
2,r/gaming,2.022728
3,r/gifs,1.875062
5,r/todayilearned,1.579901
7,r/CringeAnarchy,0.756726
8,r/aww,0.654116
10,r/mildlyinteresting,0.578071
11,r/WTF,0.482566
12,r/hearthstone,0.379388
13,r/pokemongo,0.356606


It seems like we may have a bit of a harry potter effect problem, with the really popular subreddits dominating the recommendations.  I wonder if we could account for this by log transforming comments or somehow normalizing comments in another way.