<a href="https://colab.research.google.com/github/joedockrill/jester-collab-filtering/blob/master/JesterModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# fastai collab model for jester

hello children. today we're going to build a joke recommendation engine.

we're going to use the jester dataset from http://eigentaste.berkeley.edu/dataset/ - dataset 1 contains 100 jokes and ~4 million recommendations from ~73,400 users which should be plenty to play with.

we're going to train a collab filtering model on the data and use it to power a joke recomendation engine which will use the ratings of previous users to match the user with jokes they're likely to rate highly.

we will do this by putting jokes in front of them  and asking them to rate them, but we will use the ratings they give us each time to make more acurate predictions about the way they are likely to rate each joke. as we continue putting new jokes in front of them they should become less random, and more "up their street".

In [None]:
import pandas as pd
from fastai.collab import *
from fastai.tabular import *

import shutil
from google.colab import drive

drive.mount('/content/drive')
DRIVE_DIR = "/content/drive/My Drive/fastai-v3/jester/"

# data cleaning

the jokes come as 100 seperate html files and the ratings are in 3 different files so lets fix that.

ratings files 1 and 2 both contain users who have rated >= 36 jokes, file 3 is users who have rated 15 to 35.

the ratings files are N x 101, one row per user, the first col is the number of jokes that user has rated, the other 100 are the joke ratings. ratings go from -10 up to 10. 99 means null / not rated.

the sub-matrix containing cols [5, 7, 8, 13, 15, 16, 17, 18, 19, 20] is dense and contains jokes which almost all users have rated. this is useful information to cold-start our recommendation engine once we have built a model. feeding them jokes hardly anyone has rated won't provide you with useful information. (luckily this dataset is pretty dense anyway)

another cold-start idea might be to try starting with some of the more contentious jokes. we'll see...

In [None]:
!wget -q http://eigentaste.berkeley.edu/dataset/jester_dataset_1_joke_texts.zip
!wget -q http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip
# !wget -q http://eigentaste.berkeley.edu/dataset/jester_dataset_1_2.zip
# !wget -q http://eigentaste.berkeley.edu/dataset/jester_dataset_1_3.zip

In [None]:
!unzip -q jester_dataset_1_joke_texts.zip 'jokes/*'
!unzip -q jester_dataset_1_1.zip
# !unzip -q jester_dataset_1_2.zip
# !unzip -q jester_dataset_1_3.zip

# Clean ratings

First we concat them together. Next we need to flatten them out because we have a dense matrix which is n_users x n_jokes and fastai wants user,thing,rating rows.

There are ~~a little over 4 million ratings so flattening it does take around 2 mins to run~~ a lot less now since I'm only using 1 file of 3.

We keep both versions because we're going to use the matrix later.

In [None]:
cols = ["n_rated"]
for i in range(1, 101): cols.append("joke_" + str(i))

# frames = []
# for i in range (1,4):
#   frames.append(pd.read_excel("jester-data-{}.xls".format(i), 
#                               header=None, names=cols))

# df_ratings = pd.concat(frames); frames = None
# df_ratings.reset_index(drop=True, inplace=True)

df_ratings = pd.read_excel("jester-data-1.xls", header=None, names=cols)
df_ratings.to_csv("ratings-matrix.csv", index=False)

In [None]:
df_ratings.tail()

Unnamed: 0,n_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,joke_9,joke_10,joke_11,joke_12,joke_13,joke_14,joke_15,joke_16,joke_17,joke_18,joke_19,joke_20,joke_21,joke_22,joke_23,joke_24,joke_25,joke_26,joke_27,joke_28,joke_29,joke_30,joke_31,joke_32,joke_33,joke_34,joke_35,joke_36,joke_37,joke_38,joke_39,...,joke_61,joke_62,joke_63,joke_64,joke_65,joke_66,joke_67,joke_68,joke_69,joke_70,joke_71,joke_72,joke_73,joke_74,joke_75,joke_76,joke_77,joke_78,joke_79,joke_80,joke_81,joke_82,joke_83,joke_84,joke_85,joke_86,joke_87,joke_88,joke_89,joke_90,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
24978,100,0.44,7.43,9.08,2.33,3.2,6.75,-8.79,-0.53,-8.74,7.23,-0.53,5.63,-7.14,-4.08,-3.5,-8.2,-3.98,-9.22,-0.15,-6.46,5.63,-0.92,-2.91,-4.17,2.82,3.4,8.64,6.84,6.8,-0.87,7.38,-3.5,8.88,7.43,5.39,2.23,-0.68,3.4,-0.58,...,8.59,3.45,0.87,9.27,-4.66,5.73,-0.49,8.35,1.94,5.0,-9.66,8.98,8.98,-9.81,9.13,9.08,9.08,3.98,0.73,9.03,8.98,9.22,8.93,9.13,9.27,-1.99,-9.95,-9.9,9.13,8.83,8.83,-1.21,9.22,-6.7,8.45,9.03,6.55,8.69,8.79,7.43
24979,91,9.13,-8.16,8.59,9.08,0.87,-8.93,-3.5,5.78,-8.11,4.9,8.88,-8.69,-7.48,-8.83,-1.75,6.6,3.54,1.5,7.67,-0.44,9.22,8.74,9.03,9.08,8.93,3.74,3.2,-9.17,-8.98,8.79,-7.67,-3.06,9.13,8.4,-0.63,-7.18,0.58,8.88,9.27,...,2.77,8.11,-7.96,8.93,-0.87,-5.87,8.88,-1.12,-8.74,8.74,99.0,99.0,99.0,99.0,99.0,4.9,99.0,99.0,99.0,99.0,-0.29,0.92,-0.78,0.15,-0.1,0.0,-0.19,-0.87,-1.36,-0.58,-1.17,-5.73,-1.46,0.24,9.22,-8.2,-7.23,-8.59,9.13,8.45
24980,39,99.0,99.0,99.0,99.0,-7.77,99.0,6.7,-6.75,99.0,99.0,99.0,99.0,-6.46,-1.65,-6.8,-6.41,-6.99,7.23,6.75,-6.99,6.55,99.0,99.0,99.0,99.0,0.49,-0.53,-6.94,-0.49,99.0,6.46,-0.53,99.0,99.0,-7.86,-0.34,99.0,-6.94,99.0,...,0.49,-0.24,99.0,99.0,-3.11,-6.65,99.0,-0.58,6.31,99.0,99.0,-7.86,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
24981,37,99.0,99.0,99.0,99.0,-9.71,99.0,4.56,-8.3,99.0,99.0,99.0,99.0,-9.47,99.0,3.45,-0.92,-4.51,-4.13,-5.73,-9.51,2.82,99.0,99.0,99.0,99.0,-0.49,2.91,2.62,8.3,99.0,3.06,5.44,99.0,99.0,-0.68,2.04,99.0,99.0,1.55,...,-8.83,-0.78,99.0,99.0,4.51,-2.48,99.0,1.26,5.78,99.0,99.0,99.0,99.0,99.0,-4.56,99.0,99.0,99.0,99.0,3.16,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
24982,72,2.43,2.67,-3.98,4.27,-2.28,7.33,2.33,4.56,6.75,4.61,-3.16,7.38,-8.2,9.08,-8.83,-7.77,5.49,1.36,-9.32,7.04,7.28,3.2,-0.05,-1.26,6.94,5.49,1.21,5.0,7.38,2.33,3.35,6.17,-4.81,3.79,6.26,8.54,5.29,1.12,0.83,...,6.17,-0.29,0.83,4.22,4.27,7.38,6.21,7.48,5.15,3.2,6.26,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,7.23,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [None]:
def flatten_ratings(df_dense):
  rows = []
  
  for index, row in df_dense.iterrows():
    for i in range(1, 101):
      rating = row["joke_{}".format(i)]
      if(rating != 99):
        rows.append({"user_id":index, "joke_id":i, "rating":rating})

  df = pd.DataFrame(rows)
  return df

df_flattened = flatten_ratings(df_ratings)
df_flattened.to_csv("ratings-flattened.csv", index=False)

In [None]:
df_flattened.tail()

Unnamed: 0,user_id,joke_id,rating
1810450,24982,68,7.48
1810451,24982,69,5.15
1810452,24982,70,3.2
1810453,24982,71,6.26
1810454,24982,87,7.23


In [None]:
!rm *.zip
!rm *.xls

# Clean jokes

the joke files are html documents and as such contain \<P\> tags for formatting the jokes so they display nicely. i've left them in because i intend to display them with a jupyter html widget anyway. you may wish to remove them.

In [None]:
import re

ptn = "(:?<!--begin of joke -->\n)([\w\W]*)(:?<!--end of joke -->)"
p = re.compile(ptn, re.IGNORECASE + re.MULTILINE)

jokes = []

for i in range (1, 101):
  with open("jokes/init{}.html".format(i), mode="r") as fs:
    html = fs.read()
    m = p.search(html)
    assert m is not None, "i fail to find anything funny here. #dadjokes"
    jokes.append({"joke_num":i, "joke":m.group(2)})
    
df_jokes = pd.DataFrame(jokes)
df_jokes.to_csv("jokes.csv")

In [None]:
df_jokes.head()

Unnamed: 0,joke_num,joke
0,1,"A man visits the doctor. The doctor says ""I ha..."
1,2,This couple had an excellent relationship goin...
2,3,Q. What's 200 feet long and has 4 teeth? <P>\n...
3,4,Q. What's the difference between a man and a t...
4,5,Q.\tWhat's O. J. Simpson's Internet address? <...


In [None]:
!rm -r jokes

# Build a collab filtering model


In [None]:
# training with no val set due to a current issue with predictions afterwards.  
data = CollabDataBunch.from_df(df_flattened, seed=123, valid_pct=0)
learn = collab_learner(data, n_factors=50, y_range=[-10.5, 10.5], wd=1e-1)

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(3, 5e-3)

epoch,train_loss,valid_loss,time
0,20.716496,#na#,02:42
1,19.083994,#na#,02:43
2,17.782064,#na#,02:43


In [None]:
learn.save("stage1")

In [None]:
learn.export(DRIVE_DIR + "jester.pkl")
shutil.copyfile("jokes.csv", DRIVE_DIR + "jokes.csv")
shutil.copyfile("ratings-flattened.csv", DRIVE_DIR + "ratings-flattened.csv")
shutil.copyfile("ratings-matrix.csv", DRIVE_DIR + "ratings-matrix.csv")

'/content/drive/My Drive/fastai-v3/jester/ratings-matrix.csv'

# Test predictions


In [None]:
learn.export("model.pkl")
learn = load_learner("model.pkl", ".")

In [None]:
import functools
import time

def timed(func):
  """Time the function and print the run time"""
  @functools.wraps(func)
  def wrapper(*args, **kwargs):
    start = time.perf_counter()
    retval = func(*args, **kwargs)
    end = time.perf_counter(); rt = end - start
    print("@timed:", func.__name__, "took", "{:.3f}".format(rt), "secs")
    return retval
  return wrapper

def pred_string_batch(user):
  rows = [{"user_id":user, "joke_id":joke, "rating":0.} for joke in range(1,101)]
  df = pd.DataFrame(rows); s = ""

  learn.data = CollabDataBunch.from_df(df, test=df, no_check=True)
  preds,_ = learn.get_preds(DatasetType.Test)
  for pred in preds:
    s += str(pred.item()) + "/"
    
  return s

def pred_string_1by1(user):
  rows = [{"user_id":user, "joke_id":joke, "rating":0.} for joke in range(1,101)]
  df = pd.DataFrame(rows); s = ""

  for index, row in df.iterrows():
    _,_,pred = learn.predict(df.loc[index])
    s += str(pred.item()) + "/"
    
  return s

@timed
def test(func, times, user_id=0):
  preds = {}

  for i in range(0,times):
    if(user_id == 0): s = func(i)
    else:             s = func(user_id)

    if(s in preds): preds[s] += 1
    else:           preds[s] = 1
    if((i+1) % 100 == 0): print("Done", i+1)

  return preds


In [None]:
preds = test(pred_string_batch, 100)
print(len(preds), "unique sets of preds (different users)")
preds = test(pred_string_batch, 100, user_id = 50)
print(len(preds), "unique sets of preds (same user)")

In [None]:
preds = test(pred_string_1by1, 10)
print(len(preds), "unique sets of preds (different users)")
preds = test(pred_string_1by1, 10, user_id=50)
print(len(preds), "unique sets of preds (same user)")
