# Content-based and collaborative filtering
### Author: Harris Dupre
### DATA 643 Spring 2020

### Introduction

In this project we will seek to use the "Jester" dataset (http://eigentaste.berkeley.edu/dataset/) to demonstrate content-based filtering by using a word analyzer and vectorizer, and to demonstrate collaborative filtering using a matrix of user ratings of jokes.

### Load libraries, import data

In [24]:
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt


df_jokes = pd.read_csv("/Users/harris/ds_masters/DATA643/Project2/joke_texts.csv")
df_ratings = pd.read_csv("/Users/harris/ds_masters/DATA643/Project2/joke_ratings.csv")


df_jokes

Unnamed: 0,joke_id,joke_text
0,1,Q. What's O. J. Simpson's web address? A. Slas...
1,2,How many feminists does it take to screw in a ...
2,3,Q. Did you hear about the dyslexic devil worsh...
3,4,They asked the Japanese visitor if they have e...
4,5,Q: What did the blind person say when given so...
...,...,...
134,135,"A blonde, brunette, and a red head are all lin..."
135,136,America: 8:00 - Welcome to work! 12:00 - Lunch...
136,137,It was the day of the big sale. Rumors of the ...
137,138,"Recently a teacher, a garbage collector, and a..."


In [25]:
df_ratings

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.750
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375
...,...,...,...,...
1092054,9517_132,9517,132,3.156
1092055,27767_118,27767,118,-1.594
1092056,10580_81,10580,81,2.000
1092057,31007_119,31007,119,8.906


### Content-based filtering

We will use the TfidVectorizer function to assign weight to the importance of words found in the joke_text column. With a matrix of these weights we can use a cosine similarity function to find the most "similar" jokes and recommend them to the user.

In [26]:
# create a TfidVectorizer function using the 'word' analyzer feature to analyze unigrams, bigrams, and trigrams.
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=0,stop_words='english')
# create a matrix of each word and its tf-idf score relative to each joke
tfidf_matrix = tf.fit_transform(df_jokes['joke_text'])
# calculate the cosine similarity of each item
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}
for idx, row in df_jokes.iterrows():
   similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
   similar_items = [(cosine_similarities[idx][i], df_jokes['joke_id'][i]) for i in similar_indices] 
   results[row['joke_id']] = similar_items[1:]



#### Content-based recommender

In [27]:
# create a function that accesses a joke based on id
def item(id):  
  return df_jokes.loc[df_jokes['joke_id'] == id]['joke_text'].tolist()[0].split(' - ')[0]

# print n number of jokes similar to the selected joke
def recommend(item_id, num):
    print("Recommending " + str(num) + " jokes similar to " + item(item_id) + "...")   
    print("-------")    
    recs = results[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")
        print("--------")

# 5 jokes most similar to joke_id #20
recommend(20, 5)

Recommending 5 jokes similar to Q: What's the difference between a lawyer and a plumber? A: A plumber works to unclog the system....
-------
Recommended: Q: What's the difference between the government and the Mafia? A: One of them is organized. (score:0.04199457862314348)
--------
Recommended: A lawyer opened the door of his BMW, when suddenly a car came along and hit the door, ripping it off completely. When the police arrived at the scene, the lawyer was complaining bitterly about the damage to his precious BMW. "Officer, look what they've done to my Beeeeemer!" he whined. "You lawyers are so materialistic, you make me sick!" retorted the officer. "You're so worried about your stupid BMW that you didn't even notice your left arm was ripped off!" "Oh my gaaaad..." replied the lawyer, finally noticing the bloody left shoulder where his arm once was. "Where's my Rolex?!" (score:0.04106215242864088)
--------
Recommended: Q: What is the difference between George Washington, Richard Nixon

It appears at first glance that the trigam "What's the difference" is found in many of these jokes.

### User-user collaborative filtering

We will now use the user ratings of each joke in relation to each other and to similar user ratings to recommend items 

In [28]:
# we will use df_jokes (displayed earlier) and df_ratings
df_ratings

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.750
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375
...,...,...,...,...
1092054,9517_132,9517,132,3.156
1092055,27767_118,27767,118,-1.594
1092056,10580_81,10580,81,2.000
1092057,31007_119,31007,119,8.906


In [29]:
# pivot the ratings table on user and rating, filling any Nan (where a user did not have
# a rating for a particular item) with the raw average of all the ratings
jokes_users = df_ratings.pivot(index='joke_id',columns='user_id',values='Rating').fillna(0)

jokes_users

user_id,1,2,3,4,5,6,7,8,9,10,...,40854,40855,40856,40857,40858,40859,40860,40861,40862,40863
joke_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.219,-9.688,0.000,6.906,-0.031,0.000,6.219,8.250,-5.750,-7.156,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
2,-9.281,9.938,0.000,0.000,0.000,-2.344,-7.438,9.000,0.281,0.000,...,9.844,0.000,-5.719,0.094,-0.125,-1.750,0.000,-4.594,-4.438,-7.906
3,0.000,9.531,-7.219,-5.906,0.000,0.000,0.000,8.875,0.781,0.000,...,0.000,-3.719,-8.156,0.000,0.000,-0.094,-7.375,-4.312,1.531,0.000
4,-6.781,9.938,-2.031,0.000,7.500,-0.969,-3.438,0.000,0.000,-5.500,...,0.000,-3.531,0.000,0.094,0.000,0.000,9.688,3.000,0.000,-7.594
5,0.875,0.406,-9.938,0.000,-7.219,0.000,0.531,9.375,0.000,-6.000,...,0.000,0.000,0.000,-2.656,0.000,8.219,0.000,3.938,-9.156,-6.375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,8.375
136,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,8.938
137,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,8.281
138,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,3.062,0.000,0.000,0.000,0.000,0.000,0.000,0.000


In [30]:
# graphically represented
jokes_users.count().values()

TypeError: 'numpy.ndarray' object is not callable

In [None]:
# convert into a sparse matrix for more efficient calculation as most of the data is zeroes.
mat_jokes_users = csr_matrix(jokes_users.values)

# create a nearest neighbors model using cosine similarity and the brute-force algorithm (computing the distance
# between all pairs of points in the dataset)
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(mat_jokes_users)



#### User-user collaborative recommender

In [31]:
# recommender takes joke id, the sparse matrix, the knn model, and number of recommendations desired
def recommender(joke_id, data, model, n_recommendations):
    model.fit(data)
    print('Joke Selected: ', df_jokes.loc[df_jokes['joke_id'] == joke_id, 'joke_text'])
    distances, indices=model.kneighbors(data[joke_id], n_neighbors=n_recommendations)
    for i in indices:
        print(df_jokes['joke_text'][i].where(i!=joke_id))

recommender(65, mat_jokes_users, model_knn, 5)

Joke Selected:  64    Q: Do you know the difference between an intel...
Name: joke_text, dtype: object
65                                                   NaN
123    An artist asked the gallery owner if there had...
58     This guy's wife asks, "Honey, if I died would ...
61     On the first day of college, the Dean addresse...
106    A man joins a big corporate empire as a traine...
Name: joke_text, dtype: object


### Comparing the two approaches

Content-based filtering allows us to base our recommendations on the item content alone. No user ratings are necessary to be able to have an accurate assessment of a new item.

User-based filtering can be applied to items that don't contain enough content to calculate similarity alone.

In this situation, content-based filtering seemed to find a similar "type" of joke. Jokes don't have genres as much as they have similar setups or subjects. In the example of this project, we managed to find other "what's the difference" jokes and one other lawyer joke. As a lawyer joke will always need the word "lawyer" in it, this is a reliable way of finding and recommending similar subject jokes.

But this has little to do with the quality of a joke. A person might find a particular lawyer joke funny, but dislike others. User-user based filtering appears better at finding users with a similar sense of humor, and then recommending jokes based on user recorded ratings. It's difficult to quantify what exactly makes one joke similar to another in terms based on how humorous they were to different users -- it could be argued that finding humor in one joke over another is practically random.