# DATA 612 Project 2 - Joke Recommender System Part II

By Mike Silva

## Introduction

This is the continuation of the creation of a recommender system that provides users with jokes that they will find funny.  By providing this content we will keep users engaged longer on the website.

### About the Jester Dataset

For this project I will be using the [Jester dataset](http://eigentaste.berkeley.edu/dataset/).  It was created by Ken Goldberg at UC Berkley (Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001).

Data files are in .zip format, when unzipped, they are in Excel (.xls) format.  The ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" meaning "not rated").  Each row is a user.  The first column gives the number of jokes rated by the user. The next 100 give the ratings for jokes 1 to 100.  I will only be the first data set that has data for users that have rated 36 or more jokes.  

In addition to the ratings, we will be using the actual joke content.

### data612

This notebook relies on the module I created for this class.  You can see the [data612 module here](https://github.com/mikeasilva/CUNY-SPS/blob/master/DATA612/data612.py).

## Content-Based Filtering

In this section I will develop a content-based filtering.  For this we will need the joke content.  I have not previously downloaded this data so I will need to acquire it.

In [1]:
import os
import requests
import zipfile
import pandas as pd
import nltk
from shutil import rmtree
from sklearn.metrics.pairwise import cosine_similarity
import data612

# STEP 1 - DOWNLOAD THE DATA SET
if not os.path.exists("jester_dataset_1_joke_texts.zip"):
    # We need to download it
    response = requests.get("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_joke_texts.zip")
    if response.status_code == 200:
        with open("jester_dataset_1_joke_texts.zip", "wb") as f:
            f.write(response.content)
## This was done in Part I but we will include this just in case
if not os.path.exists("jester_dataset_1_1.zip"):
    # We need to download it
    response = requests.get("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip")
    if response.status_code == 200:
        with open("jester_dataset_1_1.zip", "wb") as f:
            f.write(response.content)
# STEP 2 - EXTRACT ALL FILES
if not os.path.exists("jokes"):
    with zipfile.ZipFile("jester_dataset_1_joke_texts.zip","r") as z:
        z.extractall()
    # CLEANUP
    rmtree("__MACOSX")
## Again this was done in Part I but we are including this just in case
if not os.path.exists("jester-data-1.xls"):
    with zipfile.ZipFile("jester_dataset_1_1.zip","r") as z:
        z.extract("jester-data-1.xls")

Now that we have the joke content we will need to read in the data and process it to create the filters.  First we need to get some counts so we can build our term matrix:

In [2]:
tokens_and_jokes = dict()
token_counts = dict()

stop_words = set(nltk.corpus.stopwords.words('english')) 

for n in range(1, 101):
    joke_text = data612.read_joke(n)
    for token in joke_text.split():#nltk.word_tokenize(joke_text):
        if token not in stop_words:
            token = token.lower()

            # Create a token count
            token_counts[token] = token_counts.get(token, 0) + 1
            # Create a token joke count
            key = (token, n)
            tokens_and_jokes[key] = tokens_and_jokes.get(key, 0) + 1

Now that we have the counts we can begin forming matrixes.

## Simple term matrix

For this first run we will not remove the stop words or lemmatize the tokens.  We are just going to build a term frequency matrix where each row it the joke and the columns are the terms.

In [3]:
simple_data = list()

for t_and_j in tokens_and_jokes.keys():
    row = {
        "token": t_and_j[0],
        "joke": t_and_j[1],
        "count": tokens_and_jokes[t_and_j]
    }
    simple_data.append(row)
    
simple_data = pd.DataFrame(simple_data)

simple_data = simple_data.pivot_table(index='joke', columns='token', values='count', fill_value=0)
simple_data

token,!,!',"""a","""actually","""agh,","""ah,","""amal.""","""an","""and","""anybody",...,you're,"you,",you.,"you.""",you.you,you?,"you?""",young,younger,your
joke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
cosine_similarity(simple_data)

array([[1.        , 0.        , 0.1069045 , ..., 0.        , 0.12      ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.03015113,
        0.03553345],
       [0.1069045 , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.12      , 0.03015113, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.03553345, 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [5]:
df = pd.read_excel("jester-data-1.xls",  header=None, na_values = 99)
df = df.drop([0], axis=1)

In [6]:
df = data612.rescale_jester_ratings(df).fillna(0)
df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,-4.0,4.0,-5.0,-4.0,-4.0,-4.0,-5.0,2.0,-4.0,-2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,0.0,0.0
1,2.0,0.0,3.0,2.0,-1.0,-5.0,0.0,-3.0,4.0,5.0,...,1.0,-2.0,0.0,4.0,0.0,-1.0,2.0,0.0,-2.0,1.0
2,0.0,0.0,0.0,0.0,5.0,5.0,5.0,5.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,4.0,0.0,0.0,1.0,4.0,-1.0,3.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4.0,2.0,-2.0,-3.0,1.0,1.0,4.0,2.0,0.0,3.0,...,3.0,3.0,2.0,3.0,3.0,1.0,2.0,3.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24978,0.0,4.0,5.0,1.0,2.0,3.0,-4.0,0.0,-4.0,4.0,...,4.0,-1.0,5.0,-3.0,4.0,5.0,3.0,4.0,4.0,4.0
24979,5.0,-4.0,4.0,5.0,0.0,-4.0,-2.0,3.0,-4.0,2.0,...,-1.0,-3.0,-1.0,0.0,5.0,-4.0,-4.0,-4.0,5.0,4.0
24980,0.0,0.0,0.0,0.0,-4.0,0.0,3.0,-3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24981,0.0,0.0,0.0,0.0,-5.0,0.0,2.0,-4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
cosine_similarity(df)

array([[ 1.        , -0.35028599, -0.16171663, ..., -0.048795  ,
         0.16336267,  0.06847182],
       [-0.35028599,  1.        ,  0.15509786, ...,  0.01165378,
         0.07111801,  0.04866718],
       [-0.16171663,  0.15509786,  1.        , ...,  0.04982138,
        -0.07050169,  0.45375985],
       ...,
       [-0.048795  ,  0.01165378,  0.04982138, ...,  1.        ,
         0.32631806,  0.02525859],
       [ 0.16336267,  0.07111801, -0.07050169, ...,  0.32631806,
         1.        ,  0.19695965],
       [ 0.06847182,  0.04866718,  0.45375985, ...,  0.02525859,
         0.19695965,  1.        ]])

In [8]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.mean(axis=0))

1      0.297082
2      0.073530
3      0.105432
4     -0.450546
5      0.201417
6      0.650763
7     -0.213705
8     -0.310931
9     -0.175519
10     0.532162
11     0.791418
12     0.650803
13    -0.880239
14     0.627146
15    -0.854781
16    -1.557339
17    -0.540928
18    -0.313974
19     0.083617
20    -0.462394
21     1.045471
22     0.368330
23     0.045151
24    -0.538086
25     0.160269
26     0.614098
27     1.587800
28     0.733699
29     1.480927
30    -0.154865
31     1.080295
32     1.573710
33    -0.459593
34     0.370972
35     1.499059
36     1.651483
37    -0.461434
38     0.597206
39     0.496858
40     0.448385
41    -0.120402
42     0.969699
43    -0.324461
44    -0.685186
45     0.471721
46     0.706400
47     0.694552
48     0.904055
49     1.381780
50     1.827363
51    -0.282272
52    -0.053356
53     1.466797
54     1.351439
55     0.199296
56     0.872593
57    -0.639235
58    -1.203258
59    -0.206540
60    -0.117320
61     1.223792
62     1.485050
63     0