<a href="https://colab.research.google.com/github/leukschrauber/Assignments/blob/main/assignment_7_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment

*by Fabian Leuk (csba6437/12215478)*

The following assignment consists again of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to train a neural model for a recommendation system.

Find any data set that can be used for a recommender system and try to train and validate a neural network for it.

For this purpose I ask you to download a data set from the given lists and to use it for your program application. 

https://gist.github.com/entaroadun/1653794
https://github.com/caserec/Datasets-for-Recommender-Systems
https://grouplens.org/datasets/movielens/
https://eigentaste.berkeley.edu/dataset/

## Data Preparation

I have chosen the Jester-Dataset, which comprises ratings of around 25.000 users for 100 jokes. The dataset can be downloaded [here](https://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip), the corresponding jokes can be downloaded [here](https://eigentaste.berkeley.edu/dataset/jester_dataset_1_joke_texts.zip). 

* Data files are in .zip format, when unzipped, they are in Excel (.xls) format
* Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
* One row per user
* The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.

I processed the data by extracting the joke texts from the HTML-Files using Regex-Matching and BeautifulSoup. Furthermore, I converted the joke ratings into long format as demanded by fastAI and scaled the ratings from being between -10 and 10 to be between 0 and 5.

In [125]:
!pip install bs4
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
import os
import re
from bs4 import BeautifulSoup
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

joke_text_files = '/content/gdrive/My Drive/SE_Digital_Organizations/jokes'
joke_ratings_file = '/content/gdrive/My Drive/SE_Digital_Organizations/jester-data-1.xls'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Mounted at /content/gdrive


In [126]:
# Function to extract joke text from html file and remove remaining HTML tags
def extract_joke_from_file(file_path):
    with open(file_path, "r") as file:
        html = file.read()

    pattern = r"<!--begin of joke -->(.*?)<!--end of joke -->"
    matches = re.findall(pattern, html, re.DOTALL)

    cleaned_jokes = []
    for joke in matches:
        soup = BeautifulSoup(joke, "html.parser")
        cleaned_joke = soup.get_text()
        cleaned_jokes.append(cleaned_joke)

    return cleaned_jokes

# Iterate over files in directory, sort by joke number and add to dataframe
html_files = [file_name for file_name in os.listdir(joke_text_files) if file_name.endswith(".html")]
html_files.sort(key=lambda x: int(re.search(r"\d+", x).group()))
jokes_list = []
for file_name in html_files:
    file_path = os.path.join(joke_text_files, file_name)
    jokes = extract_joke_from_file(file_path)

    if jokes:
        jokes_list.extend(jokes)

joke_text_df = pd.DataFrame({"joke_text": jokes_list})
joke_text_df = joke_text_df.reset_index().rename(columns={'index': 'joke_id'})
joke_text_df.head(5)

Unnamed: 0,joke_id,joke_text
0,0,"\nA man visits the doctor. The doctor says ""I have bad news for you.You have\ncancer and Alzheimer's disease"". \nThe man replies ""Well,thank God I don't have cancer!""\n"
1,1,"\nThis couple had an excellent relationship going until one day he came home\nfrom work to find his girlfriend packing. He asked her why she was leaving him\nand she told him that she had heard awful things about him. \n\n""What could they possibly have said to make you move out?"" \n\n""They told me that you were a pedophile."" \n\nHe replied, ""That's an awfully big word for a ten year old."" \n"
2,2,\nQ. What's 200 feet long and has 4 teeth? \n\nA. The front row at a Willie Nelson Concert.\n
3,3,\nQ. What's the difference between a man and a toilet? \n\nA. A toilet doesn't follow you around after you use it.\n
4,4,"\nQ.\tWhat's O. J. Simpson's Internet address? \nA.\tSlash, slash, backslash, slash, slash, escape.\n"


In [127]:
# drop the rating count, make index the column user_id and scale ratings to be between zero and 5

joke_ratings = pd.read_excel(joke_ratings_file, header=None)
joke_ratings.columns=[str(i) for i in range(101)]

joke_ratings = joke_ratings.drop("0", axis=1)
joke_ratings = joke_ratings.reset_index().rename(columns={'index': 'user_id'})

def transform_column(column):
    if column.name != "user_id":
        return column.apply(lambda x: (x + 10) / 4)
    return column

joke_ratings = joke_ratings.apply(transform_column, axis=0)

print(joke_ratings.head(5))

   user_id       1        2        3        4       5       6       7       8  \
0        0   0.545   4.6975   0.0850   0.4600  0.6200  0.3750  0.0375  3.5425   
1        1   3.520   2.4275   4.0900   3.5925  1.9050  0.0850  2.3175  1.1650   
2        2  27.250  27.2500  27.2500  27.2500  4.7575  4.8175  4.7575  4.8175   
3        3  27.250   4.5875  27.2500  27.2500  2.9500  4.5400  1.7950  4.0525   
4        4   4.625   3.6525   1.4575   1.1525  2.8400  2.9000  4.2600  3.6525   

        9  ...       91       92       93       94       95       96       97  \
0   0.255  ...   3.2050  27.2500  27.2500  27.2500  27.2500  27.2500   1.0925   
1   4.720  ...   3.2050   1.2625   2.4275   4.4650   2.4525   1.9650   3.2650   
2  27.250  ...  27.2500  27.2500  27.2500   4.7700  27.2500  27.2500  27.2500   
3  27.250  ...  27.2500  27.2500  27.2500   2.6325  27.2500  27.2500  27.2500   
4   2.390  ...   3.7975   3.8950   3.5675   3.7975   3.9325   2.8875   3.2775   

        98     99      100

In [129]:
# convert from wide to long format
melted_joke_ratings = pd.melt(joke_ratings, id_vars='user_id', value_vars=joke_ratings.columns[1:], var_name='number', value_name='rating')
melted_joke_ratings = melted_joke_ratings[melted_joke_ratings['rating'] != 27.25]
melted_joke_ratings = melted_joke_ratings.copy()
melted_joke_ratings.head(5)

Unnamed: 0,user_id,number,rating
0,0,1,0.545
1,1,1,3.52
4,4,1,4.625
5,5,1,0.9575
7,7,1,4.21


In [130]:
dls = CollabDataLoaders.from_df(melted_joke_ratings, item_name='number', user_name="user_id", bs=128)

## Training

For training, I used fastAI's collab_learner API. I trained a ordinary collaborative filtering model and a neural network with the following hyperparameters:

*   batch size: 128
*   latent factors: determined by fastAI => 50
*   epochs: 10
*   learning rate: 5e-4
*   weight decay: 0.1

The hyperparameter choice turned out to be quite tricky, but finally I found these parameters which were decreasing the training and validation losses consistently and provided reasonable results overall.

Although training losses were smaller for the neural network, the ordinary collaborative filtering model achieved lower losses on the validation data set.

Because training time was around 30 minutes, I saved the models to my Google Drive for later retrieval.


### Using Collab Learner

In [None]:
ordinary_learn = collab_learner(dls, y_range=(0.0, 5.5))
ordinary_learn.fit_one_cycle(10, 5e-4, wd=0.1)

epoch,train_loss,valid_loss,time
0,1.473742,1.456311,01:14
1,1.176519,1.171204,01:12
2,1.126006,1.12678,01:11
3,1.074324,1.097894,01:11
4,1.055906,1.074965,01:12
5,1.029559,1.058813,01:15
6,0.971448,1.049513,01:11
7,0.944129,1.044559,01:11
8,0.910618,1.042669,01:11
9,0.925971,1.042407,01:11


### Using Neural Network

In [None]:
nn_learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5))
nn_learn.fit_one_cycle(10, 5e-4, wd=0.1)

epoch,train_loss,valid_loss,time
0,1.232545,1.210195,02:04
1,1.153208,1.147422,02:04
2,1.109861,1.10582,02:04
3,1.099963,1.088661,02:04
4,1.054629,1.078682,02:04
5,1.006162,1.068304,02:05


epoch,train_loss,valid_loss,time
0,1.232545,1.210195,02:04
1,1.153208,1.147422,02:04
2,1.109861,1.10582,02:04
3,1.099963,1.088661,02:04
4,1.054629,1.078682,02:04
5,1.006162,1.068304,02:05
6,0.975808,1.060126,02:04
7,0.901285,1.05852,02:05
8,0.832088,1.070963,02:04
9,0.813799,1.07695,02:04


In [None]:
nn_learn.export('/content/gdrive/My Drive/SE_Digital_Organizations/neural_network/model.pkl')
ordinary_learn.export('/content/gdrive/My Drive/SE_Digital_Organizations/ordinary_learner/model.pkl')

## Interpretation

Let's see what are the best jokes by finding the jokes with the highest bias term.

In [131]:
loaded = load_learner('/content/gdrive/My Drive/SE_Digital_Organizations/ordinary_learner/model.pkl')

In [132]:
loaded.model

EmbeddingDotBias(
  (u_weight): Embedding(24984, 50)
  (i_weight): Embedding(101, 50)
  (u_bias): Embedding(24984, 1)
  (i_bias): Embedding(101, 1)
)

In [133]:
best_joke_ids = loaded.model.i_bias.weight.squeeze().argsort(descending=True)[:5]
print(best_joke_ids)
filtered_df = joke_text_df[joke_text_df['joke_id'].isin(best_joke_ids.tolist())]
for index, row in filtered_df.iterrows():
    print(row['joke_id'])
    print(row['joke_text'])

tensor([47, 89, 31, 21, 27])
21

A duck walks into a pharmacy and asks for a condom. The pharmacist says
"Would you like me to stick that on your bill?"
The duck says: 
"What kind of duck do you think I am!"

27

A mechanical, electrical and a software engineer from Microsoft were
driving through the desert when the car broke down. The mechanical
engineer said "It seems to be a problem with the fuel injection system,
why don't we pop the hood and I'll take a look at it." To which the
electrical engineer replied, "No I think it's just a loose ground wire,
I'll get out and take a look." Then, the Microsoft engineer jumps in.
"No, no, no. If we just close up all the windows, get out, wait a few
minutes, get back in, and then reopen the windows everything will work
fine."

31

A man arrives at the gates of heaven. St. Peter asks, "Religion?" 
The man says, "Methodist." St. Peter looks down his list, and says, 
"Go to room 24, but be very quiet as you pass room 8." 

Another man arrives at 

## Recommending jokes

I proceed to recommend joke for users that are part of the dataset in the following steps:



1.   Calculate the cosine similarity between the embedding of a particular user and the embeddings of all other users => ~25.000 Similarities
2.   For each joke, take the sumproduct of the similarities and the respective ratings for a user => SUMPRODUCT(~25.000 Similarities, 100 joke ratings) => 1 summated rating
3.   Divide the previous SUMPRODUCT by the sum of all similarities to get a weighted average rating.
4.   Identify the top 10 rated jokes for a user
5.   If a user did not rate one of these jokes, recommend it to him.



In [134]:
def recommend_joke_to_user(user_id):
  user_factors = loaded.model.u_weight.weight
  distances = nn.CosineSimilarity(dim=1)(user_factors[user_id], user_factors)
  distances = distances[:-1]
  joke_ratings_without_user = joke_ratings.drop(["user_id"], axis=1)
  joke_ratings_without_user = torch.from_numpy(joke_ratings_without_user.values)
  replaced_tensor = torch.where(joke_ratings_without_user > 5.1, torch.tensor(0), joke_ratings_without_user)
  ratings = replaced_tensor * distances[:, np.newaxis]
  summated_ratings = torch.sum(ratings, dim=0)
  weighted_ratings = summated_ratings/distances.sum()
  values, joke_indices = torch.topk(weighted_ratings, 10)

  recommendation_given = False;

  for joke_index in joke_indices.tolist():
      user_may_have_rated = melted_joke_ratings[(melted_joke_ratings["user_id"] == user_id) & (melted_joke_ratings["number"] == str(joke_index))]
      if(len(user_may_have_rated) > 0):
        print("User " + str(user_id) + " already rated joke with the id " + str(joke_index))
        continue;
      else:
        print("Recommend joke with id " + str(joke_index) + " to user " + str(user_id))
        joke_df = joke_text_df[(joke_text_df["joke_id"] == joke_index)]
        for index, row in joke_df.iterrows():
          print(row['joke_id'])
          print(row['joke_text'])
          recommendation_given = True
          break;

  if(not recommendation_given):
    print("Top jokes for user " + str(user_id) + " already rated.")

In [135]:
for i in range(100):
  recommend_joke_to_user(i)

User 0 already rated joke with the id 49
User 0 already rated joke with the id 26
User 0 already rated joke with the id 35
User 0 already rated joke with the id 31
User 0 already rated joke with the id 28
User 0 already rated joke with the id 34
User 0 already rated joke with the id 65
User 0 already rated joke with the id 52
User 0 already rated joke with the id 48
User 0 already rated joke with the id 61
Top jokes for user 0 already rated.
User 1 already rated joke with the id 49
User 1 already rated joke with the id 35
User 1 already rated joke with the id 26
User 1 already rated joke with the id 31
User 1 already rated joke with the id 34
User 1 already rated joke with the id 68
User 1 already rated joke with the id 52
User 1 already rated joke with the id 67
User 1 already rated joke with the id 28
User 1 already rated joke with the id 48
Top jokes for user 1 already rated.
User 2 already rated joke with the id 49
User 2 already rated joke with the id 35
User 2 already rated joke 