<a href="https://colab.research.google.com/github/leukschrauber/LearningPortfolio/blob/main/learn_portfolio_7_2/learn_portfolio_7_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning Portfolio
*by Fabian Leuk (csba6437/12215478)*

## Session 7: Recommender Systems

In this week's assignment, I have chosen the Jester-Dataset, which comprises ratings of around 25.000 users for 100 jokes. The dataset can be downloaded [here](https://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip), the corresponding jokes can be downloaded [here](https://eigentaste.berkeley.edu/dataset/jester_dataset_1_joke_texts.zip) in order to build a Recommender System. 

I have extended the knowledge gained from the presented materials by actually calculating recommendations for the users included in the dataset. As the calculation was not given in the fastAI course, as well as in the other videos we were supposed to watch, I will elaborate on the calculation in this Learning Portfolio.

### Key Learnings




*   HTML Content can be easily processed with Python using regex matching and BeautifulSoup
*   Data can be transformed from wide format to long format using pd.melt()
*   Negative rating data needs a custom loss function when training with fastAI
*   Hyperparameter tuning was challenging to do this time. Most often, the defaults of fastAI worked best.
*   A neural net could net outperform the classic matrix recommender system during validation
*   export() and load_learner() can be used to save and retrieve trained models
*   **How to use cosine similarity, broadcasting and vector multiplication to actually make recommendations for a user.**
*   A user should not be recommended items that he already saw.
*   Adding a new user to the system requires new training, so the user embedding for the user can be generated.
*  torch.topk can be used to find the highest k numbers inside a pytorch tensor






### Application: Giving recommendations based on a trained model


#### Data preparation

In [1]:
!pip install bs4
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
import os
import re
from bs4 import BeautifulSoup
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=5fbc9f3773db57e0fa2a2deaf255eae5075bd482aa87441ef9219a721c7c8ee0
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Mounted at /content/gdrive


We start by extracting the actual jokes from the jokes-html files. While this not necessary for our results, it certainly is more "fun" while interpreting the results.

The texts were given in HTML-Files, so we use regex matching and Beautiful Soup to remove all of the HTML-Tags and give us the raw jokes out of the HTML.

Wrap that inside a pandas Dataframe assigning a 0-based index as an id and we can work with it later.

In [4]:
joke_text_files = '/content/gdrive/My Drive/SE_Digital_Organizations/jokes'
joke_ratings_file = '/content/gdrive/My Drive/SE_Digital_Organizations/jester-data-1.xls'

# Function to extract joke text from html file and remove remaining HTML tags
def extract_joke_from_file(file_path):
    with open(file_path, "r") as file:
        html = file.read()

    pattern = r"<!--begin of joke -->(.*?)<!--end of joke -->"
    matches = re.findall(pattern, html, re.DOTALL)

    cleaned_jokes = []
    for joke in matches:
        soup = BeautifulSoup(joke, "html.parser")
        cleaned_joke = soup.get_text()
        cleaned_jokes.append(cleaned_joke)

    return cleaned_jokes

# Iterate over files in directory, sort by joke number and add to dataframe
html_files = [file_name for file_name in os.listdir(joke_text_files) if file_name.endswith(".html")]
html_files.sort(key=lambda x: int(re.search(r"\d+", x).group()))
jokes_list = []
for file_name in html_files:
    file_path = os.path.join(joke_text_files, file_name)
    jokes = extract_joke_from_file(file_path)

    if jokes:
        jokes_list.extend(jokes)

joke_text_df = pd.DataFrame({"joke_text": jokes_list})
joke_text_df = joke_text_df.reset_index().rename(columns={'index': 'joke_id'})
joke_text_df.head(5)

Unnamed: 0,joke_id,joke_text
0,0,"\nA man visits the doctor. The doctor says ""I have bad news for you.You have\ncancer and Alzheimer's disease"". \nThe man replies ""Well,thank God I don't have cancer!""\n"
1,1,"\nThis couple had an excellent relationship going until one day he came home\nfrom work to find his girlfriend packing. He asked her why she was leaving him\nand she told him that she had heard awful things about him. \n\n""What could they possibly have said to make you move out?"" \n\n""They told me that you were a pedophile."" \n\nHe replied, ""That's an awfully big word for a ten year old."" \n"
2,2,\nQ. What's 200 feet long and has 4 teeth? \n\nA. The front row at a Willie Nelson Concert.\n
3,3,\nQ. What's the difference between a man and a toilet? \n\nA. A toilet doesn't follow you around after you use it.\n
4,4,"\nQ.\tWhat's O. J. Simpson's Internet address? \nA.\tSlash, slash, backslash, slash, slash, escape.\n"


Next, we extract the ratings from the xls-file. The first column just counts how many jokes have been rated by the respective user. As we do not need this information and can also easily compute it ourselves given the rest of the data, we drop the column.

Again, we use the index as a user_id. Furthermore, we transform the ratings to be between 0 and 5.

The transformation was necessary to train our model, because the standard loss function did not consider that there were negative ratings. Another solution would be to provide a custom loss function.

In the end, we obtain the ratings for around 25000 users in wide format.

In [9]:
# drop the rating count, make index the column user_id and scale ratings to be between zero and 5

joke_ratings = pd.read_excel(joke_ratings_file, header=None)
joke_ratings.columns=[str(i) for i in range(101)]

joke_ratings = joke_ratings.drop("0", axis=1)
joke_ratings = joke_ratings.reset_index().rename(columns={'index': 'user_id'})

def transform_column(column):
    if column.name != "user_id":
        return column.apply(lambda x: (x + 10) / 4)
    return column

joke_ratings = joke_ratings.apply(transform_column, axis=0)

print(joke_ratings.head(5))

   user_id       1        2        3        4       5       6       7       8  \
0        0   0.545   4.6975   0.0850   0.4600  0.6200  0.3750  0.0375  3.5425   
1        1   3.520   2.4275   4.0900   3.5925  1.9050  0.0850  2.3175  1.1650   
2        2  27.250  27.2500  27.2500  27.2500  4.7575  4.8175  4.7575  4.8175   
3        3  27.250   4.5875  27.2500  27.2500  2.9500  4.5400  1.7950  4.0525   
4        4   4.625   3.6525   1.4575   1.1525  2.8400  2.9000  4.2600  3.6525   

        9  ...       91       92       93       94       95       96       97  \
0   0.255  ...   3.2050  27.2500  27.2500  27.2500  27.2500  27.2500   1.0925   
1   4.720  ...   3.2050   1.2625   2.4275   4.4650   2.4525   1.9650   3.2650   
2  27.250  ...  27.2500  27.2500  27.2500   4.7700  27.2500  27.2500  27.2500   
3  27.250  ...  27.2500  27.2500  27.2500   2.6325  27.2500  27.2500  27.2500   
4   2.390  ...   3.7975   3.8950   3.5675   3.7975   3.9325   2.8875   3.2775   

        98     99      100

We already have a trained model on this dataset, so let's use it.

In [16]:
loaded = load_learner('/content/gdrive/My Drive/SE_Digital_Organizations/ordinary_learner/model.pkl')

#### Making recommendations

Now, we will start to build recommendations for our model. Our model contains an embedding for each user and an embedding for each movie. Each embedding is a vector of 50 latent variables, which the model determined for use while training.

Additionally, each user and each item has a bias, which is a constant our model determined as well while training.

In [17]:
loaded.model

EmbeddingDotBias(
  (u_weight): Embedding(24984, 50)
  (i_weight): Embedding(101, 50)
  (u_bias): Embedding(24984, 1)
  (i_bias): Embedding(101, 1)
)

For recommending a joke to a user, we could simply average the ratings for the joke. We could also just recommend the joke with the highest bias, but then we would not have taken into account the user's preferences, which can deviate from the mean.

We need the following approach: We want to find out which users are similar to the user we want to make recommendations for and we want to weigh their opinion larger than opinions of users that are very different from our user.

As each user has an embedding, we can just compare the embedding of our user to the embeddings of other users. We can do so by calculating cosine similarity between these embeddings.

In [47]:
user_factors = loaded.model.u_weight.weight
similarities = nn.CosineSimilarity(dim=1)(user_factors[666], user_factors)[:-1]
similarities.shape

torch.Size([24983])

Now for this user, we need to calculate a weighted average by taking into account the calculated cosine similarities:

$\frac{\sum_{i=1}^n cosinesimilarity_i rating_{ij}}{\sum_{i=1}^n cosinesimilarity_i}$

Given a number of users n and a joke j, we are building the sum of all products between the cosinesimilarity for the user i and the rating for joke j of user i. In the denominator, we are just adding up all the cosine similarities.

Before we can start, we need to take into account a peculiarity in our dataset. If a user has not rated a joke, the dataset contained a 99. Meaning that after our transformation, it contains a 27.25. We will take care of it by replacing it with our neutral rating: 2.5.

Let's start by calculating all the products for the counter in our fraction for user 666 and joke 1:

In [54]:
joke_ratings_without_user = joke_ratings.drop(["user_id"], axis=1)
joke_rating_for_joke_42 = torch.from_numpy(joke_ratings_without_user["1"].values)
joke_rating_for_joke_42 = torch.where(joke_rating_for_joke_42 > 5.1, torch.tensor(2.5), joke_rating_for_joke_42)
product = similarities * joke_rating_for_joke_42
product

tensor([-0.0036, -1.0723,  0.7182,  ...,  0.4684, -0.9358,  0.2970],
       dtype=torch.float64, grad_fn=<MulBackward0>)

Sum that up, so that will give us the sumproduct for our counter:

In [55]:
sumproduct = torch.sum(product, dim=0)
sumproduct

tensor(18815.1288, dtype=torch.float64, grad_fn=<SumBackward1>)

Divide the sumproduct by the sum of similarities to give us our weighted average:

In [56]:
predicted_rating = sumproduct/similarities.sum()
predicted_rating

tensor(2.6388, dtype=torch.float64, grad_fn=<DivBackward0>)

Now that we predicted a rating for a single joke, we can do so for the whole set of jokes for our user 666.

In [57]:
joke_ratings_without_user = torch.from_numpy(joke_ratings_without_user.values)
replaced_tensor = torch.where(joke_ratings_without_user > 5.1, torch.tensor(2.5), joke_ratings_without_user)
ratings = replaced_tensor * similarities[:, np.newaxis]
summated_ratings = torch.sum(ratings, dim=0)
weighted_ratings = summated_ratings/similarities.sum()
weighted_ratings.shape

torch.Size([100])

Now we can select the top 10 joke ids for our user.

In [60]:
values, joke_indices = torch.topk(weighted_ratings, 10)
joke_indices

tensor([49, 35, 31, 26, 34, 61, 28, 52, 48, 67])

And we can look up what these jokes are in our joke dataset:

In [61]:
for joke_index in joke_indices.tolist():
    print("Recommend joke with id " + str(joke_index) + " to user " + str(666))
    joke_df = joke_text_df[(joke_text_df["joke_id"] == joke_index)]
    for index, row in joke_df.iterrows():
      print(row['joke_id'])
      print(row['joke_text'])

Recommend joke with id 49 to user 666
49

A guy goes into confession and says to the priest, "Father, I'm 80 years
old, widower, with 11 grandchildren. Last night I met two beautiful flight
attendants. They took me home and I made love to both of them. Twice."

The priest said: "Well, my son, when was the last time you were in
confession?"
 "Never Father, I'm Jewish."
 "So then, why are you telling me?"
 "I'm telling everybody."

Recommend joke with id 35 to user 666
35

A guy walks into a bar, orders a beer and says to the bartender,
"Hey, I got this great Polish Joke..." 

"Before you go telling that joke you better know that I'm Polish, both
bouncers are Polish and so are most of my customers"

"Okay" says the customer,"I'll tell it very slowly." 

Recommend joke with id 31 to user 666
31

A man arrives at the gates of heaven. St. Peter asks, "Religion?" 
The man says, "Methodist." St. Peter looks down his list, and says, 
"Go to room 24, but be very quiet as you pass room 8." 

Ano

We can wrap all of what we learned in a function to make it possible to predict good jokes for every user. Also, we should take into account whether a user has already seen a joke and not recommend it to him in case he has.

In [70]:
def recommend_joke_to_user(user_id):
  user_factors = loaded.model.u_weight.weight
  distances = nn.CosineSimilarity(dim=1)(user_factors[user_id], user_factors)
  distances = distances[:-1]
  joke_ratings_without_user = joke_ratings.drop(["user_id"], axis=1)
  joke_ratings_without_user = torch.from_numpy(joke_ratings_without_user.values)
  replaced_tensor = torch.where(joke_ratings_without_user > 5.1, torch.tensor(2.5), joke_ratings_without_user)
  ratings = replaced_tensor * distances[:, np.newaxis]
  summated_ratings = torch.sum(ratings, dim=0)
  weighted_ratings = summated_ratings/distances.sum()
  values, joke_indices = torch.topk(weighted_ratings, 10)

  recommendation_given = False;

  for joke_index in joke_indices.tolist():
      user_may_have_rated = joke_ratings[(joke_ratings["user_id"] == user_id)]
      if(user_may_have_rated[str(joke_index)].values[0] != 27.25):
        print("User " + str(user_id) + " already rated joke with the id " + str(joke_index))
        continue;
      else:
        print("Recommend joke with id " + str(joke_index) + " to user " + str(user_id))
        joke_df = joke_text_df[(joke_text_df["joke_id"] == joke_index)]
        for index, row in joke_df.iterrows():
          print(row['joke_id'])
          print(row['joke_text'])
          recommendation_given = True
          break;

  if(not recommendation_given):
    print("Top jokes for user " + str(user_id) + " already rated.")

In [73]:
for i in range(1111,1122):
  recommend_joke_to_user(i)

User 1111 already rated joke with the id 49
User 1111 already rated joke with the id 35
User 1111 already rated joke with the id 31
User 1111 already rated joke with the id 26
User 1111 already rated joke with the id 34
User 1111 already rated joke with the id 52
User 1111 already rated joke with the id 68
User 1111 already rated joke with the id 28
User 1111 already rated joke with the id 61
Recommend joke with id 67 to user 1111
67

A man piloting a hot air balloon discovers he has wandered off course and
is hopelessly lost. He descends to a lower altitude and locates a man
down on the ground. He lowers the balloon further and shouts "Excuse me,
can you tell me where I am?"

The man below says: "Yes, you're in a hot air balloon, about 30 feet
above this field."

"You must work in Information Technology," says the balloonist.

"Yes I do," replies the man. "And how did you know that?"

"Well," says the balloonist, "what you told me is technically correct,
but of no use to anyone."

The