<a href="https://colab.research.google.com/github/gulce0/IE-423/blob/main/Task8_Gulce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initialize

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# joke metadata
dfJk = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/jokes/JokeText.csv')

# user ratings for each joke
dfJkRts = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/jokes/UserRatings1.csv')

## Build Recommendations

### 1. Content Based Filtering

#### Prepare data

In [None]:
dfJk.head()

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...


In [None]:
dfJk.shape

(100, 2)

In [None]:
# Remove duplicates
dfJk.drop_duplicates(subset ='JokeText', keep = 'first', inplace = True)
dfJk.shape

(100, 2)

#### *Build Model*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Prepare the TF-IDF matrix for the jokes
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(dfJk['JokeText'])
print(tfidf_matrix.shape)

(100, 1378)


We use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the text of the jokes into numerical vectors. These vectors represent the importance of each word in the jokes. We print the shape of the TF-IDF matrix, showing the number of jokes and the number of features (words or phrases).

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Cosine similarity measures the cosine of the angle between two vectors, giving a similarity score between 0 and 1.

In [None]:
# Function to get joke recommendations based on content
def get_content_based_recommendations(joke_id, cosine_sim=cosine_sim):
    # Get the index of the joke
    idx = joke_id - 1  # Assuming joke_id starts from 1

    # Get the pairwise similarity scores of all jokes with the selected joke
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the jokes based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar jokes
    sim_scores = sim_scores[1:11]

    # Get the joke indices
    joke_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar jokes
    return dfJk.iloc[joke_indices]


We define a function to get content-based recommendations for a given joke ID. This function uses the cosine similarity matrix to find the most similar jokes.

In [None]:
# Example: Get content-based recommendations for joke with ID 1
content_recommendations = get_content_based_recommendations(1)
print(content_recommendations)

    JokeId                                           JokeText
86      86  A man, recently completing a routine physical ...
67      67  A man piloting a hot air balloon discovers he ...
87      87  A Czechoslovakian man felt his eyesight was gr...
75      75  There once was a man and a woman that both  go...
31      31  A man arrives at the gates of heaven. St. Pete...
38      38  What is the difference between men and women:\...
55      55  A man and Cindy Crawford get stranded on a des...
80      80  An Asian man goes into a New York CityBank to ...
32      32  What do you call an American in the finals of ...
3        3  Q. What's the difference between a man and a t...


This calls the get_content_based_recommendations function with joke_id 1 and prints the top 10 recommended jokes based on their content similarity.

*The current content-based filtering model uses the description (in this case, the joke text) to make recommendations based on the textual content. To include the joke name (or title) in the recommendations, we can combine the joke name with the joke description into a single text field and then use this combined field for generating the TF-IDF matrix.*

In [None]:

# Fill NaN values in the joketext column with empty strings
dfJk['JokeText'] = dfJk['JokeText'].fillna('')

# Combine the joke id and joketext into a new 'combined' column
dfJk['combined'] = dfJk['JokeId'].astype(str) + " " + dfJk['JokeText']

In [None]:
# Prepare the TF-IDF matrix for the combined text
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(dfJk['combined'])
print(tfidf_matrix.shape)


(100, 1444)


In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
# Function to get joke recommendations based on content
def get_content_based_recommendations(joke_id, cosine_sim=cosine_sim):
    # Get the index of the joke
    idx = joke_id - 1  # Assuming joke_id starts from 1

    # Get the pairwise similarity scores of all jokes with the selected joke
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the jokes based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar jokes
    sim_scores = sim_scores[1:11]

    # Get the joke indices
    joke_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar jokes
    return dfJk.iloc[joke_indices]

In [None]:
# Example: Get content-based recommendations for joke with ID 1
content_recommendations = get_content_based_recommendations(1)
print(content_recommendations)


    JokeId                                           JokeText  \
86      86  A man, recently completing a routine physical ...   
67      67  A man piloting a hot air balloon discovers he ...   
87      87  A Czechoslovakian man felt his eyesight was gr...   
75      75  There once was a man and a woman that both  go...   
31      31  A man arrives at the gates of heaven. St. Pete...   
38      38  What is the difference between men and women:\...   
55      55  A man and Cindy Crawford get stranded on a des...   
80      80  An Asian man goes into a New York CityBank to ...   
32      32  What do you call an American in the finals of ...   
3        3  Q. What's the difference between a man and a t...   

                                             combined  
86  86 A man, recently completing a routine physic...  
67  67 A man piloting a hot air balloon discovers ...  
87  87 A Czechoslovakian man felt his eyesight was...  
75  75 There once was a man and a woman that both ...  
31  

This approach ensures that the recommendations take into account both the joke ID and its text, providing more relevant suggestions.

### 2. Bundling Recommendation

In [None]:
dfJkRts.head(10)

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User36701,User36702,User36703,User36704,User36705,User36706,User36707,User36708,User36709,User36710
0,0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,,,,,,,,,2.91,
1,1,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,,,,-5.63,,-6.07,,-1.6,-4.56,
2,2,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,,,,,,4.08,,,8.98,
3,3,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,...,,,,,,,,,,
4,4,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,2.28,-0.49,5.1,-0.29,-3.54,-1.36,7.48,-5.78,0.73,2.62
5,5,1.75,-0.78,1.26,6.65,8.25,-8.11,-6.75,2.14,0.34,...,,-3.4,-0.92,-4.27,,-2.57,9.32,7.96,-9.13,3.3
6,6,4.76,1.6,-5.39,-7.52,4.08,4.42,-0.15,-0.24,-3.01,...,-9.95,-4.42,0.97,-3.54,6.36,3.01,3.74,5.19,-9.42,0.53
7,7,3.3,1.07,1.5,7.28,2.52,2.72,-5.87,8.06,-6.65,...,4.32,-1.07,0.49,-2.14,2.57,-5.73,-2.33,2.67,8.69,-2.62
8,8,-2.57,-8.69,-8.4,-5.15,-9.66,9.08,-3.54,2.82,-3.4,...,,,,,,,,,,
9,9,-1.41,-4.66,4.37,-7.14,2.48,9.13,-5.19,7.52,1.36,...,-8.4,-6.26,-1.17,0.44,7.52,8.59,8.88,6.07,8.35,3.06


In [None]:
from sklearn.cluster import KMeans

# Perform KMeans clustering
num_clusters = 5  # You can change the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(tfidf_matrix)

# Assign jokes to clusters
dfJk['Cluster'] = kmeans.labels_

  super()._check_params_vs_input(X, default_n_init=10)


In [None]:
# Function to get jokes in the same cluster
def get_bundle_recommendations(joke_id):
  # Get the cluster of the joke
  cluster_id = dfJk.loc[dfJk['JokeId'] == joke_id, 'Cluster'].values[0]

  # Get jokes in the same cluster
  cluster_jokes = dfJk[dfJk['Cluster'] == cluster_id]

  # Return the jokes in the same cluster
  return cluster_jokes

In [None]:
# Example: Get bundle recommendations for joke with ID 1
bundle_recommendations = get_bundle_recommendations(1)
print(bundle_recommendations)

    JokeId                                           JokeText  \
0        0  A man visits the doctor. The doctor says "I ha...   
1        1  This couple had an excellent relationship goin...   
2        2  Q. What's 200 feet long and has 4 teeth? \n\nA...   
4        4  Q.\tWhat's O. J. Simpson's Internet address? \...   
5        5  Bill & Hillary are on a trip back to Arkansas....   
7        7  Q. Did you hear about the dyslexic devil worsh...   
10      10  Q. What do a hurricane, a tornado, and a redne...   
12      12  They asked the Japanese visitor if they have e...   
14      14  Q:  What did the blind person say when given s...   
15      15  Q. What is orange and sounds like a parrot?  \...   
18      18  Q: If a person who speaks three languages is c...   
22      22  Q: What is the Australian word for a boomerang...   
23      23  What do you get when you run over a parakeet w...   
24      24  Two kindergarten girls were talking outside: o...   
31      31  A man arrives

### 3. Colloborative Filtering

In [None]:
import numpy as np
from scipy.sparse.linalg import svds

In [None]:
# Display the first few rows of each DataFrame to understand the structure
print("Jokes DataFrame:")
print(dfJk.head())

print("\nUser Ratings DataFrame:")
print(dfJkRts.head())

Jokes DataFrame:
   JokeId                                           JokeText  \
0       0  A man visits the doctor. The doctor says "I ha...   
1       1  This couple had an excellent relationship goin...   
2       2  Q. What's 200 feet long and has 4 teeth? \n\nA...   
3       3  Q. What's the difference between a man and a t...   
4       4  Q.\tWhat's O. J. Simpson's Internet address? \...   

                                            combined  Cluster  
0  0 A man visits the doctor. The doctor says "I ...        1  
1  1 This couple had an excellent relationship go...        1  
2  2 Q. What's 200 feet long and has 4 teeth? \n\...        1  
3  3 Q. What's the difference between a man and a...        3  
4  4 Q.\tWhat's O. J. Simpson's Internet address?...        1  

User Ratings DataFrame:
   JokeId  User1  User2  User3  User4  User5  User6  User7  User8  User9  ...  \
0       0   5.10  -8.79  -3.50   7.14  -8.79   9.22  -4.03   3.11  -3.64  ...   
1       1   4.90  -0.87  -2

Next, prepare data for the Surprise Library.

We need to transform this wide format into a long format that the Surprise library can understand.

In [None]:
# Transform the user rating data from wide to long format
dfJkRts_long = dfJkRts.melt(id_vars=['JokeId'], var_name='user_id', value_name='rating')

# Convert the user_id to a consistent format
dfJkRts_long['user_id'] = dfJkRts_long['user_id'].str.extract('(\d+)').astype(int)

# Handle missing or invalid values
dfJkRts_long.dropna(subset=['rating'], inplace=True)
dfJkRtss_long = dfJkRts_long[dfJkRts_long['rating'] > 0]

# Display the transformed DataFrame
print("\nTransformed User Ratings DataFrame:")
print(dfJkRts_long.head())




Transformed User Ratings DataFrame:
   JokeId  user_id  rating
0       0        1    5.10
1       1        1    4.90
2       2        1    1.75
3       3        1   -4.17
4       4        1    5.15


In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357272 sha256=2a92c75c6d50f4d2435253710d29ea54a81deba07d6d89fa3c5ec5889671b102
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succe

In [None]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split

# Define the rating scale and load the data into Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(dfJkRts_long[['user_id', 'JokeId', 'rating']], reader)

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.25)


In [None]:
from surprise import Dataset, Reader, NMF, accuracy
# Define the NMF model
model = NMF(n_factors=15, n_epochs=50, random_state=10)

# Fit the model to the training data
model.fit(trainset)

# Test the model on the test data
predictions = model.test(testset)

# Evaluate the accuracy of the model
rmse = accuracy.rmse(predictions)
print(f'RMSE: {rmse}')

RMSE: 5.3062
RMSE: 5.306195738927242


#### Build Model

In [None]:
from surprise import SVD, accuracy

# Define the SVD model
model = SVD()

# Fit the model to the training data
model.fit(trainset)

# Test the model on the test data
predictions = model.test(testset)

# Evaluate the accuracy of the model
rmse = accuracy.rmse(predictions)
print(f'RMSE: {rmse}')

RMSE: 4.8715
RMSE: 4.871519129782608


In [None]:
# Tune hyperparameters

from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=2)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

4.922540325104562
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [None]:
# Function to get collaborative filtering recommendations
def get_collaborative_recommendations(user_id, dfJokes, model, top_n=10):
    # Predict ratings for all jokes for the given user
    joke_ids = dfJokes['JokeId'].unique()
    predictions = [model.predict(user_id, joke_id) for joke_id in joke_ids]

    # Sort predictions by estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)

    # Get the top N joke ids
    top_jokes = [pred.iid for pred in predictions[:top_n]]

    # Return the top N recommended jokes
    return dfJokes[dfJokes['JokeId'].isin(top_jokes)]

# Example: Get collaborative filtering recommendations for user with ID 1
collab_recommendations = get_collaborative_recommendations(1, dfJk, model)
print(collab_recommendations)

    JokeId                                           JokeText  \
0        0  A man visits the doctor. The doctor says "I ha...   
4        4  Q.\tWhat's O. J. Simpson's Internet address? \...   
26      26  Clinton returns from a vacation in Arkansas an...   
31      31  A man arrives at the gates of heaven. St. Pete...   
34      34  An explorer in the deepest Amazon suddenly fin...   
35      35  A guy walks into a bar, orders a beer and says...   
42      42  Arnold Swartzeneger and Sylvester Stallone are...   
47      47  The graduate with a Science degree asks, "Why ...   
53      53  The Pope dies and, naturally, goes to heaven. ...   
63      63  What is the rallying cry of the International ...   

                                             combined  Cluster  
0   0 A man visits the doctor. The doctor says "I ...        1  
4   4 Q.\tWhat's O. J. Simpson's Internet address?...        1  
26  26 Clinton returns from a vacation in Arkansas...        0  
31  31 A man arrives at 

#### Predict

Let's first see which jokes user #3 has already viewed.

In [None]:
dfJkRts_long[dfJkRts_long['user_id'] == 3]

Unnamed: 0,JokeId,user_id,rating
200,0,3,-3.50
201,1,3,-2.91
202,2,3,-2.18
203,3,3,-0.10
204,4,3,7.52
...,...,...,...
295,95,3,3.98
296,96,3,-6.46
297,97,3,-6.89
298,98,3,-2.33


In [None]:
model.predict(1, 80)

Prediction(uid=1, iid=80, r_ui=None, est=4.253064712998293, details={'was_impossible': False})