<a href="https://colab.research.google.com/github/melikesifa/assignments/blob/main/task_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialize

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Movie metadata
dfJk = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/JokeText.csv')

# User ratings for each movie
dfJkRtg1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/UserRatings1.csv', index_col= 'JokeId')
dfJkRtg2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/UserRatings2.csv', index_col= 'JokeId')

In [4]:
dfJkRtg = pd.merge(dfJkRtg1, dfJkRtg2, left_index = True, right_index = True)

In [5]:
dfJk.head()

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...


In [6]:
dfJkRtg.shape

(100, 73421)

In [7]:
dfJkRtg.head()

Unnamed: 0_level_0,User1,User2,User3,User4,User5,User6,User7,User8,User9,User10,...,User73412,User73413,User73414,User73415,User73416,User73417,User73418,User73419,User73420,User73421
JokeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,-7.67,...,,,,,,,,,,
1,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,-5.15,...,,,,,,,,,,
2,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,-3.25,...,,,,,,,,,,
3,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,-1.65,...,,,,,,,,,,
4,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,4.03,...,3.64,4.32,6.99,-9.66,-8.4,-0.63,9.51,-7.67,-1.6,8.3


It is possible to see there are NaN values. Lets look at what we can do

In [8]:
column_nan_percentage = dfJkRtg.isna().mean() * 100


column_nan_distribution = column_nan_percentage.describe()

column_nan_distribution

Unnamed: 0,0
count,73421.0
mean,43.66244
std,29.015686
min,0.0
25%,25.0
50%,48.0
75%,71.0
max,85.0


In [9]:
row_nan_percentage = dfJkRtg.isna().mean(axis=1) * 100


row_nan_distribution = row_nan_percentage.describe()

row_nan_distribution

Unnamed: 0,0
count,100.0
mean,43.66244
std,24.981459
min,0.010896
25%,24.248512
50%,46.740715
75%,70.115839
max,74.796039


In [10]:
dfJkRtg =dfJkRtg.fillna(method = 'ffill')

In [27]:

cols_num_null = dfJkRtg.columns[dfJkRtg.isnull().any()].tolist()
dfJkRtg = dfJkRtg.drop(cols_num_null, axis=1)

I change the NaN values with 0.0

In [32]:
missing_values = dfJkRtg.isna().sum()

#total number of missing values
total_missing = missing_values.sum()

if total_missing > 0:
    print("Missing values in each column:")
    print(missing_values[missing_values > 0])
else:
    print(0)

0


## Build Recommendations

### 1. Content Based Filtering

The idea here is to determine how similar the descriptions are based on the terms used in the descriptions - while ignoring commonly used words.  Then recommend other items with similar descriptions.  In order to do this, **TF-IDF Vectorization** is used.

#### Build Model

In [13]:
# Generate a matrix of common terms that show up in each movie

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(dfJk['JokeText'])
tfidf_matrix.shape

(100, 3774)

In [14]:
# Calculate cosine similarity between each pair of movies as a function of the similarity of the common terms

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(100, 100)

#### Predict

In [15]:
# Prepare recommendation function (build code from scratch and then package as function for ease of understanding)

titles = dfJk['JokeText']
indices = pd.Series(dfJk.index, index=dfJk['JokeText'])

def get_recommendations(JokeText):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[joke_indices]

In [16]:
def get_joke_recommendations(joke_id, top_n=3):
    # Find the index of the joke that matches the given joke_id
    idx = dfJk.index[dfJk['JokeId'] == joke_id].tolist()[0]

    # Calculate cosine similarity for the selected joke
    cosine_similarities = cosine_similarity(tfidf_matrix[idx], tfidf_matrix).flatten()

    # Get the indices of the top_n most similar jokes
    # Exclude the first index since it is the joke itself
    similar_indices = cosine_similarities.argsort()[-top_n-1:-1][::-1]

    # Return the top_n similar jokes
    return dfJk.loc[similar_indices]


In [17]:
# Assuming you want recommendations related to the joke with JokeId 0
recommended_jokes = get_joke_recommendations(0, top_n=2)
print(recommended_jokes)


    JokeId                                           JokeText
86      86  A man, recently completing a routine physical ...
67      67  A man piloting a hot air balloon discovers he ...


utexas_ds_orie_divider_gray.png

### 2. Collaborative Filtering

#### Prepare data

In [33]:
dfJkRtgV2 = dfJkRtg.transpose()

dfJkRtgV2 = dfJkRtgV2.drop(dfJkRtgV2.index[0])

dfJkRtgV2.reset_index(drop=True, inplace=True)
dfJkRtgV2.index += 1

dfJkRtgV2.columns = [f'{i-1}' for i in range(1, dfJkRtgV2.shape[1] + 1)]

# Create the 'UserId' column *before* trying to select it
dfJkRtgV2['UserId'] = range(1, len(dfJkRtgV2) + 1)

# Now you can reorder the columns
dfJkRtgV2 = dfJkRtgV2[['UserId'] + dfJkRtgV2.columns[:-1].tolist()]

# Assuming 'User' columns in dfRating are named like 'User1', 'User2', etc.
user_ids_in_dfJkRtgV2 = [int(col.split('User')[1]) for col in dfJkRtg.columns if 'User' in col]

user_ids_in_dfJkRtgV2 = user_ids_in_dfJkRtgV2[:len(dfJkRtgV2)]

# Update 'UserId' column in dfNRating
dfJkRtgV2['UserId'] = user_ids_in_dfJkRtgV2
# Display the resulting DataFrame
dfJkRtgV2.head(33000)

Unnamed: 0,UserId,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
1,1,-8.79,-0.87,1.99,-4.61,5.39,-0.78,1.60,1.07,-8.69,...,3.59,1.21,2.86,-0.05,-1.75,-1.02,-0.97,4.13,-1.84,2.96
2,2,-3.50,-2.91,-2.18,-0.10,7.52,1.26,-5.39,1.50,-8.40,...,1.84,-4.03,-1.41,1.65,-3.79,3.98,-6.46,-6.89,-2.33,-7.38
3,3,7.14,-3.88,-3.06,0.05,6.26,6.65,-7.52,7.28,-5.15,...,-4.47,6.36,4.71,-5.19,6.26,3.93,-2.57,1.07,2.33,-0.34
4,4,-8.79,-0.58,-0.58,8.98,7.67,8.25,4.08,2.52,-9.66,...,-0.29,9.37,8.30,9.13,-3.45,9.13,9.17,9.17,9.08,8.98
5,5,9.22,9.37,-3.93,9.27,3.45,-8.11,4.42,2.72,9.08,...,0.73,-1.12,2.28,3.79,3.74,1.94,1.99,3.45,9.17,-1.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14111,14111,3.30,5.19,8.06,0.39,-7.14,7.57,8.45,3.06,0.29,...,-2.48,0.73,7.77,8.35,-9.22,5.97,8.01,-1.17,-6.50,0.97
14112,14112,1.50,-3.16,-2.04,-1.60,0.78,2.82,4.76,1.65,1.80,...,4.90,4.03,3.69,4.42,0.97,4.95,3.74,2.09,4.37,3.69
14113,14113,5.19,7.62,4.95,8.16,-3.01,5.34,2.62,5.00,-8.64,...,-4.13,-4.32,1.70,-1.89,-8.16,6.21,-3.83,8.64,3.79,4.03
14114,14114,-3.50,3.83,-3.88,0.63,-0.10,-1.02,-8.40,-1.99,0.24,...,7.38,-3.88,7.43,6.75,4.61,-4.03,-2.67,4.56,0.00,4.17


#### Build Model

In [34]:
# Prepare data into Surprise library format

!pip3 install scikit-surprise #or !conda install -c conda-forge scikit-surprise
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(0,5))
# Melt the DataFrame to have three columns: UserId, JokeId, Rating
df = dfJkRtgV2.melt(id_vars=['UserId'], var_name='JokeId', value_name='Rating')

# Load the data into Surprise format
X = Dataset.load_from_df(df[['UserId', 'JokeId', 'Rating']], reader)
X_train, X_test = train_test_split(X, test_size=.25)


Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357225 sha256=22a7a328ebf02c2f6971646f844f954c36e747306d371a232f28c5724148759d
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [35]:
# Define SVD model

from surprise import SVD

mdlSvdMvsRtg = SVD()

In [36]:
# Fit SVD model

mdlSvdMvsRtg.fit(X_train)
test_pred = mdlSvdMvsRtg.test(X_test)

In [37]:
# Evalute SVD accuracy

from surprise import accuracy

accuracy.rmse(test_pred)

RMSE: 4.5059


4.505918096170216

In [40]:
# Tune hyperparameters

from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10, 15], 'lr_all': [0.002],
              'reg_all': [0.4]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=2)

gs.fit(X)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

4.604684651883255
{'n_epochs': 15, 'lr_all': 0.002, 'reg_all': 0.4}


In [41]:
# Cross-validate

from surprise.model_selection import cross_validate

cross_validate(mdlSvdMvsRtg, X, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.4895  4.5012  4.4964  4.4914  4.4979  4.4953  0.0043  
MAE (testset)     3.5601  3.5718  3.5650  3.5621  3.5694  3.5657  0.0044  
Fit time          29.24   30.40   30.51   31.08   30.59   30.36   0.61    
Test time         3.29    4.58    3.21    4.69    4.91    4.14    0.73    


{'test_rmse': array([4.48949541, 4.50122239, 4.49637401, 4.49143345, 4.49791083]),
 'test_mae': array([3.56011781, 3.5718494 , 3.56503266, 3.56207681, 3.56938935]),
 'fit_time': (29.242921829223633,
  30.396188020706177,
  30.506810665130615,
  31.07712173461914,
  30.59161925315857),
 'test_time': (3.2890491485595703,
  4.577349662780762,
  3.2106285095214844,
  4.685553550720215,
  4.914057016372681)}

Let us now use the trained model to arrive at predictions.

#### Predict

Let's first see which movies user # 1 has already viewed.

In [43]:
dfJkRtgV2[dfJkRtgV2['UserId'] == 1]

Unnamed: 0,UserId,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
1,1,-8.79,-0.87,1.99,-4.61,5.39,-0.78,1.6,1.07,-8.69,...,3.59,1.21,2.86,-0.05,-1.75,-1.02,-0.97,4.13,-1.84,2.96


In [45]:
mdlSvdMvsRtg.predict(1, 88)

Prediction(uid=1, iid=88, r_ui=None, est=0.8233812255932014, details={'was_impossible': False})

In [46]:
mdlSvdMvsRtg.predict(1, 3)

Prediction(uid=1, iid=3, r_ui=None, est=0.8233812255932014, details={'was_impossible': False})

In [47]:
mdlSvdMvsRtg.predict(1, 53)

Prediction(uid=1, iid=53, r_ui=None, est=0.8233812255932014, details={'was_impossible': False})

It gives the same score for every joke