$$ ITI \space AI-Pro: \space Intake \space 44 $$
$$ Recommender \space Systems $$
$$ Lab \space no. \space 2 $$

# `01` Import Necessary Libraries

## `i` Default Libraries

In [11]:
!pip install scikit-surprise --user

DEPRECATION: Loading egg at c:\python311\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330




In [69]:
import numpy as np
import pandas as pd
from surprise.reader import Reader
from surprise.dataset import Dataset
from surprise.model_selection import train_test_split
from surprise.prediction_algorithms.knns import KNNWithMeans

## `ii` Additional Libraries
Add imports for additional libraries you used throughout the notebook

In [70]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

----------------------------

# `02` Load Data

 The dataset will have the following columns :
   - song_id (String) : Unique identified for the song
   - user_id (String) : Unique identifier for the user
   - song_genre (Integer) : An integer representing a genre for the song, value is between 1 and 5, indicating that there are 5 unique genres. Each song can only have 1 genre
   - artist_id (String) : Unique identifier for the author of the song
   - n_listen (Integer) : The number of times this user has heard the song (0 -> 15)
   - publish_year (Integer) : The year of song publishing

In [71]:
data = pd.read_csv("Data/songs_data.csv")
data.head()

Unnamed: 0,song_id,artist_id,song_genre,user_id,n_listen,publish_year
0,537,368,4,2066,13,2002
1,921,107,1,1179,5,2006
2,352,188,1,1468,11,2013
3,853,370,4,460,9,2020
4,479,408,2,1125,3,2020


--------------------------

# `03` Content-based Filtering

Practice for content-based filtering on dummy data

## `i` Feature Engineering/Selection
Construct the item vector representation matrix from the `data` above

In [72]:
encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(data[['song_genre']])
encoded_data = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['song_genre']))
data_encoded = pd.concat([data, encoded_data], axis=1)

#data_encoded = pd.get_dummies(data, columns=['song_genre'])
# scaler = MinMaxScaler()

# data_encoded[['n_listen', 'publish_year']] = scaler.fit_transform(data_encoded[['n_listen', 'publish_year']])
item_features = ['song_id', 'artist_id', 'n_listen', 'publish_year'] + list(data_encoded.columns[data_encoded.columns.str.startswith('song_genre_')])
item_vectors = data_encoded[item_features].set_index('song_id')
item_vectors.head()




Unnamed: 0_level_0,artist_id,n_listen,publish_year,song_genre_1,song_genre_2,song_genre_3,song_genre_4,song_genre_5
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
537,368,13,2002,0.0,0.0,0.0,1.0,0.0
921,107,5,2006,1.0,0.0,0.0,0.0,0.0
352,188,11,2013,1.0,0.0,0.0,0.0,0.0
853,370,9,2020,0.0,0.0,0.0,1.0,0.0
479,408,3,2020,0.0,1.0,0.0,0.0,0.0


## `ii` Utility Matrix
Construct utility matrix for the loaded dataframe `data`

In [73]:
utility_matrix = data.pivot_table(index='user_id', columns='song_id', values='n_listen', fill_value=0)
utility_matrix.head()

song_id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2,15.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,11.0,0.0,6.0
3,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,11.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0


## `iii` Item-Item Similarity Matrix

Construct item-item (Cosine/Adjusted Cosine) similarity matrix.

In [74]:
def adjusted_cosine_sim(vec_a, vec_b):
    """
    Returns the adjusted cosine similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """

    adj_a = vec_a - np.mean(vec_a)
    adj_b = vec_b - np.mean(vec_b)
    sim_score = np.dot(adj_a, adj_b) / (np.linalg.norm(adj_a)*np.linalg.norm(adj_b))
    # sim_score = cosine_sim(adj_a, adj_b)

    return sim_score

In [40]:
# sim_mat = pd.DataFrame(index=item_vectors.index, columns=item_vectors.index)
# for i in item_vectors.index:
#     for j in item_vectors.index:
#         if i != j:
#             sim_mat.loc[i, j] = adjusted_cosine_sim(item_vectors.loc[i], item_vectors.loc[j])
#         else:
#             sim_mat.loc[i, j] = 1

# sim_df = pd.DataFrame(sim_mat, index=item_vectors.index, columns=item_vectors.index)
# sim_df=sim_df.astype(float)
# sim_df.head()

In [8]:
def apply_similarity_matrix(item_vectors,similarity_func):
  item_item_sim=np.zeros((len(item_vectors),len(item_vectors)))
  for i,(_,vec_a) in enumerate(item_vectors.iterrows()):
    for j,(_,vec_b) in enumerate(item_vectors.iterrows()):
      sim_score=similarity_func(vec_a,vec_b)
      item_item_sim[i,j]=sim_score
  return item_item_sim
sim_df=apply_similarity_matrix(item_vectors,adjusted_cosine_sim)

In [76]:
#both crashes the kernel , I also tried vectroized and batch version , still not working

## `iv` Top-K Candidate Generation

Selet top-K (a k of your choice) similar items for each item (a user of your choice) rated from the similarity matrix above.

In [88]:
def get_top_k_similar_items(similarity_matrix, item_id, k):
    """
    Returns the top k similar items to the given item_id.

            Parameters:
                similarity_matrix (pandas.DataFrame): Item-item similarity matrix
                item_id (int): Item ID
                k (int): Number of similar items to return

            Returns:
                top_k_similar_items (list): List of top k similar items
    """

    top_k_similar_items = similarity_matrix.loc[item_id].sort_values(ascending=False).head(k).index.tolist()

    return top_k_similar_items
  

## `v` Candidate Filtering

Filter out items (your user) has rated from the candidates above.

In [89]:
def get_user_unlistened_songs(user_id, utility_matrix):
    """
    Returns the songs that the user has not listened to.

            Parameters:
                user_id (int): User ID
                utility_matrix (pandas.DataFrame): Utility matrix

            Returns:
                unlistened_songs (list): List of songs that the user has not listened to
    """

    unlistened_songs = utility_matrix.columns[utility_matrix.loc[user_id] == 0].tolist()

    return unlistened_songs

## `vi` Candidate Rating Prediction

Calculate the predicted rating for each of the candidate items.

In [90]:
def predict_rating(user_id, item_id, utility_matrix, sim_df, k):
    """
    Returns the predicted rating for the given user and item.

            Parameters:
                user_id (int): User ID
                item_id (int): Item ID
                utility_matrix (pandas.DataFrame): Utility matrix
                sim_df (pandas.DataFrame): Item-item similarity matrix
                k (int): Number of similar items to consider

            Returns:
                predicted_rating (float): Predicted rating for the given user and item
    """

    unlistened_songs = get_user_unlistened_songs(user_id, utility_matrix)
    sim_items = get_top_k_similar_items(sim_df, item_id, k)
    rating_numerator = 0
    rating_denominator = 0

    for song in unlistened_songs:
        if song != item_id:
            if song in sim_items:
                rating_numerator += utility_matrix.loc[user_id, song] * sim_df.loc[item_id, song]
                rating_denominator += sim_df.loc[item_id, song]

    if rating_denominator != 0:
        predicted_rating = rating_numerator / rating_denominator
    else:
        predicted_rating = 0

    return predicted_rating

--------------------------

# `04` KNN Item-based Colaborative Filtering

Practice for Using Scikit Surprise Library

## `i` Data Loading

Load `songsDataset.csv` file into a dataframe

In [54]:
df = pd.read_csv('Data/songsDataset.csv')
df.head()

Unnamed: 0,userID,songID,rating
0,0,90409,5
1,4,91266,1
2,5,8063,2
3,5,24427,4
4,5,105433,4


## `ii` Prepare Data

Procedures to Follow:
- Instantiate the Reader Object (see, [Documentation](https://surprise.readthedocs.io/en/stable/reader.html))
- Load the Data into `surprise.dataset.Dataset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/dataset.html))
- Build the full (i.e. without folds) `surprise.Trainset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/trainset.html#:~:text=It%20is%20used%20by%20the%20fit()%20method%20of%20every%20prediction%20algorithm.%20You%20should%20not%20try%20to%20build%20such%20an%20object%20on%20your%20own%20but%20rather%20use%20the%20Dataset.folds()%20method%20or%20the%20DatasetAutoFolds.build_full_trainset()%20method.))

In [13]:
reader = Reader(rating_scale=(0,5))

In [30]:
data =Dataset.load_from_df(df, reader)
trainset = data.build_full_trainset()
train, test = train_test_split(data, 0.2, random_state=1234)

## `iii` Initialize the `KNNWithMeans` Model

**Note**: `KNNWithMeans` uses the normalized ratings instead of the raw ones. (See [Documentation](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans))

**Hint**: Use $k=10$ and configure `sim_options` to be:
- item_based
- pearson

In [17]:
knn_model =KNNWithMeans(k=10, sim_options={'name': 'pearson', 'user_based': False})

## `iv` Fit the Model on Data

In [31]:
knn_model.fit(train)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1b474430690>

## `v` Calculate Predicted Rating $\hat{r}$ for User $199988$

**Hine**: you can use `.predict()` method of the model (see [Documentaion](https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=.predict#train-on-a-whole-trainset-and-the-predict-method:~:text=pred%20%3D%20algo.predict(uid%2C%20iid%2C%20r_ui%3D4%2C%20verbose%3DTrue)))

In [63]:
unique_song_ids = df['songID']
song_predictions = pd.DataFrame({'songID': unique_song_ids})
song_predictions.head()

Unnamed: 0,songID
0,90409
1,91266
2,8063
3,24427
4,105433


Unnamed: 0,songID
0,90409
1,91266
2,8063
3,24427
4,105433


In [67]:
song_predictions['predicted_rating'] = song_predictions['songID'].apply(lambda x: knn_model.predict('199988', x).est)
song_predictions.set_index('songID', inplace=True)
song_predictions.head()

Unnamed: 0_level_0,predicted_rating
songID,Unnamed: 1_level_1
90409,3.450795
91266,3.450795
8063,3.450795
24427,3.450795
105433,3.450795


In [None]:
song_predictions['predicted_rating'] =None
song_predictions.head()

Unnamed: 0_level_0,predicted_rating
songID,Unnamed: 1_level_1
90409,4.808493
91266,4.70561
8063,4.2398
24427,4.549136
105433,4.872347


## `vi` Recommend Top 10 Songs

In [68]:
song_predictions_sorted = song_predictions.sort_values('predicted_rating', ascending=False)
song_predictions_sorted.head(10)

Unnamed: 0_level_0,predicted_rating
songID,Unnamed: 1_level_1
90409,3.450795
17029,3.450795
55622,3.450795
25182,3.450795
74640,3.450795
90409,3.450795
119103,3.450795
72309,3.450795
3785,3.450795
2263,3.450795


Unnamed: 0_level_0,predicted_rating
songID,Unnamed: 1_level_1
60888,5.0
122065,5.0
132189,5.0
71582,5.0
52611,5.0
62954,5.0
112023,5.0
40712,5.0
126757,4.983563
92881,4.941095


----------------------------------------------

$$ Wish \space you \space all \space the \space best \space ♡ $$
$$ Abdelrahman \space Eid $$