In [2]:
import pandas as pd
import numpy as np

### Exploring the data
We are only interested in three files in the movielens folder: u.data, u.user, and u.item. Although these files are not in CSV format, the code required to load them into a Pandas DataFrame is almost identical.

Let's start with `u.user`:

In [3]:
# Load the u.user file into a dataframe
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('../../data/ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Next, let's take a look at the `u.item` file, which gives us information about the movies that have been rated by our users:

In [4]:
# Load the u.item file into a dataframe
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation',
          'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery',
          'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('../../data/ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')
movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


We see that this file gives us information regarding the movie's title, release date, IMDb URL, and its genre(s). Since we are focused on building only collaborative filters, we do not require any of this information, apart from the movie title and its corresponding ID.

In [5]:
# Remove all information except Movie ID and title
movies = movies[['movie_id', 'title']]

Lastly, let's import the `u.data` file into our notebook. This is arguably the most important file as it contains all the ratings that every user has given to a movie. It is from this file that we will construct our ratings matrix:

In [6]:
# Load the u.data file into a dataframe
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('../../data/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [7]:
# Drop the timestamp column
ratings = ratings.drop('timestamp', axis=1)

In [12]:
for data in [users, movies, ratings]:
    print(data.head(2))
    print('\n')

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043


   movie_id             title
0         1  Toy Story (1995)
1         2  GoldenEye (1995)


   user_id  movie_id  rating
0      196       242       3
1      186       302       3




### Training and test data
The ratings DataFrame contains user `ratings` for movies that range from 1 to 5. Therefore, we can model this problem as an instance of supervised learning where we need to predict the rating, given a user and a movie. Although the ratings can take on only five discrete values, we will model this as a regression problem.

We will split our dataset using a slightly hacky way: we will assume that the user_id field is the target variable (or y) and that our ratings DataFrame consists of the predictor variables (or X). We will then pass these two variables into scikit-learn's train_test_split function and stratify it along y. This ensures that the proportion of each class is the same in both the training and testing datasets:

In [13]:
# Import the train_test_split function
from sklearn.model_selection import train_test_split

# Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['user_id']

# Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
stratify=y, random_state=42)

### Evaluation
We will be using the `RMSE` to assess our modeling performance.

In [54]:
# Import the mean_squared_error function
from sklearn.metrics import mean_squared_error

# Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    assert not np.isnan(y_true).any(), "y_true contains NaN values"
    assert not np.isnan(y_pred).any(), "y_pred contains NaN values"

    return np.sqrt(mean_squared_error(y_true, y_pred))

### Baseline Collaborative filter

In [15]:
# Define the baseline model to always return 3.
def baseline(user_id, movie_id):
    return 3.0

In [16]:
# Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model):
    # Construct a list of user-movie tuples from the testing dataset
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])

    # Predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])

    # Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    # Return the final RMSE score
    return rmse(y_true, y_pred)

In [17]:
score(baseline)

1.2488234462885457

### User-based collaborative filtering
Let's first build a ratings matrix where each row represents a user and each column represents a movie. Therefore, the value in the $i^{th}$ row and $j^{th}$ column will denote the rating given by user i to movie j.

In [18]:
# Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')
r_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


**Mean**

Let's first build one of the simplest collaborative filters possible. This simply takes in user_id and movie_id and outputs the mean rating for the movie by all the users who have rated it. No distinction is made between the users. In other words, the rating of each user is assigned equal weight.

It is possible that some movies are available only in the test set and not the training set (and consequentially, not in our ratings matrix). In such cases, we will just default to a rating of 3.0, like the baseline model.

In [25]:
# User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, movie_id):
    # Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        # Compute the mean of all the ratings given to the movie
        mean_rating = r_matrix[movie_id].mean()
    else:
        # Default to a rating of 3.0 in the absence of any information
        mean_rating = 3.0
    return mean_rating

# Compute RMSE for the Mean model
score(cf_user_mean)

1.0300824802393536

We see that the score obtained for this model is lower and therefore better than the baseline.

**Weighted mean**

In the previous model, we assigned equal weights to all the users. However, it makes intuitive sense to give more preference to those users whose ratings are similar to the user in question than the other users whose ratings are not.

Therefore, let's alter our previous model by introducing a weight coefficient. This coefficient will be one of the similarity metrics that we computed in the previous chapter.

Mathematically, it is represented as follows:
$$r_{u,m}=\frac{\sum_{u',u' \neq u} sim(u,u').r_{u',m}}{\sum_{u',u' \neq u} |sim(u,u')|}$$
In this formula, $r_{u,m}$ represents the rating given by user u to movie m.

In [26]:
# Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

# Import cosine_score
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

# Convert into pandas dataframe
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)
cosine_sim.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.0,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.0,0.139805,0.0,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.0,0.17506,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.0
4,0.029577,0.130237,0.139805,1.0,0.0,0.04519,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.0,0.130343,0.077357,0.15789,0.063911
5,0.245753,0.054918,0.0,0.0,1.0,0.176443,0.28186,0.132205,0.03879,0.1342,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
6,0.335853,0.190552,0.032485,0.04519,0.176443,1.0,0.394725,0.143385,0.125126,0.372679,...,0.328643,0.070809,0.135806,0.17167,0.125446,0.086464,0.230566,0.095478,0.197307,0.185268
7,0.344724,0.079399,0.043869,0.088586,0.28186,0.394725,1.0,0.215861,0.121224,0.378723,...,0.339853,0.110866,0.096055,0.10469,0.126108,0.075012,0.270071,0.020036,0.236086,0.266571
8,0.191582,0.076146,0.080968,0.199526,0.132205,0.143385,0.215861,1.0,0.116173,0.169088,...,0.150048,0.064242,0.118297,0.053969,0.168057,0.095736,0.164157,0.076269,0.089871,0.210995
9,0.057149,0.167992,0.022263,0.135013,0.03879,0.125126,0.121224,0.116173,1.0,0.152694,...,0.082819,0.0644,0.127051,0.069251,0.095673,0.0,0.131458,0.106763,0.089297,0.089583
10,0.251979,0.147376,0.059925,0.026919,0.1342,0.372679,0.378723,0.169088,0.152694,1.0,...,0.279849,0.087828,0.131888,0.111841,0.094423,0.080883,0.255758,0.063461,0.169309,0.181031


With the user cosine similarity matrix in hand, we are now in a position to efficiently calculate the weighted mean scores for this model. However, implementing this model in code is a little more nuanced than its simpler mean counterpart. This is because we need to only consider those cosine similarity scores that have a corresponding, non-null rating. In other words, we need to avoid all users that have not rated movie m.

In [64]:
# User Based Collaborative Filter using Weighted Mean Ratings
def cf_user_wmean(user_id, movie_id):
    # Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        # Get the similarity scores for the user in question with every other user
        sim_scores = cosine_sim[user_id]

        # Get the user ratings for the movie in question
        m_ratings = r_matrix[movie_id]

        # Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index

        # Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()

        # Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        # Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ (sim_scores.sum() + 0.000001)
    else:
        # Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0
    return wmean_rating

score(cf_user_wmean)

1.0236623800413516

Since we are dealing with positive ratings, the cosine similarity score will always be positive. Therefore, we do not need to explicitly add in a modulus function while computing the normalizing factor.

### User demographics
Finally, let's take a look at filters that leverage user demographic information. The basic intuition behind these filter is that users of the same demographic tend to have similar tastes. Therefore, their effectiveness depends on the assumption that women, or teenagers, or people from the same area will share the same taste in movies.

Unlike the previous models, these filters do not take into account the ratings given by all users to a particular movie. Instead, they only look at those users that fit a certain demographic.

Let's now build a gender demographic filter. All this filter does is identify the gender of a user, compute the (weighted) mean rating of a movie by that particular gender, and return that as the predicted value.

In [66]:
# Merge the original users dataframe with the training set
merged_df = pd.merge(X_train, users)
merged_df.head()

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code
0,862,177,4,25,M,executive,13820
1,70,193,4,27,M,engineer,60067
2,666,527,4,44,M,administrator,61820
3,535,168,5,45,F,educator,80302
4,603,1240,5,21,M,programmer,47905


In [76]:
# Compute the mean rating of every movie by gender
gender_mean = merged_df[['movie_id', 'sex', 'rating']].groupby(['movie_id', 'sex'])['rating'].mean()
gender_mean.head()

movie_id  sex
1         F      3.797872
          M      3.888446
2         F      3.285714
          M      3.202703
3         F      2.916667
Name: rating, dtype: float64

In [74]:
# Set the index of the users dataframe to the user_id
users = users.set_index('user_id')
# Gender Based Collaborative Filter using Mean Ratings
def cf_gender(user_id, movie_id):
    # Check if movie_id exists in r_matrix (or training set)
    if movie_id in r_matrix:
        # Identify the gender of the user
        gender = users.loc[user_id]['sex']
        # Check if the gender has rated the movie
        if gender in gender_mean[movie_id]:
            # Compute the mean rating given by that gender to the movie
            gender_rating = gender_mean[movie_id][gender]
        else:
            gender_rating = 3.0
    else:
        # Default to a rating of 3.0 in the absence of any information
        gender_rating = 3.0
    return gender_rating

score(cf_gender)

1.0392906999935203

We see that this model actually performs worse than the standard mean ratings collaborative filter. This indicates that a user's gender isn't the strongest indicator of their taste in movies.

Let's try building one more demographic filter, but this time using both gender and occupation:

In [75]:
#Compute the mean rating by gender and occupation
gen_occ_mean = merged_df[['sex', 'rating', 'movie_id', 'occupation']].pivot_table(values='rating', index='movie_id',
                                                                                  columns=['occupation', 'sex'], aggfunc='mean')
gen_occ_mean.head()

occupation,administrator,administrator,artist,artist,doctor,educator,educator,engineer,engineer,entertainment,...,salesman,salesman,scientist,scientist,student,student,technician,technician,writer,writer
sex,F,M,F,M,M,F,M,F,M,F,...,F,M,F,M,F,M,F,M,F,M
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,3.9375,3.75,5.0,3.4,3.666667,3.25,3.884615,4.0,4.083333,4.0,...,,4.0,3.5,4.0,4.043478,3.796296,4.0,3.75,4.0,3.0
2,3.0,3.666667,,,,4.0,3.5,,3.066667,,...,,,,3.0,2.666667,3.277778,,2.714286,,2.333333
3,3.5,4.0,,,,,2.0,,3.777778,,...,,,,,3.0,3.391304,,4.25,,1.0
4,3.666667,3.6,,4.666667,3.0,2.5,3.8,4.0,3.65,,...,4.0,4.0,,3.4,3.25,3.777778,,3.333333,4.25,3.25
5,4.0,2.333333,,,,4.0,2.333333,,3.5,,...,,,,4.0,4.333333,3.111111,,3.333333,4.0,2.0


In [77]:
# Gender and Occupation Based Collaborative Filter using Mean Ratings
def cf_gen_occ(user_id, movie_id):
    # Check if movie_id exists in gen_occ_mean
    if movie_id in gen_occ_mean.index:
        # Identify the user
        user = users.loc[user_id]
        # Identify the gender and occupation
        gender = user['sex']
        occ = user['occupation']
        # Check if the occupation has rated the movie
        if occ in gen_occ_mean.loc[movie_id]:
            # Check if the gender has rated the movie
            if gender in gen_occ_mean.loc[movie_id][occ]:
                # Extract the required rating
                rating = gen_occ_mean.loc[movie_id][occ][gender]
                # Default to 3.0 if the rating is null
                if np.isnan(rating):
                    rating = 3.0
                return rating
    
    # Return the default rating
    return 3.0

score(cf_gen_occ)

1.1419651376788005

We see that this model performs the worst out of all the filters we've built so far, beating only the baseline. This strongly suggests that tinkering with user demographic data may not be the best way to go forward with the data that we are currently using.

### Item-based collaborative filtering
Item-based collaborative filtering is essentially user-based collaborative filtering where the users now play the role that items played, and vice versa. In item-based collaborative filtering, we compute the pairwise similarity of every item in the inventory. Then, given `user_id` and `movie_id`, we compute the weighted mean of the ratings given by the user to all the items they have rated. The basic idea behind this model is that a particular user is likely to rate two items that are similar to each other similarly.

In [80]:
# Build the ratings matrix using pivot_table function
r_matrix_i = X_train.pivot_table(values='rating', index='movie_id', columns='user_id')
r_matrix_i.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,,,4.0,4.0,,,,4.0,...,,,4.0,,4.0,,,,,
2,3.0,,,,3.0,,,,,,...,4.0,,,,,,,,,5.0
3,4.0,,,,,,,,,,...,,,4.0,,,,,,,
4,,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
5,3.0,,,,,,,,,,...,,,,,,,,,,


In [88]:
# Create a dummy ratings matrix with all null values imputed to 0
r_matrix_i_dummy = r_matrix_i.copy().fillna(0)

# Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim_i = cosine_similarity(r_matrix_i_dummy, r_matrix_i_dummy)

# Convert into pandas dataframe
cosine_sim_i = pd.DataFrame(cosine_sim_i, index=r_matrix_i.index, columns=r_matrix_i.index)
cosine_sim_i.head(5)

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.260375,0.28411,0.339919,0.188551,0.075488,0.493766,0.346421,0.408303,0.196823,...,0.0,0.03838,0.040708,0.0,0.0,0.0,0.0,0.0,0.054278,0.0
2,0.260375,1.0,0.18335,0.362014,0.256462,0.098676,0.286996,0.271497,0.186905,0.099162,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095783
3,0.28411,0.18335,1.0,0.261785,0.164305,0.063693,0.296699,0.175637,0.225768,0.124924,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107006
4,0.339919,0.362014,0.261785,1.0,0.192404,0.049803,0.357379,0.367472,0.337266,0.190223,...,0.0,0.046614,0.0,0.0,0.10987,0.0,0.0,0.0,0.065922,0.087896
5,0.188551,0.256462,0.164305,0.192404,1.0,0.060136,0.276375,0.18241,0.261563,0.045282,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
# Item Based Collaborative Filter using Weighted Mean Ratings
def cf_item_wmean(user_id, movie_id):
    # Check if movie_id exists in r_matrix_i
    if movie_id in r_matrix_i.columns:
        # Get the similarity scores for the movie in question with every other movie
        sim_scores = cosine_sim_i[movie_id]

        # Get the movie ratings for the user in question
        m_ratings = r_matrix_i[user_id]

        # Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index

        # Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()

        # Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        # Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ (sim_scores.sum() + 0.000001)
    else:
        # Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0
    return wmean_rating

score(cf_item_wmean)

1.0265550736985705