## Using SVD for Recommender System
SVD is a popular matrix factorization algorithm that can be used for recommender systems.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Read the Dataset

In [3]:
ratings = pd.read_csv('./goodbooks/ratings.csv')
metadata = pd.read_csv('./goodbooks/books.csv')

In [7]:
print(ratings.shape)
ratings.sample(5)

(981756, 3)


Unnamed: 0,book_id,user_id,rating
804129,8115,4806,4
643478,6464,53079,4
147429,1475,31559,1
63067,631,29312,3
877762,8883,38424,3


In [10]:
ratings['rating'].value_counts()

4    357366
5    292961
3    248623
2     63231
1     19575
Name: rating, dtype: int64

Ratings file contains the list of ratings given by users for a book, values range from 1 to 5.

In [8]:
print(metadata.shape)
metadata.sample(5)

(10000, 23)


Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
5810,5811,15745371,15745371,21434921,18,1250012899,9781250000000.0,C.C. Hunter,2014.0,,...,22968,24381,1506,186,615,2881,7074,13625,https://images.gr-assets.com/books/1345739181m...,https://images.gr-assets.com/books/1345739181s...
4359,4360,31106,31106,2484087,39,670038571,9780670000000.0,Jane Green,2007.0,Second Chance,...,22795,23625,1121,577,2375,8762,7839,4072,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
2461,2462,88815,88815,725380,86,151013047,9780151000000.0,Mohsin Hamid,2007.0,The Reluctant Fundamentalist,...,33879,40953,4713,768,3367,12197,16905,7716,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
1553,1554,59264,59264,3736193,70,786818611,9780787000000.0,Jonathan Stroud,2006.0,Ptolemy's Gate,...,63838,68833,1783,959,1972,9826,22917,33159,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
8050,8051,6282334,6282334,6466309,19,1400066212,9781400000000.0,Tracy Kidder,2000.0,Strength in What Remains,...,11833,12735,1649,130,455,2619,5469,4062,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...


In [105]:
pd.merge(ratings, metadata, left_on='book_id', right_on='id')[['title','rating']]

Unnamed: 0,title,rating
0,"The Hunger Games (The Hunger Games, #1)",5
1,"The Hunger Games (The Hunger Games, #1)",3
2,"The Hunger Games (The Hunger Games, #1)",5
3,"The Hunger Games (The Hunger Games, #1)",4
4,"The Hunger Games (The Hunger Games, #1)",4
...,...,...
981751,The First World War,5
981752,The First World War,4
981753,The First World War,5
981754,The First World War,5


In [17]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

The books file contains the metadata for each of the 10,000 books.

### Create a Surprise Dataset
The Surprise dataset will require 3 things:
1. user IDs
2. item IDs
3. corresponding ratings

In [9]:
from surprise import Dataset, Reader

In [11]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings[['user_id','book_id','rating']], reader)

In [12]:
from surprise import SVD
from surprise.model_selection import cross_validate

In [14]:
# cross validate an SVD model with cv=3
svd = SVD(verbose=True, n_epochs=10)
cross_validate(svd, data, measures=['RMSE','MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 5 split(s

{'test_rmse': array([0.85079766, 0.85185394, 0.85064099, 0.85099545, 0.84927677]),
 'test_mae': array([0.66982056, 0.67046455, 0.66923777, 0.6700349 , 0.66825207]),
 'fit_time': (25.653196573257446,
  26.722018241882324,
  26.904203176498413,
  27.460997104644775,
  27.5600266456604),
 'test_time': (1.9844048023223877,
  2.082953929901123,
  2.301028251647949,
  2.0499963760375977,
  2.229996919631958)}

- The mean RMSE of 0.85 is <1. Next, we will convert the data into a Surprise Trainset and fit with the SVD model.

In [15]:
# convert data to trainset and fit to model
trainset = data.build_full_trainset()
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x210a6d2d908>

## Generating Rating Predictions
With the trained SVD model, we will use it to predict the rating a user would assign to a book given the user ID (UID) and item ID (IID).

In [16]:
svd.predict(uid=10, iid=100)

Prediction(uid=10, iid=100, r_ui=None, est=4.076954355955856, details={'was_impossible': False})

The model predicted that this user will give a 4-star rating (roughly) to the book corresponding to IID of 100. We can then use this predicted rating to justify recommending it to a user.

## Generating Book Recommendations
Using the rating prediction model, we can then generate book recommendations.

In [18]:
import difflib

In [124]:
def get_book_id(title, metadata):
    """
    Gets the book ID for a book title based on the closest match in the metadata
    """
    
    existing_titles = list(metadata['title'].values)
    
    # using difflib to return similar titles
    closest_titles = difflib.get_close_matches(title, existing_titles)
    
    return metadata[metadata['title'] == closest_titles[0]]['id'].values[0]

def get_book_info(book_id, metadata):
    """
    Returns some basic information about a book using the book id
    """
    details = ['id','isbn','authors','title','original_title']
    book_info = metadata[metadata['id'] == book_id][details]
    
    return book_info #.to_dict(orient='records')

def predict_review(user_id, title, model, metadata):
    """
    Predicts the review (on a scale of 1 to 5) that a user would assign to a specific book
    """
    
    book_id = get_book_id(title, metadata)
    review_prediction = model.predict(uid=user_id, iid=book_id)
    
    return review_prediction.est

def generate_recommendation(user_id, model, metadata, thresh=4, top=3):
    """
    Generates a book recommendation for a user based on a rating threshold. Only books with a predicted rating at
    or above the threshold will be recommended
    """
    
    existing_titles = list(metadata['title'].values)
    
    # get list of books already rated by the user
    rated_books = ratings[ratings['user_id']==user_id][['book_id','rating']]
    rated_titles = pd.merge(rated_books, metadata, left_on='book_id', right_on='id')[['title','rating']]
    print(rated_titles)
    
    # add randomness to the recommendations
    np.random.shuffle(existing_titles)
    
    # initialize the counter for number of recommendations
    recommended_count = 0
    recommended_books = pd.DataFrame(columns=['id','isbn','authors','title','original_title'])
    title_count = 0
    
    for title in existing_titles:
        title_count += 1
        if title_count%100 == 0:
            print(title_count, recommended_count)
        
        # stop searching once maximum number of recommendations has been reached
        if recommended_count == top:
            break
        
        # get the predicted rating for the book
        rating = predict_review(user_id, title, model, metadata)
        if rating >= thresh:
            # show comment if the book has already been rated by the user
            if rated_titles['title'].isin([title]).sum()>0:
                print(f'I was going to recommend {title} but you have already rated this title!')
            else:
                book_id = get_book_id(title, metadata)
                recommended_books = recommended_books.append(get_book_info(book_id, metadata))
                recommended_count += 1
    
    return recommended_books

In [126]:
generate_recommendation(1, svd, metadata, 4, 3)

                                               title  rating
0                            The Forty Rules of Love       4
1  Brunelleschi's Dome: How a Renaissance Genius ...       3
2  Born on a Blue Day: Inside the Extraordinary M...       4


Unnamed: 0,id,isbn,authors,title,original_title
1327,1328,1439177724,Terry Hayes,"I Am Pilgrim (Pilgrim, #1)",I Am Pilgrim
2038,2039,670175919,Robert McCloskey,Blueberries for Sal,Blueberries for Sal
8965,8966,749935081,Susan Elizabeth Phillips,Ain't She Sweet,Ain't She Sweet?


## Visualizing the Book Factors
We can visualize the similarity between books based on the book factor matrix (svd.qi). Using a dimensionality reduction technique, each book will be represented as a two-dimensional point in space. The technique is called t-SNE (t-Distributed Stochastic Neighbors Embedding).

In [127]:
from sklearn.manifold import TSNE

In [151]:
svd.qi.shape

(10000, 100)

In [128]:
tsne = TSNE(n_components=2, n_iter=500, verbose=3, random_state=42)
books_embedding = tsne.fit_transform(svd.qi)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.266s...
[t-SNE] Computed neighbors for 10000 samples in 28.352s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.233294
[t-SNE] Computed conditional probabilities in 0.475s
[t-SNE] Iteration 50: error = 95.4130249, gradient norm = 0.1047006 (50 iterations in 5.421s)
[t-SNE] I

In [131]:
books_embedding[:5]

array([[-2.2913356, -1.117074 ],
       [ 1.1640812, -4.400837 ],
       [-3.1550276, -4.067922 ],
       [-4.2195   ,  1.8293823],
       [-4.779417 ,  0.89934  ]], dtype=float32)

In [133]:
projection = pd.DataFrame(columns=['x', 'y'], data=books_embedding)
projection['title'] = metadata['original_title']

In [134]:
projection.head()

Unnamed: 0,x,y,title
0,-2.291336,-1.117074,The Hunger Games
1,1.164081,-4.400837,Harry Potter and the Philosopher's Stone
2,-3.155028,-4.067922,Twilight
3,-4.2195,1.829382,To Kill a Mockingbird
4,-4.779417,0.89934,The Great Gatsby


In [137]:
import plotly.express as px

In [156]:
fig = px.scatter(projection, x='x', y='y')
fig.show()

- Some books are generally popular among a wide range of audiences and are represented as points in the center of this scatterplot.
<br>
- Other books may belong to specific genres like mystery, romance, fantasy and are popular among specific audiences. These books would be represented by points further away from the center of the plot.

We can look at some of the book titles to see where they sit.

In [148]:
def plot_books(titles):
    books = []
    for title in titles:
        # exclude null titles
        if title != 'null':
            books.append(get_book_id(title, metadata)-1)
        
    book_vector_df = projection.iloc[books]
    
    fig = px.scatter(book_vector_df, x='x', y='y', text='title')
    fig.show()

In [150]:
books = list(metadata['title'].sample(20))
plot_books(books)