# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

### 1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [1]:
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Download latest version
path = kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Bobby\.cache\kagglehub\datasets\khusheekapoor\coursera-courses-dataset-2021\versions\1


In [None]:
df = pd.read_csv("C:/Users/Bruger/.cache/kagglehub/datasets/khusheekapoor/coursera-courses-dataset-2021/versions/1/Coursera.csv")

In [5]:
df

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...
...,...,...,...,...,...,...,...
3517,"Capstone: Retrieving, Processing, and Visualiz...",University of Michigan,Beginner,4.6,https://www.coursera.org/learn/python-data-vis...,"In the capstone, students will build a series ...",Databases syntax analysis web Data Visuali...
3518,Patrick Henry: Forgotten Founder,University of Virginia,Intermediate,4.9,https://www.coursera.org/learn/henry,"�Give me liberty, or give me death:� Rememberi...",retirement Causality career history of the ...
3519,Business intelligence and data analytics: Gene...,Macquarie University,Advanced,4.6,https://www.coursera.org/learn/business-intell...,�Megatrends� heavily influence today�s organis...,analytics tableau software Business Intellig...
3520,Rigid Body Dynamics,Korea Advanced Institute of Science and Techno...,Beginner,4.6,https://www.coursera.org/learn/rigid-body-dyna...,"This course teaches dynamics, one of the basic...",Angular Mechanical Design fluid mechanics F...


In [6]:
def feature_fiting(df, feature):
    df = df.copy()
    #RWe first replace missing values with an empty string
    df[feature] = df[feature].fillna('')
    
    #Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
    tfidf = TfidfVectorizer(stop_words='english')
    
    #Construct the required TF-IDF matrix by fitting and transforming the data
    tfidf_matrix = tfidf.fit_transform(df[feature])
    
    return tfidf_matrix

In [7]:
tfidf_matrix = feature_fiting(df, "Course Description")

In [8]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 328 ms
Wall time: 303 ms


In [9]:
cosine_sim

array([[1.00000000e+00, 3.12366523e-02, 1.97603991e-02, ...,
        3.17538002e-02, 3.33859933e-02, 1.96231367e-02],
       [3.12366523e-02, 1.00000000e+00, 8.58915185e-03, ...,
        3.13671991e-02, 4.88239107e-03, 4.56033552e-02],
       [1.97603991e-02, 8.58915185e-03, 1.00000000e+00, ...,
        3.45669421e-03, 1.65197252e-02, 6.37237740e-03],
       ...,
       [3.17538002e-02, 3.13671991e-02, 3.45669421e-03, ...,
        1.00000000e+00, 5.07544593e-04, 6.72367274e-03],
       [3.33859933e-02, 4.88239107e-03, 1.65197252e-02, ...,
        5.07544593e-04, 1.00000000e+00, 1.14068789e-03],
       [1.96231367e-02, 4.56033552e-02, 6.37237740e-03, ...,
        6.72367274e-03, 1.14068789e-03, 1.00000000e+00]])

In [10]:
cosine_sim.shape

(3522, 3522)

In [None]:
def get_recommendations(_indicies, feature_return, cosine_sim=cosine_sim):
    indices = pd.Series(df.index, index=df[feature_return]).drop_duplicates()
  
    idx = indices[_indicies]

    # Get the pairwise similarity scores of all courses with that description
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]

    # Get the course indices
    course_indices = [i[0] for i in sim_scores]

    return df[feature_return].iloc[course_indices]

In [12]:
get_recommendations(df.iloc[0]["Course Description"], "Course Description")

1481    What you�ll achieve:   In this project-centere...
1629    WRITE YOUR FIRST NOVEL  If you�ve ever had the...
3481    Do you have a desire to write a novel, write a...
2186    Everything comes together in the Capstone: sto...
3445    Do you need to write more easily and effective...
3384    This course aims to improve your Business Engl...
2894    In 2-hours long project-based course, you will...
614     Acquiring good academic research and writing s...
2732    Want your workplace writing to make a positive...
104     Writing well is one of the most important skil...
Name: Course Description, dtype: object

In [13]:
get_recommendations(df.iloc[1]["Course Description"], "Course Description")

3311    By the end of this guided project, you will be...
3232    By the end of this 2.5 project, you will be fl...
1636    By the end of this project, you will be fluent...
954     By the end of this project, you will be fluent...
10      By the end of this guided project, you will be...
2147    By the end of this guided project, you will be...
1915    By the end of this project, you will be fluent...
3400    By the end of this 2 hour-long guided project,...
422     This guided project was developed to engage an...
969     By the end of this guided project, you will be...
Name: Course Description, dtype: object

### 2. Create a Content-based filtering recommender system based on the Skills.

In [14]:
tfidf_matrix = feature_fiting(df, "Skills")

In [15]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 109 ms
Wall time: 119 ms


In [16]:
get_recommendations(df.iloc[0]["Skills"], "Skills")

1481    bible  film  film studies  Cinematography  Wri...
1629    art  Interior Design  Fiction Writing  languag...
3481    Fiction Writing  film  Writing  determination ...
2186    Peer Review  project  public speaking  write-o...
3445    Writing  Office Administration  business admin...
3384    Communication  Business Writing  email  Busine...
2894    google apps script  project  Planning  email  ...
614     language  ordered pair  Proofreading  essay wr...
2732    grammar  email writing  Note Taking  Writing  ...
104     Business Writing  email writing  Writing  engl...
Name: Skills, dtype: object

In [17]:
get_recommendations(df.iloc[1]["Skills"], "Skills")

3311    Mapping  Project Management  presentation  Pro...
3232    project  modeling  persona (user experience)  ...
1636    Product Development  Planning  Strategy  busin...
954     cost benefit analysis  Mapping  project  Produ...
10      project  modeling  Project Management  agile m...
2147    agile management  modeling  software  Product ...
1915    Integral  project  Product Development  market...
3400    modeling  Planning  analysis  market (economic...
422     Project Management  Mapping  Leadership and Ma...
969     Mapping  business case  project  personal adve...
Name: Skills, dtype: object

### 3. Load in the Ratings.csv file (on moodle, it is called Books_Ratings.csv). Group by User-ID and sort by Book-Rating in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.

In [18]:
path = kagglehub.dataset_download("arashnic/book-recommendation-dataset")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Bobby\.cache\kagglehub\datasets\arashnic\book-recommendation-dataset\versions\3


In [None]:
df_rating = pd.read_csv("C:/Users/Bruger/.cache/kagglehub/datasets/arashnic/book-recommendation-dataset/versions/3/Ratings.csv")
df_rating

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [21]:
user_unique = df_rating["User-ID"].value_counts().head(200).index

df_rating_groupby = df_rating[df_rating["User-ID"].isin(user_unique)]

df_rating_sortby = df_rating_groupby.sort_values("Book-Rating", ascending=False)

In [22]:
df_rating_sortby

Unnamed: 0,User-ID,ISBN,Book-Rating
250439,56959,0894717561,10
290911,69355,0590150123,10
1040387,248718,1892881268,10
104626,23902,0679774386,10
1040386,248718,1892123029,10
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
138523,31315,0821740636,0


In [23]:
len(df_rating_sortby["User-ID"].unique())

200

In [None]:
df_rating_sortby["User-ID"].value_counts().count()

np.int64(200)

### 4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the Books.csv dataset.

In [None]:
df_book = pd.read_csv("C:/Users/Bruger/.cache/kagglehub/datasets/arashnic/book-recommendation-dataset/versions/3/Books.csv")
df_book

  df_book = pd.read_csv("C:/Users/Bruger/.cache/kagglehub/datasets/arashnic/book-recommendation-dataset/versions/3/Books.csv")


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [None]:
matching_books = df_book[df_book["ISBN"].isin(df_rating_sortby["ISBN"])]
matching_books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...
...,...,...,...,...,...,...,...,...
271348,0231128444,Slow Food(The Case For Taste),Carlo Petrini,2003,Columbia University Press,http://images.amazon.com/images/P/0231128444.0...,http://images.amazon.com/images/P/0231128444.0...,http://images.amazon.com/images/P/0231128444.0...
271349,0520242335,Strong Democracy : Participatory Politics for ...,Benjamin R. Barber,2004,University of California Press,http://images.amazon.com/images/P/0520242335.0...,http://images.amazon.com/images/P/0520242335.0...,http://images.amazon.com/images/P/0520242335.0...
271350,0762412119,"Burpee Gardening Cyclopedia: A Concise, Up to ...",Allan Armitage,2002,Running Press Book Publishers,http://images.amazon.com/images/P/0762412119.0...,http://images.amazon.com/images/P/0762412119.0...,http://images.amazon.com/images/P/0762412119.0...
271351,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...


In [None]:
matching_books["ISBN"].value_counts()

ISBN
0226751260    1
0140375376    1
0451208692    1
1865083429    1
1864486341    1
             ..
0671870432    1
0425176428    1
0399135782    1
0374157065    1
0002005018    1
Name: count, Length: 127296, dtype: int64

In [None]:
df_rating_sortby

Unnamed: 0,User-ID,ISBN,Book-Rating
250439,56959,0894717561,10
290911,69355,0590150123,10
1040387,248718,1892881268,10
104626,23902,0679774386,10
1040386,248718,1892123029,10
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
138523,31315,0821740636,0


In [None]:
book_ranking = df_rating_sortby.groupby("ISBN").agg({"Book-Rating": "mean"}).sort_values("Book-Rating", ascending=False)

In [None]:
book_ranking

Unnamed: 0_level_0,Book-Rating
ISBN,Unnamed: 1_level_1
0373258704,10.0
0001953877,10.0
0373261268,10.0
0373263023,10.0
0373273096,10.0
...,...
N0553212583>>,0.0
0001848445,0.0
000184251X,0.0
0001841572,0.0


In [None]:
user_book_df = df_rating_sortby.pivot_table(index=["User-ID"], columns=["ISBN"], values="Book-Rating")
user_book_df

ISBN,9022906116,*0515128325,0 7336 1053 6,0000000000,00000000000,0000000051,0000001481,0000913154,0001046438,000104687X,...,TBR0385495641,THEALLTRUETRA,THECATASTROPH,THEFLYINGACE,X000000000,"YOUTELLEM,AND",ZR903CX0003,"\0432534220\""""","\2842053052\""""",b00005wz75
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,,,,,,,,,,,...,,,,,,,,,,
6251,,,,,,,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,
7346,,,,,,,,,,,...,,,,,,,,,,
11601,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284,,,,,,,,,,,...,,,,,,,,,,
274061,,,,,,,,,,,...,,,,,,,,,,
274308,,,,,,,,,,,...,,,,,,,,,,
275970,,,,,,,,,,,...,,,,,,,,,,


In [None]:
random_user = np.array(user_book_df.sample(random_state = 50).index)[0]
random_user

np.int64(170518)

In [None]:
random_user_df = user_book_df[user_book_df.index == random_user]
random_user_df

ISBN,9022906116,*0515128325,0 7336 1053 6,0000000000,00000000000,0000000051,0000001481,0000913154,0001046438,000104687X,...,TBR0385495641,THEALLTRUETRA,THECATASTROPH,THEFLYINGACE,X000000000,"YOUTELLEM,AND",ZR903CX0003,"\0432534220\""""","\2842053052\""""",b00005wz75
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
170518,,,,,,,,,,,...,,,,,,,,,,


In [None]:
random_user_books_read = random_user_df.columns[random_user_df.notna().any()].tolist()
random_user_books_read

['000617454X',
 '0006496296',
 '0020413904',
 '0020967705',
 '0023340118',
 '0023888873',
 '002542730X',
 '003057627X',
 '0030981913',
 '0044457820',
 '0060173289',
 '0060179341',
 '006017935X',
 '006018552X',
 '0060803967',
 '0060809833',
 '0060809868',
 '006090917X',
 '0060922532',
 '0060930004',
 '0060934638',
 '0060977744',
 '0060984031',
 '0061000035',
 '0061000175',
 '0061001791',
 '0061004693',
 '006101351X',
 '0061020060',
 '0061030996',
 '0061042366',
 '0061042943',
 '0061054933',
 '0061057096',
 '0061060976',
 '0061081841',
 '006108199X',
 '0061082163',
 '0061083852',
 '0061084883',
 '0061090891',
 '0061091618',
 '0061091790',
 '0061092045',
 '0061092088',
 '0061092908',
 '0061093343',
 '0061095540',
 '0061097101',
 '006109921X',
 '0061099279',
 '0061099686',
 '006131045X',
 '0070150281',
 '0070510210',
 '0099218313',
 '0133030091',
 '013365768X',
 '0135971209',
 '0140077022',
 '0140130209',
 '0140157352',
 '0140236589',
 '0140243437',
 '0140263098',
 '014028009X',
 '01404412

In [None]:
len(random_user_books_read)

850

In [None]:
books_read_df = user_book_df[random_user_books_read]

In [None]:
books_read_df.head(9)

ISBN,000617454X,0006496296,0020413904,0020967705,0023340118,0023888873,002542730X,003057627X,0030981913,0044457820,...,1566471192,1573225541,1582341028,1582431345,1853265608,1878379089,1883642523,376045817,425034925,903180113
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,,,,,,,0.0,,,,...,,,,,,,,,,
6251,,,,,,,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,
7346,,,,,,,,,,,...,,,,,,,,,,
11601,,,,,,,,,,,...,,,,,,,,,,
11676,,,,,,,6.0,,,,...,,,,,0.0,,,,,
12538,,,,,,,10.0,,,,...,,,,,,,,,,
13552,,,,,,,0.0,,,,...,,,,,,,,,,
15408,,,,,,,,,,,...,,,,,,,,,,


In [None]:
user_books_count = books_read_df.T.notnull().sum()

In [None]:
user_books_count

User-ID
3363      16
6251      39
6575      33
7346      32
11601     51
          ..
271284    51
274061    21
274308    56
275970    11
278418    60
Length: 200, dtype: int64

In [None]:
user_books_count = user_books_count.reset_index()
user_books_count.columns = ["User-ID", "book_count"]
user_books_count

Unnamed: 0,User-ID,book_count
0,3363,16
1,6251,39
2,6575,33
3,7346,32
4,11601,51
...,...,...
195,271284,51
196,274061,21
197,274308,56
198,275970,11


In [None]:
user_books_count.sort_values("book_count", ascending=False) # number one will be the same user that you are, so take the second highest user

Unnamed: 0,User-ID,book_count
112,170518,850
5,11676,251
48,76352,182
25,35859,140
67,102967,121
...,...,...
120,175886,2
37,56271,1
43,63714,0
133,193560,0


In [None]:
user_books_count

Unnamed: 0,User-ID,book_count
0,3363,16
1,6251,39
2,6575,33
3,7346,32
4,11601,51
...,...,...
195,271284,51
196,274061,21
197,274308,56
198,275970,11


In [None]:
user_same_books = user_books_count[user_books_count["book_count"] > (len(random_user_books_read)*10)/100]["User-ID"]
user_same_books

5       11676
10      16795
25      35859
34      52584
36      55492
48      76352
52      78783
67     102967
102    153662
112    170518
137    198711
160    230522
162    232131
194    269566
Name: User-ID, dtype: int64

In [None]:
final_df = books_read_df[books_read_df.index.isin(user_same_books)]
final_df

ISBN,000617454X,0006496296,0020413904,0020967705,0023340118,0023888873,002542730X,003057627X,0030981913,0044457820,...,1566471192,1573225541,1582341028,1582431345,1853265608,1878379089,1883642523,376045817,425034925,903180113
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11676,,,,,,,6.0,,,,...,,,,,0.0,,,,,
16795,,,,,,,0.0,,,,...,,,,,,,,,,
35859,,,,,,,,,,,...,,,,,,,,,,
52584,,,,,,,10.0,,,,...,,,,,,,,,,
55492,,,,,,,,,,,...,,,,,,,,,,
76352,,,,,,,,,,,...,,,,,,,,,,
78783,,,,,,,,,,,...,,,,,,,,,,
102967,,,,,,,,,,,...,,,,,,,,,,
153662,,,,,,,,,,,...,,,,,,,,,,
170518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
corr_df = final_df.T.corr()
corr_df

User-ID,11676,16795,35859,52584,55492,76352,78783,102967,153662,170518,198711,230522,232131,269566
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
11676,1.0,0.107932,0.053418,0.17856,0.032286,-0.037574,-0.125419,-0.100328,-0.027182,0.006024,,-0.162184,0.01237,0.075157
16795,0.107932,1.0,-0.032722,0.215649,,,0.045363,0.006504,0.022522,-0.091768,,-0.042887,0.195354,-0.166579
35859,0.053418,-0.032722,1.0,-0.093012,,-0.054355,-0.14786,0.11319,-0.253007,-0.035963,,-0.125674,0.03425,
52584,0.17856,0.215649,-0.093012,1.0,0.773676,-0.027792,-0.059501,-0.05122,-0.146214,0.4068,,0.370084,0.15601,0.354221
55492,0.032286,,,0.773676,1.0,-0.016667,,-0.032928,-0.091396,1.0,,0.647458,,-0.026316
76352,-0.037574,,-0.054355,-0.027792,-0.016667,1.0,0.369174,-0.018909,,-0.013096,,-0.036971,-0.064134,-0.018182
78783,-0.125419,0.045363,-0.14786,-0.059501,,0.369174,1.0,0.397378,-0.107801,-0.038206,,,-0.163692,-0.04344
102967,-0.100328,0.006504,0.11319,-0.05122,-0.032928,-0.018909,0.397378,1.0,-0.096402,-0.034921,,0.470126,-0.123976,-0.055493
153662,-0.027182,0.022522,-0.253007,-0.146214,-0.091396,,-0.107801,-0.096402,1.0,-0.094668,,-0.315123,0.228494,
170518,0.006024,-0.091768,-0.035963,0.4068,1.0,-0.013096,-0.038206,-0.034921,-0.094668,1.0,,0.525824,-0.049646,-0.035227


In [None]:
user_corr = corr_df[random_user].reset_index()
user_corr = user_corr.rename(columns={random_user: 'correlation'})
user_corr = user_corr.sort_values(by="correlation", ascending=False)
user_corr = user_corr.loc[user_corr["User-ID"] != random_user]
user_corr = user_corr.reset_index(drop=True)
user_corr

Unnamed: 0,User-ID,correlation
0,55492,1.0
1,230522,0.525824
2,52584,0.4068
3,11676,0.006024
4,76352,-0.013096
5,102967,-0.034921
6,269566,-0.035227
7,35859,-0.035963
8,78783,-0.038206
9,232131,-0.049646


In [None]:
top_users_ratings = user_corr.merge(df_rating_sortby[["User-ID", "ISBN", "Book-Rating"]], how="inner")
top_users_ratings

Unnamed: 0,User-ID,correlation,ISBN,Book-Rating
0,55492,1.0,0020442203,10
1,55492,1.0,0061015725,10
2,55492,1.0,0064407667,10
3,55492,1.0,0140501711,10
4,55492,1.0,0307157857,10
...,...,...,...,...
55698,198711,,8511839102,0
55699,198711,,9307166813,0
55700,198711,,9590624067,0
55701,198711,,9631172937,0


In [None]:
top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]
top_users_ratings

Unnamed: 0,User-ID,correlation,ISBN,Book-Rating,weighted_rating
0,55492,1.0,0020442203,10,10.0
1,55492,1.0,0061015725,10,10.0
2,55492,1.0,0064407667,10,10.0
3,55492,1.0,0140501711,10,10.0
4,55492,1.0,0307157857,10,10.0
...,...,...,...,...,...
55698,198711,,8511839102,0,
55699,198711,,9307166813,0,
55700,198711,,9590624067,0,
55701,198711,,9631172937,0,


In [None]:
recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
recommendation_df = recommendation_df.reset_index()
recommendation_df

Unnamed: 0,ISBN,weighted_rating
0,0310904102,10.0
1,0385311400,10.0
2,1557482969,10.0
3,0698301978,10.0
4,0679905278,10.0
...,...,...
41782,8467003995,
41783,8511839102,
41784,9590624067,
41785,9631172937,


In [None]:
recommendation_df["ISBN"][0]

'0310904102'

In [None]:
books_to_be_recommended = recommendation_df.merge(matching_books[["ISBN", "Book-Title", "Book-Author", "Year-Of-Publication", "Publisher"]], left_on="ISBN", right_on="ISBN")
books_to_be_recommended = books_to_be_recommended.head()
books_to_be_recommended

Unnamed: 0,ISBN,weighted_rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,310904102,10.0,Comparative Study Bible,Not Applicable (Na ),1985,Zondervan
1,385311400,10.0,Drums of Autumn,DIANA GABALDON,1996,Delacorte Press
2,1557482969,10.0,Pollyanna,Eleanor H. Porters,1993,Barbour Pub Inc
3,698301978,10.0,"\I Can't\"" Said the Ant""",Polly Cameron,1961,Putnam Pub Group (L)
4,679905278,10.0,"Oh, the Places You'll Go!",Seuss,1990,Random House Books for Young Readers


In [None]:
def user_based_recommender(input_user, rate_ratio=0.70, num_recommendations=5):
    user_book_df = df_rating_sortby.pivot_table(index=["User-ID"], columns=["ISBN"], values="Book-Rating")
    user_df = user_book_df[user_book_df.index == input_user]
    input_user_books_read = user_df.columns[user_book_df.notna().any()].tolist()

    # User rating of the movies the input user have rated
    books_read_df = user_book_df[input_user_books_read]

    # Counting how many movies other users have rated that the input user have also rated
    user_books_count = books_read_df.T.notnull().sum()
    
    user_books_count = user_books_count.reset_index()
    user_books_count.columns = ["User-ID", "book_count"]
    
    # Selecting similar users over based on a rating similarity count ratio threshold
    user_same_books = user_books_count[user_books_count["book_count"] > (len(input_user_books_read)*rate_ratio/100)]["User-ID"]
    
    # Creating a correlation matrix based on ratings
    final_df = books_read_df[books_read_df.index.isin(user_same_books)]
    corr_df = final_df.T.corr()

    # Created top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: 'correlation'})
    user_corr = user_corr.sort_values(by="correlation", ascending=False)
    user_corr = user_corr.loc[user_corr["User-ID"] != input_user]
    user_corr = user_corr.reset_index(drop=True)

    #print("same books: ", user_books_count)
    #print("Core: ", corr_df)

    # Creating correlated weighting of rating
    top_users_ratings = user_corr.merge(df_rating_sortby[["User-ID", "ISBN", "Book-Rating"]], how="inner")
    top_users_ratings["weighted_rating"] = top_users_ratings["correlation"] * top_users_ratings["Book-Rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by = "weighted_rating", ascending = False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    books_to_be_recommended = recommendation_df.merge(matching_books[["ISBN", "Book-Title", "Book-Author", "Year-Of-Publication", "Publisher"]], left_on="ISBN", right_on="ISBN")
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    return books_to_be_recommended

In [None]:
user_based_recommender(225087, 0.7)

Unnamed: 0,ISBN,weighted_rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,1564581772,4.890254,DK Handbooks: Horses,Elwyn Hartley Edwards,2000,Dorling Kindersley Publishing
1,157145165X,4.890254,Encyclopedia of the Horse,Elizabeth Peplow,2001,Thunder Bay Press
2,0140360352,4.890254,Gentle Ben,Walt Morey,1992,Puffin Books
3,0451524667,4.890254,Animal Farm,George Orwell,1990,New Amer Library Classics
4,0816750343,4.890254,How to Draw Endangered Animals (How to Draw),Molly Walsh,1999,Troll Communications


In [None]:
user_based_recommender(225087, 0.1)

Unnamed: 0,ISBN,weighted_rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,1564581772,4.890254,DK Handbooks: Horses,Elwyn Hartley Edwards,2000,Dorling Kindersley Publishing
1,816750343,4.890254,How to Draw Endangered Animals (How to Draw),Molly Walsh,1999,Troll Communications
2,140360352,4.890254,Gentle Ben,Walt Morey,1992,Puffin Books
3,793800722,4.890254,Just Say Good Dog!,Linda Goodman,1993,TFH Publications
4,801487722,4.890254,Fish Behavior in the Aquarium and in the Wild,Stephan Reebs,2001,Cornell University Press


In [None]:
user_based_recommender(110973, 0.9)

Unnamed: 0,ISBN,weighted_rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,067168289X,5.248027,IF THERE BE THORNS (Dollanger Saga (Paperback)),V.C. Andrews,1989,Pocket
1,0345296575,5.248027,Lord Foul Bane,Stephen R Donaldson,1981,Ballantine Books
2,0316365629,5.248027,Toot &amp; Puddle: You Are My Sunshine,Holly Hobbie,1999,"Little, Brown"
3,0836249011,5.248027,Cc Little Red Riding Hood (Children's Classics...,Jennifer Greenway,1992,Andrews McMeel Publishing
4,0785305092,5.248027,Campbell's No Time to Cook,Not Applicable (Na ),1993,Publications Intl
