Item-based collaborative filtering with ratings data filtered to contain only books with more then 25 ratings. Since we will compare book vectors, it would be nice to have there some values.

# Setup

In [None]:
!wget https://raw.githubusercontent.com/katarinagresova/MLprojects/main/BookRecommendations/data/preprocessed_books.csv
!wget https://raw.githubusercontent.com/katarinagresova/MLprojects/main/BookRecommendations/data/preprocessed_users.csv
!wget https://raw.githubusercontent.com/katarinagresova/MLprojects/main/BookRecommendations/data/preprocessed_ratings.csv

--2021-11-08 17:55:36--  https://raw.githubusercontent.com/katarinagresova/MLprojects/main/BookRecommendations/data/preprocessed_books.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23601906 (23M) [text/plain]
Saving to: ‘preprocessed_books.csv’


2021-11-08 17:55:36 (139 MB/s) - ‘preprocessed_books.csv’ saved [23601906/23601906]

--2021-11-08 17:55:36--  https://raw.githubusercontent.com/katarinagresova/MLprojects/main/BookRecommendations/data/preprocessed_users.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1840909 (1.

In [None]:
import pandas as pd
import numpy as np    
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

In [None]:
books = pd.read_csv('preprocessed_books.csv')
users = pd.read_csv('preprocessed_users.csv')
ratings = pd.read_csv('preprocessed_ratings.csv')

In [None]:
ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0.000000
1,276726,0155061224,0.000000
2,276727,0446520802,0.000000
3,276729,052165615X,0.000000
4,276729,0521795028,0.000000
...,...,...,...
1031005,276704,0876044011,0.578179
1031006,276704,1563526298,2.578179
1031007,276706,0679447156,0.000000
1031008,276709,0515107662,0.000000


# Data preprocessing

- create new column combining book title and book author - this will be used to identify books
- filter ratings to use only books with at least 50 ratings

In [None]:
books['Title'] = books.apply(lambda x: x['Book-Title'].lower() + ' | ' + x['Book-Author'].lower() , axis=1)

In [None]:
books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Title
0,0195153448,classical mythology,mark p. o. morford,2002,Oxford University Press,classical mythology | mark p. o. morford
1,0002005018,clara callan,richard bruce wright,2001,HarperFlamingo Canada,clara callan | richard bruce wright
2,0060973129,decision in normandy,carlo d'este,1991,HarperPerennial,decision in normandy | carlo d'este
3,0374157065,flu: the story of the great influenza pandemic...,gina bari kolata,1999,Farrar Straus Giroux,flu: the story of the great influenza pandemic...
4,0393045218,the mummies of urumchi,e. j. w. barber,1999,W. W. Norton &amp; Company,the mummies of urumchi | e. j. w. barber
...,...,...,...,...,...,...
270942,0440400988,there's a bat in bunk five,paula danziger,1988,Random House Childrens Pub (Mm),there's a bat in bunk five | paula danziger
270943,0525447644,from one to one hundred,teri sloat,1991,Dutton Books,from one to one hundred | teri sloat
270944,006008667X,lily dale : the true story of the town that ta...,christine wicker,2004,HarperSanFrancisco,lily dale : the true story of the town that ta...
270945,0192126040,republic (world's classics),plato,1996,Oxford University Press,republic (world's classics) | plato


In [None]:
books['Title'].value_counts()

little women | louisa may alcott                    23
wuthering heights | emily bronte                    22
adventures of huckleberry finn | mark twain         20
pride and prejudice | jane austen                   19
the secret garden | frances hodgson burnett         17
                                                    ..
drawing: a contemporary approach | claudia betti     1
my foolish heart | james pendergrast                 1
avenging angel (point crime s.) | david belbin       1
die kunst der unordnung. | luciano decrescenzo       1
the bond of power | joseph chilton pearce            1
Name: Title, Length: 248207, dtype: int64

Just to know, that there are many books with multiple ISBN numbers, but with the same name and the same author. We will treat those books as one. So we will use column `Title` in user-item table, instead of column `ISBN`.

In [None]:
ratings['ISBN'].unique().shape

(269745,)

In [None]:
num_ratings_by_book = ratings.groupby('ISBN')["Book-Rating"].count()
ratings[ratings['ISBN'].isin(num_ratings_by_book[num_ratings_by_book < 25].index)].value_counts()

User-ID  ISBN        Book-Rating
278854   0553578596  -5.046604      1
95359    0399142819   4.390684      1
         0396087213  -0.609316      1
         039562892X  -2.484316      1
         0394832922   3.390684      1
                                   ..
189334   0452268737   5.639425      1
         0452268621   5.639425      1
         0452268370  -3.360575      1
         0452267765  -4.360575      1
2        0195153448   0.000000      1
Length: 684540, dtype: int64

In [None]:
ratings = ratings[ratings['ISBN'].isin(num_ratings_by_book[num_ratings_by_book > 24].index)]
ratings = ratings.reset_index(drop=True)

In [None]:
ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0.000000
1,276727,0446520802,0.000000
2,276744,038550120X,0.000000
3,276746,0425115801,-0.180370
4,276746,0449006522,-1.155024
...,...,...,...
346465,276704,0446605409,-2.582535
346466,276704,0743211383,3.867899
346467,276704,080410526X,-2.301666
346468,276706,0679447156,0.000000


In [None]:
ratings['ISBN'].unique().shape

(5463,)

# User-item matrix

In [None]:
ratings = pd.merge(ratings, books[['ISBN', 'Title']])

In [None]:
ratings['Title'].unique().shape

array(['flesh tones: a novel | m. j. rose',
       'the notebook | nicholas sparks', 'a painted house | john grisham',
       ..., 'slightly scandalous (get connected romances) | mary balogh',
       'years | lavyrle spencer', 'something wonderful | judith mcnaught'],
      dtype=object)

In [None]:
sparse = ratings.pivot_table(columns='Title', values='Book-Rating', index='User-ID')

In [None]:
corr = sparse.corr()

In [None]:
corr

Title,""" lamb to the slaughter and other stories (penguin 60s s.) | roald dahl","""o"" is for outlaw | sue grafton","""surely you're joking, mr. feynman!"": adventures of a curious character | richard p. feynman",'salem's lot | stephen king,10 lb. penalty | dick francis,"14,000 things to be happy about | barbara ann kipfer",16 lighthouse road | debbie macomber,1984 | george orwell,1st to die: a novel | james patterson,2010: odyssey two | arthur c. clarke,204 rosewood lane | debbie macomber,2061: odyssey three | arthur c. clarke,24 hours | greg iles,253 | geoff ryman,2nd chance | james patterson,3001: the final odyssey | arthur c. clarke,311 pelican court | debbie macomber,3rd degree | james patterson,4 blondes | candace bushnell,50 simple things you can do to save the earth | earthworks group,52 deck series: 52 ways to celebrate friendship | lynn gordon,7b | stella cameron,84 charing cross road | helene hanff,9-11 | noam chomsky,a 2nd helping of chicken soup for the soul (chicken soup for the soul series (paper)) | jack canfield,a 3rd serving of chicken soup for the soul (chicken soup for the soul series (paper)) | jack canfield,a 4th course of chicken soup for the soul: 101 more stories to open the heart and rekindle the spirit | jack canfield,a 5th portion of chicken soup for the soul : 101 stories to open the heart and rekindle the spirit | jack canfield,a beautiful mind: the life of mathematical genius and nobel laureate john nash | sylvia nasar,a bend in the road | nicholas sparks,a book without covers | john andrew storey,a breach of promise (william monk novels (paperback)) | anne perry,a brief history of time : the updated and expanded tenth anniversary edition | stephen hawking,a calculated risk | katherine neville,a canticle for leibowitz (bantam spectra book) | walter m. miller jr.,a caress of twilight (meredith gentry novels (hardcover)) | laurell k. hamilton,a caress of twilight (meredith gentry novels (paperback)) | laurell k. hamilton,a case of need | michael crichton,a certain justice (adam dalgliesh mysteries (hardcover)) | p. d. james,a certain justice (adam dalgliesh mysteries (paperback)) | p. d. james,...,worst fears realized | stuart woods,worst fears | fay weldon,worth any price | lisa kleypas,wouldn't take nothing for my journey now | maya angelou,writ of execution | perri o'shaughnessy,writing down the bones | natalie goldberg,written on the body | jeanette winterson,wuthering heights (penguin classics) | emily bronte,wuthering heights (penguin popular classics) | emily bronte,wuthering heights (signet classic) | emily bronte,wuthering heights (wordsworth classics) | emily bronte,wuthering heights | emily bronte,xanth 13: isle of view | piers anthony,xanth 14: question quest | piers anthony,xanth 15: the color of her panties | piers anthony,xenocide (ender wiggins saga (paperback)) | orson scott card,year in provence | peter mayle,year of wonders | geraldine brooks,year of wonders: a novel of the plague | geraldine brooks,years | lavyrle spencer,yesterday | fern michaels,"yesterday, i cried : celebrating the lessons of living and loving | iyanla vanzant",you belong to me and other true cases (ann rule's crime files: vol. 2) | ann rule,you belong to me | johanna lindsey,you belong to me | mary higgins clark,you can heal your life/101 | louise l. hay,"you can't scare me! (goosebumps, no 15) | r. l. stine",you just don't understand | deborah tannen,you shall know our velocity | dave eggers,you're only old once! : a book for obsolete children | dr seuss,young wives | olivia goldsmith,your oasis on flame lake (ballantine reader's circle) | lorna landvik,yukon ho! | bill watterson,z for zachariah | robert c. o'brien,zen and the art of motorcycle maintenance: an inquiry into values | robert pirsig,zlata's diary: a child's life in sarajevo | zlata filipovic,zodiac: the eco-thriller | neal stephenson,zombies of the gene pool | sharyn mccrumb,zoya | danielle steel,zwã?â¶lf. | nick mcdonell
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
""" lamb to the slaughter and other stories (penguin 60s s.) | roald dahl",1.000000,-0.343682,,,1.000000,,,-1.000000,0.595034,,,,,-1.0,0.857165,,,,-0.505843,,,,-1.000000,,,,,,1.000000,1.000000,,,,,,,,-0.965592,,-1.000000,...,,,,1.000000,,,,-0.114533,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"""o"" is for outlaw | sue grafton",-0.343682,1.000000,-1.000000,,0.193714,-0.484389,0.274386,0.271387,0.279546,-0.586031,-0.347390,1.000000,-0.237598,1.0,0.547894,-0.408847,-0.573078,0.769125,-0.507919,-0.190983,-0.844879,0.913624,0.366996,1.0,0.512593,0.603631,0.6195,0.543598,-0.449270,-0.221837,0.996695,-0.842789,1.000000,0.567077,,,-0.655651,-0.311328,-1.0,-0.308230,...,-0.377237,-1.0,-0.758676,-0.771883,0.570404,-0.888338,1.0,-1.000000,-1.0,-1.0,,-0.007332,-0.540556,0.242319,0.984964,-0.328139,-1.0,-0.421593,1.000000,0.028104,0.564928,,0.214734,0.073608,0.194416,-1.0,0.086888,,,1.0,0.715810,0.501862,0.333431,-0.682920,-0.224741,,-0.325098,-0.987181,0.056862,
"""surely you're joking, mr. feynman!"": adventures of a curious character | richard p. feynman",,-1.000000,1.000000,,,1.000000,,-0.802411,0.480622,0.975115,,,0.975850,-1.0,0.999986,,,1.000000,0.649119,-1.000000,,,0.140380,-1.0,-1.000000,,,,-0.342178,0.999466,0.436576,,0.999737,,,,,0.986944,,0.999959,...,,,,,-1.000000,-1.000000,,,,,,,,,,1.000000,,0.876379,,,,,,,1.000000,,1.000000,,,,1.000000,,-1.000000,-1.000000,-0.455882,,-0.542674,-1.000000,-1.000000,
'salem's lot | stephen king,,,,1.000000,-0.995161,,,1.000000,-0.850344,0.966699,,1.000000,1.000000,,0.449961,1.000000,,,1.000000,,,,,,,,,,,,,,,,,,,0.488899,,,...,,,,,-1.000000,,,,,,,,,,,,,,,,,,,,0.463043,,-1.000000,,,,,,,,1.000000,,,,-0.888249,
10 lb. penalty | dick francis,1.000000,0.193714,,-0.995161,1.000000,,1.000000,,0.402725,1.000000,,0.435808,-1.000000,,-0.122793,,1.000000,,1.000000,-1.000000,,,,,0.267856,1.000000,1.0000,1.000000,,0.884616,,1.000000,,,,,,1.000000,,1.000000,...,1.000000,,,1.000000,,,,,,,,1.000000,,,,,,-1.000000,,,,,,,-0.297793,,1.000000,,,,0.840584,,-1.000000,,-0.981749,,,1.000000,1.000000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zlata's diary: a child's life in sarajevo | zlata filipovic,,,,,,,,,0.696101,-1.000000,1.000000,,,,0.559262,,,,,,,,,,0.815284,,,-1.000000,,-1.000000,,,,,,,,0.983162,,,...,,,,,,,,,,,-1.0,-1.000000,,,,,,,1.000000,,,,,,1.000000,,,,,1.0,,,,,-0.861340,1.0,,,,
zodiac: the eco-thriller | neal stephenson,,-0.325098,-0.542674,,,,1.000000,0.247717,0.438353,-1.000000,,,-0.197175,,0.999993,-1.000000,,,,,,,-1.000000,-1.0,,,,,,1.000000,,,1.000000,,,,0.071677,1.000000,,1.000000,...,,,,,1.000000,,-1.0,,,,,-1.000000,-0.843336,-1.000000,,-1.000000,,-0.872935,,,-0.999764,,,,0.783237,,,,,,,,-1.000000,,0.343066,,1.000000,,1.000000,
zombies of the gene pool | sharyn mccrumb,,-0.987181,-1.000000,,1.000000,1.000000,,,-0.942041,-1.000000,,,1.000000,,0.361856,,,,,-1.000000,,,1.000000,,-1.000000,,,,1.000000,,,1.000000,,,,,0.996445,-1.000000,,1.000000,...,,,,,,,,,,,,-0.755149,,,,,,-0.636984,,,,,,,-0.521063,,-1.000000,,,,-1.000000,,,,1.000000,,,1.000000,,
zoya | danielle steel,,0.056862,-1.000000,-0.888249,1.000000,1.000000,-0.740594,-0.653677,0.423802,0.155051,-0.301463,-0.931468,0.081101,,-0.015480,-0.560235,-0.957179,0.961262,-0.215296,-0.773932,,,-0.745881,1.0,-0.202246,-0.435653,-1.0000,1.000000,-0.953867,-0.192744,0.999891,-0.571697,,,,,,-0.338020,,0.158920,...,-0.981265,,1.000000,,0.999675,1.000000,,,,,,-0.416785,,,-1.000000,-0.684829,,0.217903,0.998979,1.000000,-0.965353,,0.844997,,0.975706,,0.998446,,,1.0,0.973916,-1.000000,0.653120,-0.065975,1.000000,,1.000000,,1.000000,


In [None]:
myTopBook = '"surely you\'re joking, mr. feynman!": adventures of a curious character | richard p. feynman'
similar = corr[myTopBook].dropna()
similar

Title
"o" is for outlaw | sue grafton                                                                -1.000000
"surely you're joking, mr. feynman!": adventures of a curious character | richard p. feynman    1.000000
14,000 things to be happy about | barbara ann kipfer                                            1.000000
1984 | george orwell                                                                           -0.802411
1st to die: a novel | james patterson                                                           0.480622
                                                                                                  ...   
z for zachariah | robert c. o'brien                                                            -1.000000
zen and the art of motorcycle maintenance: an inquiry into values | robert pirsig              -0.455882
zodiac: the eco-thriller | neal stephenson                                                     -0.542674
zombies of the gene pool | sharyn mccrumb        

In [None]:
mean_ratings = ratings.groupby('Title')['Book-Rating'].mean()
mean_ratings

Title
" lamb to the slaughter and other stories (penguin 60s s.) | roald dahl                         0.463628
"o" is for outlaw | sue grafton                                                                -0.197531
"surely you're joking, mr. feynman!": adventures of a curious character | richard p. feynman    0.003429
'salem's lot | stephen king                                                                    -0.288384
10 lb. penalty | dick francis                                                                  -0.182577
                                                                                                  ...   
zlata's diary: a child's life in sarajevo | zlata filipovic                                     0.082010
zodiac: the eco-thriller | neal stephenson                                                     -0.931777
zombies of the gene pool | sharyn mccrumb                                                       0.298484
zoya | danielle steel                            

In [None]:
mean_ratings = ratings.groupby('Title')['Book-Rating'].mean()
mean_ratings = mean_ratings[[i in similar.index for i in mean_ratings.index]]
mean_ratings

Title
"o" is for outlaw | sue grafton                                                                -0.197531
"surely you're joking, mr. feynman!": adventures of a curious character | richard p. feynman    0.003429
14,000 things to be happy about | barbara ann kipfer                                            0.193463
1984 | george orwell                                                                           -0.300207
1st to die: a novel | james patterson                                                          -0.400696
                                                                                                  ...   
z for zachariah | robert c. o'brien                                                             0.015145
zen and the art of motorcycle maintenance: an inquiry into values | robert pirsig              -0.078613
zodiac: the eco-thriller | neal stephenson                                                     -0.931777
zombies of the gene pool | sharyn mccrumb        

In [None]:
rec = similar.multiply(mean_ratings).sort_values(ascending=False)
rec

Title
johnny got his gun | dalton trumbo                                                          1.106901
conversations with god : an uncommon dialogue (book 1) | neale donald walsch                0.840312
chasing cezanne | peter mayle                                                               0.804888
the crow road | iain banks                                                                  0.771862
in this mountain | jan karon                                                                0.771379
                                                                                              ...   
narrative of the life of frederick douglass (dover thrift editions) | frederick douglass   -0.833157
the demon-haunted world: science as a candle in the dark | carl sagan                      -0.851177
anil's ghost (vintage international) | michael ondaatje                                    -0.941267
fierce invalids home from hot climates | tom robbins                                 

In [None]:
corr.to_csv('data/IBCF_model.gz', compression='gzip')

In [None]:
mean_ratings = ratings.groupby('Title')['Book-Rating'].mean()
mean_ratings.to_csv('data/IBCF_ratings.csv')