# Unsupervised Nearest Neighbors Kullanarak Collaborative Book Recommendation  

NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in sklearn.metrics.pairwise.


The most naive neighbor search implementation involves the brute-force computation of distances between all pairs of points in the dataset: for  samples in  dimensions, this approach scales as O[D N^2].

Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples  grows, the brute-force approach quickly becomes infeasible


## Kaynaklar

- https://scikit-learn.org/stable/modules/neighbors.html#unsupervised-nearest-neighbors
- https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
- https://www.kaggle.com/sankha1998/collaborative-book-recommendation-system/data
- https://www.kaggle.com/ruchi798/book-crossing-starter-notebook-and-eda


## TODO

- eda notebooku ile birlestir
- gorsellestirme ekle (cesitli notebooklardan)
- detayli veri on isleme ekle
- pandas ve np API'sini iyi anla
- implicit ratinglerden bahset

In [1]:
import numpy as np
import pandas as pd 

In [2]:
# sutunlarin isimlerini liste ile belirle
#Users
u_cols = ['user_id', 'location', 'age']
users = pd.read_csv('../data/book_x/BX-Users.csv', sep=';', names=u_cols, encoding='latin-1',low_memory=False)

#Books
b_cols = ['ISBN', 'title' ,'author','year', 'publisher', 'img_s', 'img_m', 'img_l']
books = pd.read_csv('../data/book_x/BX_Books.csv', sep=';', names=b_cols, encoding='latin-1',low_memory=False)

#Ratings
r_cols = ['user_id', 'ISBN', 'rating']
ratings = pd.read_csv('../data/book_x/BX-Book-Ratings.csv', sep=';', names=r_cols, encoding='latin-1',low_memory=False)

In [3]:
users.head()

Unnamed: 0,user_id,location,age
0,User-ID,Location,Age
1,1,"nyc, new york, usa",
2,2,"stockton, california, usa",18
3,3,"moscow, yukon territory, russia",
4,4,"porto, v.n.gaia, portugal",17


In [4]:
# ise yaramayacak sutunlari at
books = books[['ISBN', 'title', 'author', 'year', 'publisher']] #feature engineering : selecting features
books.head()

Unnamed: 0,ISBN,title,author,year,publisher
0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
1,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
2,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
3,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
4,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux


In [5]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
0,User-ID,ISBN,Book-Rating
1,276725,034545104X,0
2,276726,0155061224,5
3,276727,0446520802,0
4,276729,052165615X,3


In [6]:
print("book dataframe'i bicimi", books.shape)
print("users dataframe'i bicimi", users.shape)
print("ratings dataframe'i bicimi", ratings.shape)

book dataframe'i bicimi (271380, 5)
users dataframe'i bicimi (278859, 3)
ratings dataframe'i bicimi (1149781, 3)


In [7]:
# degerlendirme yapan ozgun kullinici sayisi
ratings['user_id'].value_counts().shape

(105284,)

In [8]:
# kullanicilarin yaptigi rating sayisinin histogramini yap

# Veri On Isleme

### TODO
- implicit ratingsleri mean value ile degistir
- daha detayli preprocessing yap

In [9]:
# 200 degeri bruteforce da n^2 (polinomial) arttigi icin selected_ratings'i cok arttirmak akillica degil
# yada farkli bir algoritma kullanmak gerekli

selected_ratings = ratings['user_id'].value_counts() > 30
selected_ratings[selected_ratings].shape

(5132,)

In [10]:
selected_rating_index = selected_ratings[selected_ratings].index

In [11]:
selected_rating_index

Index(['11676', '198711', '153662', '98391', '35859', '212898', '278418',
       '76352', '110973', '235105',
       ...
       '173657', '140842', '225414', '229681', '34377', '220620', '170158',
       '165392', '223892', '245782'],
      dtype='object', length=5132)

In [12]:
# ratingsleri selectred ratingslere gore flitrele
ratings = ratings[ratings['user_id'].isin(selected_rating_index)]
print(ratings.shape)

(834558, 3)


In [13]:
# books ile ratings dataframelerini birlerstir

ratings_with_books = ratings.merge(books,on = 'ISBN')
print(ratings_with_books.shape)
# print(rating_with_books.head())

(761799, 7)


In [14]:
# pandas kisimlarini anla
# kitabin toplam rating sayisi
number_rating = ratings_with_books.groupby('title')['rating'].count().reset_index()
# sutunu yeni anlamina gore isimlendir
number_rating.rename(columns={'rating':'number of rating'}, inplace = True)
number_rating.head()

Unnamed: 0,title,number of rating
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


In [15]:
final_ratings = ratings_with_books.merge(number_rating,on = 'title')
final_ratings.head()
print(final_ratings.shape)

(761799, 8)


In [16]:
## 50 den fazla rating yapilan kitaplari kullan 
final_ratings = final_ratings[final_ratings['number of rating'] >= 10]
print(final_ratings.shape)
final_ratings.head()

(373643, 8)


Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number of rating
0,276847,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
1,278418,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
2,5483,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
3,7346,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
4,8362,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213


In [17]:
# mukerrer kayitlari sil
final_ratings.drop_duplicates(['user_id', 'title'], inplace = True)
final_ratings.shape

(369425, 8)

In [18]:
# final_ratings
final_ratings.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 369425 entries, 0 to 711210
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   user_id           369425 non-null  object
 1   ISBN              369425 non-null  object
 2   rating            369425 non-null  object
 3   title             369425 non-null  object
 4   author            369425 non-null  object
 5   year              369425 non-null  object
 6   publisher         369424 non-null  object
 7   number of rating  369425 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 25.4+ MB


In [19]:
# pivot tablo olustur
# rating uzerinden aggraget et
final_ratings['rating'] = final_ratings['rating'].astype(int)
# final_ratings.info(verbose=True)
book_pivot = final_ratings.pivot_table(columns='user_id', index='title', values='rating') 

In [20]:
final_ratings.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number of rating
0,276847,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
1,278418,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
2,5483,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
3,7346,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213
4,8362,446364193,0,Along Came a Spider (Alex Cross Novels),James Patterson,1993,Warner Books,213


In [28]:
book_pivot.shape

(12942, 5059)

In [30]:
book_pivot.fillna(0,inplace=True)

In [23]:
# final_ratings = final_ratings[final_ratings['title'] == 'Animal Farm']
# final_ratings

# Asil isin oldugu yer

knn e gorsellestirme yapilabilir mi

## TODO

Burasi onemli bu bolumdeki tum kodlari detayi ile anlamak lazim.

In [24]:
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

book_sparse = csr_matrix(book_pivot)
nn = NearestNeighbors(algorithm='brute')
nn.fit(book_sparse)

NearestNeighbors(algorithm='brute')

## Recommendation Function

In [25]:
import pprint
# pp = pprint.PrettyPrinter(indent=4)
# pp.pprint(book_pivot.index.tolist())

In [26]:
def recommendation(book_name):
    # alttaki 2 satiri cok iyi anlamak lazim
    book_id = np.where(book_pivot.index == book_name)[0][0]
    print("book_id: ", book_id)
    # print(book_pivot.iloc[book_id,:].values.reshape(1,-1))
    distances, suggestions = nn.kneighbors(book_pivot.iloc[book_id,:].values.reshape(1,-1))
    
    print("distances: ", distances)
    print("suggestions: ", suggestions)
    
    for i in range(len(suggestions)):
        if i == 0:
            print("the suggestions are ",book_name,"are : ")
        if not i:
            print(book_pivot.index[suggestions[i]])
    print()

In [27]:
# recommendation('White Teeth: A Novel')
# recommendation('Pleading Guilty')
# recommendation('American Gods')

recommendation('Animal Farm')
recommendation('The Fellowship of the Ring (The Lord of the Rings, Part 1)')
recommendation("Harry Potter and the Sorcerer's Stone (Book 1)")
recommendation("Harry Potter and the Chamber of Secrets (Book 2)")
recommendation("The Da Vinci Code")

# print(recommendation("Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))"))

book_id:  821
distances:  [[ 0.         67.9337913  67.97058187 68.19824045 68.22756041]]
suggestions:  [[  821 11009  8516 11755 12331]]
the suggestions are  Animal Farm are : 
Index(['Animal Farm', 'The Republic of Love',
       'Spontaneous Healing : How to Discover and Enhance Your Body's Natural Ability to Maintain and Heal Itself',
       'The crow road', 'Vintage Stuff'],
      dtype='object', name='title')

book_id:  9854
distances:  [[ 0.         86.         87.69264507 91.47677301 95.35722311]]
suggestions:  [[ 9854 11515 11023 11022  2284]]
the suggestions are  The Fellowship of the Ring (The Lord of the Rings, Part 1) are : 
Index(['The Fellowship of the Ring (The Lord of the Rings, Part 1)',
       'The Two Towers (The Lord of the Rings, Part 2)',
       'The Return of the King (The Lord of the Rings, Part 3)',
       'The Return of the King (The Lord of The Rings, Part 3)',
       'DIANA HER TRUE STORY COMMEMORATIVE EDITION'],
      dtype='object', name='title')

book_id: