# Assignment 5 - Recommending system

30.12.2024, Krzysztof Czarnowus

This system will be recommending anime - japanese animation series and movies.

Dataset source: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database

### Keggle description

**Context**

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

**Content**

Anime.csv

    anime_id - myanimelist.net's unique id identifying an anime.
    name - full name of anime.
    genre - comma separated list of genres for this anime.
    type - movie, TV, OVA, etc.
    episodes - how many episodes in this show. (1 if movie).
    rating - average rating out of 10 for this anime.
    members - number of community members that are in this anime's "group".

Rating.csv

    user_id - non identifiable randomly generated user id.
    anime_id - the anime that this user has rated.
    rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
from sklearn.neighbors import NearestNeighbors

In [3]:
anime = pd.read_csv('anime.csv')
rating = pd.read_csv('rating.csv')

We want to use **collaborative filtering**, which needs a lot of memory. Let's focus on popular animes - those who have more than 1000 ratings. It will make program less memory-consuming. And popularity is also an important part of reccomending!

In [4]:
valid_anime_ids = rating[rating['rating'] != -1].groupby('anime_id').size()

#choosing animes with more than 1000 ratings
valid_anime_ids = valid_anime_ids[valid_anime_ids >= 1000].index

#deleting rest of them from the dataset
anime = anime[anime['anime_id'].isin(valid_anime_ids)]

#deleting them also from "ratings" dataset
rating = rating[rating['anime_id'].isin(valid_anime_ids)]

In [5]:
print("Amount of records: ", len(anime))
anime.head()

Amount of records:  1462


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [6]:
print("Amount of records: ", len(rating))
rating.head()

Amount of records:  6315714


Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


It appears that some of the users didn't give any rate to animes they watched. Let's remove those records. The popularity of animes will be the same as it is counted in "members" column in anime dataframe.

In [7]:
rating = rating[rating['rating'] != -1]
print("Amount of records: ", len(rating))

Amount of records:  5192794


It's also important to make sure that there won't be two ratings of the same user to the same anime.

In [8]:
rating = rating.drop_duplicates(subset=['user_id', 'anime_id'], keep='first')
print("Amount of records: ", len(rating))
print("Amount of users: ", rating['user_id'].nunique())

Amount of records:  5192787
Amount of users:  69323


There are more than 5 000 000 ratings and 70 000 users left - it's probably enough for creating good recommendation system.

### Time to create new user - myself!

In [9]:
#creating unique user_id
my_id = rating['user_id'].max()+1
print("New user id is: ", my_id)

#function to add anime with desired name
def add_rate(name, ratio):
    global rating #to change values in rating df
    #finding anime_id
    if name in anime['name'].values:
        anime_id = anime.loc[anime['name'] == name, 'anime_id'].iloc[0]
        row = {'user_id': my_id, 'anime_id': anime_id, 'rating': ratio}
        rating = pd.concat([rating, pd.DataFrame([row])], ignore_index=True)
        print("Added anime ", name, " with rating ", ratio, " to dataset")
    else:
        print("Anime named ", name,  " doesn't exist in this dataset")

add_rate("Neon Genesis Evangelion", 8)
add_rate("Neon Genesis Evangelion: The End of Evangelion", 8)
add_rate("Cowboy Bebop", 10)
add_rate("Shingeki no Kyojin", 4)
add_rate("Samurai Champloo", 7)
add_rate("Death Note", 5)
add_rate("Nana", 8)
add_rate("Ghost in the Shell", 8)
add_rate("Naruto", 3)
add_rate("Trigun", 5)
add_rate("Black Lagoon", 7)
add_rate("Akira", 5)
add_rate("Paprika", 9)
add_rate("Hunter x Hunter (2011)", 4)
add_rate("Kimi no Na wa.", 7)
add_rate("FLCL", 7)
add_rate("Tengen Toppa Gurren Lagann", 7)
add_rate("Hajime no Ippo", 7)
add_rate("Sen to Chihiro no Kamikakushi", 8)
add_rate("Mononoke Hime", 7)
add_rate("One Punch Man", 5)
add_rate("Hotaru no Haka", 5)

New user id is:  73517
Added anime  Neon Genesis Evangelion  with rating  8  to dataset
Added anime  Neon Genesis Evangelion: The End of Evangelion  with rating  8  to dataset
Added anime  Cowboy Bebop  with rating  10  to dataset
Added anime  Shingeki no Kyojin  with rating  4  to dataset
Added anime  Samurai Champloo  with rating  7  to dataset
Added anime  Death Note  with rating  5  to dataset
Added anime  Nana  with rating  8  to dataset
Added anime  Ghost in the Shell  with rating  8  to dataset
Added anime  Naruto  with rating  3  to dataset
Added anime  Trigun  with rating  5  to dataset
Added anime  Black Lagoon  with rating  7  to dataset
Added anime  Akira  with rating  5  to dataset
Added anime  Paprika  with rating  9  to dataset
Added anime  Hunter x Hunter (2011)  with rating  4  to dataset
Added anime  Kimi no Na wa.  with rating  7  to dataset
Added anime  FLCL  with rating  7  to dataset
Added anime  Tengen Toppa Gurren Lagann  with rating  7  to dataset
Added anime  

### Let's recommend some animes!

First approach is using K-NN. It will find similarity between users and recommend animes basing on that.

In [10]:
#creating user-item matrix
user_item_matrix = rating.pivot(index=['user_id'], columns=['anime_id'], values='rating').fillna(0)

In [17]:
user_item_matrix

anime_id,1,5,6,7,15,16,18,19,20,22,...,32281,32282,32379,32438,32542,32729,32828,32935,32998,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,8.0,0.0,6.0,0.0,6.0,0.0,6.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73513,9.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73515,10.0,10.0,10.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
#creating K-NN model that uses ratings as vectors
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(user_item_matrix.values)

In [12]:
#creating user vector basing on his ratings
my_vector = user_item_matrix.loc[my_id].values.reshape(1, -1)

#finding 200 closest neighbors out of 70 000 in database
distances, indices = model.kneighbors(my_vector, n_neighbors=200)

#receiving a list of user_ids 
similar_user_ids = user_item_matrix.index[indices.flatten()]

In [19]:
print("Similar users IDs: ")
print(similar_user_ids.to_list())

Similar users IDs: 
[73517, 13952, 1785, 62308, 5038, 10540, 56298, 49054, 7556, 60891, 23233, 14921, 71903, 69172, 34491, 69748, 26349, 66335, 21029, 33796, 60106, 10602, 52999, 33620, 25383, 42196, 4824, 72553, 52364, 65955, 34946, 60246, 71622, 58264, 2536, 66853, 37132, 66859, 62344, 57932, 60665, 3929, 2646, 7945, 66056, 61121, 51280, 66658, 28011, 16506, 6300, 5273, 49604, 61750, 6830, 49991, 71109, 13265, 27187, 29774, 28625, 26958, 11394, 53084, 34579, 68114, 41294, 46413, 17070, 6790, 33417, 49552, 69745, 45544, 24070, 5256, 15265, 33006, 40306, 30378, 72625, 9764, 35381, 10271, 10546, 21335, 54091, 49767, 66408, 27752, 11032, 41256, 35537, 8780, 41115, 51589, 56416, 49067, 37152, 59211, 21671, 10205, 65915, 26819, 38273, 29446, 11705, 4202, 64890, 42612, 32430, 39641, 27005, 35418, 12562, 23557, 14801, 29352, 53350, 37304, 58581, 35291, 47100, 10620, 49805, 44196, 54952, 72381, 45706, 19563, 28740, 23693, 30785, 59055, 15966, 57453, 72736, 37271, 2775, 5110, 6025, 68157, 3531

In [13]:
#extracting only ratings of the closest neighbors
recommended_anime = user_item_matrix.loc[similar_user_ids]
#deleting animes with no ratings of those neighbors
recommended_anime = recommended_anime[recommended_anime > 0]
#counting number of ratings of closest neighbors for each anime
rating_counts = recommended_anime.count(axis=0)
#counting mean value of neighbors' ratings
recommended_anime = recommended_anime.mean(axis=0)
#deleting animes that were recommended by less than 10 users
recommended_anime[rating_counts < 10] = 0
#deleting animes that are rated by user
rated = user_item_matrix.loc[my_id]
recommended_anime = recommended_anime[~recommended_anime.index.isin(rated[rated > 0].index)]

In [14]:
#converting Series into Dataframe
recommended_anime = recommended_anime.to_frame(name='accuracy')
recommended_anime['id'] = recommended_anime.index

#merging original anime df with recommendation df
recommended_anime = recommended_anime.merge(anime[['anime_id', 'name']], left_on='id', right_on='anime_id', how='left')
recommended_anime = recommended_anime.drop(columns=['anime_id'])

### Effect of filtering and sorting - recommendation

In [15]:
print("Final dataframe with accuracy as probability:")
recommended_anime.sort_values(by='accuracy', ascending=False).head(10)

Final dataframe with accuracy as probability:


Unnamed: 0,accuracy,id,name
5,9.133333,19,Monster
639,9.130435,5114,Fullmetal Alchemist: Brotherhood
197,9.045455,457,Mushishi
255,9.0,601,Nekojiru-sou
832,8.941176,9253,Steins;Gate
14,8.857143,44,Rurouni Kenshin: Meiji Kenkaku Romantan - Tsui...
470,8.806452,2251,Baccano!
998,8.8,12355,Ookami Kodomo no Ame to Yuki
154,8.785714,339,Serial Experiments Lain
50,8.615385,97,Last Exile


In [20]:
print("Basing on your ratings, we may recommend you some of the great animes:")

top10 = recommended_anime.sort_values(by='accuracy', ascending=False).head(10)['name'].tolist()

for x in top10:
    print("\t", x)

Basing on your ratings, we may recommend you some of the great animes:
	 Monster
	 Fullmetal Alchemist: Brotherhood
	 Mushishi
	 Nekojiru-sou
	 Steins;Gate
	 Rurouni Kenshin: Meiji Kenkaku Romantan - Tsuioku-hen
	 Baccano!
	 Ookami Kodomo no Ame to Yuki
	 Serial Experiments Lain
	 Last Exile


### Summary

1. We created a simple recommending system that uses **user-based collaborative filtering**.
2. It uses k-NN algorithm to find 200 of investigated user's closest nieghbors basing on their ratings.
3. Then it takes only animes that were rated by at least 10 of them , calculates mean ratio of their ratings and gives recommendation.