# ANIME RECOMMENDER SYSTEM

This is an ***Anime Recommendation System*** that is implemented through various kinds of machine learning algorithms and includes different types of systems such as **Popularity Based (through Average Ratings)**, **Popularity Based (through Ranking/Score)**, **Content Based**, and **Collaborative-Filtering Based**.

The ***'reviews.csv'*** file is no longer available in this project so some of the systems will not be able to function as they did before, including **Popularity Based (through Average Ratings) System** and **Collaborative-Filtering Based System**. However, the code for these recommender systems would remain effective had the ***'reviews.csv'*** file been present, so they have been left in this notebook for reference.

The dataset that was available for the anime i.e. ***'animes.csv'*** has scarce and scattered information regarding user ratings as not many users have rated more than one kind of anime so there's not much co-relation between users based on the kind of anime they have rated on, thus, making **Collaborative Filtering** ineffective for this dataset. 

However, the algorithm would remain more or less the same, so, when working with a different dataset, the **Collaborative-Filtering Based System** can be implemented, and subsequently, a **Hybrid Recommender System** can be implemented as well.

In [None]:
import numpy as np
import pandas as pd
import pickle

In [None]:
animes = pd.read_csv('animes.csv')

In [None]:
animes.head(1)

In [None]:
print(animes.shape)

In [None]:
animes.isnull().sum() # to check for any null values

In [None]:
animes.duplicated().sum() # to check duplicates

In [None]:
animes.drop_duplicates(inplace = True)
animes.duplicated().sum()

In [None]:
# EXTRACT THE RELEASE DATE FROM THE AIRED COLUMN IN ANIMES CSV

# extracting aired column from animes containing release date and final date

release_date = animes['aired'].values
release_date_list = [] 

# extracting aired column values and seperating the release date from final date
# and storing the release date part

for i in range(len(release_date)):
    release_date_list.append(release_date[i].split("to")[0])
    
animes['aired'] = release_date_list
animes.rename(columns = {'aired':'release_date'}, inplace = True)
animes.head(2)

In [None]:
# dropping NULL values from the animes dataset

animes.dropna(axis = 0, how = 'any', inplace = True)

In [None]:
animes.isnull().sum()

In [None]:
animes.shape

In [None]:
# renaming the uid aka anime id in animes.csv to anime_uid so we can merge
# animes and reviews

animes.rename(columns = {'uid':'anime_uid'}, inplace = True)

In [None]:
animes = animes[['anime_uid', 'title', 'synopsis', 'genre', 'release_date', 'episodes', 'members', 
                 'popularity', 'ranked', 'score', 'img_url']]
animes

## POPULARITY BASED RECOMMENDATION SYSTEM (USING AVERAGE RATINGS)

This type of recommendation system needs the ***'reviews.csv'*** file that is no longer available in this project. However, the code would remain applicable if the ***'reviews.csv'*** file had been present.

## POPULARITY BASED RECOMMENDER SYSTEM (USING THE SCORE COLUMN IN THE ANIMES.CSV)


In [None]:
popular_df = animes[animes['members'] >= 1000000].sort_values('popularity', ascending = True)
popular_df.head(10)

In [None]:
popular_df = popular_df[['anime_uid', 'title', 'synopsis', 'genre', 'release_date',
           'episodes', 'score', 'img_url']][0:10]
popular_df

In [None]:
anime_details = animes

In [None]:
pickle.dump(anime_details, open('anime_details.pkl', 'wb'))

## CONTENT BASED RECOMMENDATION SYSTEM

In [49]:
# This fetches the synopsis of the first anime 'Haikyuu' in this case

animes_cbrs = animes

animes_cbrs['synopsis'][0]

"Following their participation at the Inter-High, the Karasuno High School volleyball team attempts to refocus their efforts, aiming to conquer the Spring tournament instead.  \r\n \r\nWhen they receive an invitation from long-standing rival Nekoma High, Karasuno agrees to take part in a large training camp alongside many notable volleyball teams in Tokyo and even some national level players. By playing with some of the toughest teams in Japan, they hope not only to sharpen their skills, but also come up with new attacks that would strengthen them. Moreover, Hinata and Kageyama attempt to devise a more powerful weapon, one that could possibly break the sturdiest of blocks.  \r\n \r\nFacing what may be their last chance at victory before the senior players graduate, the members of Karasuno's volleyball team must learn to settle their differences and train harder than ever if they hope to overcome formidable opponents old and new—including their archrival Aoba Jousai and its world-class 

In [None]:
animes_cbrs['synopsis'] = animes_cbrs['synopsis'].apply(lambda x:x.split())

In [None]:
animes_cbrs.head()

In [None]:
import ast

In [None]:
def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i)
    return L

In [None]:
animes_cbrs['genre'] = animes_cbrs['genre'].apply(convert)

In [None]:
animes_cbrs.head()

In [None]:
# remove any spaces between the genres, say Slice of Life to SliceofLife to avoid 
# errors in the recommendation system

animes_cbrs['genre'].apply(lambda x:[i.replace(" ", "") for i in x])

In [None]:
animes_cbrs['tags'] = animes_cbrs['synopsis'] + animes_cbrs['genre']

In [None]:
animes_cbrs.head()

In [None]:
# converting the list of tags to a string

animes_cbrs['tags'] = animes_cbrs['tags'].apply(lambda x: " ".join(x))

In [None]:
# all the tags in the same string 

animes_cbrs['tags'][0]

In [None]:
# convert the string into lowercase

animes_cbrs['tags'] = animes_cbrs['tags'].apply(lambda x:x.lower())

In [None]:
animes_cbrs.head()

In [None]:
animes_cbrs['tags'][1]

### STEMMING

We will apply stemming on the list of words as there are multiple variations of the same word.

nltk is a famous natural language processing library. Install nltk using **'pip install nltk'**.

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

In [None]:
animes_cbrs['tags'] = animes_cbrs['tags'].apply(stem)

## TEXT VECTORIZATION USING BAG OF WORDS TECHNIQUE

Combine all the words (tags in this case) that are there like word1 + word2 + ... + wordn into a one large string.

In that huge string, calculate which words have the highest frequency (after calculating the frequency of each word) and extract, say, 5000 words that have the highest frequency.

Then we check with each tag of the anime and check how many times each of those 5000 words are present there or not (which will be 0 if the word is not present in that particular anime's tag).

This will look somewhat like this:

<pre><code>
            Word1     Word2     Word3    .....     Word5000
 Anime1       5         3         0                   1                             
 Anime2       2         0         1                   3
 Anime3       1         1         1                   1
 .
 .
 .
 Anime5000    5         2         5                   5
</code></pre>

*(with shape (5000, 5000))*

That one row of 5, 3, 0, ... 1 is actually a vector. The anime has been converted into a vector in a 5000 dimensional space.

Now, when someone says they like a particular anime, we will fetch the 5 closest vectors of that anime.

Therefore, we converted our entire text into a vector, so now, every anime is a vector in a 5000 dimensional space

We have taken 5000 words when we could have taken more. However, we need the most efficient/best performance in the least amount of words as when the number of words increases, so does the dimensionality of the data, which is problematic, so it's better to take the least amount of words possible.

We will not be considering stop words (aka words that are used for sentence formation but add no value/contribution to the actual meaning of the sentence, like are, and, or, to, from, etc.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000, stop_words = 'english')

In [None]:
# there will be many 0 values in this. by default, CountVectorizer returns a SciPy sparse matrix
# so we will convert it to a numpy array as we need it

vectors = cv.fit_transform(animes_cbrs['tags']).toarray()

In [None]:
vectors[0] # this matrix will be sparse as hardly 5000 words will be there in a single anime

### We will be calculating the Cosine Similarity of one vector with all the other vectors and repeat it for all the vectors (each anime with each anime)

While dealing with higher dimensional data is, euclidean distance should be avoided as it fails the higher the dimensional data is. It is not a reliable source of measure to calculate the distance when dealing with data of higher dimensionality so we will be using the cosine distance between the vectors i.e. the (θ) angle instead. 

The smaller the angle is, the lesser the distance, therefore, the two vectors (anime) will be more similar. Cosine distance is inversely proportional to cosine similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(vectors)

This method calculates the similarity between each vector, with the anime having the most similarity (1) with itself. Hence, there will be a diagonal representation where every anime will have the most similarity with itself, which will be represented through 1 (the cosine similarity is from 0 to 1).

In [None]:
similarity[0] # this is the similarity score of the first anime with every anime

In [None]:
similarity[1] # this is the similarity score of the second anime with every anime

In [None]:
def recommend(anime):
    
    index = animes_cbrs[animes_cbrs['title'] == anime].index[0]
    distances = sorted(list(enumerate(similarity[index])), reverse = True, key = lambda x: x[1])
    
    data = []
    for i in distances[1:6]:
        item = []
        #print(animes_cbrs.iloc[i[0]].title)
        temp_df = animes_cbrs[animes_cbrs['title'] == animes_cbrs.iloc[i[0]].title]
        item.extend(list(temp_df['title'].values))
        
        item.extend(list(anime_details[anime_details['title'].isin(temp_df['title'])]['synopsis'].values))
        
        item.extend(list(temp_df['genre'].values))
        item.extend(list(temp_df['release_date'].values))
        item.extend(list(temp_df['episodes'].values))
        item.extend(list(temp_df['score'].values))
        item.extend(list(temp_df['img_url'].values))
        
        data.append(item)
        
    return data

In [None]:
recommend('Death Note')

In [None]:
pickle.dump(animes_cbrs, open('animes_cbrs.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))

## COLLABORATIVE BASED RECOMMENDATION SYSTEM

We merge the ***'animes.csv'*** and the ***'reviews.csv'*** to have a dataframe that will be somewhat as follows:

<pre><code>
           User 1     User 2    .....     User n
 Anime 1      5           3                  10
 Anime 2      1           0                  3
 Anime 3      5           6                  7
 .
 .
 .
 Anime n      3           9                  6
</code></pre>

To make the recommender system more accurate, it is better to have a certain threshold for the number of ratings done by a user and consider only those users that pass that threshold, as they will be reliable users and will make the system more efficient and accurate.

You can also consider only those animes that has been rated by atleast ***'N'*** users so as to not recommend anime that are really not well known and may not be the most accurate.

We will be using the merged dataset **(animes.csv + reviews.csv)** aka **anime_w_ratings**

We will check every user and how many animes they have rated, then return ***True*** for the ones that have rated more than ***'N'*** animes, and **False** for those who haven't. We will temporarily store this in a variable x.

<pre><code>
counts = animes_cfrs['uid'].value_counts()
animes_cfrs = animes_cfrs[animes_cfrs['uid'].isin(counts[counts >= 50].index)]
</code></pre>

<pre><code>
x = anime_w_ratings.groupby('uid').count()['score_y'] >= 50
valid_users = x[x].index
</code></pre>

These will be all the user-ids have rated atleast 50 anime.

Fetch only those rows from the dataframe in which the users are valid users:
<pre><code>
filtered_rating = anime_w_ratings[anime_w_ratings['uid'].isin(valid_users)]
</code></pre>

Fetch all the anime that have ratings more than or equal to 50.

<pre><code>
y = filtered_rating.groupby('title').count()['score_y'] >= 50
valid_animes = y[y].index 
</code></pre>

These are all the animes that have more than or equal to 50 ratings.

Fetch only those rows from the dataframe in which the anime that are valid animes:

<pre><code>
animes_cfrs = filtered_rating[filtered_rating['title'].isin(valid_animes)]
</code></pre>

Create a spreadsheet-style pivot table as a DataFrame which will store the anime titles along with their user-id and the scores given by the users.

<pre><code>
pt = animes_cfrs.pivot_table(index = 'title', columns = 'uid', values = 'score_y')
</code></pre>

Replace all the NULL values with 0 instead.

<pre><code>
pt.fillna(0, inplace = True)
</code></pre>

Compute the distance of each anime with all the other anime

<pre><code>
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(pt)
</code></pre>

RECOMMENDER FUNCTION:

<pre><code>
def recommend_cfrs(anime_name):
    index fetch
    index = np.where(pt.index == anime_name)[0][0]
    distances = similarity_scores[index]
    similar_items = sorted(list(enumerate(similarity_scores[index])), key = lambda x:x[1], reverse = True)[1:6]

    for i in similar_items:
        print(pt.index[i[0]])
</code></pre>

Check whether the recommender system is working by calling the function and giving the name of the anime you want recommendations for as the parameter:

<pre><code>
recommend_cfrs('Death Note')
</code></pre>

It seems like there aren't any users who have voted on more than one kind of anime, so, since the data we are working with is very sparse and scattered, it would seem that the collaborative filtering method is not working too well as a recommendation system for this dataset (while the methodology is the same so it will work for any other dataset)

In [None]:
popular_df

In [None]:
pickle.dump(popular_df, open('popular.pkl', 'wb'))