# __``System Recommender``__

<img src='a_img.png' width=800 height=800>

# **Content Based Filtering**

<hr>

Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags, and products in general.

Like many machine learning techniques, a recommender system makes prediction based on users’ historical behaviors. Specifically, it’s to predict user preference for a set of items based on past experience. To build a recommender system, the most two popular approaches are Content-based and Collaborative Filtering.



Recommender systems produce a list of recommendations in any of the two ways –

### 1. **Collaborative filtering**

Collaborative filtering approaches build a model from user’s past behavior (i.e. items purchased or searched by the user) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that user may have an interest in.

Collaborative Filtering, on the other hand, doesn’t need anything else except users’ historical preference on a set of items. Because it’s based on historical data, the core assumption here is that the users who have agreed in the past tend to also agree in the future. 

### 2. **Content-based filtering**

Content-based filtering approaches uses a series of discrete characteristics of an item in order to recommend additional items with similar properties. Content-based filtering methods are totally based on a description of the item and a profile of the user’s preferences. It recommends items based on user’s past preferences.

Content-based approach requires a good amount of information of items’ own features, rather than using users’ interactions and feedbacks. For example, it can be movie attributes such as genre, year, director, actor etc., or textual content of articles that can extracted by applying Natural Language Processing. 

<hr>

### How to make content-based filtering:

__1. Calculate similarity among the items:__

-    Cosine-Based Similarity

<img src = 'b_img.png'>

-    Correlation-Based Similarity
-    Adjusted Cosine Similarity
-    1-Jaccard distance

__2. Calculation of Prediction:__

-    Weighted Sum
-    Regression

<hr>

# Understanding Content Based Filtering using __``Simple Dataset``__

Rekomendasi diberikan berdasarkan features dari item yang disukai user

### __Import Libraries__

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### __Create DataFrame__

In [3]:
df = pd.DataFrame([
    {'title': 'A', 'genre': 'Pop', 'penyanyi': 'Andi'},
    {'title': 'B', 'genre': 'Keroncong', 'penyanyi': 'Andi'},
    {'title': 'C', 'genre': 'Dangdut', 'penyanyi': 'Andi'},
    {'title': 'D', 'genre': 'Pop', 'penyanyi': 'Budi'},
    {'title': 'E', 'genre': 'Keroncong', 'penyanyi': 'Budi'},
    {'title': 'F', 'genre': 'Dangdut', 'penyanyi': 'Budi'},
    {'title': 'G', 'genre': 'Pop', 'penyanyi': 'Caca'},
    {'title': 'H', 'genre': 'Keroncong', 'penyanyi': 'Caca'},
    {'title': 'I', 'genre': 'Dangdut', 'penyanyi': 'Caca'},
    {'title': 'J', 'genre': 'Pop', 'penyanyi': 'Caca'},
])

df

Unnamed: 0,title,genre,penyanyi
0,A,Pop,Andi
1,B,Keroncong,Andi
2,C,Dangdut,Andi
3,D,Pop,Budi
4,E,Keroncong,Budi
5,F,Dangdut,Budi
6,G,Pop,Caca
7,H,Keroncong,Caca
8,I,Dangdut,Caca
9,J,Pop,Caca


**CountVectorizer in Python**

In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).

Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

<img src='c_img.png'>

[source](https://www.educative.io/edpresso/countvectorizer-in-python)

In [4]:
cv = CountVectorizer()

#tokenize
cv.fit(df['genre'])

cv.vocabulary_

{'pop': 2, 'keroncong': 1, 'dangdut': 0}

In [5]:
mgenre = cv.fit_transform(df['genre'])
cv.get_feature_names()

['dangdut', 'keroncong', 'pop']

In [7]:
type(mgenre)

scipy.sparse.csr.csr_matrix

In [8]:
# melihat hasilnya
mgenre.toarray()

array([[0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1]], dtype=int64)

In [9]:
mgenre.shape

(10, 3)

In [11]:
#ubah menjadi 2d
mgenre_matrix = mgenre.todense()

# menampilkan dalam bentuk dataframe
df_matrix = pd.DataFrame(mgenre_matrix, columns = cv.get_feature_names())
df_matrix

Unnamed: 0,dangdut,keroncong,pop
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,1,0
5,1,0,0
6,0,0,1
7,0,1,0
8,1,0,0
9,0,0,1


### __Cosine Similarity__

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

<img src = 'b_img.png'>

#### __Example__

![image](https://www.machinelearningplus.com/wp-content/uploads/2018/10/the_three_documents-865x610.png?ezimgfmt=ng:webp/ngcb1)

[source](https://www.machinelearningplus.com/nlp/cosine-similarity/)

### __Contoh Penerapan Cosine Similarity__

In [12]:
# ada 3 document
doc_WM = 'Today, we learn how to build dinamic website and mobile apps'
doc_DM = 'Today, we learn how to promote mobile apps to segmented market'
doc_DS = 'Today, we learn hot to build recommender system'

documents = [doc_WM, doc_DM, doc_DS]

In [15]:
count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(documents)

In [16]:
# tampilkan dalam bentuk dataframe
doc_matrix = sparse_matrix.todense()
df_coba = pd.DataFrame(doc_matrix, 
                       columns=count_vectorizer.get_feature_names(),
                      index = ['doc_WM', 'doc_DM', 'doc_DS'])
df_coba

Unnamed: 0,apps,build,dinamic,hot,learn,market,mobile,promote,recommender,segmented,today,website
doc_WM,1,1,1,0,1,0,1,0,0,0,1,1
doc_DM,1,0,0,0,1,1,1,1,0,1,1,0
doc_DS,0,1,0,1,1,0,0,0,1,0,1,0


In [17]:
cosine_similarity(df_coba)

array([[1.        , 0.57142857, 0.50709255],
       [0.57142857, 1.        , 0.3380617 ],
       [0.50709255, 0.3380617 , 1.        ]])

### __Lanjutan code music recommender__

In [21]:
cosScore = cosine_similarity(mgenre)
cosScore

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1.]])

### __Music Recommendation__

**Enumerate() in Python**

A lot of times when dealing with iterators, we also get a need to keep a count of iterations. Python eases the programmers’ task by providing a built-in function enumerate() for this task. Enumerate() method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.
[source](https://www.geeksforgeeks.org/enumerate-in-python/)

In [22]:
df

Unnamed: 0,title,genre,penyanyi
0,A,Pop,Andi
1,B,Keroncong,Andi
2,C,Dangdut,Andi
3,D,Pop,Budi
4,E,Keroncong,Budi
5,F,Dangdut,Budi
6,G,Pop,Caca
7,H,Keroncong,Caca
8,I,Dangdut,Caca
9,J,Pop,Caca


In [23]:
# terakhir yang didengar oleh user
last_played = int(input('terakhir didengar? ')) #index

music_recom = list(enumerate(cosScore[last_played]))
music_recom

terakhir didengar?  1


[(0, 0.0),
 (1, 1.0),
 (2, 0.0),
 (3, 0.0),
 (4, 1.0),
 (5, 0.0),
 (6, 0.0),
 (7, 1.0),
 (8, 0.0),
 (9, 0.0)]

In [24]:
your_recom = sorted(music_recom, key = lambda x: x[1], reverse=True)
your_recom[:5]

[(1, 1.0), (4, 1.0), (7, 1.0), (0, 0.0), (2, 0.0)]

In [25]:
# menamplkan lima musik yang direkomendasikan
for i in your_recom[:5]:
    print(df.iloc[i[0]])

title               B
genre       Keroncong
penyanyi         Andi
Name: 1, dtype: object
title               E
genre       Keroncong
penyanyi         Budi
Name: 4, dtype: object
title               H
genre       Keroncong
penyanyi         Caca
Name: 7, dtype: object
title          A
genre        Pop
penyanyi    Andi
Name: 0, dtype: object
title             C
genre       Dangdut
penyanyi       Andi
Name: 2, dtype: object


<hr>

# __``Anime Recomendation``__

### __Open Anime Dataset__

In [34]:
df_anime = pd.read_csv('anime.csv')
df_anime.shape

(12294, 7)

In [35]:
df_anime.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [36]:
df_anime = df_anime.iloc[:850] #data terlalu besar
df_anime.isna().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [37]:
df_anime['type'].value_counts()

TV         469
Movie      184
OVA        109
Special     76
ONA          9
Music        3
Name: type, dtype: int64

In [38]:
df_anime['genre'].value_counts()

Adventure, Comedy, Mystery, Police, Shounen                       17
Comedy, School, Slice of Life                                     15
Comedy, Drama, Shounen, Sports                                    10
Comedy, School, Shounen, Sports                                   10
Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen       9
                                                                  ..
Action, Drama, Fantasy, Magic, Mystery, Psychological, Shounen     1
Drama, School, Shoujo Ai, Slice of Life                            1
Action, Adventure, Comedy, Fantasy, Romance                        1
Horror, Mystery, Psychological, Supernatural, Thriller             1
Comedy, Parody, Shounen, Supernatural                              1
Name: genre, Length: 517, dtype: int64

In [39]:
len(df_anime)

850

### __Create Recommender System__

In [42]:
cvr = CountVectorizer(
    tokenizer = lambda x: x.split(', ') # hanya koma untuk memecah kolom genre
)

mgenre_a = cvr.fit_transform(df_anime['genre'])

print(len(cvr.get_feature_names()))
print(cvr.get_feature_names())

40
['action', 'adventure', 'cars', 'comedy', 'dementia', 'demons', 'drama', 'ecchi', 'fantasy', 'game', 'harem', 'historical', 'horror', 'josei', 'kids', 'magic', 'martial arts', 'mecha', 'military', 'music', 'mystery', 'parody', 'police', 'psychological', 'romance', 'samurai', 'school', 'sci-fi', 'seinen', 'shoujo', 'shoujo ai', 'shounen', 'shounen ai', 'slice of life', 'space', 'sports', 'super power', 'supernatural', 'thriller', 'vampire']


In [43]:
mgenre_a.toarray()

array([[0, 0, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [44]:
cosScore_anime = cosine_similarity(mgenre_a)
cosScore_anime

array([[1.        , 0.18898224, 0.        , ..., 0.4472136 , 0.        ,
        0.40824829],
       [0.18898224, 1.        , 0.28571429, ..., 0.16903085, 0.        ,
        0.3086067 ],
       [0.        , 0.28571429, 1.        , ..., 0.3380617 , 0.26726124,
        0.15430335],
       ...,
       [0.4472136 , 0.16903085, 0.3380617 , ..., 1.        , 0.        ,
        0.18257419],
       [0.        , 0.        , 0.26726124, ..., 0.        , 1.        ,
        0.28867513],
       [0.40824829, 0.3086067 , 0.15430335, ..., 0.18257419, 0.28867513,
        1.        ]])

### __Anime Recommender__

In [77]:
animeSuka = input('Anime yang Anda sukai? ')
indexSuka = df_anime[df_anime['name'] == animeSuka].index[0]
indexSuka

Anime yang Anda sukai?  One Piece


74

In [78]:
anime_recom = list(enumerate(cosScore_anime[indexSuka]))

In [79]:
# 1. Ranking Manual
anime_recom_sortir = sorted(anime_recom, key=lambda x:x[1], reverse=True)
# anime_recom_sortir

In [80]:
# 2. Filter yang Cosine Similarity, score-nya di atas 70%
anime_recom_bgt = list(filter(lambda x: x[1] > 0.7, anime_recom))
anime_recom_bgt_sorted = sorted(anime_recom_bgt, key=lambda x:x[1], reverse=True)

In [81]:
# menampilkan rekomendasi anime sesuai yang kamu suka berdasarkan kesamaan 'GENRE'
for i in anime_recom_bgt_sorted[1:10]:
    print(df_anime.iloc[i[0]]['name'])

One Piece: Episode of Merry - Mou Hitori no Nakama no Monogatari
One Piece: Episode of Nami - Koukaishi no Namida to Nakama no Kizuna
One Piece Film: Strong World
One Piece Film: Z
One Piece Film: Gold
One Piece Film: Strong World Episode 0
One Piece: Episode of Luffy - Hand Island no Bouken
Dragon Ball Z
Dragon Ball Kai (2014)


In [82]:
df_anime[df_anime['name'] == 'One Piece']

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
74,21,One Piece,"Action, Adventure, Comedy, Drama, Fantasy, Sho...",TV,Unknown,8.58,504862


In [83]:
df_anime[df_anime['name'] == 'Dragon Ball Z']

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
206,813,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,291,8.32,375662


<hr>

## **Take Class Exercise**
#### 1. Create Recommender System based on 'type' feature
#### 2. Create Recommender System based on 'genre' & 'type' feature

## **Take Home Exercise**
#### 3. Create Recommender System based on 'rating' & 'type' feature

<hr>

# **Reference**:
- Carlos Pinela, "Recommender Systems — User-Based and Item-Based Collaborative Filtering", https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f
- Rakesh4real, "User-Based and Item-Based Collaborative Filtering — Part 5", https://medium.com/fnplus/user-based-and-item-based-collaborative-filtering-b73d9b2badba
- Muffaddal Qutbuddin, "Comprehensive Guide on Item Based Collaborative Filtering", https://towardsdatascience.com/comprehensive-guide-on-item-based-recommendation-systems-d67e40e2b75d
- Aishwarya.27, "Python | Implementation of Movie Recommender System", https://www.geeksforgeeks.org/python-implementation-of-movie-recommender-system/
- Shuyu Luo, "Introduction to Recommender System", https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26
- Kevin Liao, "Prototyping a Recommender System Step by Step Part 1: KNN Item-Based Collaborative Filtering", https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea
- Selva Prabhakaran, "Cosine Similarity – Understanding the math and how it works (with python codes)", https://www.machinelearningplus.com/nlp/cosine-similarity/