To open this notebook in Google Colab and start coding, click on the Colab icon below.

<table style="border:2px solid orange" align="left">
    <td style="border:2px solid orange">
        <a target="_blank" href="https://colab.research.google.com/github/neuefische/ds-meetups/blob/main/03_Recommender_Introduction/01_content_based.ipynb">
        <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>

# content-based recommender system

Recommender systems are algorithms which suggest information of interest to the user. If you ever used the internet, you will surely have made contact with a recommender system. Netflix, Amazon, Spotify, all make use of such functions to improve user satisfaction as well as buying behavior. Based on your previous activity the recommender system will suggest you the next video, product, song or whatever. The most common recommender types are content-based, collaborative filtering and hybrid models of those two types.

Having enough user data do build a proper collaborative model is usually an obstacle at the beginning, as there is little known about the users behavior. This is also referred to as the cold start problem in recommender systems. In this notebook you will learn how to build a content-based recommender-system. A content-based model has the advantage that no user data is required and the model can be scaled to a large number of users independently of their preferences. Solving the cold start problem. For a content-based recommender system the filtering is based on the item features only.
<br/><br/>
<p align="center">
    <img src="images/book_text.png" alt="drawing" width="400"/>
</p>
<br/><br/>
An example: If you have a book that has the features "aquarium" and "fish" you would get recommendations with the same features, so books about aquarium fish without any usage of ratings or user data.


Let's start with building your own content-based recommender system!

Import packages:

In [None]:
#requirements
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

## Cosine similarity 

To get meaningful recommendations, it is important to measure the similarity of two items. To do so, cosine similarity is the most popular metric for content-based models. Mathematically, the cosine similarity is the dot product of normalized vectors and measures the cosine of the angle between two vectors projected in a n-dimensional space (n = number of features). Output values are limited between 0 and 1, with a value of 0 indicating no similarity, whereas 1 means that both the items are identical. 

If you got a two-dimensional space and the angle of two data points is 45° the cosine would be 0.7 while it would be zero for an angle of 90°. In a data frame each feature represents a dimension. This means, every observation (row) could be plotted in a n-dimensional space based on it's feature-properties. Let's have a look at two examples:

### Example 1: 2-dimensional space 

Let's assume, we got two fish with different colors. One fish is solid blue, while the other is solid orange. Plotted in a 2-dimensional space, where one axis represents the color blue and the other the color orange, one fish would have the coordinates (1,0) and the other (0,1). A fish with both colors would be in between (1,1). To calculate the cosine similarity score between two fish, we simply measure the cosine of the two coordinate vectors. 

<br/><br/>
<p align="center">
    <img src="images/fish_text.png" alt="drawing" width="300"/>
</p>
<br/><br/>
The vectors of the blue and orange fish and the solid orange fish would have a 45° angle, therefore a cosine similarity of 0.7. Just the same if we compare the solid blue and the blue-orange fish. The solid blue and the solid orange fish would have an angle of 90° and hence a cosine of 0, so no similarity at all with respect to the feature color. 

### Example 2: 4-dimensional space

Let's create an example data frame:

In [None]:
#create example data frame
d = {'green': [1, 1, 0, 1, 1], 'blue': [1, 0, 0, 1, 1], 'yellow': [0, 0, 1, 1, 0], 'black': [1, 0, 0, 1, 1]}
df_example2 = pd.DataFrame(data = d, index= ["fish1", "fish2", "fish3", "fish4", "fish5"])
df_example2

For this example let's think of 5 different fish. Each fish has a different combination of the colors green, blue, yellow and black. Those colors are the features of the fish and likewise the dimensionality of the data. Each feature is represented by a value of 0 or 1, while 0 means the fish doesn't have this color and 1 it has. In our example fish1 has the colors green, blue and yellow. If we assign colors to dimensions we get the coordinates (1,1,0,1). The following equation allows us to calculate the cosine similarity for multi-dimensional data: 

$$
  CosSim(x,y) = \frac{\sum_{i}x_{i}y_{i}}{\sqrt{\sum_{i}x_{i}^2}\sqrt{\sum_{i}y_{i}^2}}
$$

If we calculate the cosine similarity on our data frame, we get:

In [None]:
#calculate cosine similarity
cosine_sim_sk = pd.DataFrame(cosine_similarity(df_example2).round(2), columns=["fish1", "fish2", "fish3", "fish4", "fish5"], index= ["fish1", "fish2", "fish3", "fish4", "fish5"])
cosine_sim_sk

How to interpret this?

You can see, that fish1 is of course totally similar to itself, indicated by a value of 1. Then fish1 is less similar to fish 2 (only one overlap) as to fish 4 (three overlaps) and has no similarity to fish3 while it is totally similar to fish5.

## Now, let's start to build our first recommender system :)

Reading in the data:

In [None]:
#read data
df = pd.read_csv("https://raw.githubusercontent.com/neuefische/ds-meetups/main/03_Recommender_Introduction/fish_1.csv")
df.reset_index(drop= True)
df.head()

To calculate the cosine similarity of the fish in our data frame, we need to transform the features into binary vectors. To do this we use the get_dummies function implemented in scikit-learn.

In [None]:
# create dummy vectors
df_vectorized = pd.get_dummies(df.iloc[:, 1:])
df_vectorized.head()

You can see that this increases the numbers of features from 24 to 114. 

Let's calculate the cosine similarity:

In [None]:
#calculate cosine similarity
cosine_sim_sk = pd.DataFrame(cosine_similarity(df_vectorized))

And finally the function to get recommendations!

Now that we have the cosine similarities, we can simply sort the similarities with respect to a specific item. This function takes in the name of a fish, checks the similarity scores compared to other fish, sorts the values and gives out the 10 most similar fish based on similarity score. Just type in a name of a fish in the data frame and you will get recommended the 10 most similar fish.

In [None]:
# Build index with fish names
names = df['name']
indices = pd.Series(df.index, index=df['name'])

# Function that get fish recommendations based on the cosine similarity 
def fish_recommendations(name):
    #get the index of the fish we put into the function
    idx = indices[name]
    #calculate all cosine similaroties to that fish and store it in a list
    sim_scores = list(enumerate(cosine_sim_sk[idx]))
    #sort the list staring with the highest similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #get the similarities from 1:11 (not starting with 0 because it is the same fish)
    sim_scores = sim_scores[1:11]
    #get the indeces of that 10 fish
    fish_indices = [i[0] for i in sim_scores]
    #return the fish names of that 10 fish
    return names.iloc[fish_indices]

Now, use this function to get recommendations :)

In [None]:
#get recommendations
fish_recommendations('Blue Neon - Paracheirodon simulans')

## Text related content-based recommender

When you think of online warehouses or movie databases, many items got descriptions in text form. Content-based recommender systems are often build on these descriptions as they contain useful information about the article. In the following you will find an example of how to build a text based recommender system.

First we have to create a text block, we can work with. Let's join our features to text. 

In [None]:
#set column types to string for whole data frame
df = df.astype(str)
columns = df.columns
#merge features to text block
df['text'] = df[columns].agg(' '.join, axis=1)


Let's have a look at the text.

In [None]:
#check first element
df.text[0]

## Text vectorization


Different tools are available to vectorize texts. Dependent on your problem, you can use NLP methods like CountVectorizer, TfidfVectorizor or words2vec. In this notebook we will use the TfidfVectorizer. Documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). TF-IDF or Term Frequency Inverse Document Frequency is a statistical measure to determine the significance of different words in a document. Subsequently every word gets a TF-IDF value. This value is the product of the Term frequency (TF) and the inverse document frequency (IDF).</br></br>

Let's have a look at an example with 3 sentences where each sentence represents a document:

1. Recommender are cool
2. Recommender are helpful
3. Recommender use your data

In these 3 sentences the term "recommender" occurs 3 times, the therm "are" 2 times and "use" 1 time.</br></br>


Let's calculate the TF (term-frequency) for those sentences.

1. [0.33, 0.33, 0.33] 
2. [0.33, 0.33, 0.33] 
3. [0.25, 0.25, 0.25, 0.25]

In sentence 1. are 3 unique words, the frequency of each word is therefore 1/3, same in sentence 2.Sentence 4 consists of 4 unique words with a corresponding frequency of 1/4.</br></br>

Now, let's have a look at the IDF (inverse document frequency)</br></br>

The IDF is the logarithm of the ratio of the total number of documents to the number of documents having this term. 

* The word "recommender" appears in all 3 documents and has therefore a Log(3/3) = 0
* The word "are" appears in two documents with a corresponding Log(3/2) = 0.176
* The words "cool", "helpful", etc appear in only one document leading to a Log(3/1) = 0.477</br></br>

The IDF is a measure of how much information the word provides and as you can see, the IDF is higher the rarer the word is. This means rare words are more meaningful to get the context of a document.</br></br>


The TF-IDF of a word then is the product of the TF and the IDF value. The higher the score, the more relevant is this word for that particular document and with regards to recommender system the more important they are to distinguish categories. 





Now that we've learned how TF-IDF works, let's vectorize our data frame:

In [None]:
#vectorize text column
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0)
tfidf_matrix = tf.fit_transform(df['text'])

Since we now have a vecorized matrix of our data, we can likewise above calculate the cosine similarity on it.

In [None]:
# calculate cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) # linear_kernel is a useful function to calculate cosine sim on large data sets 

And use the cosine similarity matrix to get our top 10 recommendations.

In [None]:
# Build index with fish names
names = df['name']
indices = pd.Series(df.index, index=df['name'])

# Function that get fish recommendations based on the cosine similarity 
def fish_recommendations(name):
    #get the index of the fish we put into the function
    idx = indices[name]
    #calculate all cosine similaroties to that fish and store it in a list
    sim_scores = list(enumerate(cosine_sim[idx]))
    #sort the list staring with the highest similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #get the similarities from 1:11 (not starting with 0 because it is the same fish)
    sim_scores = sim_scores[1:11]
    #get the indeces of that 10 fish
    fish_indices = [i[0] for i in sim_scores]
    #return the fish names of that 10 fish
    return names.iloc[fish_indices]

In [None]:
#get recommendations
fish_recommendations('Blue Neon - Paracheirodon simulans')

Input:

Blue neon             |  
:-------------------------:|
<img src="https://blog.tetra.net/de-de/wp-content/uploads/2019/05/Platinum-Green-Neon-Fish-Paracheirodon-simulans_shutterstock_630636488_bearb-1200x800.jpg" alt="drawing" width="100"/> |

Pictures of Top 5 recommendations:

Red Neon            |  Red Neon XL           |  Neon Tetra          |  Red-gold Neon         |  Dwarf tetra          
:-------------------------:|:-------------------------: |:-------------------------: |:-------------------------: |:-------------------------:
<img src="https://www.garnelio.de/media/image/a3/36/17/Paracheirodon-axelrodi-roter-neon2VaNjbtTw3umpb.jpg" alt="drawing" width="100"/>  |  <img src="https://www.garnelio.de/media/image/a3/36/17/Paracheirodon-axelrodi-roter-neon2VaNjbtTw3umpb.jpg" alt="drawing" width="100"/>   |  <img src="https://www.aquarium-ratgeber.com/wp-content/uploads/2021/08/neonsalmler-paracheirodon-innesi.jpg" alt="drawing" width="100"/>  |  <img src="https://cdn02.plentymarkets.com/idwditcg5ajj/item/images/85046/full/Paracheirodon-axelrodi-GOLD.jpg" alt="drawing" width="100"/>  |  <img src="https://www.zierfische-kotterba.de/wp-content/uploads/nanostomus-marginatus.jpg" alt="drawing" width="100"/> 

Inspecting the outcome, you can see that the most similar fish to the blue neon is the red neon, a close relative. If you want you can now play around with different fish names or use this notebook as a template for your own recommender :)