# Building your own recommender system

* This recommender system uses `cosine similarity values` to provide users with top 0-5 books which are most likely to belong to the **genre** they want 

* To present the results in a dataframe, import `pandas`
* To use cosine similarity, import `sklearn` and `cosine_similarity`

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

#### Read the data

In [2]:
PATH = 'google_books_1299.csv' # This is the file which contains all the data we need to use

# Writes a function 'get_data' to read data, get lowercased words
# and drop all the missing values into a dataframe
def get_data(path_to_data):

    data = pd.read_csv(f'{path_to_data}',index_col=0) # reads data from the file 'google_books_1299.csv' into the dataframe 'data'
    # index is the column [0] (the first column, which is the 'title' column in the dataframe)
    
    data["title"] = data["title"].str.lower() # Converts every characters of every entry in the 'title' column to lowercase
    data = data.dropna() # drops all the missing values
    data.index = [i for i in range(0,len(data))] # sets the data index as the length of this dataframe
    return data

In [None]:
# Applies the function 'get_data' on the csv file

In [3]:
data = get_data(PATH)

# Create a content-based recommender system using cosine similarity

In [4]:
# This recommender system uses the simple_preprocess package
# So at first we import this package

from gensim.utils import simple_preprocess
data['generes'] = data['generes'].astype('string') # 1, Converts the datatype of every value in the column 'genres' to string
data['preprocessed_genre'] = data['generes'].apply(
    lambda genre: simple_preprocess(genre.replace('&', '').replace('amp', '').replace(',', '').replace('none', ''), min_len=3) if type(genre) is str else '')

# 2, Creates a new column 'preprocessed_genre', copys all the values from 'generes' to this new column

# Tokenizes every entry in the column 'preprocessed_genre'
# Removes ',' '&' and 'amp' between words, convert NaN into ''
# min_len=1 means the token should be at least 1 character
# If the datatype of any entry inside this column is not string, just return ''

* For the next step, we need to use `CountVectorizer` and `TfidfVectorizer`
* Also we need to import `cosine_similarity` from `sklearn`, and `scipy` from `spatial`

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial

* Before calculating the cosine similarity, we install the `SentenceTransformer`

* We choose to use the SentenceTransformer model `all-MiniLM-L6-v2`.
* This model maps sentences and paragraphs to a 384 dimensional dense vector space, and can be used for tasks like clustering or semantic search. <https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2>

In [6]:
##This framework provides an easy method to compute dense vector representations for sentences.The models are based on transformer networks like BERT.Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.
from sentence_transformers import SentenceTransformer 
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') 

* The recommender contains 2 functions:
    * The first function finds the resulst according to `cosine similarity values`, and present the **top (0-5) results** which has the **highest** cosine similarity values
    * The second function provides the users a dictionary of all the **genres** of books in the dataset, so that users can **type in** what genre they want
        * Then, the second functions applies the first function

In [7]:
# Function 1:
# This part is built to sort the books and recommend the books to users
# Users cannot see this part

def recommender(genres, data): # Creates a function called 'recommender' 
    book_dic = {} # Creates an empty dictionary
    
    # We want to apply this function row by row in the dataframe, so:
    
    for index, value in data.iterrows(): #.iterrows: Iterates over DataFrame rows as (index, Series) pairs. -reads the DataFrame row by row
        genres_embeddings = model.encode([genres]) # Uses the model to encode genres with UTF-8
        value_embeddings = model.encode(', '.join([str(elem) for elem in value[13]])) #encodes the string values in genres
        cosine = 1 - spatial.distance.cosine(genres_embeddings, value_embeddings) # cosine similarity = 1 - cosine distance between the 'genres_embeddings' and 'value_embeddings' 
        if cosine > 0.7: # When cosine similarity is large (> 0.7)
            
            # Because this is a loop function:
            # if the rating value (value[2]) has already existed in the dictionary, we don't need to import them, and we can just go on 
            if index in book_dic.keys(): 
                continue
                
            # if the rating value is not yet in the dictionary, we add this value as a float number to the dictionary
            else:
                book_dic[index] = float(value[2]) 
        else:
            continue
    
    # Then, there can be three conditions when users type the genre they want into this recommender:
    
    # 1st condition: when there are more than 5 books qualified for the search
    if book_dic and len(list(book_dic.keys()))>=5: 
        book_dic = dict(sorted(book_dic.items(), key=lambda x: x[1], reverse=True)) # lists the results from the highest to the lowest cosine similarity values
        book_indexs = list(book_dic.keys())[:5] # picks the top 5 results (which have the highest cosine similarity values)
        recommendation = pd.DataFrame(columns=['ISBN','title']) # presents the top 5 results by showing their 'ISBN's and 'title's in the dataframe
        
        # Use a small for loop to create a list of dictionary wich contains the 'ISBN's and 'title's of the recommended books
        for book_index in book_indexs:
            book_ISBN = data['ISBN'].iloc[book_index] # maps the 'ISBN' according to the book index
            book_title = data['title'].iloc[book_index] # maps the 'title' according to the book index
            recommendation=recommendation.append({'ISBN' : book_ISBN , 'title' : book_title} , ignore_index=True) # appends each recommended book to the list     
        return recommendation
    
    # 2nd condition: when there are less than 5 books qualified for the search
    elif book_dic:
        # Gives an explanation to the users
        # The rest of the codes remain the same as the 1st condition, because there are only fewer books
        print("We can only give you less than 5 recommendations") 
        book_dic = dict(sorted(book_dic.items(), key=lambda x: x[1], reverse=True))
        book_indexs = list(book_dic.keys())
        recommendation = pd.DataFrame(columns=['ISBN','title'])
        for book_index in book_indexs:
            book_ISBN = data['ISBN'].iloc[book_index]
            book_title = data['title'].iloc[book_index]
            recommendation=recommendation.append({'ISBN' : book_ISBN , 'title' : book_title} , ignore_index=True)      
        return recommendation
        
    # 3rd condition: when there are no results
    else:
        # Gives an explanation to the users
        recommendation = "Sorry, we do not have any recommentation for you."
        return recommendation

# Function 2:
# This part is built to present users with all the genres
# So that users can select which genre they want

# This is what the users actually confront with

def Content_based_recommender(data): # Creates another function called 'Content_based_recommender' 

    data = data[data['preprocessed_genre'].notna()] # drops the missing values in the column 'preprocessed_genre'
    genres_set = set() # get every unique value from the column 'preprocessed_genre' into a dictionary 'genres_set'
    for i in data['preprocessed_genre']: # iterate on every single word in this column
        for genre in i:
            genres_set.add(genre) # Every unique value is a genre, and thus is added to the dictionary
    
    # Now the users will be shown with the dictionary which contains all the genres in the dataframe
    # They are asked to type in the genre they want
    print(f"What type of genre do you like? \n\nYou can choose from the following:\n\n{genres_set}")
    genres = input().lower() # Lowercase the user's input
    
    # Applies the previous function 'recommender' to find the recommended books from the dataframe
    # The result is stored as 'recommendations'
    recommendations = recommender(genres, data)

    # Shows the recommendations
    return recommendations

In [None]:
# Now we can run our recommender!

In [None]:
Content_based_recommender(data)