This notebook is the primary step to build my first ever recommendation system. The data that will be fed into the system is an extended version of the original <a href=https://github.com/zygmuntz/goodbooks-10k>goodbooks-10k</a> dataset made by <a href="https://github.com/zygmuntz">@Zygmunt Zając</a>, whereas the extended one (enriched with e.g.: book descriptions, genres. More info about the additional fields <a href="https://github.com/malcolmosh/goodbooks-10k-extended/blob/master/README.md">here</a>) was made by user <a href="https://github.com/malcolmosh">malcolmosh</a>. Big shoutout to them both!<br>

The main idea of the system is to print out the recommendations based on the user's query. So that will include text preprocessing, creating text embeddings, calculation of vector similarity scores based on which the retrieval will retirive top-n recommendations.  <br>

But before all of that let's start with EDA.

In [30]:
import pandas as pd
from ast import literal_eval
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt

In [31]:
books_df = pd.read_csv('https://raw.githubusercontent.com/malcolmosh/goodbooks-10k/master/books_enriched.csv', index_col=[0], converters={"genres": literal_eval})

Let's remove the columns that won't be needed. Also, for the purpose of this exercise we choose only books published in English. 

In [32]:
books_df = books_df.drop(['best_book_id', 'goodreads_book_id', 'isbn', 'isbn13', 'small_image_url', 'work_id', 'work_ratings_count', 'work_text_reviews_count'], axis=1)
books_df = books_df[books_df["language_code"] == "eng"]

In [33]:
books_df.isna().sum()

index                          0
authors                        0
average_rating                 0
book_id                        0
books_count                    0
description                   52
genres                         0
image_url                      0
language_code                  0
original_publication_year     20
original_title               562
pages                         69
publishDate                    8
ratings_1                      0
ratings_2                      0
ratings_3                      0
ratings_4                      0
ratings_5                      0
ratings_count                  0
title                          0
authors_2                      0
dtype: int64

Let's investigate the columns with missing values.

In [34]:
def print_n(data, n):
    return data.head(n)

In [35]:
print_n(books_df["title"], 10)

0              The Hunger Games (The Hunger Games, #1)
1    Harry Potter and the Sorcerer's Stone (Harry P...
2                              Twilight (Twilight, #1)
3                                To Kill a Mockingbird
4                                     The Great Gatsby
5                               The Fault in Our Stars
6                                           The Hobbit
7                               The Catcher in the Rye
8                Angels & Demons  (Robert Langdon, #1)
9                                  Pride and Prejudice
Name: title, dtype: object

In [36]:
print_n(books_df["original_title"], 10)

0                            The Hunger Games
1    Harry Potter and the Philosopher's Stone
2                                    Twilight
3                       To Kill a Mockingbird
4                            The Great Gatsby
5                      The Fault in Our Stars
6          The Hobbit or There and Back Again
7                      The Catcher in the Rye
8                            Angels & Demons 
9                         Pride and Prejudice
Name: original_title, dtype: object

Given that the titles in the `original_title` columns are more friendly to use, we'll have to get rid of rows missing them.

Books for which we do not have descriptions are unnecessary, since the descriptions will be really important to base the recommendations on. 

In [37]:
print_n(books_df[books_df["description"].isna()], 5)

Unnamed: 0,index,authors,average_rating,book_id,books_count,description,genres,image_url,language_code,original_publication_year,...,pages,publishDate,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,ratings_count,title,authors_2
427,427,['Anonymous'],4.43,464,1449,,"[religion, classics, nonfiction, christian, hi...",https://images.gr-assets.com/books/1313518530m...,eng,1611.0,...,1590.0,01/15/08,10011,6802,11712,14338,128731,159457,Holy Bible: King James Version,['Anonymous']
911,911,['Jules Verne'],3.84,972,1363,,"[classics, science-fiction, fiction, fantasy]",https://s.gr-assets.com/assets/nophoto/book/11...,eng,1864.0,...,240.0,April 25th 2006,1238,6342,31442,42106,29972,89410,Journey to the Center of the Earth (Extraordin...,['Jules Verne']
1044,1044,"['Aesop', 'Laura Harris', 'Laura Gibbs']",4.05,1120,942,,"[classics, fiction, fantasy, philosophy]",https://s.gr-assets.com/assets/nophoto/book/11...,eng,-560.0,...,306.0,April 10th 2003,773,3717,22587,34885,37000,88508,Aesop's Fables,"['Aesop', 'Laura Harris', 'Laura Gibbs']"
1244,1244,"['Allen Ginsberg', 'William Carlos Williams']",4.14,1329,47,,"[poetry, classics, fiction]",https://images.gr-assets.com/books/1327870926m...,eng,1956.0,...,56.0,01/01/01,1544,3203,12223,24316,34043,71968,Howl and Other Poems,"['Allen Ginsberg', 'William Carlos Williams']"
1252,1252,"['Anonymous', 'Joseph Smith Jr.']",4.37,1338,343,,"[religion, nonfiction, spirituality, history, ...",https://images.gr-assets.com/books/1327389004m...,eng,1830.0,...,531.0,10/28/13,6989,2468,2246,1749,52708,63530,The Book of Mormon: Another Testament of Jesus...,"['Anonymous', 'Joseph Smith Jr.']"


We will examine possible time trends, so let's make sure that the `original_publication_year` variable is the right type.

The rest of columns containing missing values are not that important from our tasks' perspective, so we ignore them.

In [38]:
books_df = books_df[books_df["original_title"].notna() & books_df["description"].notna() & books_df["original_publication_year"].notna()]

In [39]:
#books_df.isna().sum()

In [40]:
books_df["original_publication_year"] = books_df[books_df["original_publication_year"].notna()]["original_publication_year"].astype(int)

In [45]:
#print_n(books_df, 5)
#books_df.to_excel("preprocessed_books_df.xlsx")

## Per author analysis

In [42]:
def clean_authors(x):
    authors = literal_eval(x)
    authors = [a.strip("[] '") for a in authors]
    return ', '.join(authors)

In [80]:
books_df['authors'] = books_df['authors'].apply(clean_authors)