Imports

In [50]:
import pandas as pd
import numpy as np

---

# Setting up the dataframes

We are using two datasets: 
- __main_books_df__: a small dataset providing extra data for content based filtering book , eg. descriptions and geres;
- __book_list_large_df__: a larger one providing a link to a ratings dataset used for the collaborative filtering part + images for the GUI.

Methodology: We keep all entries from the smaller dataset that are present in the larger one. Like this, we can later modify the large reviews dataset to only contain reviews for the books we have already filtered here

In [51]:
#Main dataset with descriptions
main_books_df = pd.read_csv("source_datasets/abdallah_books_with_descriptions.csv")

#Large book dataset (over 270k entries)
book_list_large_df = pd.read_csv("source_datasets/book_dataset_large/books_large.csv",low_memory=False)

In [52]:
book_list_large_df.head(3)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0060973129.01.LZZZZZZZ.jpg


In [53]:
main_books_df.head(3)

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCPgAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api,"A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers...",2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GPgAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api,"A new 'Christie for Christmas' -- a full-length novel adapted from her acclaimed play by Charles Osborne Following BLACK COFFEE and THE UNEXPECTED GUEST comes the final Agatha Christie play novelisation, bringing her superb storytelling to a new legio...",2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQawwEACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api,Volume Two of Stephen Donaldson's acclaimed second trilogy featuing the compelling anti-hero Thomas Covenant.,1982.0,3.97,479.0,172.0


# Setting up the book dataset

First, we filter common entries from the smaller dataset with the ones in the larger one, based on the ISBN

In [54]:
same_title_books_df = book_list_large_df[book_list_large_df['ISBN'].isin(main_books_df['isbn10'])]

In [55]:
same_title_books_df.shape

(2447, 8)

Merge all of the additional columns from main_books_df and drop the duplicate column created

In [56]:
# Merge additional columns 
same_title_books_df = same_title_books_df.merge(main_books_df[['isbn10', 'categories', 'description']], 
                                                how='left', left_on='ISBN', right_on='isbn10')

# Drop the duplicate
same_title_books_df.drop(columns=['isbn10'], inplace=True)

A new merged dataset is formed

In [57]:
same_title_books_df.head(1)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,categories,description
0,440234743,The Testament,John Grisham,1999,Dell,http://images.amazon.com/images/P/0440234743.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0440234743.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0440234743.01.LZZZZZZZ.jpg,Fiction,"A suicidal billionaire, a burnt-out Washington litigator, and a woman who has forsaken technology to work in the wilds of Brazil are all brought together by an astounding mystery of the testament"


Last step is to reorder the data

In [58]:
# Reordering columns
columns_order = ['ISBN', 'Book-Title', 'Book-Author', 'Publisher', 'Year-Of-Publication', 
                  'categories', 'description', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L']

same_title_books_df = same_title_books_df[columns_order]

print("Shape: " + str(same_title_books_df.shape))

Shape: (2447, 10)


### The base merged dataset we will be using:

This merge of datasets allows us to use both metadata such as categories or descriptions AND provides a link to a ratings dataset provided with the larger books dataset we used.

In [59]:
same_title_books_df.head(3)

Unnamed: 0,ISBN,Book-Title,Book-Author,Publisher,Year-Of-Publication,categories,description,Image-URL-S,Image-URL-M,Image-URL-L
0,440234743,The Testament,John Grisham,Dell,1999,Fiction,"A suicidal billionaire, a burnt-out Washington litigator, and a woman who has forsaken technology to work in the wilds of Brazil are all brought together by an astounding mystery of the testament",http://images.amazon.com/images/P/0440234743.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0440234743.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0440234743.01.LZZZZZZZ.jpg
1,553582909,Icebound,Dean R. Koontz,Bantam Books,2000,Fiction,"A secret Arctic experiment turns into a frozen nightmare when a team of scientists, stranded on a drifting iceberg with a massive explosive charge, battles the elements for survival, only to discover that one of them is a murderer. Reissue.",http://images.amazon.com/images/P/0553582909.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0553582909.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0553582909.01.LZZZZZZZ.jpg
2,842342702,Left Behind: A Novel of the Earth's Last Days (Left Behind #1),Tim Lahaye,Tyndale House Publishers,2000,Fiction,"The first book in the author's successful ""last days"" series follows a 747 pilot as he tries to recover from the effects of ""The Rapture."" Reprint.",http://images.amazon.com/images/P/0842342702.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0842342702.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0842342702.01.LZZZZZZZ.jpg


In [60]:
# same_title_books_df.to_csv("./filtered_datasets/books_merge_1.csv", index=False)
# same_title_books_df = pd.read_csv("./filtered_datasets/books_merge_1.csv")

---

## Error corrections in the datasets

#### 1. Some books in the initial, smaller dataset are exatly the same __BUT__ they are different editions

In [61]:
columns_with_text = same_title_books_df[same_title_books_df['Book-Title'].str.contains('The Two Towers')]
columns_with_text[["Book-Title","Image-URL-M"]].head(2)

Unnamed: 0,Book-Title,Image-URL-M
28,"The Two Towers (The Lord of the Rings, Part 2)",http://images.amazon.com/images/P/0345339711.01.MZZZZZZZ.jpg
430,"The Two Towers (The Lord of the Rings, Part 2)",http://images.amazon.com/images/P/0618002235.01.MZZZZZZZ.jpg


Notice: they have the same name but different images

__Solution:__ We chose the option of removing the duplicates, as they represent the same book and tell the same story

In [62]:
same_title_books_df_unique = same_title_books_df.drop_duplicates(subset=['Book-Title'])
same_title_books_df_unique.to_csv("./filtered_datasets/books_merge_1.csv", index=False)
print("old :" + str(same_title_books_df.shape))
print("new :" + str(same_title_books_df_unique.shape))

old :(2447, 10)
new :(2426, 10)


Observation: dataset took a negligable hit of only 21 entries dropped

#### 2. Some entries have really short or no descriptions

In [63]:
no_description_count = same_title_books_df['description'].isna()

print("Number of entries with no description:", no_description_count.sum())

Number of entries with no description: 78


We can see a total of 78 remaining entries have no description

In [64]:
same_title_books_df[no_description_count].head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Publisher,Year-Of-Publication,categories,description,Image-URL-S,Image-URL-M,Image-URL-L
137,380699176,"100 Great Fantasy Short, Short Stories",Isaac Asimov,Harper Mass Market Paperbacks (Mm),1987,Fiction,,http://images.amazon.com/images/P/0380699176.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0380699176.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0380699176.01.LZZZZZZZ.jpg
212,806509023,Existentialism and Human Emotions (A Philosophical Library Book),Jean-Paul Sartre,Citadel Trade,1984,Philosophy,,http://images.amazon.com/images/P/0806509023.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0806509023.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0806509023.01.LZZZZZZZ.jpg
247,345384563,"A History of God: The 4,000-Year Quest of Judaism, Christianity and Islam",Karen Armstrong,Ballantine Books,1994,God (Christianity),,http://images.amazon.com/images/P/0345384563.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0345384563.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0345384563.01.LZZZZZZZ.jpg
280,553212419,Sherlock Holmes : The Complete Novels and Stories (Bantam Classic) Volume I,"Arthur Conan, Sir Doyle",Bantam,1986,Detective and mystery stories,,http://images.amazon.com/images/P/0553212419.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0553212419.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0553212419.01.LZZZZZZZ.jpg
602,394735307,The Tale of Genji,Murasaki Shikibu,Alfred A. Knopf,1978,,,http://images.amazon.com/images/P/0394735307.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0394735307.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0394735307.01.LZZZZZZZ.jpg


Solution: Remove titles with no description

In [65]:
same_title_books_df = same_title_books_df.dropna(subset=['description'])

Some items have really short descriptions that might throw off the predictions

In [66]:
word_threshold = 8

short_descriptions = same_title_books_df['description'].apply(lambda x: len(str(x).split()) < word_threshold)

short_descriptions_df = same_title_books_df[short_descriptions]

pd.set_option('display.max_colwidth', 255)
print(f"Number of entries with fewer than {word_threshold} words in description:", short_descriptions.sum())
print()
short_descriptions_df[['Book-Title','description']]

Number of entries with fewer than 8 words in description: 37



Unnamed: 0,Book-Title,description
53,"Breath, Eyes, Memory",Oprah's Book Club.
126,"Arrows of the Queen ( The Heralds of Valdemar, Book 1)",Eventyrroman.
206,Triumph of the Darksword (Darksword Trilogy),Science fiction.
236,The Crack-Up,(Autobiographical).
251,Strata,Fantasy-roman.
317,Time And Again,Romance.
357,More Than Complete Hitchhiker's Guide: Complete &amp; Unabridged,Five stories by Douglas Adams.
434,"That Was Then, This Is Now",HINTON/THAT WAS THEN THIS IS NOW
616,The Eight,National bestseller.
624,Pride and Prejudice (Penguin Popular Classics),First published in 1813.


So we filter out these entries

In [67]:
same_title_books_df = same_title_books_df[~short_descriptions]

# Display the number of remaining entries
print("Number of remaining entries:", len(same_title_books_df))

Number of remaining entries: 2332


In [69]:
same_title_books_df.to_csv("./filtered_datasets/final/final_books_content.csv", index=False)