In [1]:
import pandas as pd
import bz2


pd.set_option('display.max_rows', 100)

In [2]:
full_books = pd.read_csv("full_df.csv.bz2", compression='bz2')

In [3]:
full_books.head(3)

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
0,1333909,743509986.0,626222,Anita Diamant,Good Harbor,"Anita Diamant's international bestseller ""The ...",Simon & Schuster Audio,"['to-read', 'fiction', 'currently-reading', 'c...",3.23,10,0,2001.0,,"['8709549', '17074050', '28937', '158816', '22...",https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...
1,7327624,,10333,Barbara Hambly,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",Omnibus book club edition containing the Ladie...,"Nelson Doubleday, Inc.","['to-read', 'fantasy', 'fiction', 'owned', 'ha...",4.03,140,600,1987.0,eng,"['19997', '828466', '1569323', '425389', '1176...",https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...
2,6066819,743294297.0,9212,Jennifer Weiner,Best Friends Forever,Addie Downs and Valerie Adler were eight when ...,Atria Books,"['to-read', 'chick-lit', 'currently-reading', ...",3.49,51184,368,2009.0,eng,"['6604176', '6054190', '2285777', '82641', '75...",https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...


In [4]:
full_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115445 entries, 0 to 1115444
Data columns (total 16 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   book_id        1115445 non-null  int64  
 1   isbn           760981 non-null   object 
 2   author_id      1115445 non-null  int64  
 3   authors        1115444 non-null  object 
 4   title          1115442 non-null  object 
 5   description    1115435 non-null  object 
 6   publisher      863332 non-null   object 
 7   genres         1115445 non-null  object 
 8   avg_rating     1115445 non-null  float64
 9   ratings_count  1115445 non-null  int64  
 10  num_pages      1115445 non-null  int64  
 11  pub_year       896977 non-null   float64
 12  language_code  522531 non-null   object 
 13  similar_books  1115445 non-null  object 
 14  url            1115445 non-null  object 
 15  cover_image    1115445 non-null  object 
dtypes: float64(2), int64(4), object(10)
memory usage: 136.

# PART 1: Check for Duplicates

* Since it is possible for books to have the same title, we will look for books that have the same title AND author
* We will also check for books that have the same description
* There are also some hidden duplicates in the form of box sets and book collections

### a) There are 245,814 duplicates. We will drop the duplicates, but we will keep the books with the highest count of ratings:

In [5]:
full_books.duplicated(subset=['title','authors']).value_counts()

False    869631
True     245814
dtype: int64

In [6]:
books = full_books.sort_values(by='ratings_count', ascending=False).drop_duplicates(subset=['title','authors'], keep='first', ignore_index=True)

### b) Of the remaining books, there are 37,827 duplicates based on the descriptions. Let's take a look at some of these decriptions:

In [7]:
books.duplicated(subset=['description']).value_counts()

False    831804
True      37827
dtype: int64

In [8]:
dd = books['description'].value_counts().to_frame().query('description > 1').reset_index()
dd.head(10)

Unnamed: 0,index,description
0,This book was converted from its physical edit...,322
1,This is a pre-1923 historical reproduction tha...,203
2,"Many of the earliest books, particularly those...",100
3,This work has been selected by scholars as bei...,72
4,A Simon & Schuster eBook. Simon & Schuster has...,71
5,The Ensign of The Church of Jesus Christ of La...,64
6,Boyds Mills Press publishes a wide range of hi...,51
7,"The story focuses on Kenichi, an average 16-ye...",45
8,<>,45
9,A Simon & Schuster eBook,44


In [9]:
dd.tail(10)

Unnamed: 0,index,description
30358,When Chechen rebels took Moscow theatergoers h...,2
30359,Based on the Emmy Award-winning YouTube series...,2
30360,"Written by Tom Jackman, the local investigativ...",2
30361,An enthralling literary debut that evokes one ...,2
30362,Mat Phai (Di Tim Nhung Co Hoi Tiem An Trong Cu...,2
30363,"The Clone Wars are over, but for those with re...",2
30364,***Content Warning - This is a new adult ficti...,2
30365,In a bleak future where government systems are...,2
30366,Olivia Greyson is the proud owner of The Ginge...,2
30367,Lily Hayes is a beautiful 3rd grade teacher wh...,2


### c) Reading through the descriptions above, we can see that many of these duplicates are audiobooks, or republications of older books. For many of them, they are not including useful information about the actual descriptions of the books:

In [10]:
books[books['description'] == dd.iloc[0,0]].head(3)

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
26830,18680050,,3073,John Buchan,The Thirty-Nine Steps,This book was converted from its physical edit...,,"['to-read', 'fiction', 'classics', 'mystery', ...",3.6,2651,0,,eng,"['406575', '46429', '102066', '327008', '94969...",https://www.goodreads.com/book/show/18680050-t...,https://images.gr-assets.com/books/1382113363m...
41834,18625474,,2223232,Lydia Maria Gurney,Things Mother Used to Make A Collection of Old...,This book was converted from its physical edit...,,"['to-read', 'currently-reading', 'cookbooks', ...",3.56,1544,0,,,"['16135072', '13072231', '9497653', '11068380'...",https://www.goodreads.com/book/show/18625474-t...,https://s.gr-assets.com/assets/nophoto/book/11...
59688,5488559,,2192,Aristotle,Ethics,This book was converted from its physical edit...,Public Domain Books,"['to-read', 'currently-reading', 'favorites', ...",3.91,978,0,2005.0,eng,"['1354', '60080', '332138', '130119', '25709',...",https://www.goodreads.com/book/show/5488559-et...,https://images.gr-assets.com/books/1396128601m...


In [11]:
books[books['description'] == dd.iloc[3,0]].head(5)

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
55170,926803,1600969607.0,4633123,H. Rider Haggard,Montezuma's Daughter,This work has been selected by scholars as bei...,Boomer Books,"['to-read', 'adventure', 'classics', 'fiction'...",4.08,1086,440,2008.0,eng,"['2769141', '572690', '533415', '6069859', '46...",https://www.goodreads.com/book/show/926803.Mon...,https://s.gr-assets.com/assets/nophoto/book/11...
86560,12899857,,4190443,Mary H. Foster,Asgard Stories: Tales from Norse Mythology,This work has been selected by scholars as bei...,,"['to-read', 'currently-reading', 'mythology', ...",3.8,591,99,2001.0,,"['717128', '8122211', '677065', '18868228', '4...",https://www.goodreads.com/book/show/12899857-a...,https://images.gr-assets.com/books/1421445946m...
94406,194349,192123157.0,2448,Arthur Conan Doyle,His Last Bow: Some Reminiscences of Sherlock H...,This work has been selected by scholars as bei...,"Oxford University Press, USA","['to-read', 'mystery', 'currently-reading', 'c...",4.3,524,304,1993.0,,"['1043897', '184441', '26347202', '948244', '1...",https://www.goodreads.com/book/show/194349.His...,https://s.gr-assets.com/assets/nophoto/book/11...
97898,25167837,,45712,Katherine Mansfield,The Garden Party,This work has been selected by scholars as bei...,,"['to-read', 'short-stories', 'classics', 'fict...",3.61,499,0,,eng,"['1856297', '12365551', '1749624', '71605', '9...",https://www.goodreads.com/book/show/25167837-t...,https://images.gr-assets.com/books/1488110223m...
185015,32503238,,91660,James Willard Schultz,"Rising Wolf, the White Blackfoot: Hugh Monroe'...",This work has been selected by scholars as bei...,,"['currently-reading', 'to-read', 'native-ameri...",4.24,196,0,,,[],https://www.goodreads.com/book/show/32503238-r...,https://s.gr-assets.com/assets/nophoto/book/11...


### d) Some books are missing their description. We can utilize the isbnlib python library to try and extract the missing descriptions, or we can just remove them

In [12]:
books[books['description'] == dd.iloc[8,0]].head()

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
22680,1251054,318047101,9494,H.P. Lovecraft,The Colour Out of Space and others,<>,Necronomicon Press,"['to-read', 'horror', 'favorites', 'fiction', ...",4.19,3216,0,1982.0,eng,"['10533450', '10562172', '275779', '9458934', ...",https://www.goodreads.com/book/show/1251054.Th...,https://images.gr-assets.com/books/1330132634m...
116739,289557,2718117524,248419,Pierre Georges Castex,"""Le Rouge Et Le Noir"" De Stendhal",<>,CDU SEDES,"['to-read', 'currently-reading', 'french', 'cl...",3.74,388,187,1995.0,eng,"['3137202', '220362', '504763', '88140', '8381...",https://www.goodreads.com/book/show/289557._Le...,https://images.gr-assets.com/books/1331355465m...
131739,151629,2070369609,5548,Simone de Beauvoir,La Femme rompue,<>,Folio,"['to-read', 'currently-reading', 'fiction', 'f...",3.98,326,252,1972.0,,"['2799237', '88332', '57787', '1559221', '1520...",https://www.goodreads.com/book/show/151629.La_...,https://images.gr-assets.com/books/1277113227m...
165850,135161,2070401871,78176,Andrei Makine,Le Testament français,<>,Folio,"['currently-reading', 'to-read', 'fiction', 'r...",3.83,231,0,1997.0,,"['138612', '1225688', '816971', '780991', '157...",https://www.goodreads.com/book/show/135161.Le_...,https://images.gr-assets.com/books/1356806680m...
200105,1692280,2070385981,457158,Rejean Ducharme,Le nez qui voque,<>,,"['to-read', 'quebecois', 'currently-reading', ...",4.0,173,336,,,"['1177365', '1907739', '422170', '6041667', '1...",https://www.goodreads.com/book/show/1692280.Le...,https://s.gr-assets.com/assets/nophoto/book/11...


### e) Some of the duplicated descriptions are actually the same book, but with slightly different titles:

In [13]:
books[books['description'] == dd.iloc[49,0]].head()

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
4848,13512170,,656983,J.R.R. Tolkien,The Hobbit,Bilbo Baggins is a reasonably typical hobbit: ...,Houghton Mifflin Harcourt,"['books-i-own', 'classic', 'adventure', 'young...",4.25,17097,322,2012.0,eng,"['25300956', '104091', '44687', '64216', '6201...",https://www.goodreads.com/book/show/13512170-t...,https://images.gr-assets.com/books/1390804912m...
52998,6637795,788737279.0,656983,J.R.R. Tolkien,"The Hobbit, Prequel to the Lord of the Rings T...",Bilbo Baggins is a reasonably typical hobbit: ...,,"['books-i-own', 'classic', 'adventure', 'young...",4.25,1141,0,,,"['25300956', '104091', '44687', '64216', '6201...",https://www.goodreads.com/book/show/6637795-th...,https://images.gr-assets.com/books/1332981616m...
84299,6472585,,656983,J.R.R. Tolkien,The Hobbit: or There and Back Again,Bilbo Baggins is a reasonably typical hobbit: ...,Easton Press,"['books-i-own', 'classic', 'adventure', 'young...",4.25,614,317,1984.0,eng,"['25300956', '104091', '44687', '64216', '6201...",https://www.goodreads.com/book/show/6472585-th...,https://s.gr-assets.com/assets/nophoto/book/11...
95741,2194861,739410741.0,656983,J.R.R. Tolkien,The Hobbit or There and Back Again,Bilbo Baggins is a reasonably typical hobbit: ...,Houghton Mifflin Co.,"['books-i-own', 'classic', 'adventure', 'young...",4.25,514,256,1997.0,eng,"['25300956', '104091', '44687', '64216', '6201...",https://www.goodreads.com/book/show/2194861.Th...,https://s.gr-assets.com/assets/nophoto/book/11...
144295,111782,395520215.0,656983,J.R.R. Tolkien,"The Hobbit; or, There and Back Again",Bilbo Baggins is a reasonably typical hobbit: ...,,"['books-i-own', 'classic', 'adventure', 'young...",4.25,284,0,1989.0,,"['25300956', '104091', '44687', '64216', '6201...",https://www.goodreads.com/book/show/111782.The...,https://s.gr-assets.com/assets/nophoto/book/11...


In [14]:
books[books['description'] == dd.iloc[30366,0]]

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
114726,15810129,425260690.0,4111360,Virginia Lowell,"One Dead Cookie (Cookie Cutter Shop Mystery, #4)",Olivia Greyson is the proud owner of The Ginge...,Berkley,"['to-read', 'mystery', 'cozy-mystery', 'curren...",3.95,398,304,2013.0,eng,"['16000229', '15810126', '15810844', '15741975...",https://www.goodreads.com/book/show/15810129-o...,https://s.gr-assets.com/assets/nophoto/book/11...
418400,18933004,,4111360,Virginia Lowell,One Dead Cookie,Olivia Greyson is the proud owner of The Ginge...,,"['to-read', 'mystery', 'cozy-mystery', 'curren...",3.95,47,0,,,"['16000229', '15810126', '15810844', '15741975...",https://www.goodreads.com/book/show/18933004-o...,https://s.gr-assets.com/assets/nophoto/book/11...


### f) There are also magazines and different volumes of the same manga:

In [16]:
books[books['description'] == dd.iloc[7,0]].head(2)

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
162800,15854588,,6421711,Syun Matsuena,History's Strongest Disciple Kenichi Volume 3,"The story focuses on Kenichi, an average 16-ye...",,"['manga', 'to-read', 'martial-arts', 'manga-re...",4.37,238,180,,,"['13612727', '3808017', '4937615', '1435389']",https://www.goodreads.com/book/show/15854588-h...,https://images.gr-assets.com/books/1346184681m...
171699,15854586,,6421711,Syun Matsuena,History's Strongest Disciple Kenichi Volume 5,"The story focuses on Kenichi, an average 16-ye...",,"['manga', 'to-read', 'martial-arts', 'manga-re...",4.34,219,180,,,"['13612727', '3808017', '6535071', '17157752']",https://www.goodreads.com/book/show/15854586-h...,https://images.gr-assets.com/books/1346184656m...


### g) There are some 'duplicates' hidden within our data due to things like boxsets and collections:

In [17]:
mask = books['title'].str.contains(pat='Harry Potter Collection', regex=True, case=True, na=False)
books[mask]

Unnamed: 0,book_id,isbn,author_id,authors,title,description,publisher,genres,avg_rating,ratings_count,num_pages,pub_year,language_code,similar_books,url,cover_image
3299,10,439827604,1077326,J.K. Rowling,"Harry Potter Collection (Harry Potter, #1-6)","Six years of magic, adventure, and mystery mak...",Scholastic,"['to-read', 'favorites', 'fantasy', 'currently...",4.73,25245,3342,2005.0,eng,"['690926', '2223324', '13350', '510712', '1187...",https://www.goodreads.com/book/show/10.Harry_P...,https://images.gr-assets.com/books/1328867351m...
42529,7,439887453,1077326,J.K. Rowling,"The Harry Potter Collection (Harry Potter, #1-6)","Six years of magic, adventure, and mystery mak...",Scholastic,"['to-read', 'favorites', 'fantasy', 'currently...",4.73,1512,0,2006.0,eng,"['690926', '2223324', '13350', '510712', '1187...",https://www.goodreads.com/book/show/7.The_Harr...,https://images.gr-assets.com/books/1328866031m...
42795,1668764,747594562,1077326,J.K. Rowling,The Complete Harry Potter Collection Box Set (...,A fabulous opportunity to own all seven Harry ...,Bloomsbury,"['favorites', 'currently-reading', 'fantasy', ...",4.74,1500,4000,2007.0,eng,"['6443349', '65113', '7619', '11387535', '4924...",https://www.goodreads.com/book/show/1668764.Th...,https://images.gr-assets.com/books/1381586062m...
641060,28794609,3200307951,1077326,J.K. Rowling,J.K. Rowling Harry Potter Collection (Harry Po...,J.K. Rowling Harry Potter Collection 7 Books B...,,"['favorites', 'currently-reading', 'fantasy', ...",4.74,17,0,,eng,"['6443349', '65113', '7619', '11387535', '4924...",https://www.goodreads.com/book/show/28794609-j...,https://s.gr-assets.com/assets/nophoto/book/11...


### h) Drop books that have no useful description:
* There are books with duplicated descriptions that do not actuall describe the books, we will drop these:

In [18]:
# Indices from dd
dd_indexes = [0,1,2,3,4,5,6,7,8,9,19,11,12,13,15,16,17,20,25,28,32,33,41]

# Empty list
books_to_drop = []

#loop through indinces from dd, extract book index associated with description
for i in dd_indexes:
    drop = list(books[books['description'] == dd.iloc[i,0]].index)
    books_to_drop.append(drop)
    
# Our for-loop gave us a list of lists, so we will just use list comprehension to get a single list    
books_to_drop = [item for sublist in books_to_drop for item in sublist]

# Drop books
books = books.drop(books_to_drop).reset_index(drop=True)

In [19]:
books.shape

(868342, 16)

### i) Drop remaining duplicate descriptions, we will keep the highest rating count as we did with book title:

In [20]:
books = books.sort_values(by='ratings_count', ascending=False).drop_duplicates(subset=['description'], keep='first', ignore_index=True)

In [21]:
books.shape

(831781, 16)

### j) Drop books that are boxsets or collections:

In [22]:
key_words = 'Boxset|boxset|boxed set|Boxed Set|Book Collection'
mask = books['title'].str.contains(pat=key_words, regex=True, case=True, na=False)
books = books[~mask].reset_index(drop=True)

In [23]:
books.shape

(830978, 16)

# PART 2: Missing Values

In [24]:
books_for_rec = books[['book_id','isbn','authors','title','description','genres']]
books_for_rec

Unnamed: 0,book_id,isbn,authors,title,description,genres
0,2767052,0439023483,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",Winning will make you famous.\nLosing means ce...,"['favorites', 'currently-reading', 'to-read', ..."
1,3,0439554934,J.K. Rowling,Harry Potter and the Sorcerer's Stone (Harry P...,Harry Potter's life is miserable. His parents ...,"['to-read', 'favorites', 'fantasy', 'young-adu..."
2,2657,0061120081,Harper Lee,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,"['to-read', 'favorites', 'classics', 'classic'..."
3,4671,0743273567,F. Scott Fitzgerald,The Great Gatsby,"THE GREAT GATSBY, F. Scott Fitzgerald's third ...","['to-read', 'classics', 'favorites', 'fiction'..."
4,11870085,0525478817,John Green,The Fault in Our Stars,"There is an alternate cover edition .\n""I fel...","['to-read', 'favorites', 'young-adult', 'ficti..."
...,...,...,...,...,...,...
830973,1531074,0811841723,Susan Verlander,"Goodnight, Country","Dinner bells ring, screen doors swing. Bats sq...","['rhyme', 'picture-books', 'books-sounds', 'be..."
830974,20637510,146368942X,Kip Manley,"Wake up.. (City of Roses, #1)",City of Roses is a serialized epic very firmly...,"['to-read', 'on-hold', 'fantasy', 'one-of-thes..."
830975,20827002,,Lizzie Stark,Pocket Guide to American Freeform,"Pocket Guide to American Freeformis a 24,000-w...","['to-read', 'non-fiction', 'books-to-get', 'ra..."
830976,4247549,0826498884,Scott T. Brown,A Guide to Writing Academic Essays in Religiou...,One of the greatest challenges for instructors...,"['to-read', 'on-my-shelves']"


### a) Missing books titles

In [37]:
books_for_rec[books_for_rec['title'].isna()]

Unnamed: 0,book_id,isbn,authors,title,description,genres
659074,7807037,189738856X,Cara Benson,,"Poetry. ""In the magical dictionary of (MADE), ...","['to-read', 'poetry', 'spring-2010', 'favorite..."
774235,2433394,0440428475,E.L. Konigsburg,,Ben has always been content to be brilliant at...,"['to-read', 'fiction', 'ya', 'young-adult', 'c..."


In [64]:
books_for_rec.at[774235,'title']='Boku to Joji'
books_for_rec.at[659074,'title']='Made'

### b) Missing book description

In [38]:
books_for_rec[books_for_rec['description'].isna()]

Unnamed: 0,book_id,isbn,authors,title,description,genres
293813,21802898,,J. Lea,Beyond All Boundaries,,"['to-read', 'new-adult', 'kindle', 'romance', ..."


In [39]:
import isbnlib

isbnlib.isbn_from_words('Beyond All Boundaries')
books_for_rec['description'].fillna(isbnlib.desc('9781950639021'),inplace=True)

### c) Missing author

In [66]:
books_for_rec[books_for_rec['authors'].isna()]

Unnamed: 0,book_id,isbn,authors,title,description,genres
711347,711979,9643510735,,تهران شهر بی آسمان,A collection of short stories set in Tehran.,"['to-read', 'روايات', 'currently-reading', 'ow..."


In [70]:
books_for_rec.dropna(subset=['authors'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_for_rec.dropna(subset=['authors'],inplace=True)


In [71]:
books_for_rec.isna().sum()

book_id             0
isbn           241601
authors             0
title               0
description         0
genres              0
dtype: int64

### There are a lot of missing isbn numbers, but we won't need these for our recommendation system, so we will leave them missing for now

In [78]:
from langdetect import detect

def detect_lang(text):
    try:
        return detect(text)
    except:
        return 'unknown'

In [None]:
books_for_rec['lang'] = books_for_rec['description'].apply(detect_lang)
books_for_rec

In [74]:
#books_for_rec.to_csv('cleaned_books.csv.bz2', index=False, compression='bz2')