In [1]:
import numpy as np
import pandas as pd

In [2]:
books = pd.read_csv("books_cleaned.csv")

In [16]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5693 entries, 0 to 5692
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   isbn13              5693 non-null   int64  
 1   isbn10              5693 non-null   object 
 2   title               5693 non-null   object 
 3   authors             5646 non-null   object 
 4   categories          5662 non-null   object 
 5   thumbnail           5506 non-null   object 
 6   description         5693 non-null   object 
 7   published_year      5693 non-null   float64
 8   average_rating      5693 non-null   float64
 9   num_pages           5693 non-null   float64
 10  ratings_count       5693 non-null   float64
 11  title_and_subtitle  5693 non-null   object 
 12  tagged_description  5693 non-null   object 
dtypes: float64(4), int64(1), object(8)
memory usage: 578.3+ KB


### Exploring the `categories` Column

I would be creating a new column that containt two (2) unique values: `Fiction` and `Nonfiction`.

The top 20 most popular categories are:

In [23]:
popular_categories = books["categories"].value_counts().reset_index()
print(f"There are {popular_categories.shape[0]} unique book categories")

print(f"\nThe first 20 most popular books are:-")
popular_categories[:20]

There are 499 unique book categories

The first 20 most popular books are:-


Unnamed: 0,categories,count
0,Fiction,2269
1,Juvenile Fiction,456
2,Biography & Autobiography,348
3,History,226
4,Literary Criticism,138
5,Philosophy,129
6,Religion,126
7,Comics & Graphic Novels,125
8,Drama,99
9,Juvenile Nonfiction,81


Now, examine all categories that contains the word fiction and nonficition:

In [35]:
fictional_books = books[books["categories"].str.lower().str.contains("fiction", na=False)]
print(f"The total number of fictional books recorded are: {fictional_books.shape[0]}.")
print(f"\nThese categories are:")
fictional_categories = fictional_books.categories.unique()
print(fictional_categories)

The total number of fictional books recorded are: 2902.

These categories are:
['Fiction' 'Fantasy fiction' 'English fiction' 'Science fiction'
 'American fiction' 'Juvenile Fiction' 'Fantasy fiction, English'
 'FICTION' 'Juvenile Nonfiction' 'Young Adult Fiction'
 'Australian fiction' 'Domestic fiction' 'Classical fiction'
 'Historical fiction' 'Experimental fiction, American'
 'Experimental fiction' 'Diary fiction' 'Adventure fiction'
 'Political fiction' 'Occult fiction' 'Arabic fiction' 'Czech fiction'
 'Alternative histories (Fiction)' 'Boarding school-fiction'
 'Fantasy fiction, American' 'Humorous fiction' 'JUVENILE FICTION'
 'European fiction' 'Science fiction, American' 'Christian fiction'
 'Indic fiction (English)']


In [39]:
non_fictional_books = books[books["categories"].str.lower().str.contains("nonfiction", na=False)]
print(f"The total number of nonfictional books recorded are: {non_fictional_books.shape[0]}.")
print(f"\nThese categories are:")
non_fictional_categories = non_fictional_books.categories.unique()
print(non_fictional_categories)

The total number of nonfictional books recorded are: 81.

These categories are:
['Juvenile Nonfiction']


Some of the columns form the most popular categories do not contain the word `fiction` or `nonfiction` but it is obvious they can fictional or non-fictional. Recreating the new column:

In [41]:
# names of the popular categories:
popular_categories.categories[:20].unique()

array(['Fiction', 'Juvenile Fiction', 'Biography & Autobiography',
       'History', 'Literary Criticism', 'Philosophy', 'Religion',
       'Comics & Graphic Novels', 'Drama', 'Juvenile Nonfiction',
       'Science', 'Poetry', 'Business & Economics',
       'Literary Collections', 'Social Science', 'Performing Arts',
       'Body, Mind & Spirit', 'Art', 'Travel', 'Cooking'], dtype=object)

In [42]:
fictional_category_mapping = {fictional_category:"Fiction" for fictional_category in fictional_categories}

non_fictional_category_mapping = {non_fictional_category:"Nonfiction" for non_fictional_category in non_fictional_categories}

other_category_mapping = {
 'Biography & Autobiography': "Nonfiction",
 'History': "Nonfiction",
 'Literary Criticism': "Nonfiction",
 'Philosophy': "Nonfiction",
 'Religion': "Nonfiction",
 'Comics & Graphic Novels': "Fiction",
 'Drama': "Fiction",
 'Science': "Nonfiction",
 'Poetry': "Fiction"}

# combine all categories together
category_mapping = {**fictional_category_mapping, **non_fictional_category_mapping, **other_category_mapping}

In [51]:
books["simple_categories"] = books["categories"].map(category_mapping)

In [52]:
books.head(2)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description,simple_categories
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...,Fiction
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...,


Books that are note in category_mapping will be assigned `NaN`. 

Examining categories that are not `NaN`:

In [53]:
books[~(books["simple_categories"].isna())]

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description,simple_categories
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...,Fiction
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,Rage of angels,"9780006178736 A memorable, mesmerizing heroine...",Fiction
8,9780006482079,0006482074,Warhost of Vastmark,Janny Wurts,Fiction,http://books.google.com/books/content?id=uOL0f...,"Tricked once more by his wily half-brother, Ly...",1995.0,4.03,522.0,2966.0,Warhost of Vastmark,9780006482079 Tricked once more by his wily ha...,Fiction
11,9780006483908,0006483909,Jimmy the Hand,Raymond E. Feist;S. M. Stirling,Fantasy fiction,http://books.google.com/books/content?id=hV4-o...,"Jimmy the Hand, boy thief of Krondor, lived in...",2003.0,3.95,368.0,5579.0,Jimmy the Hand,"9780006483908 Jimmy the Hand, boy thief of Kro...",Fiction
15,9780006496878,0006496873,Mystical Paths,Susan Howatch,English fiction,http://books.google.com/books/content?id=by4yt...,1968 finds Nicholas Darrow wrestling with pers...,1996.0,4.23,576.0,1023.0,Mystical Paths,9780006496878 1968 finds Nicholas Darrow wrest...,Fiction
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5686,9788171565641,8171565646,Aspects of the Novel,E. M. Forster,English fiction,http://books.google.com/books/content?id=qWU9P...,"Forster's lively, informed originality and wit...",2004.0,3.83,141.0,10.0,Aspects of the Novel,"9788171565641 Forster's lively, informed origi...",Fiction
5687,9788172235222,8172235224,Mistaken Identity,Nayantara Sahgal,Indic fiction (English),http://books.google.com/books/content?id=q-tKP...,On A Train Journey Home To North India After L...,2003.0,2.93,324.0,0.0,Mistaken Identity,9788172235222 On A Train Journey Home To North...,Fiction
5690,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,104.0,I Am that: Talks with Sri Nisargadatta Maharaj,9788185300535 This collection of the timeless ...,Nonfiction
5691,9789027712059,9027712050,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,http://books.google.com/books/content?id=Vy7Sk...,Since the three volume edition ofHegel's Philo...,1981.0,0.00,210.0,0.0,The Berlin Phenomenology,9789027712059 Since the three volume edition o...,Nonfiction


There were `4213` books labeled as either `Fiction` or `NoNfiction`.