In [6]:
import numpy as np
import pandas as pd
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns

### CREATION OF A CSV FILE WITH THE GENRE OF EACH BOOK

Objective: To identify the genre most in demand by readers for each book and write it in a specific column.

Goodreads determines a book's genre by crowd-sourcing user shelves. If a number of users shelve a book as "science," for example, then that genre is assigned to the book in their algorithm. This isn't a perfect system, as sometimes users might shelve something as "science" when it's actually "science fiction," and so on. 
For more details : https://help.goodreads.com/s/article/How-can-I-set-my-book-s-genres

The data set related to book’s genres comes from the great work made by Mengting Wan and Julian McAuley. 
•	Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
•	Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]

All files can be found at this adress
https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/


### Step 1 :  Create a dataframe with all book's genre : df_Genre

In [2]:
import gzip
import json

In [3]:
# Function to load the gz file and transform into JSON file 
def load_data(file_name, head = 10000000):
    count = 0
    data = []
    with gzip.open(file_name) as fin:
        for l in fin:
            d = json.loads(l)
            count += 1
            data.append(d)
            
            # break if reaches the 100th line
            #if (head is not None) and (count > head):
                #break
    return data

In [4]:
# Loading the gz file and transform it with function above
books = load_data(r"C:\Users\gunon\Documents\bootcamp-main\3-projects\ML_Book_Valuations\goodreads_book_genres_initial.json.gz")

In [7]:
# Transform the nested JSON data structure books into a dataframe called df_Genre
df_Genre = pd.json_normalize(books)

# Rename the column with book number - will be necessary for the next steps
df_Genre.rename(columns={'book_id': 'bookID'}, inplace=True)

In [8]:
# Information on the dataframe
df_Genre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2360655 entries, 0 to 2360654
Data columns (total 11 columns):
 #   Column                                         Dtype  
---  ------                                         -----  
 0   bookID                                         object 
 1   genres.history, historical fiction, biography  float64
 2   genres.fiction                                 float64
 3   genres.fantasy, paranormal                     float64
 4   genres.mystery, thriller, crime                float64
 5   genres.poetry                                  float64
 6   genres.romance                                 float64
 7   genres.non-fiction                             float64
 8   genres.children                                float64
 9   genres.young-adult                             float64
 10  genres.comics, graphic                         float64
dtypes: float64(10), object(1)
memory usage: 198.1+ MB


In [11]:
# A look in detail to the dataframe
df_Genre

Unnamed: 0,bookID,"genres.history, historical fiction, biography",genres.fiction,"genres.fantasy, paranormal","genres.mystery, thriller, crime",genres.poetry,genres.romance,genres.non-fiction,genres.children,genres.young-adult,"genres.comics, graphic"
0,5333265,1.0,,,,,,,,,
1,1333909,5.0,219.0,,,,,,,,
2,7327624,,8.0,31.0,1.0,1.0,,,,,
3,6066819,,555.0,,10.0,,23.0,,,,
4,287140,,,,,,,3.0,,,
...,...,...,...,...,...,...,...,...,...,...,...
2360650,3084038,7.0,,,,,,5.0,,,
2360651,26168430,,1.0,,4.0,,,,1.0,,
2360652,2342551,,,,,14.0,,1.0,7.0,1.0,
2360653,22017381,,,,2.0,,13.0,,,,


In [12]:
# Look for data of a specific book_id n°1 (Harry Potter)
df_Genre[df_Genre['bookID']=='1']

Unnamed: 0,bookID,"genres.history, historical fiction, biography",genres.fiction,"genres.fantasy, paranormal","genres.mystery, thriller, crime",genres.poetry,genres.romance,genres.non-fiction,genres.children,genres.young-adult,"genres.comics, graphic"
861044,1,,11308.0,42143.0,467.0,,340.0,,6907.0,14393.0,


### Step 2 : Create a column which keeps the major book's genre.

Each book was assigned a score for each of the 10 literary genres. The book's genre with the highest score will be recorded in a specific column. For books without genre the code genres.missing will be given.
A new data frame is create : df_Genre_Bright
A new column is added : book_Genre

In [21]:
# DIFFERENT ACTIONS ON THE DATAFRAME df_Genre

# Replace NaN with 0 in the dataframe Genre
df_Genre_Bright= df_Genre.fillna(0)

# Define a function to get the title of the column with the highest note
def get_highest_note_column(row):
    max_note = row[['genres.history, historical fiction, biography', 'genres.fiction', 'genres.fantasy, paranormal', 'genres.mystery, thriller, crime', 'genres.poetry', 'genres.romance', 'genres.non-fiction', 'genres.children', 'genres.young-adult', 'genres.comics, graphic']].max()
    return row.index[row == max_note][0]

# Add a new column 'book_genre' to the dataframe
df_Genre_Bright['book_genre'] = df_Genre_Bright.apply(get_highest_note_column, axis=1)

# Replace 0 with 'genres.missing' in the 'book_genre' column
df_Genre_Bright['book_genre'] = df_Genre_Bright['book_genre'].replace(0, 'genres.missing')

# Print the updated dataframe
df_Genre_Bright


Unnamed: 0,bookID,"genres.history, historical fiction, biography",genres.fiction,"genres.fantasy, paranormal","genres.mystery, thriller, crime",genres.poetry,genres.romance,genres.non-fiction,genres.children,genres.young-adult,"genres.comics, graphic",book_genre
0,5333265,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"genres.history, historical fiction, biography"
1,1333909,5.0,219.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,genres.fiction
2,7327624,0.0,8.0,31.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,"genres.fantasy, paranormal"
3,6066819,0.0,555.0,0.0,10.0,0.0,23.0,0.0,0.0,0.0,0.0,genres.fiction
4,287140,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,genres.non-fiction
...,...,...,...,...,...,...,...,...,...,...,...,...
2360650,3084038,7.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,"genres.history, historical fiction, biography"
2360651,26168430,0.0,1.0,0.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,"genres.mystery, thriller, crime"
2360652,2342551,0.0,0.0,0.0,0.0,14.0,0.0,1.0,7.0,1.0,0.0,genres.poetry
2360653,22017381,0.0,0.0,0.0,2.0,0.0,13.0,0.0,0.0,0.0,0.0,genres.romance


In [26]:
df_Genre_Bright.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2360655 entries, 0 to 2360654
Data columns (total 12 columns):
 #   Column                                         Dtype  
---  ------                                         -----  
 0   bookID                                         object 
 1   genres.history, historical fiction, biography  float64
 2   genres.fiction                                 float64
 3   genres.fantasy, paranormal                     float64
 4   genres.mystery, thriller, crime                float64
 5   genres.poetry                                  float64
 6   genres.romance                                 float64
 7   genres.non-fiction                             float64
 8   genres.children                                float64
 9   genres.young-adult                             float64
 10  genres.comics, graphic                         float64
 11  book_genre                                     object 
dtypes: float64(10), object(2)
memory usage: 21

In [27]:
# Saving final book's Genre dataframe (df_Genre_Bright) into csv file to reuse it in the following ML project
df_Genre_Bright.to_csv(r"C:\Users\gunon\Documents\bootcamp-main\3-projects\ML_Book_Valuations\Genre.csv", index=False)