# Romance Market Analysis - Data Exploration and Cleaning

**Author** : Lucie Dou  
**Date** : January 2025  
**Goal** : Loading, exploring, and cleaning Goodreads data to prepare for romance market analysis

---

## Content
1. Loading raw data
2. Initial Exploration
3. Romance filtering
4. Cleaning and converting types
5. Final statistics
6. Backing up clean data

In [2]:
# Importing libraries
import pandas as pd
import numpy as np

## 1. Loading raw data

We load the Goodreads dataset from Kaggle (11,000+ books).

**Special Features** :
- Encoding : `latin-1` (European accented characters)
- Séparator : Semicolon `;`
- Management of malformed lines

In [3]:
# Loading CSV with adapted parameters
df = pd.read_csv("../data/raw/Goodreads_books_with_genres.csv", 
                 encoding='latin-1',  # Accented characters
                 sep=';',             # Semicolon separator
                 on_bad_lines='skip') # Ignore malformed lines

# Dimensions overview
print(f"Dataset loaded : {df.shape[0]:,} books × {df.shape[1]} columns\n")

# Show the first few lines
df.head()


Dataset loaded : 11,127 books × 13 columns



Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
0,1,Harry Potter and the Half-Blood Prince (Harry ...,"J,K, Rowling/Mary GrandPré",457,0439785960,"9,78044E+12",eng,652,2095690,27591,9/16/2006,"Scholastic Inc,","Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
1,2,Harry Potter and the Order of the Phoenix (Har...,"J,K, Rowling/Mary GrandPré",449,0439358078,"9,78044E+12",eng,870,2153167,29221,9/1/2004,"Scholastic Inc,","Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
2,4,Harry Potter and the Chamber of Secrets (Harry...,"J,K, Rowling",442,0439554896,"9,78044E+12",eng,352,6333,244,11/1/2003,Scholastic,"Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,"J,K, Rowling/Mary GrandPré",456,043965548X,"9,78044E+12",eng,435,2339585,36325,5/1/2004,"Scholastic Inc,","Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,"J,K, Rowling/Mary GrandPré",478,0439682584,"9,78044E+12",eng,2690,41428,164,9/13/2004,Scholastic,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Adve..."



## 2. Initial Exploration

Let's look at the data structure before cleaning

In [4]:
# Information about the columns
print("=== Information about the columns ===\n")
print(df.info())

print("\n=== TYPES OF DATA ===\n")
print(df.dtypes)

print("\n=== MISSING VALUES ===\n")
print(df.isnull().sum())

=== Information about the columns ===

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Book Id             11127 non-null  int64 
 1   Title               11127 non-null  object
 2   Author              11127 non-null  object
 3   average_rating      11127 non-null  object
 4   isbn                11127 non-null  object
 5   isbn13              11127 non-null  object
 6   language_code       11127 non-null  object
 7   num_pages           11127 non-null  int64 
 8   ratings_count       11127 non-null  int64 
 9   text_reviews_count  11127 non-null  int64 
 10  publication_date    11127 non-null  object
 11  publisher           11127 non-null  object
 12  genres              11030 non-null  object
dtypes: int64(4), object(9)
memory usage: 1.1+ MB
None

=== TYPES OF DATA ===

Book Id                int64
Title                 ob

**Key observations** :
- 11 127 books in total
- 13 columns available
- Some values ​​are missing from the column `genres` (97)

## 3. Romance filtering

We only keep books whose genre contains "Romance".

In [5]:
# Filtering (case-insensitive, ignores null values)
df_romance = df[df['genres'].str.contains('Romance', case=False, na=False)].copy()

print(f"Identified romances : {len(df_romance):,} books")
print(f"Reprerenting : {len(df_romance)/len(df)*100:.1f}% of dataset")

# Aperçu
df_romance.head()

Identified romances : 1,566 books
Reprerenting : 14.1% of dataset


Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
33,57,A Changeling for All Seasons (Changeling Seaso...,Angela Knight/Sahara Kelly/Judy Mays/Marteeka ...,376,1595962808,"9,7816E+12",eng,304,167,4,11/1/2005,Changeling Press,"Romance;Fantasy,Paranormal;Anthologies;Adult F..."
35,59,The Changeling Sea,"Patricia A, McKillip",406,141312629,"9,78014E+12",eng,137,4454,302,4/14/2003,Firebird,"Fantasy;Young Adult;Romance;Fiction;Fantasy,Ma..."
38,66,The Changeling (Daughters of England #15),Philippa Carr,398,449146979,"9,78045E+12",eng,369,345,12,8/28/1990,Ivy Books,"Historical,Historical Fiction;Romance;Fiction;..."
89,151,Anna Karenina,Leo Tolstoy/Richard Pevear/Larissa Volokhonsky,405,143035002,"9,78014E+12",eng,838,16643,1851,5/31/2004,Penguin Classics,"Classics;Fiction;Romance;Cultural,Russia;Histo..."
90,152,Anna Karenina,Leo Tolstoy/David Magarshack/Priscilla Meyer,405,451528611,"9,78045E+12",eng,960,109420,5696,11/5/2002,Signet,"Classics;Fiction;Romance;Cultural,Russia;Histo..."


## 4. Cleaning and converting types

### Handling missing values

In [6]:
# Check for missing values ​​after conversion
print("=== MISSING VALUES AFTER CLEANING ===\n")
missing = df_romance.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing value !")

=== MISSING VALUES AFTER CLEANING ===

No missing value !


## 5. Final statistics

Dataset cleaned and ready for analysis!

In [12]:
print("=== STATISTICS OF CLEANED ROMANCES ===\n")
print(f"Number of books : {len(df_romance):,}")
print(f"Average rating: {df_romance['average_rating'].mean():.2f}/5")
print(f"Minimal rating : {df_romance['average_rating'].min()}")
print(f"Maximal rating : {df_romance['average_rating'].max()}")
print(f"Average number of reviews : {df_romance['ratings_count'].mean():,.0f}")
print(f"Average number of pages : {df_romance['num_pages'].mean():.0f}")

=== STATISTICS OF CLEANED ROMANCES ===

Number of books : 1,566
Average rating: 3.90/5
Minimal rating : 2.4
Maximal rating : 4.55
Average number of reviews : 29,295
Average number of pages : 353


**Results** :
- 1 566 usable romances
- Reviews between 2,4 et 4,55 (consistent)
- good quality dataset (no outliers)
- High commitment (~29k ratings)

## 6. Backup of cleaned data

The data is now ready for subgenre analysis.

In [10]:
# Saved in the folder processed/
output_path = "../data/processed/romance_books_clean.csv"
df_romance.to_csv(output_path, index=False, sep=';', encoding='utf-8')

print(f"Data saved in : {output_path}")
print(f"{len(df_romance)} romances ready for analysis !")

Data saved in : ../data/processed/romance_books_clean.csv
1566 romances ready for analysis !


## Next steps

**Next Notebook** : `02_subgenres_analysis.ipynb`
- Identifying romance subgenres
- Analyse comparative (Historical, Contemporary, Paranormal, etc.)
- Statistics by subgenre