Chargement des données finales
Création des 3-4 graphiques professionnels
Sauvegarde dans visuals/

# Analysis of romance subgenres

**Author** : Lucie Dou  
**Date** : January 2025  
**Goal** : Identify and analyze the different subgenres of romance to understand their specific characteristics.

---

## Content
1. Loading cleaned data
2. Extraction of all present genres
3. Identifying romance subgenres
4. Creation of indicators by sub-genre
5. Comparative analyses
6. Saving enriched data

In [2]:
# importing Libraries
import pandas as pd
import numpy as np
from collections import Counter

## 1. Loading cleaned data

We load the data prepared in the previous notebook.

In [4]:
# Loading cleaned-up romances
df_romance = pd.read_csv("../data/processed/romance_books_clean.csv", 
                         encoding='utf-8', 
                         sep=';')

print(f"Dataset loaded: {len(df_romance):,} romances\n")

# Quick check
print("=== DATA OVERVIEW ===")
print(df_romance.head())

Dataset loaded: 1,566 romances

=== DATA OVERVIEW ===
   Book Id                                              Title  \
0       57  A Changeling for All Seasons (Changeling Seaso...   
1       59                                 The Changeling Sea   
2       66         The Changeling (Daughters of England  #15)   
3      151                                      Anna Karenina   
4      152                                      Anna Karenina   

                                              Author average_rating  \
0  Angela Knight/Sahara Kelly/Judy Mays/Marteeka ...           3,76   
1                               Patricia A, McKillip           4,06   
2                                      Philippa Carr           3,98   
3     Leo Tolstoy/Richard Pevear/Larissa Volokhonsky           4,05   
4       Leo Tolstoy/David Magarshack/Priscilla Meyer           4,05   

         isbn       isbn13 language_code  num_pages  ratings_count  \
0  1595962808   9,7816E+12           eng        304       

In [5]:
# Loading cleaned-up romances
df_romance = pd.read_csv("../data/processed/romance_books_clean.csv", 
                         encoding='utf-8', 
                         sep=';')

print(f"Dataset chargé : {len(df_romance):,} romances\n")

# Safety check and cleaning
print("=== VÉRIFICATION DU TYPE DE average_rating ===")
print(f"Type actuel : {df_romance['average_rating'].dtype}\n")

# If it's in text (object) format, clean and convert it
if df_romance['average_rating'].dtype == 'object':
    print(" Colonne en texte, conversion nécessaire...")
    
    # Replace commas with periods
    df_romance['average_rating'] = df_romance['average_rating'].str.replace(',', '.')
    
    # Convert to numeric (errors='coerce' sets NaN if impossible)
    df_romance['average_rating'] = pd.to_numeric(df_romance['average_rating'], errors='coerce')
    
    # Remove rows with outliers
    before = len(df_romance)
    df_romance = df_romance.dropna(subset=['average_rating'])
    after = len(df_romance)
    
    print(f" Conversion réussie")
    if before > after:
        print(f" {before - after} ligne(s) supprimée(s) (valeurs invalides)")
else:
    print(" Déjà en format numérique")

# Check that the grades are consistent (between 0 and 5)
print(f"\n=== STATISTIQUES RAPIDES ===")
print(f"Note min : {df_romance['average_rating'].min()}")
print(f"Note max : {df_romance['average_rating'].max()}")
print(f"Note moyenne : {df_romance['average_rating'].mean():.2f}")

# Filter outlier scores (outside the 0-5 range)
df_romance = df_romance[
    (df_romance['average_rating'] >= 0) & 
    (df_romance['average_rating'] <= 5)
]

print(f"\n Dataset final : {len(df_romance):,} romances valides")

Dataset chargé : 1,566 romances

=== VÉRIFICATION DU TYPE DE average_rating ===
Type actuel : object

 Colonne en texte, conversion nécessaire...
 Conversion réussie

=== STATISTIQUES RAPIDES ===
Note min : 2.4
Note max : 4.55
Note moyenne : 3.90

 Dataset final : 1,566 romances valides


## 2. Exploration of existing genres

Before filtering, let's look at all the genres present in our dataset to identify the relevant subgenres.

In [6]:
# Extract all individual genres
all_genres = []

for genres_str in df_romance['genres'].dropna():
    # Separate with ; and with ,
    genres_list = genres_str.replace(';', ',').split(',')
    # Clean (remove gaps)
    genres_list = [g.strip() for g in genres_list if g.strip()]
    all_genres.extend(genres_list)

# Count the occurrences
genre_counts = Counter(all_genres)

print("=== TOP 20 MOST FREQUENT GENRES ===\n")
for genre, count in genre_counts.most_common(20):
    percentage = count / len(df_romance) * 100
    print(f"{genre:35} : {count:4} ({percentage:5.1f}%)")

=== TOP 20 MOST FREQUENT GENRES ===

Romance                             : 2302 (147.0%)
Fiction                             : 1427 ( 91.1%)
Fantasy                             : 1099 ( 70.2%)
Historical                          :  985 ( 62.9%)
Literature                          :  691 ( 44.1%)
Adult                               :  523 ( 33.4%)
Contemporary                        :  522 ( 33.3%)
Historical Fiction                  :  499 ( 31.9%)
Young Adult                         :  489 ( 31.2%)
Classics                            :  482 ( 30.8%)
Womens Fiction                      :  436 ( 27.8%)
Novels                              :  408 ( 26.1%)
Chick Lit                           :  370 ( 23.6%)
Mystery                             :  350 ( 22.3%)
Paranormal                          :  338 ( 21.6%)
Cultural                            :  336 ( 21.5%)
Sequential Art                      :  335 ( 21.4%)
European Literature                 :  333 ( 21.3%)
Humor                      

## 3. Identifying romance subgenres

We identify the most represented specific subgenres of romance.

In [7]:
# Filter only subgenres containing "Romance"
print("\n=== IDENTIFIED ROMANCE SUBGENRES ===\n")

romance_subgenres = [(g, c) for g, c in genre_counts.items() if 'Romance' in g]
romance_subgenres.sort(key=lambda x: x[1], reverse=True)

for genre, count in romance_subgenres:
    percentage = count / len(df_romance) * 100
    print(f"{genre:40} : {count:4} ({percentage:5.1f}%)")


=== IDENTIFIED ROMANCE SUBGENRES ===

Romance                                  : 2302 (147.0%)
Contemporary Romance                     :  231 ( 14.8%)
Historical Romance                       :  214 ( 13.7%)
Paranormal Romance                       :  136 (  8.7%)
Category Romance                         :   90 (  5.7%)
Regency Romance                          :   38 (  2.4%)
M M Romance                              :   37 (  2.4%)
M F Romance                              :   17 (  1.1%)
Fantasy Romance                          :   16 (  1.0%)
Erotic Romance                           :   15 (  1.0%)
Christian Romance                        :   13 (  0.8%)
Western Romance                          :   10 (  0.6%)
Gothic Romance                           :    9 (  0.6%)
Medieval Romance                         :    9 (  0.6%)
Planetary Romance                        :    6 (  0.4%)
Western Historical Romance               :    5 (  0.3%)
Harlequin Romance                        :    4 (

**Main subgenres identified** (>100 occurrences) :
- Contemporary Romance
- Paranormal Romance
- Historical Romance
- Erotic Romance
- Romantic Suspense
- Fantasy Romance

These 6 subgenres will be analyzed in detail.

## 4. Creation of indicators by sub-genre

We create a boolean column for each main subgenre.

In [8]:
# Define the subgenres to be analyzed
subgenres = {
    'Contemporary': 'Contemporary Romance',
    'Historical': 'Historical Romance',
    'Paranormal': 'Paranormal Romance',
    'Erotic': 'Erotic Romance',
    'Suspense': 'Romantic Suspense',
    'Fantasy': 'Fantasy Romance'
}

# Create indicator columns
for key, genre_name in subgenres.items():
    col_name = f'is_{key.lower()}'
    df_romance[col_name] = df_romance['genres'].str.contains(genre_name, case=False, na=False)
    count = df_romance[col_name].sum()
    print(f"{genre_name:25} : {count:4} books ({count/len(df_romance)*100:5.1f}%)")


Contemporary Romance      :  230 books ( 14.7%)
Historical Romance        :  163 books ( 10.4%)
Paranormal Romance        :  136 books (  8.7%)
Erotic Romance            :   15 books (  1.0%)
Romantic Suspense         :   89 books (  5.7%)
Fantasy Romance           :   16 books (  1.0%)


**Note** : A book can belong to several subgenres (e.g., "Paranormal Historical Romance").

## 5. Comparative analyses

Let's compare the subgenres based on different criteria.

### 5.1 Overall statistics by subgenre

In [9]:
# Prepare the results
results = []

for key, genre_name in subgenres.items():
    mask = df_romance[f'is_{key.lower()}']
    count = mask.sum()
    
    if count > 0:  # Avoid division by zero
        avg_rating = df_romance[mask]['average_rating'].mean()
        avg_engagement = df_romance[mask]['ratings_count'].mean()
        avg_pages = df_romance[mask]['num_pages'].mean()
        
        results.append({
            'Subgenre': genre_name,
            'Number': count,
            'Pourcentage': f"{count/len(df_romance)*100:.1f}%",
            'Average rating': f"{avg_rating:.2f}",
            'Average engagement': f"{avg_engagement:.0f}",
            'Average Pages': f"{avg_pages:.0f}"
        })

# Create and display the table
df_results = pd.DataFrame(results)
df_results = df_results.sort_values('Number', ascending=False)

print("=== COMPARISON OF SUBGENRES ===\n")
print(df_results.to_string(index=False))

=== COMPARISON OF SUBGENRES ===

            Subgenre  Number Pourcentage Average rating Average engagement Average Pages
Contemporary Romance     230       14.7%           3.79              31531           356
  Historical Romance     163       10.4%           3.93              22674           426
  Paranormal Romance     136        8.7%           3.97              56390           377
   Romantic Suspense      89        5.7%           3.86               4105           354
     Fantasy Romance      16        1.0%           3.98               6062           500
      Erotic Romance      15        1.0%           3.85               3662           304


### 5.2 Ranking by quality (average rating)

In [12]:
# Sort by average rating (convert to float for sorting)
df_results_quality = df_results.copy()
df_results_quality['Note_num'] = df_results_quality['Average rating'].astype(float)
df_results_quality = df_results_quality.sort_values('Note_num', ascending=False)

print("\n=== RANKING BY AVERAGE RATING ===\n")
for i, (_, row) in enumerate(df_results_quality.iterrows(), 1):
    print(f"{i:02d}. {row['Subgenre']:25} : {row['Average rating']}/5")


=== RANKING BY AVERAGE RATING ===

01. Fantasy Romance           : 3.98/5
02. Paranormal Romance        : 3.97/5
03. Historical Romance        : 3.93/5
04. Romantic Suspense         : 3.86/5
05. Erotic Romance            : 3.85/5
06. Contemporary Romance      : 3.79/5


### 5.3 Ranking by engagement (number of reviews)

In [13]:
# sort by engagement
df_results_engagement = df_results.copy()
df_results_engagement['Engagement_num'] = df_results_engagement['Average engagement'].str.replace(',', '').astype(float)
df_results_engagement = df_results_engagement.sort_values('Engagement_num', ascending=False)

print("\n=== RANKING BY ENGAGEMENT ===\n")
for i, (_, row) in enumerate(df_results_engagement.iterrows(), 1):
    print(f"{i:02d}. {row['Subgenre']:25} : {row['Average engagement']} average engagement")


=== RANKING BY ENGAGEMENT ===

01. Paranormal Romance        : 56390 average engagement
02. Contemporary Romance      : 31531 average engagement
03. Historical Romance        : 22674 average engagement
04. Fantasy Romance           : 6062 average engagement
05. Romantic Suspense         : 4105 average engagement
06. Erotic Romance            : 3662 average engagement


## 6. Key Insights

Summary of main findings:

In [15]:
# Calculate some insights automatically
best_rated = df_results_quality.iloc[0]
most_engaging = df_results_engagement.iloc[0]
most_popular = df_results.iloc[0]

print("=== kEY INSIGHTS ===\n")
print(f"Most published subgenre : {most_popular['Subgenre']} ({most_popular['Number']} books)")
print(f"Best average rating : {best_rated['Subgenre']} ({best_rated['Average rating']}/5)")
print(f"Stronger commitment : {most_engaging['Subgenre']} ({most_engaging['Average engagement']} rates)")

# Compare Historical vs Contemporary
hist_rating = float(df_results[df_results['Subgenre'] == 'Historical Romance']['Average rating'].values[0])
contemp_rating = float(df_results[df_results['Subgenre'] == 'Contemporary Romance']['Average rating'].values[0])

if hist_rating > contemp_rating:
    diff = hist_rating - contemp_rating
    print(f"Historical Romance ({hist_rating:.2f}) is rated higher than Contemporary ({contemp_rating:.2f}) of {diff:.2f} points")
else:
    diff = contemp_rating - hist_rating
    print(f"Contemporary Romance ({contemp_rating:.2f}) is rated higher than Historical ({hist_rating:.2f}) of {diff:.2f} points")

=== kEY INSIGHTS ===

Most published subgenre : Contemporary Romance (230 books)
Best average rating : Fantasy Romance (3.98/5)
Stronger commitment : Paranormal Romance (56390 rates)
Historical Romance (3.93) is rated higher than Contemporary (3.79) of 0.14 points


## 7. Saving enriched data

The data with sub-genre indicators is ready for visualization

In [20]:
# Calculate some insights automatically
best_rated = df_results_quality.iloc[0]
most_engaging = df_results_engagement.iloc[0]
most_popular = df_results.iloc[0]

print("=== KEY INSIGHTS ===\n")
print(f"Most published sub-genre : {most_popular['Subgenre']} ({most_popular['Number']} books)")
print(f"Best average rate : {best_rated['Subgenre']} ({best_rated['Average rating']}/5)")
print(f"Strongest Engagement : {most_engaging['Subgenre']} ({most_engaging['Average engagement']} rating)")

# Compare Historical vs Contemporary
hist_rating = float(df_results[df_results['Subgenre'] == 'Historical Romance']['Average rating'].values[0])
contemp_rating = float(df_results[df_results['Subgenre'] == 'Contemporary Romance']['Average rating'].values[0])

if hist_rating > contemp_rating:
    diff = hist_rating - contemp_rating
    print(f"Historical Romance ({hist_rating:.2f}) is rated higher than Contemporary ({contemp_rating:.2f}) of {diff:.2f} points")
else:
    diff = contemp_rating - hist_rating
    print(f"Contemporary Romance ({contemp_rating:.2f}) is rated higher than Historical ({hist_rating:.2f}) of {diff:.2f} points")

=== KEY INSIGHTS ===

Most published sub-genre : Contemporary Romance (230 books)
Best average rate : Fantasy Romance (3.98/5)
Strongest Engagement : Paranormal Romance (56390 rating)
Historical Romance (3.93) is rated higher than Contemporary (3.79) of 0.14 points


## Next steps

**Next Notebook** : `03_visualizations.ipynb`
- Distribution graphs
- Visual comparisons of quality vs. popularity
- Distribution of ratings by subgenre