# Analysis of romance subgenres

**Author** : Lucie Dou  
**Date** : January 2025  
**Goal** : Identify and analyze the different subgenres of romance to understand their specific characteristics.

---

## Content
1. Loading cleaned data
2. Extraction of all present genres
3. Creation of indicators by sub-genre
4. Comparative analyses
5. Saving enriched data

In [4]:
# importing Libraries
import pandas as pd
import numpy as np
from collections import Counter

## 1. Loading cleaned data

We load the data prepared in the previous notebook.

In [24]:
DATA_PATH = "../data/processed/romance_books_clean.csv"

df = pd.read_csv(
    DATA_PATH,
    sep=";",
    encoding="latin-1",
    on_bad_lines="skip"
)

print(f"Dataset loaded: {len(df):,} books")

Dataset loaded: 1,566 books


In [26]:
if df["average_rating"].dtype == "object":
    df["average_rating"] = (
        df["average_rating"]
        .str.replace(",", ".", regex=False)
        .astype(float)
    )

df = df.dropna(subset=["average_rating"])
df = df[(df["average_rating"] >= 0) & (df["average_rating"] <= 5)]

print(f"Valid ratings: {len(df):,}")


Valid ratings: 1,566


## 2. Exploration of existing genres

Before filtering, let's look at all the genres present in our dataset to identify the relevant subgenres.

We identify the most represented specific subgenres of romance.

In [32]:
all_genres = []

for g in df["genres"].dropna():
    all_genres.extend(
        [x.strip() for x in g.replace(";", ",").split(",") if x.strip()]
    )

genre_counts = Counter(all_genres)

print("\nTop 20 genres:")
for genre, count in genre_counts.most_common(20):
    print(f"{genre:35} {count:5}")


Top 20 genres:
Romance                              2302
Fiction                              1427
Fantasy                              1099
Historical                            985
Literature                            691
Adult                                 523
Contemporary                          522
Historical Fiction                    499
Young Adult                           489
Classics                              482
Womens Fiction                        436
Novels                                408
Chick Lit                             370
Mystery                               350
Paranormal                            338
Cultural                              336
Sequential Art                        335
European Literature                   333
Humor                                 308
Adult Fiction                         304


## 3. Creation of indicators by sub-genre

We create a boolean column for each main subgenre.

In [33]:
subgenres = {
    "Contemporary": "Contemporary Romance",
    "Historical": "Historical Romance",
    "Paranormal": "Paranormal Romance",
    "Erotic": "Erotic Romance",
    "Suspense": "Romantic Suspense",
    "Fantasy": "Fantasy Romance"
}

# Create indicators
for key, label in subgenres.items():
    df[f"is_{key.lower()}"] = df["genres"].str.contains(
        label, case=False, na=False
    )

print("\nSubgenre distribution:")
for key in subgenres:
    col = f"is_{key.lower()}"
    print(f"{subgenres[key]:25} {df[col].sum():5}")



Subgenre distribution:
Contemporary Romance        230
Historical Romance          163
Paranormal Romance          136
Erotic Romance               15
Romantic Suspense            89
Fantasy Romance              16


These 6 subgenres will be analyzed in detail.

**Note** : A book can belong to several subgenres (e.g., "Paranormal Historical Romance").

## 5. Comparative analyses

Let's compare the subgenres based on different criteria.

In [34]:
results = []

for key, label in subgenres.items():
    mask = df[f"is_{key.lower()}"]
    if mask.sum() == 0:
        continue

    results.append({
        "Subgenre": label,
        "Books": mask.sum(),
        "Avg_rating": df.loc[mask, "average_rating"].mean(),
        "Avg_engagement": df.loc[mask, "ratings_count"].mean(),
        "Avg_pages": df.loc[mask, "num_pages"].mean()
    })

df_results = (
    pd.DataFrame(results)
    .sort_values("Books", ascending=False)
)

print("\n=== Subgenre comparison ===")
print(df_results.round(2).to_string(index=False))

# Rankings
print("\n=== Ranking by rating ===")
print(
    df_results
    .sort_values("Avg_rating", ascending=False)[["Subgenre", "Avg_rating"]]
    .round(2)
    .to_string(index=False)
)

print("\n=== Ranking by engagement ===")
print(
    df_results
    .sort_values("Avg_engagement", ascending=False)[["Subgenre", "Avg_engagement"]]
    .round(0)
    .to_string(index=False)
)


=== Subgenre comparison ===
            Subgenre  Books  Avg_rating  Avg_engagement  Avg_pages
Contemporary Romance    230        3.79        31531.03     356.05
  Historical Romance    163        3.93        22673.88     425.53
  Paranormal Romance    136        3.97        56389.82     377.14
   Romantic Suspense     89        3.86         4105.30     354.03
     Fantasy Romance     16        3.98         6062.31     499.62
      Erotic Romance     15        3.85         3661.73     304.07

=== Ranking by rating ===
            Subgenre  Avg_rating
     Fantasy Romance        3.98
  Paranormal Romance        3.97
  Historical Romance        3.93
   Romantic Suspense        3.86
      Erotic Romance        3.85
Contemporary Romance        3.79

=== Ranking by engagement ===
            Subgenre  Avg_engagement
  Paranormal Romance         56390.0
Contemporary Romance         31531.0
  Historical Romance         22674.0
     Fantasy Romance          6062.0
   Romantic Suspense        

The comparative analysis highlights clear structural differences between romance subgenres in terms of volume, reader appreciation, and engagement.

Contemporary Romance is by far the most represented subgenre in the dataset. It also generates strong engagement, indicating a large and active readership. However, despite its popularity, it records the lowest average rating, suggesting that high visibility and mass consumption may come with more heterogeneous reader satisfaction.

In contrast, Historical Romance shows a more balanced profile. With fewer titles, it achieves a higher average rating and longer books on average, pointing toward a readership that values depth, immersion, and narrative development over sheer volume.

Paranormal Romance stands out as the most engaging subgenre by a wide margin. While not the most published, it combines very high engagement with one of the best average ratings. This suggests a particularly loyal and invested audience, likely driven by strong series dynamics and genre-specific expectations.

Fantasy Romance, although marginal in terms of number of books, achieves the highest average rating. This indicates strong appreciation among readers, but its relatively low engagement suggests a niche audience rather than a mass market.

Finally, Romantic Suspense and Erotic Romance appear as more specialized subgenres. Both show lower engagement and mid-range ratings, which may reflect more polarized reader expectations or smaller, less active readerships.

Overall, the results suggest a trade-off between market size and reader satisfaction: mass-market subgenres tend to generate higher engagement but slightly lower ratings, while niche subgenres achieve stronger appreciation from a smaller, more targeted audience.

## 6. Saving enriched data

The data with sub-genre indicators is ready for visualization

In [35]:

OUTPUT_PATH = "../data/processed/romance_with_subgenres.csv"

df.to_csv(
    OUTPUT_PATH,
    sep=";",
    encoding="latin-1",
    index=False
)

print(f"\nFile saved â†’ {OUTPUT_PATH}")


File saved â†’ ../data/processed/romance_with_subgenres.csv


## Next steps

**Next Notebook** : `03_visualizations.ipynb`
- Distribution graphs
- Visual comparisons of quality vs. popularity
- Distribution of ratings by subgenre