# **Identifying Franchise Potential in Manga Source Material**

**A Retrospective Classification Analysis of Community Metrics**

Author: Kenneth Young

Date: December 2025

**Platform:** [MyAnimeList Dataset (Kaggle) ](https://www.kaggle.com/datasets/andreuvallhernndez/myanimelist)    

In [11]:
# Imports
import pandas as pd
import numpy as np
import ast  # Library for parsing strings like "['Action', 'Adventure']" into actual lists
import re   # Regex for string manipulation
import os
import sys

# 1. Introduction

## 1.1 Problem Statement
**Objective**: The anime industry is a multi-billion dollar market heavily reliant on adapting existing source material (Manga, Light Novels). However, only a small fraction of published works receive adaptations. This project aims to build a machine learning classification model to identify the statistical profile of manga that sustain a multimedia franchise.

**The Challenge (Data Causality)**: We are analyzing a static dataset (collected in 2023). This introduces a "Causality Dilemma":

*    **Ideal Scenario**: We would use historical data to predict its adaptation in a later year.

*    **Actual Scenario**: We only have current metrics. A manga's current popularity (members) is often inflated because it received an anime adaptation (Reverse Causality).

**Analysis Strategy**: To mitigate this "Post-Adaptation Bias," we will engineer features that focus on intrinsic properties (Density, Ratios) rather than raw totals. We define the problem not as "Forecasting," but as "Franchise Characterization": Can we distinguish the signal of a commercially viable property from the noise of the general market, even with the presence of retrospective bias?

## 1.2 Data Source & Scope

**Source**: The dataset is a snapshot of MyAnimeList.net (MAL), the world's largest active anime and manga community. It was scraped using the Jikan API and hosted on Kaggle by Andreu Vall Hernàndez.

**URL**: https://www.kaggle.com/datasets/andreuvallhernndez/myanimelist

**Dataset Composition**: The analysis utilizes two distinct files:

1.  `manga.csv` **(The Features)**: Contains ~67,000 entries of source material, including metadata (Authors, Genres), status (Publishing/Finished), and community metrics (Score, Members).

2.    `anime.csv` **(The Reference)**: Contains ~24,000 entries of animated works. This file is used primarily as a "Lookup Table" to determine if a manga has been adapted.

**Scope of Analysis**: We restrict our analysis to Mainstream Original Source Material. To ensure we are modeling commercial viability for the general market, we filter out:

*    *Doujinshi* (Fan-made comics)

*    *Manhwa/Manhua* (Korean/Chinese comics, unless explicitly adapted)

*    *One-shots* (Single chapters)

*    *Adult Content* (Hentai/Erotica): These works operate in a distinct market with different production incentives and are excluded from this analysis.

## 1.3 Target Variable Definition

**Variable Name**: `is_adapted` (Binary: 0 or 1)

**Definition**: A manga is considered "Adapted" (1) if it shares a verified intellectual property link with an entry in the Anime database.

**Challenge**: There is no shared ID column between the two datasets. We cannot simply join on ID. Instead, we must perform Entity Resolution based on titles. A match is defined as:

*    Exact Title Match: (e.g., One Piece ↔ One Piece)

*    Cross-Language Match: (e.g., Shingeki no Kyojin ↔ Attack on Titan)

*    Synonym Match: (e.g., DanMachi ↔ Is It Wrong to Try to Pick Up Girls in a Dungeon?)

## 1.4 Key Features & Data Dictionary
While the manga dataset contains 30 columns, our analysis focuses on four primary signal categories. Administrative metadata (IDs, URLs) and visual data (Image links) will be discarded during preprocessing.

| Category | Key Features | Business Hypothesis |
| :--- | :--- | :--- |
| **Community Reception** | `score`, `members`, `favorites` | High engagement indicates a pre-existing fanbase that reduces production risk. |
| **Content Structure** | `volumes`, `type` (Manga/LN) | Sufficient source material ("inventory") is required for a standard 12-episode season. |
| **Thematic Fit** | `genres`, `themes`, `demographics` | Certain genres (e.g., *Isekai*, *Shonen*) are historically over-represented in adaptations. |
| **Production Context** | `serializations` (Magazine) | High-tier magazines (e.g., *Shonen Jump*) act as "Kingmakers" for adaptations. |

## 1.5 Methodology Roadmap

1.    **Entity Resolution**: Constructing the is_adapted ground truth via fuzzy title matching and Set Theory (Bag of Names).

2.    **EDA & Audit**: verifying the "Recency Bias" (dropping new manga) and Multicollinearity.

3.    **Feature Engineering**: Implementing Cost-Sensitive Learning features (e.g., members_per_volume) and parsing stringified categorical lists.

4.    **Modeling Strategy**:

        *    *Baseline*: Regularized Logistic Regression (L1/L2) to establish a performance floor and assess feature linearity.

        *    *Candidate Models*: Tree-Based Ensembles (Random Forest, Gradient Boosting) to capture non-linear interactions and handle outliers.

5.    **Robust Evaluation**:

        *    *Stratified Cross-Validation*: To ensure stability given the severe class imbalance (~95% non-adapted).

       *    *Metric Optimization*: Focusing on Precision-Recall (F1-Score) over Accuracy.

## 1.6 Environment Setup & Data Ingestion

We utilize the Kaggle API to programmatically download the dataset. We implement a standard directory structure (`01_raw`, `02_interim`, `03_processed`) to ensure the analysis is reproducible and raw data remains immutable.

### 1.6.1 Kaggle Setup

In [14]:
# Environment Detection & Kaggle Setup
try:
    from google.colab import drive
    IN_COLAB = True
    PROJECT_ROOT = '.' # Colab is always flat
    
    print("Running in Google Colab. Setting up environment...")
    
    # Install Kaggle API
    %pip install -q kaggle
    
    # Mount Drive (to access kaggle.json)
    drive.mount('/content/drive')
    
    # Credentials Setup
    !mkdir -p ~/.kaggle
    !cp /content/drive/MyDrive/KaggleCredentials/kaggle.json ~/.kaggle/kaggle.json
    !chmod 600 ~/.kaggle/kaggle.json
    
    print("Colab setup complete.")

except ImportError:
    IN_COLAB = False
    print("Running locally.")
    
    # Automatic Project Root Detection (looks for .git)
    current_dir = os.path.abspath(os.getcwd())
    while True:
        if '.git' in os.listdir(current_dir):
            PROJECT_ROOT = current_dir
            break
        parent = os.path.dirname(current_dir)
        if parent == current_dir: 
            PROJECT_ROOT = os.getcwd() # Fallback
            break
        current_dir = parent
        
    print("Ensure 'kaggle.json' is in your local ~/.kaggle/ directory.")

Running locally.
Ensure 'kaggle.json' is in your local ~/.kaggle/ directory.


In [16]:
# Import Kaggle and 
import kaggle

# Directory Structure
if IN_COLAB:
    raw_dir = '.'
    interim_dir = '.'
    processed_dir = '.'
else:
    raw_dir = os.path.join(PROJECT_ROOT, 'data', '01_raw')
    interim_dir = os.path.join(PROJECT_ROOT, 'data', '02_interim')
    processed_dir = os.path.join(PROJECT_ROOT, 'data', '03_processed')
    
    for d in [raw_dir, interim_dir, processed_dir]:
        os.makedirs(d, exist_ok=True)

print(f"Project Root: {PROJECT_ROOT}")
print(f"Data Directory: {raw_dir}")

Project Root: c:\Users\Nebi\Documents\Projects\MyAnimeList_Manga_Analysis
Data Directory: c:\Users\Nebi\Documents\Projects\MyAnimeList_Manga_Analysis\data\01_raw


In [17]:
# Download Data
dataset = 'andreuvallhernndez/myanimelist'
anime_path = os.path.join(raw_dir, 'anime.csv')
manga_path = os.path.join(raw_dir, 'manga.csv')

# Only download if files are missing
if not os.path.exists(anime_path) or not os.path.exists(manga_path):
    print(f"\nFiles not found. Downloading {dataset}...")
    kaggle.api.dataset_download_files(dataset, path=raw_dir, unzip=True)
    print("Download complete.")
else:
    print(f"\nData already exists at {raw_dir}. Skipping download.")


Data already exists at c:\Users\Nebi\Documents\Projects\MyAnimeList_Manga_Analysis\data\01_raw. Skipping download.


### 1.6.2 Load Data

In [18]:
# Load Data
anime_df = pd.read_csv(anime_path, low_memory=False)
manga_df = pd.read_csv(manga_path, low_memory=False)

print(f"\nLoaded Anime Data: {anime_df.shape}")
print(f"Loaded Manga Data: {manga_df.shape}")

print("\n--- Preview: Anime Data ---")
display(anime_df.head(3))
print("\n--- Preview: Manga Data ---")
display(manga_df.head(3))


Loaded Anime Data: (24985, 39)
Loaded Manga Data: (64833, 30)

--- Preview: Anime Data ---


Unnamed: 0,anime_id,title,type,score,scored_by,status,episodes,start_date,end_date,source,...,producers,licensors,synopsis,background,main_picture,url,trailer_url,title_english,title_japanese,title_synonyms
0,5114,Fullmetal Alchemist: Brotherhood,tv,9.1,2037075,finished_airing,64.0,2009-04-05,2010-07-04,manga,...,"['Aniplex', 'Square Enix', 'Mainichi Broadcast...","['Funimation', 'Aniplex of America']",After a horrific alchemy experiment goes wrong...,,https://cdn.myanimelist.net/images/anime/1208/...,https://myanimelist.net/anime/5114/Fullmetal_A...,https://www.youtube.com/watch?v=--IcmZkvL0Q,Fullmetal Alchemist: Brotherhood,鋼の錬金術師 FULLMETAL ALCHEMIST,['Hagane no Renkinjutsushi: Fullmetal Alchemis...
1,11061,Hunter x Hunter (2011),tv,9.04,1671587,finished_airing,148.0,2011-10-02,2014-09-24,manga,...,"['VAP', 'Nippon Television Network', 'Shueisha']",['VIZ Media'],Hunters devote themselves to accomplishing haz...,,https://cdn.myanimelist.net/images/anime/1337/...,https://myanimelist.net/anime/11061/Hunter_x_H...,https://www.youtube.com/watch?v=D9iTQRB4XRk,Hunter x Hunter,HUNTER×HUNTER（ハンター×ハンター）,['HxH (2011)']
2,38524,Shingeki no Kyojin Season 3 Part 2,tv,9.05,1491491,finished_airing,10.0,2019-04-29,2019-07-01,manga,...,"['Production I.G', 'Dentsu', 'Mainichi Broadca...",['Funimation'],Seeking to restore humanity's diminishing hope...,Shingeki no Kyojin adapts content from volumes...,https://cdn.myanimelist.net/images/anime/1517/...,https://myanimelist.net/anime/38524/Shingeki_n...,https://www.youtube.com/watch?v=hKHepjfj5Tw,Attack on Titan Season 3 Part 2,進撃の巨人 Season3 Part.2,[]



--- Preview: Manga Data ---


Unnamed: 0,manga_id,title,type,score,scored_by,status,volumes,chapters,start_date,end_date,...,demographics,authors,serializations,synopsis,background,main_picture,url,title_english,title_japanese,title_synonyms
0,2,Berserk,manga,9.47,319696,currently_publishing,,,1989-08-25,,...,['Seinen'],"[{'id': 1868, 'first_name': 'Kentarou', 'last_...",['Young Animal'],"Guts, a former mercenary now known as the ""Bla...",Berserk won the Award for Excellence at the si...,https://cdn.myanimelist.net/images/manga/1/157...,https://myanimelist.net/manga/2/Berserk,Berserk,ベルセルク,['Berserk: The Prototype']
1,13,One Piece,manga,9.22,355375,currently_publishing,,,1997-07-22,,...,['Shounen'],"[{'id': 1881, 'first_name': 'Eiichiro', 'last_...",['Shounen Jump (Weekly)'],"Gol D. Roger, a man referred to as the ""King o...",One Piece is the highest selling manga series ...,https://cdn.myanimelist.net/images/manga/2/253...,https://myanimelist.net/manga/13/One_Piece,One Piece,ONE PIECE,[]
2,1706,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,manga,9.3,151433,finished,24.0,96.0,2004-01-19,2011-04-19,...,"['Seinen', 'Shounen']","[{'id': 2619, 'first_name': 'Hirohiko', 'last_...",['Ultra Jump'],"In the American Old West, the world's greatest...",JoJo no Kimyou na Bouken Part 7: Steel Ball Ru...,https://cdn.myanimelist.net/images/manga/3/179...,https://myanimelist.net/manga/1706/JoJo_no_Kim...,,ジョジョの奇妙な冒険 Part7 STEEL BALL RUN,"[""JoJo's Bizarre Adventure Part 7: Steel Ball ..."


# 2