# Data Collection

## Overview

### Source(s)
- The dataset is **only** collected from website [MyAnimeList](https://myanimelist.net) (MAL) and through their official [API](https://api.myanimelist.net/v2/anime/season).
- To collect data through MAL API, we must first provide information about our application and obtain an **authorized** Client ID ([click here for more information](https://help.myanimelist.net/hc/en-us/articles/900003108823-API)).

### Legality
- It is **legal** to collect data from MAL using their official API ([click here for more information](https://myanimelist.net/membership/terms_of_use)).
- We can use their data for **personal**, **academic** or **non-commercial** analysis or research without monetizing the results ([click here for more information](https://myanimelist.net/apiconfig/references/api/v2)). 

### Reliability
- MAL has a massive and active community of anime fans worldwide, **ensuring** that its data is **regularly updated** and **reflects popular** opinions.
- User reviews, scores, and lists are **crowdsourced** from a diverse group of viewers, providing a broad perspective.
- MAL is **well-maintained**, and new anime, reviews, and scores are updated regularly by both users and MAL staff.
- Data from MAL is presented in a **consistent** and **structured** format via its API, making it **suitable** for analysis.

$\rightarrow$ Data is **useful** and **highly reliable**.

## Import

In [1]:
import requests # For retrieving data from web APIs.
import pandas as pd # For handling and analyzing tabular data
import time # For adding delays or handling time-sensitive tasks
from datetime import datetime # For working with date and time

## Definitions

In [2]:
# MAL API Client ID
CLIENT_ID = 'a6a65b1404bd8c16be09628276f0a329'

# Base URL for the MAL API (get anime by season)
BASE_URL = 'https://api.myanimelist.net/v2/anime/season'

# Seasons (start from winter)
SEASONS = ['winter', 'spring', 'summer', 'fall']

# Genre IDs (check the comments below)
INCLUDE_GENRE_IDS = [1, 2, 4, 8, 10, 7, 22, 24, 30, 36, 37, 38]

# Genre IDs to exclude (check the comments below)
EXCLUDE_GENRE_IDS = [12, 49] 

### Comments
- List of included genre IDs (12 genres):
    - 1: **Action**
    - 2: **Adventure**
    - 4: **Comedy**
    - 7: **Mystery**
    - 8: **Drama**
    - 10: **Fantasy**
    - 22: **Romance**
    - 24: **Sci-Fi**
    - 30: **Sports**
    - 36: **Slice of Life**
    - 37: **Horror**
    - 38: **Supernatural**
- List of excluded genre IDs (NSFW):
    - 12: *Hentai*
    - 49: *Erotica*

## Collect data
In this project, we will **focus only** on anime series that meet the following criterias:
- Were broadcast during the period from 1995 (when anime began to gain popularity) to the present (2024).
- Belong to one or more genres listed in ```INCLUDE_GENRE_IDS``` and do not belong to any genres in ```EXCLUDE_GENRE_IDS```.
- Are either TV series (```tv```) or movies (```movie```).
- Have either completed broadcasting (```finished_airing```) or are still ongoing (```currently_airing```).
- Have a relative level of popularity (based on the number of user ratings ```num_scoring_users``` and the number of users showing interest in them ```num_list_users```).

In [3]:
def get_anime_data(start_year, end_year):
    """
    Fetch a list of TV and Movie anime from MyAnimeList via their API, starting from Winter 1995,
    by seasons, matching specific genres, and excluding sensitive content.
    
    Args:
        start_year (int): The starting year for fetching anime.
        end_year (int): The ending year for fetching anime.
    
    Returns:
        list: A list of anime data with detailed information.
    """
    headers = {
        'X-MAL-CLIENT-ID': CLIENT_ID
    }

    all_anime = []

    for year in range(start_year, end_year + 1):
        for season in SEASONS:
            print(f"Fetching anime for {year} {season.capitalize()}...")
            url = f"{BASE_URL}/{year}/{season}"
            params = {
                'limit': 100,  # Max limit per request
                'fields': (
                    'id,title,alternative_titles,genres,media_type,mean,num_scoring_users,'
                    'num_list_users,studios,status,start_season,rating'
                )
            }

            offset = 0

            while True:
                params['offset'] = offset
                response = requests.get(url, headers=headers, params=params)

                if response.status_code != 200:
                    print(f"Error: {response.status_code} - {response.text}")
                    break

                data = response.json()

                # Filter anime based on conditions
                anime_list = [
                    anime for anime in data.get('data', [])
                    if not any(genre['id'] in EXCLUDE_GENRE_IDS for genre in anime['node'].get('genres', [])) # Excluded genres
                    and any(genre['id'] in INCLUDE_GENRE_IDS for genre in anime['node'].get('genres', [])) # Included genres
                    and anime['node'].get('media_type') in ['tv', 'movie']  # Media type filter
                    and anime['node'].get('status') in ['finished_airing', 'currently_airing']  # Status condition
                    and anime['node'].get('num_list_users', 0) >= 10000  # Has at least 10000 members
                    and anime['node'].get('num_scoring_users', 0) >= 2000  # Has at least 2000 scoring users
                ]

                if not anime_list:
                    break

                all_anime.extend(anime_list)

                # Print progress
                print(f"Fetched {len(anime_list)} anime for {year} {season.capitalize()} (Offset: {offset})...")

                # Check pagination
                if 'next' not in data:
                    break

                # Increment offset for the next page
                offset += params['limit']

                # Avoid rate limiting
                time.sleep(2)

    return all_anime

In [4]:
# Collect data
anime_list = get_anime_data(start_year=1995, end_year=2024)
print(f"Total anime fetched: {len(anime_list)}")

Fetching anime for 1995 Winter...
Fetched 20 anime for 1995 Winter (Offset: 0)...
Fetching anime for 1995 Spring...
Fetched 22 anime for 1995 Spring (Offset: 0)...
Fetching anime for 1995 Summer...
Fetched 26 anime for 1995 Summer (Offset: 0)...
Fetching anime for 1995 Fall...
Fetched 24 anime for 1995 Fall (Offset: 0)...
Fetching anime for 1996 Winter...
Fetched 19 anime for 1996 Winter (Offset: 0)...
Fetching anime for 1996 Spring...
Fetched 20 anime for 1996 Spring (Offset: 0)...
Fetching anime for 1996 Summer...
Fetched 20 anime for 1996 Summer (Offset: 0)...
Fetching anime for 1996 Fall...
Fetched 20 anime for 1996 Fall (Offset: 0)...
Fetching anime for 1997 Winter...
Fetched 21 anime for 1997 Winter (Offset: 0)...
Fetching anime for 1997 Spring...
Fetched 22 anime for 1997 Spring (Offset: 0)...
Fetching anime for 1997 Summer...
Fetched 24 anime for 1997 Summer (Offset: 0)...
Fetching anime for 1997 Fall...
Fetched 21 anime for 1997 Fall (Offset: 0)...
Fetching anime for 1998 Wint

In [5]:
# Check dataset
anime_list[:1]

[{'node': {'id': 3907,
   'title': 'Ginga Sengoku Gunyuuden Rai',
   'main_picture': {'medium': 'https://cdn.myanimelist.net/images/anime/6/30829.jpg',
    'large': 'https://cdn.myanimelist.net/images/anime/6/30829l.jpg'},
   'alternative_titles': {'synonyms': ['Rai: Galactic Civil War Chronicle',
     'Thunder Jet',
     'Ginga Sengoku Gun Yuuden Rai'],
    'en': 'Galaxy Warring State Chronicle Rai',
    'ja': '銀河戦国群雄伝ライ'},
   'genres': [{'id': 2, 'name': 'Adventure'},
    {'id': 22, 'name': 'Romance'},
    {'id': 24, 'name': 'Sci-Fi'},
    {'id': 27, 'name': 'Shounen'},
    {'id': 29, 'name': 'Space'}],
   'media_type': 'tv',
   'mean': 7.89,
   'num_scoring_users': 5597,
   'num_list_users': 16666,
   'studios': [{'id': 154, 'name': 'E&G Films'}],
   'status': 'finished_airing',
   'start_season': {'year': 1994, 'season': 'spring'},
   'rating': 'pg_13'}}]

## Save data to CSV file 

In [6]:
def save_anime_data_to_csv(anime_list, file_name):
    """
    Save anime data to a CSV file.
    
    Args:
        anime_list (list): List of anime data.
        file_name (str): Name of the output CSV file.
    """
    data = []
    for anime in anime_list:
        alternative_titles = anime['node'].get('alternative_titles', {})
        english_title = alternative_titles.get('en')
        data.append({
            'ID': anime['node']['id'],
            'Title': anime['node']['title'],
            'Alternative Title (en)': english_title if english_title else 'N/A',
            'Media Type': anime['node'].get('media_type', 'Unknown'),
            'Status': anime['node'].get('status', 'Unknown'),
            'Premiered Season': f"{anime['node']['start_season']['season'].capitalize()} {anime['node']['start_season']['year']}"
            if anime['node'].get('start_season') else 'Unknown',
            'Genres': ', '.join([genre['name'] for genre in anime['node'].get('genres', [])]),
            'User Score': anime['node'].get('mean', 'Unknown'),
            'Number of Ratings': anime['node'].get('num_scoring_users', 0),
            'Number of Members': anime['node'].get('num_list_users', 0),
            'Studios': ', '.join([studio['name'] for studio in anime['node'].get('studios', [])]),
            'Rating': anime['node'].get('rating', 'Unknown'),
        })

    df = pd.DataFrame(data)
    df.to_csv(file_name, index=False)
    
    print(f"Data saved!")

In [7]:
current_date = datetime.now().strftime('%d%m%Y') # Version
save_anime_data_to_csv(anime_list, f'./data/anime_data_{current_date}.csv')

Data saved!
