# Data Crawling

## Overview

This ibnby file contains information about the code and process performs a data crawling operation to fetch and process anime data from the MyAnimeList (MAL) API:
- Uses the requests library to send GET requests to the MAL API endpoint
- Authentication is handled via a CLIENT_ID in the request headers.
- Fetches data for four predefined seasons (SEASONS) and filters it by genres to include and exclude. Only retrieve anime that contain one of the specified genres (based on INCLUDE_GENRE_IDS) and exclude sensitive or inappropriate genres (based on EXCLUDE_GENRE_IDS).
- Only take anime that are "tv" (series) or "movie" in format.
- Filter completed or currently airing anime.
- Only take anime that have at least 10,000 members and 5,000 ratings.
- Handles paginated API responses using the offset parameter to fetch all available results for each season, stops when there are no additional pages.
- There is a mechanism to avoid rate-limiting issues and raise messages for unsuccessful API responses (non-200 status codes).
- Saves the processed data into a CSV file for tabular organization and use it for later step - Data Exploration.

## Import

In [1]:
import requests # For retrieving data from web APIs.
import pandas as pd # For handling and analyzing tabular data
import time # For adding delays or handling time-sensitive tasks
from datetime import datetime # For working with date and time

## Definitions

In [2]:
# MAL API Client ID
CLIENT_ID = 'a6a65b1404bd8c16be09628276f0a329'

# Base URL for the MAL API (get anime by season)
BASE_URL = 'https://api.myanimelist.net/v2/anime/season'

# Seasons (start from Winter)
SEASONS = ['winter', 'spring', 'summer', 'fall']

# Genre IDs (check the comments below)
INCLUDE_GENRE_IDS = [1, 2, 4, 8, 10, 7, 22, 24, 36, 30]

# Genre IDs to exclude (check the comments below)
EXCLUDE_GENRE_IDS = [12] 

### Comments
- List of included genre IDs:
    - 1: **Action**
    - 2: **Adventure**
    - 4: **Comedy**
    - 7: **Mystery**
    - 8: **Drama**
    - 10: **Fantasy**
    - 22: **Romance**
    - 24: **Sci-Fi**
    - 30: **Sports**
    - 36: **Slice of Life**
- List of excluded genre IDs:
    - 12: *Hentai* (NSFW)

## Crawling Data

In [3]:
def get_anime_data(start_year, end_year):
    """
    Fetch a list of TV and Movie anime from MyAnimeList via their API, starting from Winter 1995,
    by seasons, matching specific genres, and excluding sensitive content.
    
    Args:
        start_year (int): The starting year for fetching anime.
        end_year (int): The ending year for fetching anime.
    
    Returns:
        list: A list of anime data with detailed information.
    """
    headers = {
        'X-MAL-CLIENT-ID': CLIENT_ID
    }

    all_anime = []

    for year in range(start_year, end_year + 1):
        for season in SEASONS:
            print(f"Fetching anime for {year} {season.capitalize()}...")
            url = f"{BASE_URL}/{year}/{season}"
            params = {
                'limit': 100,  # Max limit per request
                'fields': (
                    'id,title,alternative_titles,genres,media_type,mean,num_scoring_users,'
                    'num_list_users,studios,status,start_season,rating'
                )
            }

            offset = 0

            while True:
                params['offset'] = offset
                response = requests.get(url, headers=headers, params=params)

                if response.status_code != 200:
                    print(f"Error: {response.status_code} - {response.text}")
                    break

                data = response.json()

                # Filter anime based on conditions
                anime_list = [
                    anime for anime in data.get('data', [])
                    if not any(genre['id'] in EXCLUDE_GENRE_IDS for genre in anime['node'].get('genres', [])) # Excluded genres
                    and any(genre['id'] in INCLUDE_GENRE_IDS for genre in anime['node'].get('genres', [])) # Included genres
                    and anime['node'].get('media_type') in ['tv', 'movie']  # Media type filter
                    and anime['node'].get('status') in ['finished_airing', 'currently_airing']  # Status condition
                    and anime['node'].get('num_list_users', 0) >= 10000  # Has at least 10,000 members
                    and anime['node'].get('num_scoring_users', 0) >= 5000  # Has at least 5,000 scoring users
                ]

                if not anime_list:
                    break

                all_anime.extend(anime_list)

                # Print progress
                print(f"Fetched {len(anime_list)} anime for {year} {season.capitalize()} (Offset: {offset})...")

                # Check pagination
                if 'next' not in data:
                    break

                # Increment offset for the next page
                offset += params['limit']

                # Avoid rate limiting
                time.sleep(2)

    return all_anime

In [4]:
anime_list = get_anime_data(start_year=1995, end_year=2024)
# Check data list
print(f"Total anime fetched: {len(anime_list)}")

Fetching anime for 1995 Winter...
Fetched 19 anime for 1995 Winter (Offset: 0)...
Fetching anime for 1995 Spring...
Fetched 21 anime for 1995 Spring (Offset: 0)...
Fetching anime for 1995 Summer...
Fetched 23 anime for 1995 Summer (Offset: 0)...
Fetching anime for 1995 Fall...
Fetched 23 anime for 1995 Fall (Offset: 0)...
Fetching anime for 1996 Winter...
Fetched 18 anime for 1996 Winter (Offset: 0)...
Fetching anime for 1996 Spring...
Fetched 20 anime for 1996 Spring (Offset: 0)...
Fetching anime for 1996 Summer...
Fetched 20 anime for 1996 Summer (Offset: 0)...
Fetching anime for 1996 Fall...
Fetched 19 anime for 1996 Fall (Offset: 0)...
Fetching anime for 1997 Winter...
Fetched 19 anime for 1997 Winter (Offset: 0)...
Fetching anime for 1997 Spring...
Fetched 19 anime for 1997 Spring (Offset: 0)...
Fetching anime for 1997 Summer...
Fetched 22 anime for 1997 Summer (Offset: 0)...
Fetching anime for 1997 Fall...
Fetched 19 anime for 1997 Fall (Offset: 0)...
Fetching anime for 1998 Wint

## Saving Data to CSV file 

In [5]:
def save_anime_data_to_csv(anime_list, file_name):
    """
    Save anime data to a CSV file.
    
    Args:
        anime_list (list): List of anime data.
        file_name (str): Name of the output CSV file.
    """
    data = []
    for anime in anime_list:
        alternative_titles = anime['node'].get('alternative_titles', {})
        english_title = alternative_titles.get('en')
        data.append({
            'ID': anime['node']['id'],
            'Title': anime['node']['title'],
            'Alternative Title (en)': english_title if english_title else 'N/A',
            'Media Type': anime['node'].get('media_type', 'Unknown'),
            'Status': anime['node'].get('status', 'Unknown'),
            'Premiered Season': f"{anime['node']['start_season']['season'].capitalize()} {anime['node']['start_season']['year']}"
            if anime['node'].get('start_season') else 'Unknown',
            'Genres': ', '.join([genre['name'] for genre in anime['node'].get('genres', [])]),
            'User Score': anime['node'].get('mean', 'Unknown'),
            'Number of Ratings': anime['node'].get('num_scoring_users', 0),
            'Number of Members': anime['node'].get('num_list_users', 0),
            'Studios': ', '.join([studio['name'] for studio in anime['node'].get('studios', [])]),
            'Rating': anime['node'].get('rating', 'Unknown'),
        })

    df = pd.DataFrame(data)
    df.to_csv(file_name, index=False)
    
    print(f"Data saved!")

In [6]:
current_date = datetime.now().strftime('%d%m%Y') # Version
save_anime_data_to_csv(anime_list, f'./data/anime_data_{current_date}.csv')

Data saved!
