# COGS 108 - Data Checkpoint

## Authors

- Jordan Chen: Writing - original draft, Writing - review & editing
- Koji Nakazawa: Conceptualization, Methodology, Software
- Andrew Hoang: Background Research, Visualization
- Amandine Isidro: Data curation, Experimental investigation
- Audrey La Guardia: Analysis, Project Administration

## Research Question

The Goal of this project is to investigate whether it is possible to predict the winner of Crunchyroll's Anime of the Year award using measurable popularity engagement, and production-related variables, in hopes of discovering relationships between fan engagement and media production within the anime industry. specifically, we aim to build a predictive model that uses factors like user rating(data from Crunchyroll, MyAnimeList), social-media engagement(sentiment analysis on Youtube data), hype indicators(trailer, views, and manga popularity), production characteristics(studio and budget) to produce a binary target variable (win/not win). How do factors like rating, engagement, hype, production, and release time affect an anime's chance at winning Crunchyroll's anime of the year? How accurately can we predict the next winner?


## Background and Prior Work

The Crunchyroll Anime Awards is an annual awards ceremony organized by Crunchyroll, one of the world's largest anime streaming platforms. The Crunchyroll Anime Awards is an annual ceremony that recognizes the hard work of animators, producers, and other contributors, covering both fan favorites and critically acclaimed works.<a id="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) This ceremony was first announced in December of 2016 with the winners presented in January of 2017. The awards feature many different categories which include, anime of the year, best opening or ending song, best voice actor for many different languages, best animation, global impact, and more. The awards begin with a panel of industry experts selecting nominees for each category, followed by a fan voting stage. For certain categories, expert judges may also evaluate submissions to ensure technical merit is considered. Finally, the winners are announced early in the year, typically between February and March. 

The market of anime, which are Japanese animated TV shows, has seen a major increase of success worldwide.<a id="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Anime has gained its popularity through not only its immersive storylines, but defining artstyles and unique screen effects. It is a very broad form of media with multiple genres such as romance and action, yet fans unite to enjoy all types of anime. Naturally, the amount of discussion surrounding it increased as well. As anime became more recognized, the most popular animes would be celebrated through annual anime awards.

A recent academic paper by Jesús Armenta-Segura and Grigori Sidorov that was published on PubMed Central explored how freely available internet data such as plot descriptions and images can be used to predict anime popularity before large financial investments are made.<a id="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) The researchers proposed a multimodal neural network that combines GPT-2 for text and ResNet-50 for image embeddings, achieving a mean squared error of 0.012 and an R² of 0.142. Their findings demonstrated that pre-production features like story summaries and character visuals are moderately correlated with an anime’s eventual popularity. 

1. <a id="cite_note-1"></a> [^](#cite_ref-1) Wikipedia. (2025). Crunchyroll Anime Awards. https://en.wikipedia.org/wiki/Crunchyroll_Anime_Awards

2. <a id="cite_note-2"></a> [^](#cite_ref-2) Armenta-Segrua, J., and  (2023). Anime Success Prediction Based on Synopsis Using Traditional Classifiers.  https://rcs.cic.ipn.mx/2023_152_9/Anime%20Success%20Prediction%20Based%20on%20Synopsis%20Using%20Traditional%20Classifiers.pdf

3. <a id="cite_note-3"></a> [^](#cite_ref-3) Armenta-Segura, Jesús, and Grigori Sidorov. (2025). *Anime popularity prediction before huge investments: a multimodal approach using deep learning.* *PeerJ Computer Science,* 11, e2715. https://doi.org/10.7717/peerj-cs.2715

## Hypothesis


We aim to predict the next Crunchyroll Anime of the Year winner by building a data-driven model using variables such as fan ratings, production studio identifiers, and pre-adaptation manga performance.

To create a meaningful analysis with these variables in place we will operationalize these factors. Fan ratings will be measured through aggregated scores from platforms such as MyAnimeList and AniList. Production studio identifiers will be obtained by searching for the name of studio, budget for production, and season the anime is released. Lastly pre-adaptation manga performances can be representated original manga populatrity. Manga popularity will be measured using circulation numbers prior to the anime’s release.

Our rationale is that pre-adaptation manga performance will be a strong indication of the baseline performance of the anime. However, based on the studio and budget of the project, the production quality will be affected. If the production quality is good, then the performance of the anime should be enhanced, and if the production quality is bad, then the performance of the anime will be hindered. For example, Blue Lock was a highly anticipated anime based on manga popularity, but because of tight budget, the animation quality is lackluster, and this directly hindered the popularity of the anime despite high anticipation from the manga.

We therefore hypothesize that the likelihood of an anime winning Anime of the Year will be most strongly associated with the interaction between high pre-adaptation manga popularity and favorable production conditions (e.g., strong studio track record and adequate budget), which together produce higher fan ratings and stronger overall audience response.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/file/d/1ZdjB8Ui_8ojiJUEiUoNE_uwFmuqBAmW0/view?usp=drive_link', 'filename':'anime-data.csv'},
    { 'url': 'https://drive.google.com/file/d/1DRinWBLHC5ySrWnZVAGa5Ldx-yArqQsC/view?usp=sharing', 'filename':'anime-box-office.csv'},
    { 'url': 'https://drive.google.com/file/d/1DdAZciV95MxsLxAyanVFO3oUhOc1zbpC/view?usp=drive_link', 'filename':'best-selling-manga.csv'}, 
    { 'url': 'https://drive.google.com/file/d/1CLg_D114adl9siM4eUH-Aby_8c3PCNgw/view?usp=sharing', 'filename':'airing-time.csv'}
    
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/4 [00:00<?, ?it/s]
Downloading anime-data.csv: 0.00B [00:00, ?B/s][A
Overall Download Progress:  25%|██▌       | 1/4 [00:00<00:01,  2.98it/s]

Successfully downloaded: anime-data.csv



Downloading anime-box-office.csv: 0.00B [00:00, ?B/s][A
Overall Download Progress:  50%|█████     | 2/4 [00:00<00:00,  3.32it/s]

Successfully downloaded: anime-box-office.csv



Downloading best-selling-manga.csv: 0.00B [00:00, ?B/s][A
Overall Download Progress:  75%|███████▌  | 3/4 [00:00<00:00,  3.47it/s]

Successfully downloaded: best-selling-manga.csv



Downloading airing-time.csv: 0.00B [00:00, ?B/s][A
Overall Download Progress: 100%|██████████| 4/4 [00:01<00:00,  3.36it/s]

Successfully downloaded: airing-time.csv





# 1. Public rating and Anime details (Award Winning, # of EPS, Production Cost) 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you




**Data Description:** Below will indicate the important metrics for each repective dataset. 

*Anime Dataset:*
- title_english: Name of the anime 
- episodes: Number of episode 
- favorites: Number of people who favorited the anime 
- score: Viewer assigned anime rating out of 10 

*Box Office Dataset:*
- anime_name: Name of the anime 
- genres: Genre of the show (Action, Fantacy, Slice of Life, Etc)
- total_views: Total number of views for the anime 
- production_cost: Cost of production per episode in dollars 

There are two main data sets here that will be merged into one: anime dataset and box office dataset. The merged data set will be done by matching the anime names of the Anime Dataset and Box Office Dataset. One major concern is that the spelling of the same anime can be done distinctly for each data set. For example, Anime data will have an entry called "Devilman Crybaby" while the Box Office data will have an entry called "Devilman: Crybaby". Because of this when we merge the two datasets we will obtain two separate entries as opposed to one. In order to counteract this we will normalize the anime names by removing punctuation, numbers, and words like Season, Part, The, etc. Additionally because the data contains entries like "attack on titan" and "attack on titan final season", we will keep the base name entry "attack on titan" and remove other entries that are built on the base anime, so we have one entry per anime.

Additionally, we had to consider the possibility that certain anime are unique to a data set. To address this concern we will merge the two entries and only keep the entries that appear in both datasets. However, because one of the eight Crunchyroll award winning anime "Yuri on Ice", "Made in Abyss", "Devilman Crybaby", "Demon Slayer", "Jujutsu Kaisen", "Attack on Titan", "Cyberpunk Edgerunners", "Solo Leveling" may be deleted we will forcibly keep all Crunchyroll award winning anime even if there will be NaN columns for their respective rows. We will then collect the missing data ourselves and enter them in manually. It is important for us to keep these Crunchyroll award winning anime because there are only 8 winners, so losing even a single one would be detrimental.

As for concerns about the actual data itself, the main concern would be that the favorite column for the Anime Dataset only tells us how many people favorited the anime rather than the relative amount to how many people viewed the anime in general. Therefore having a higher favorited number may not necessarily indicate that more people enjoyed the anime because the anime may have simply garnered more views and thus more people favorited. Even though we have a total_views column from Box Office Dataset, because most people will and can view the anime multiple times it is not a good measure of how many people watched the anime.

One of the most concerning data that is missing from these two dataset is the year the anime started airing and when it finished airing. This is because if the anime performed incredibly well with the general public but finished airing before CrunchyRoll award became a thing in 2017, that anime will never actually win the award. Therefore, it will be important for us to find another dataset that contains this information in order for us to group the anime by years.

*Our merged data set will be named anime_details with the following columns:*
- anime_norm_title: anime titles after normalizing 
- episodes: Number of episode 
- favorites: Number of people who favorited the anime 
- score: Viewer assigned anime rating out of 10 
- genres: Genre of the show (Action, Fantacy, Slice of Life, Etc)
- total_views: Total number of views for the anime per episode
- production_cost: Cost of production per episode in dollars 
- award: Binary value whether the anime is considered award winning in general
- cr_award_winner: Binary Value whether the anime won the CrunchyRoll anime of the year award

In [3]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np
import string

def load_from_drive(file_id):
    url = f"https://drive.google.com/uc?id={file_id}&export=download"
    return pd.read_csv(url)

# List of words to remove
STOP_WORDS = ["season", "part", "movie", "ova", "the", "final"]  # you can add more if needed

# Normalize the names 
def normalize(s):
    if pd.isna(s):
        return ""
    s = s.lower()
    # remove punctuation and digits
    remove_chars = string.punctuation + string.digits
    s = s.translate(str.maketrans("", "", remove_chars))
    # remove stop words
    for word in STOP_WORDS:
        s = s.replace(word, "")
    # collapse extra spaces
    s = " ".join(s.split())
    return s

# Loaded data & drop unnecessary columns 
anime = load_from_drive("1ZdjB8Ui_8ojiJUEiUoNE_uwFmuqBAmW0")
box_office = load_from_drive("1DRinWBLHC5ySrWnZVAGa5Ldx-yArqQsC")
anime = anime.drop(["status", "rating", "rank", "members", "popularity", "themes", "scored_by", "demographics"], axis=1)
box_office = box_office.drop(["anime_id", "number_of_episodes", "rating", "type", "total_box_office"], axis=1)

# Make the anime names lower case
anime["title_english"] = anime["title_english"].str.lower()
box_office["anime_name"] = box_office["anime_name"].str.lower()

# Turned genres column into awards column 
# Awards column only shows if the anime is Award Winning or not 
anime["award"] = np.where(
    anime["genres"].str.contains("Award Winning", na=False),
    "Award Winning",
    "Not Award Winning"
)
anime = anime.drop("genres", axis=1)

anime["anime_norm_title"] = anime["title_english"].apply(normalize)
box_office["box_office_norm_title"] = box_office["anime_name"].apply(normalize)
anime = anime.drop_duplicates(subset="anime_norm_title").reset_index(drop=True)

# simple keyword list (lowercase, punctuation-removed)
cr_winners_keywords = [
    "yuri on ice",
    "made in abyss",
    "devilman crybaby",
    "demon slayer",
    "jujutsu kaisen",
    "attack on titan",
    "cyberpunk edgerunners",
    "solo leveling"
]

# label Crunchyroll winners using simple substring match
anime["cr_award_winner"] = anime["anime_norm_title"].apply(
    lambda t: any(keyword in t for keyword in cr_winners_keywords)
)

anime["cr_award_winner"] = anime["cr_award_winner"].map({
    True: "Crunchyroll Award Winner",
    False: "Not Crunchyroll Winner"
})
anime = anime[~anime["episodes"].isin([0, 1, 2, 3])]

# Dropped all rows with NaN 
# Drop rows with NaN ONLY if they are NOT Crunchyroll Award Winners
anime = anime[
    (anime["cr_award_winner"] == "Crunchyroll Award Winner") |
    (anime.notna().all(axis=1))
].reset_index(drop=True)
box_office= box_office.dropna().reset_index(drop=True)

# Ensure no NaN
anime["anime_norm_title"] = anime["anime_norm_title"].fillna("").astype(str)

# Sort by length so base titles come first
anime = anime.sort_values(
    by="anime_norm_title",
    key=lambda x: x.str.len()
).reset_index(drop=True)

# Keep only base titles
seen_bases = []
keep_idx = []

for idx, title in anime["anime_norm_title"].items():  # <- changed iteritems() to items()
    # Keep if no previously kept title is a prefix of this title
    if not any(title.startswith(base + " ") or title == base for base in seen_bases):
        keep_idx.append(idx)
        seen_bases.append(title)

anime = anime.loc[keep_idx].reset_index(drop=True)

display(anime)
display(anime[anime["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Total number of Award Winning anime:", (anime["award"] == "Award Winning").sum())
print("Total number of CrunchyRoll Award Winning anime:", (anime["cr_award_winner"] == "Crunchyroll Award Winner").sum())
print()
print("Box Office Dataset: ")
display(box_office)

Unnamed: 0,title_english,episodes,favorites,score,award,anime_norm_title,cr_award_winner
0,x,24.0,772,7.38,Not Award Winning,x,Not Crunchyroll Winner
1,k,13.0,6863,7.42,Not Award Winning,k,Not Crunchyroll Winner
2,18if,13.0,109,6.11,Not Award Winning,if,Not Crunchyroll Winner
3,id-0,12.0,22,6.50,Not Award Winning,id,Not Crunchyroll Winner
4,mm!,12.0,662,7.01,Not Award Winning,mm,Not Crunchyroll Winner
...,...,...,...,...,...,...,...
3079,my isekai life: i gained a second character cl...,12.0,1416,6.33,Not Award Winning,my isekai life i gained a second character cla...,Not Crunchyroll Winner
3080,the strongest tank's labyrinth raids -a tank w...,12.0,139,6.12,Not Award Winning,strongest tanks labyrinth raids a tank with a ...,Not Crunchyroll Winner
3081,"my instant death ability is so overpowered, no...",12.0,633,6.37,Not Award Winning,my instant death ability is so overpowered no ...,Not Crunchyroll Winner
3082,bogus skill : about that time i became able to...,12.0,267,5.76,Not Award Winning,bogus skill about that time i became able to e...,Not Crunchyroll Winner


Unnamed: 0,title_english,episodes,favorites,score,award,anime_norm_title,cr_award_winner
671,yuri!!! on ice,12.0,21184,7.9,Award Winning,yuri on ice,Crunchyroll Award Winner
957,solo leveling,12.0,18746,8.25,Not Award Winning,solo leveling,Crunchyroll Award Winner
968,made in abyss,13.0,46530,8.63,Not Award Winning,made in abyss,Crunchyroll Award Winner
1045,jujutsu kaisen,24.0,93814,8.54,Award Winning,jujutsu kaisen,Crunchyroll Award Winner
1146,attack on titan,25.0,181378,8.56,Award Winning,attack on titan,Crunchyroll Award Winner
1289,devilman: crybaby,10.0,25388,7.74,Not Award Winning,devilman crybaby,Crunchyroll Award Winner
1950,cyberpunk: edgerunners 2,10.0,141,,Not Award Winning,cyberpunk edgerunners,Crunchyroll Award Winner
2597,demon slayer: kimetsu no yaiba,26.0,93059,8.43,Award Winning,demon slayer kimetsu no yaiba,Crunchyroll Award Winner


Total number of Award Winning anime: 51
Total number of CrunchyRoll Award Winning anime: 8

Box Office Dataset: 


Unnamed: 0,anime_name,genres,total_views,production_cost,box_office_norm_title
0,attack on titan,"Action, Drama, Fantasy, Mystery",778095,200000,attack on titan
1,demon slayer,"Action, Adventure, Drama, Fantasy, Supernatural",735876,250000,demon slayer
2,death note,"Mystery, Psychological, Supernatural, Thriller",708493,100000,death note
3,jujutsu kaisen,"Action, Drama, Supernatural",677899,250000,jujutsu kaisen
4,my hero academia,"Action, Adventure, Comedy",672551,200000,my hero academia
...,...,...,...,...,...
995,knights & magic,"Action, Fantasy, Mecha",52265,150000,knights magic
996,isekai yakkyoku,Fantasy,52175,250000,isekai yakkyoku
997,tensei kenja no isekai life: daini no shokugyo...,"Action, Adventure, Comedy, Fantasy",52116,300000,tensei kenja no isekai life daini no shokugyo ...
998,ryuu to sobakasu no hime,"Drama, Music, Mystery, Sci-Fi",52055,300000,ryuu to sobakasu no hime


In [4]:
# Merge box office onto anime; keep all anime rows
anime_details = pd.merge(
    anime,
    box_office,
    left_on="anime_norm_title",
    right_on="box_office_norm_title",
    how="left"
)

# Drop helper columns
anime_details = anime_details.drop(["title_english", "box_office_norm_title", "anime_name"], axis=1)

# Remove duplicate anime names
# Ensure no NaN
anime["anime_norm_title"] = anime["anime_norm_title"].fillna("").astype(str)

# Sort by length so base titles come first
anime_details = anime_details.sort_values(
    by="anime_norm_title",
    key=lambda x: x.str.len()
).reset_index(drop=True)

# Keep only base titles
seen_bases = []
keep_idx = []

for idx, title in anime_details["anime_norm_title"].items():  # <- changed iteritems() to items()
    # Keep if no previously kept title is a prefix of this title
    if not any(title.startswith(base + " ") or title == base for base in seen_bases):
        keep_idx.append(idx)
        seen_bases.append(title)

anime_details = anime_details.loc[keep_idx].reset_index(drop=True)

# Bring the anime name to the very front 
# Get all columns
cols = anime_details.columns.tolist()

# Reorder the data
cols.remove("anime_norm_title")
cols.remove("award")
cols.remove("cr_award_winner")
new_order = ["anime_norm_title"] + cols + ["award", "cr_award_winner"]
anime_details = anime_details[new_order]

# Drop rows with NaN ONLY if they are NOT Crunchyroll Award Winners
anime_details = anime_details[
    (anime_details["cr_award_winner"] == "Crunchyroll Award Winner") |
    (anime_details.notna().all(axis=1))
].reset_index(drop=True)

# Display the result
display(anime_details)
display(anime_details[anime_details["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Total number of Award Winning anime:", (anime_details["award"] == "Award Winning").sum())
print("Total number of CrunchyRoll Award Winning anime:", (anime_details["cr_award_winner"] == "Crunchyroll Award Winner").sum())

Unnamed: 0,anime_norm_title,episodes,favorites,score,genres,total_views,production_cost,award,cr_award_winner
0,k,13.0,6863,7.42,"Action, Mystery, Supernatural",139327.0,100000.0,Not Award Winning,Not Crunchyroll Winner
1,no,11.0,7480,7.55,"Action, Drama, Psychological, Sci-Fi",85420.0,150000.0,Not Award Winning,Not Crunchyroll Winner
2,kon,13.0,24951,7.86,"Comedy, Music, Slice of Life",71879.0,100000.0,Not Award Winning,Not Crunchyroll Winner
3,days,12.0,8438,7.82,"Action, Drama, Psychological",154842.0,300000.0,Not Award Winning,Not Crunchyroll Winner
4,nana,47.0,33206,8.56,"Drama, Music, Romance, Slice of Life",141706.0,250000.0,Not Award Winning,Not Crunchyroll Winner
...,...,...,...,...,...,...,...,...,...
213,monogatari series second,26.0,23971,8.76,"Comedy, Drama, Mystery, Psychological, Romance...",140177.0,300000.0,Not Award Winning,Not Crunchyroll Winner
214,love live school idol project,13.0,8159,7.40,"Music, Slice of Life",95914.0,300000.0,Not Award Winning,Not Crunchyroll Winner
215,demon slayer kimetsu no yaiba,26.0,93059,8.43,,,,Award Winning,Crunchyroll Award Winner
216,panty stocking with garterbelt,13.0,6359,7.74,"Action, Comedy, Ecchi, Supernatural",79834.0,150000.0,Not Award Winning,Not Crunchyroll Winner


Unnamed: 0,anime_norm_title,episodes,favorites,score,genres,total_views,production_cost,award,cr_award_winner
87,yuri on ice,12.0,21184,7.9,,,,Award Winning,Crunchyroll Award Winner
130,solo leveling,12.0,18746,8.25,,,,Not Award Winning,Crunchyroll Award Winner
131,made in abyss,13.0,46530,8.63,"Adventure, Drama, Fantasy, Horror, Mystery, Sc...",302337.0,200000.0,Not Award Winning,Crunchyroll Award Winner
161,jujutsu kaisen,24.0,93814,8.54,"Action, Drama, Supernatural",677899.0,250000.0,Award Winning,Crunchyroll Award Winner
174,attack on titan,25.0,181378,8.56,"Action, Drama, Fantasy, Mystery",778095.0,200000.0,Award Winning,Crunchyroll Award Winner
181,devilman crybaby,10.0,25388,7.74,"Action, Drama, Horror, Psychological, Supernat...",266483.0,200000.0,Not Award Winning,Crunchyroll Award Winner
204,cyberpunk edgerunners,10.0,141,,"Action, Drama, Psychological, Sci-Fi",209422.0,100000.0,Not Award Winning,Crunchyroll Award Winner
215,demon slayer kimetsu no yaiba,26.0,93059,8.43,,,,Award Winning,Crunchyroll Award Winner


Total number of Award Winning anime: 15
Total number of CrunchyRoll Award Winning anime: 8


In [5]:
# Missing Data Analysis
print("Number of rows with NaN:", anime_details.isna().any(axis=1).sum())

# Fill in missing Data 
anime_details.loc[87] = anime_details.loc[87].fillna({'genres': 'Sports, Comedy', 'total_views': 285000, 'production_cost': 200000 })
anime_details.loc[130] = anime_details.loc[130].fillna({'genres': 'Action, Fantasy, Adventure', 'total_views': 500000, 'production_cost': 300000 })
anime_details.loc[215] = anime_details.loc[215].fillna({'genres': 'Action, Fantasy, Supernatural, Historical', 'total_views': 800000, 'production_cost': 150000 })
anime_details.loc[204] = anime_details.loc[204].fillna({'score': 8.7})

display(anime_details[anime_details["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Number of rows with NaN:", anime_details.isna().any(axis=1).sum())

Number of rows with NaN: 4


Unnamed: 0,anime_norm_title,episodes,favorites,score,genres,total_views,production_cost,award,cr_award_winner
87,yuri on ice,12.0,21184,7.9,"Sports, Comedy",285000.0,200000.0,Award Winning,Crunchyroll Award Winner
130,solo leveling,12.0,18746,8.25,"Action, Fantasy, Adventure",500000.0,300000.0,Not Award Winning,Crunchyroll Award Winner
131,made in abyss,13.0,46530,8.63,"Adventure, Drama, Fantasy, Horror, Mystery, Sc...",302337.0,200000.0,Not Award Winning,Crunchyroll Award Winner
161,jujutsu kaisen,24.0,93814,8.54,"Action, Drama, Supernatural",677899.0,250000.0,Award Winning,Crunchyroll Award Winner
174,attack on titan,25.0,181378,8.56,"Action, Drama, Fantasy, Mystery",778095.0,200000.0,Award Winning,Crunchyroll Award Winner
181,devilman crybaby,10.0,25388,7.74,"Action, Drama, Horror, Psychological, Supernat...",266483.0,200000.0,Not Award Winning,Crunchyroll Award Winner
204,cyberpunk edgerunners,10.0,141,8.7,"Action, Drama, Psychological, Sci-Fi",209422.0,100000.0,Not Award Winning,Crunchyroll Award Winner
215,demon slayer kimetsu no yaiba,26.0,93059,8.43,"Action, Fantasy, Supernatural, Historical",800000.0,150000.0,Award Winning,Crunchyroll Award Winner


Number of rows with NaN: 0


# 2. Manga Sales / Pre Anime Adaptation Source Popularity

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

**Data Description:** Below will indicate the important metrics for each repective dataset. 

*Manga Dataset:*
- manga series: Name of the manga
- author(s): Name of the author
- publisher: Name of the publishing company 
- no. of collected volumes: Number of volumns of the anime 
- approximate sales: Approximate volume of sales in millions of dollars 
- average sales per volume: Average number of sales per volume of manga in millions of dollars 

We will also be merging the Manga Dataset with the Anime Dataset. As such the same issue when we merged the Anime Dataset and the Box Office Dataset will occure. We plan on once again nromalizing the names, forcible keeping all crunchyroll award winners, and filing in the missing data manually when needed. It is also important for us to realize that Cyberpunk Edgrunner was not adopted from a manga so we will not have data on one of the CrunchyRoll award winners. 

The most meaningful piece of data in the merged dataset will be the average sales per volume (in millions). Simply looking at total approximate sales can be misleading: higher total sales may occur simply because a series has more volumes available, not necessarily because the manga itself is more popular. By focusing on average sales per volume, we can better compare the true popularity of different manga titles. 

We plan to use this metric to examine whether higher sales correlate with higher viewer scores and, ultimately, whether this influences an anime’s likelihood of winning a Crunchyroll award. We also hope to identify whether the author or publisher of the manga shows any correlation with its success. If such correlations exist, they may indicate that the original creator or publishing company plays a role in shaping the anime’s eventual performance. Our hypothesis is that the author will not significantly impact the manga’s commercial success, whereas the publisher will. This is primarily because authors typically do not control marketing or distribution; larger publishing companies, however, tend to have stronger marketing budgets and infrastructure, which can substantially boost a manga’s visibility and sales.

In [6]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

# Load data 
manga= load_from_drive("1DdAZciV95MxsLxAyanVFO3oUhOc1zbpC")

# Drop unused Columns / reformat data
manga = manga.drop(["Demographic", "Serialized"], axis=1)
manga.columns = manga.columns.str.lower()
manga["manga series"] = manga["manga series"].str.lower()
manga = manga.dropna().reset_index(drop=True)
manga["manga_norm_title"] = manga["manga series"].apply(normalize)

display(manga)

Unnamed: 0,manga series,author(s),publisher,no. of collected volumes,approximate sales in million(s),average sales per volume in million(s),manga_norm_title
0,one piece,Eiichiro Oda,Shueisha,104,516.6,4.97,one piece
1,golgo 13,"Takao Saito, Saito Production",Shogakukan,207,300.0,1.45,golgo
2,case closed / detective conan,Gosho Aoyama,Shogakukan,102,270.0,2.65,case closed detective conan
3,dragon ball,Akira Toriyama,Shueisha,42,260.0,6.19,dragon ball
4,doraemon,Fujiko F. Fujio,Shogakukan,45,250.0,4.71,doraemon
...,...,...,...,...,...,...,...
182,sukeban deka,Shinji Wada,Hakusensha,22,20.0,0.90,sukeban deka
183,swan,Kyoko Ariyoshi,Shueisha,21,20.0,0.95,swan
184,the tale of genji,Waki Yamato,Kodansha,13,20.0,1.53,tale of genji
185,tokyo daigaku monogatari,Tatsuya Egawa,Shogakukan,34,20.0,0.58,tokyo daigaku monogatari


In [7]:
# manga merge into anime; keep all anime rows
manga_anime = pd.merge(
    anime,
    manga,
    left_on="anime_norm_title",
    right_on="manga_norm_title",
    how="left"
)

# Drop unused columns 
manga_anime = manga_anime.drop(["episodes", "favorites", "manga_norm_title", "manga series", "title_english"], axis=1)

# Drop rows with NaN ONLY if they are NOT Crunchyroll Award Winners
manga_anime = manga_anime[
    (manga_anime["cr_award_winner"] == "Crunchyroll Award Winner") |
    (manga_anime.notna().all(axis=1))
].reset_index(drop=True)

manga_anime = manga_anime.drop(74)

# Reorder the data 
front_cols = ["anime_norm_title"]
end_cols = ["award", "cr_award_winner"]
middle_cols = [col for col in manga_anime.columns 
               if col not in front_cols + end_cols]
manga_anime = manga_anime[front_cols + middle_cols + end_cols]

display(manga_anime)
display(manga_anime[manga_anime["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Total number of Award Winning anime:", (manga_anime["award"] == "Award Winning").sum())
print("Total number of CrunchyRoll Award Winning anime:", (manga_anime["cr_award_winner"] == "Crunchyroll Award Winner").sum())

Unnamed: 0,anime_norm_title,score,author(s),publisher,no. of collected volumes,approximate sales in million(s),average sales per volume in million(s),award,cr_award_winner
0,nana,8.56,Ai Yazawa,Shueisha,21.0,50.0,2.38,Not Award Winning,Not Crunchyroll Winner
1,fable,8.13,Katsuhisa Minami,Kodansha,26.0,20.0,1.53,Not Award Winning,Not Crunchyroll Winner
2,gantz,6.97,Hiroya Oku,Shueisha,37.0,24.0,0.64,Not Award Winning,Not Crunchyroll Winner
3,golgo,7.51,"Takao Saito, Saito Production",Shogakukan,207.0,300.0,1.45,Not Award Winning,Not Crunchyroll Winner
4,toriko,7.51,Mitsutoshi Shimabukuro,Shueisha,43.0,25.0,0.58,Not Award Winning,Not Crunchyroll Winner
...,...,...,...,...,...,...,...,...,...
82,food wars shokugeki no soma,8.12,"Yūto Tsukuda, Shun Saeki",Shueisha,36.0,20.0,0.55,Not Award Winning,Not Crunchyroll Winner
83,tsubasa reservoir chronicle,7.51,Clamp,Kodansha,28.0,21.0,0.71,Not Award Winning,Not Crunchyroll Winner
84,dragon quest adventure of dai,7.72,"Riku Sanjo, Koji Inada",Shueisha,37.0,50.0,1.35,Not Award Winning,Not Crunchyroll Winner
85,demon slayer kimetsu no yaiba,8.43,Koyoharu Gotouge,Shueisha,23.0,150.0,6.52,Award Winning,Crunchyroll Award Winner


Unnamed: 0,anime_norm_title,score,author(s),publisher,no. of collected volumes,approximate sales in million(s),average sales per volume in million(s),award,cr_award_winner
31,yuri on ice,7.9,,,,,,Award Winning,Crunchyroll Award Winner
44,solo leveling,8.25,,,,,,Not Award Winning,Crunchyroll Award Winner
46,made in abyss,8.63,,,,,,Not Award Winning,Crunchyroll Award Winner
50,jujutsu kaisen,8.54,Gege Akutami,Shueisha,22.0,70.0,3.18,Award Winning,Crunchyroll Award Winner
55,attack on titan,8.56,Hajime Isayama,Kodansha,34.0,110.0,3.23,Award Winning,Crunchyroll Award Winner
60,devilman crybaby,7.74,,,,,,Not Award Winning,Crunchyroll Award Winner
85,demon slayer kimetsu no yaiba,8.43,Koyoharu Gotouge,Shueisha,23.0,150.0,6.52,Award Winning,Crunchyroll Award Winner


Total number of Award Winning anime: 6
Total number of CrunchyRoll Award Winning anime: 7


In [8]:
# Missing Data Analysis
print("Number of rows with NaN:", manga_anime.isna().any(axis=1).sum())

# Fill in missing data 
manga_anime.loc[31] = manga_anime.loc[31].fillna({
    'author(s)': 'Mitsurou Kubo',
    'publisher': 'Gentosha',
    'no. of collected volumes': 6,
    'approximate sales in million(s)': 23
})

manga_anime.loc[44] = manga_anime.loc[44].fillna({
    'author(s)': 'Chu-Gong',
    'publisher': 'D&C Media / Yen Press',
    'no. of collected volumes': 14.2,
    'approximate sales in million(s)': 78
})

manga_anime.loc[46] = manga_anime.loc[46].fillna({
    'author(s)': 'Akihito Tsukushi',
    'publisher': 'Takeshobo',
    'no. of collected volumes': 14,
    'approximate sales in million(s)': 60
})

manga_anime.loc[60] = manga_anime.loc[60].fillna({
    'author(s)': 'Go Nagai',
    'publisher': 'Kodansha',
    'no. of collected volumes': 5,
    'approximate sales in million(s)': 50
})

manga_anime['average sales per volume in million(s)'] = manga_anime.apply(
    lambda row: (row['approximate sales in million(s)'] / row['no. of collected volumes'])
    if pd.notnull(row['approximate sales in million(s)']) and row['no. of collected volumes'] > 0 else None,
    axis=1
)

display(manga_anime[manga_anime["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Number of rows with NaN:", manga_anime.isna().any(axis=1).sum())

Number of rows with NaN: 4


Unnamed: 0,anime_norm_title,score,author(s),publisher,no. of collected volumes,approximate sales in million(s),average sales per volume in million(s),award,cr_award_winner
31,yuri on ice,7.9,Mitsurou Kubo,Gentosha,6.0,23.0,3.833333,Award Winning,Crunchyroll Award Winner
44,solo leveling,8.25,Chu-Gong,D&C Media / Yen Press,14.2,78.0,5.492958,Not Award Winning,Crunchyroll Award Winner
46,made in abyss,8.63,Akihito Tsukushi,Takeshobo,14.0,60.0,4.285714,Not Award Winning,Crunchyroll Award Winner
50,jujutsu kaisen,8.54,Gege Akutami,Shueisha,22.0,70.0,3.181818,Award Winning,Crunchyroll Award Winner
55,attack on titan,8.56,Hajime Isayama,Kodansha,34.0,110.0,3.235294,Award Winning,Crunchyroll Award Winner
60,devilman crybaby,7.74,Go Nagai,Kodansha,5.0,50.0,10.0,Not Award Winning,Crunchyroll Award Winner
85,demon slayer kimetsu no yaiba,8.43,Koyoharu Gotouge,Shueisha,23.0,150.0,6.521739,Award Winning,Crunchyroll Award Winner


Number of rows with NaN: 0


# 3. Web Scrapped Data from Youtube For Anime Sentiment Analysis 

The sentiment data are collected through sentiment analysis of scrapped Youtube video comments related to each anime. Positive one represents very positive sentiment, negative 1 represents very negative sentiment, and 0 represents neutral. Because we scrapped multiple comments per anime we want to obtain the average sentiment of all comments related to any particular anime. It is also important to note that we were unable to scrap any videos for Jujutsu Kaisen and will be dropping this data. 

A few concern we will have to address moving forward with this that when we are scrapping the data we are only looking at comments from the first video after a youtube search. We should revise our code so it takes comments from multiple video to get a more reflective sentiment value of the anime rather than video itself. 

Additonally, some of the comments that were scrapped are unrelated to the anime, we should increase the amount of comments we scrap to decrease the affect these unrelated comments have on our sentiment value. 

In [9]:
!pip install youtube-comment-downloader
!pip install textblob
!pip install pandas
!pip install requests
!pip install beautifulsoup4
!pip install tqdm

from youtube_comment_downloader import YoutubeCommentDownloader
from textblob import TextBlob
import pandas as pd
import requests
import re
import time
from tqdm import tqdm  # progress bar

def search_youtube(anime_name, max_videos=5):
    """Return a list of video IDs quickly, with timeout and error handling."""
    query = anime_name + " anime trailer"
    url = f"https://www.youtube.com/results?search_query={query.replace(' ', '+')}"
    
    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        text = response.text
        video_ids = []
        for match in re.finditer(r"watch\?v=(\S{11})", text):
            video_ids.append(match.group(1))
            if len(video_ids) >= max_videos:
                break
        return video_ids
    except Exception as e:
        print(f"Failed to get videos for {anime_name}: {e}")
        return []

def get_relevant_comments(video_id, anime_title, max_comments=100):
    """Get relevant comments with timeout for each video."""
    downloader = YoutubeCommentDownloader()
    comments = []
    keywords = set(anime_title.lower().split())
    
    try:
        for i, c in enumerate(downloader.get_comments(video_id)):
            if i >= max_comments * 2:  # safety limit, to avoid long loops
                break
            comment_text = c["text"].lower()
            if any(word in comment_text for word in keywords):
                comments.append(c["text"])
            if len(comments) >= max_comments:
                break
    except Exception as e:
        print(f"Failed to get comments for video {video_id}: {e}")
    
    return comments

def analyze_sentiment(text):
    return TextBlob(text).sentiment.polarity

all_data = []
data = anime_details  # your dataframe with anime titles

for idx, row in tqdm(data.iterrows(), total=len(data), desc="Processing Anime"):
    title = row["anime_norm_title"]
    
    video_ids = search_youtube(title, max_videos=10) 
    if not video_ids:
        continue
    
    for video_id in video_ids:
        comments = get_relevant_comments(video_id, title, max_comments=50)
        if not comments:
            continue
        
        for c in comments:
            all_data.append({
                "anime": title,
                "comment": c,
                "sentiment": analyze_sentiment(c)
            })
        
        time.sleep(1) 

df_comments = pd.DataFrame(all_data)
df_comments.to_csv("anime_youtube_comments.csv", index=False)
print("Done!")

# Compute average sentiment per anime
df_sentiment = df_comments.groupby("anime")["sentiment"].mean().reset_index()
df_sentiment = df_sentiment.rename(columns={"sentiment": "avg_sentiment"})

# Display and save
display(df_sentiment)
df_sentiment.to_csv("anime_avg_sentiment.csv", index=False)

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Processing Anime: 100%|██████████| 218/218 [2:18:12<00:00, 38.04s/it]  

Done!





Unnamed: 0,anime,avg_sentiment
0,absolute duo,0.187405
1,accel world,0.107922
2,afro samurai,0.175773
3,aharensan wa hakarenai,0.116016
4,akame ga kill,0.179877
...,...,...
200,world trigger,0.107624
201,xxxholic,0.311942
202,your lie in april,0.120805
203,yuri on ice,0.096507


In [10]:
# Merge sentiment data with anime details data
df_sentiment = pd.read_csv("anime_avg_sentiment.csv")

sentiment_anime = pd.merge(
    anime_details,
    df_sentiment,
    left_on="anime_norm_title",
    right_on="anime",
    how="left"
)

# Drop unused columns 
sentiment_anime = sentiment_anime.drop(["anime", "episodes", "favorites", "genres", "total_views"], axis=1)

# Drop rows with NaN ONLY if they are NOT Crunchyroll Award Winners
sentiment_anime = sentiment_anime[
    (sentiment_anime["cr_award_winner"] == "Crunchyroll Award Winner") |
    (sentiment_anime.notna().all(axis=1))
].reset_index(drop=True)

sentiment_anime = sentiment_anime.drop(124)

# Reorder the data 
front_cols = ["anime_norm_title"]
end_cols = ["award", "cr_award_winner"]
middle_cols = [col for col in sentiment_anime.columns 
               if col not in front_cols + end_cols]
sentiment_anime = sentiment_anime[front_cols + middle_cols + end_cols]

display(sentiment_anime)
display(sentiment_anime[sentiment_anime["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Total number of Award Winning anime:", (sentiment_anime["award"] == "Award Winning").sum())
print("Total number of CrunchyRoll Award Winning anime:", (sentiment_anime["cr_award_winner"] == "Crunchyroll Award Winner").sum())
print()
print("Summary of sentiment value: ")
sentiment_anime["avg_sentiment"].describe()

Unnamed: 0,anime_norm_title,score,production_cost,avg_sentiment,award,cr_award_winner
0,k,7.42,100000.0,0.109916,Not Award Winning,Not Crunchyroll Winner
1,no,7.55,150000.0,0.026979,Not Award Winning,Not Crunchyroll Winner
2,kon,7.86,100000.0,0.255336,Not Award Winning,Not Crunchyroll Winner
3,days,7.82,300000.0,0.277170,Not Award Winning,Not Crunchyroll Winner
4,nana,8.56,250000.0,0.150131,Not Award Winning,Not Crunchyroll Winner
...,...,...,...,...,...,...
200,monogatari series second,8.76,300000.0,0.086005,Not Award Winning,Not Crunchyroll Winner
201,love live school idol project,7.40,300000.0,0.239505,Not Award Winning,Not Crunchyroll Winner
202,demon slayer kimetsu no yaiba,8.43,150000.0,0.104276,Award Winning,Crunchyroll Award Winner
203,panty stocking with garterbelt,7.74,150000.0,0.069628,Not Award Winning,Not Crunchyroll Winner


Unnamed: 0,anime_norm_title,score,production_cost,avg_sentiment,award,cr_award_winner
77,yuri on ice,7.9,200000.0,0.096507,Award Winning,Crunchyroll Award Winner
118,solo leveling,8.25,300000.0,-0.047457,Not Award Winning,Crunchyroll Award Winner
119,made in abyss,8.63,200000.0,0.12074,Not Award Winning,Crunchyroll Award Winner
149,jujutsu kaisen,8.54,250000.0,0.0,Award Winning,Crunchyroll Award Winner
162,attack on titan,8.56,200000.0,0.083384,Award Winning,Crunchyroll Award Winner
169,devilman crybaby,7.74,200000.0,0.187447,Not Award Winning,Crunchyroll Award Winner
191,cyberpunk edgerunners,8.7,100000.0,0.129144,Not Award Winning,Crunchyroll Award Winner
202,demon slayer kimetsu no yaiba,8.43,150000.0,0.104276,Award Winning,Crunchyroll Award Winner


Total number of Award Winning anime: 14
Total number of CrunchyRoll Award Winning anime: 8

Summary of sentiment value: 


count    204.000000
mean       0.133796
std        0.115632
min       -0.233683
25%        0.069420
50%        0.130753
75%        0.188181
max        0.580952
Name: avg_sentiment, dtype: float64

In [11]:
# Missing Data Analysis
print("Number of rows with NaN:", sentiment_anime.isna().any(axis=1).sum())

Number of rows with NaN: 0


# 4. Years of Airing and Studio of Production

*anime_airing:*
- studios: the studio that produced the anime
- start_year: The start of the anime airing
- end_year: The end of the anime airing
- years_aired_str: A list of years the anime aired in as a string

This dataset includes information liek the studio and year of airing of each anime which are essential to our quesiton but are not indicated in other datasets. Additionally it is important tto note that if the anime is currently on going the end year will indicate 2025. 

There are no major concerns reguarding this datasets outside of merging without a normalized anime name, but this concern has already been addressed within other dataset mergers. 

In [12]:
airing = load_from_drive("1CLg_D114adl9siM4eUH-Aby_8c3PCNgw")

# Drop unused Columns / reformat data
airing = airing.drop(["Title", "Title Japanese", "Source", "Episodes", "Status", "Airing", "Score", "Rank", "Members", "Favorites", "Synopsis", "Themes"], axis=1)
airing.columns = airing.columns.str.lower()
airing["title english"] = airing["title english"].str.lower()
airing = airing.dropna().reset_index(drop=True)
airing["title english"] = airing["title english"].apply(normalize)

airing = airing[airing["type"] == "TV"]
# Create a new column with all years the anime aired
airing['aired from'] = pd.to_datetime(airing['aired from'])
airing['aired to'] = pd.to_datetime(airing['aired to'])
airing['start_year'] = airing['aired from'].dt.year
airing['end_year'] = airing['aired to'].dt.year
airing['years_aired'] = airing.apply(
    lambda row: list(range(row['start_year'], row['end_year'] + 1)),
    axis=1
)
airing['years_aired_str'] = airing['years_aired'].apply(lambda x: ','.join(map(str, x)))

# Drop unused columns
airing = airing.drop(["type", "aired from", "aired to", "years_aired"], axis = 1)

display(airing)

Unnamed: 0,title english,popularity,studios,genres,start_year,end_year,years_aired_str
0,cowboy bebop,43,Sunrise,"Action, Award Winning, Sci-Fi",1998,1999,19981999
1,trigun,259,Madhouse,"Action, Adventure, Sci-Fi",1998,1998,1998
2,witch hunter robin,1925,Sunrise,"Action, Drama, Mystery, Supernatural",2002,2002,2002
3,beet vandel buster,5567,Toei Animation,"Action, Adventure, Fantasy",2004,2005,20042005
4,honey and clover,933,J.C.Staff,"Comedy, Drama, Romance",2005,2005,2005
...,...,...,...,...,...,...,...
5133,tonbo,8862,OLM,"Drama, Sports",2024,2025,20242025
5144,negative positive angler,3133,Nut,Slice of Life,2024,2024,2024
5148,future folktales,14396,"Toei Animation, Manga Productions",Adventure,2024,2025,20242025
5149,lockdown zone lv x,8060,"Imagica Infos, Imageworks Studio","Horror, Mystery",2024,2024,2024


In [13]:
# Merge airing with anime dataset 
anime_airing = pd.merge(
    anime,
    airing,
    left_on="anime_norm_title",
    right_on="title english",
    how="left"
)

# Drop unused columns 
anime_airing = anime_airing.drop(["title english", "title_english", "popularity"], axis=1)

# Drop rows with NaN ONLY if they are NOT Crunchyroll Award Winners
anime_airing = anime_airing[
    (anime_airing["cr_award_winner"] == "Crunchyroll Award Winner") |
    (anime_airing.notna().all(axis=1))
].reset_index(drop=True)

# Reorder the data 
front_cols = ["anime_norm_title"]
end_cols = ["award", "cr_award_winner"]
middle_cols = [col for col in anime_airing.columns 
               if col not in front_cols + end_cols]
anime_airing = anime_airing[front_cols + middle_cols + end_cols]


# Keep only base titles
anime_airing = anime_airing.sort_values(
    by="anime_norm_title",
    key=lambda x: x.str.len()
).reset_index(drop=True)
seen_bases = []
keep_idx = []
for idx, title in anime_airing["anime_norm_title"].items():  # <- changed iteritems() to items()
    # Keep if no previously kept title is a prefix of this title
    if not any(title.startswith(base + " ") or title == base for base in seen_bases):
        keep_idx.append(idx)
        seen_bases.append(title)
anime_airing = anime_airing.loc[keep_idx].reset_index(drop=True)


display(anime_airing)
display(anime_airing[anime_airing["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Total number of Award Winning anime:", (anime_airing["award"] == "Award Winning").sum())
print("Total number of CrunchyRoll Award Winning anime:", (anime_airing["cr_award_winner"] == "Crunchyroll Award Winner").sum())

Unnamed: 0,anime_norm_title,episodes,favorites,score,studios,genres,start_year,end_year,years_aired_str,award,cr_award_winner
0,x,24.0,772,7.38,Madhouse,"Action, Drama, Fantasy, Mystery",2001.0,2002.0,20012002,Not Award Winning,Not Crunchyroll Winner
1,k,13.0,6863,7.42,GoHands,"Action, Mystery",2012.0,2012.0,2012,Not Award Winning,Not Crunchyroll Winner
2,if,13.0,109,6.11,Gonzo,"Mystery, Supernatural",2017.0,2017.0,2017,Not Award Winning,Not Crunchyroll Winner
3,id,12.0,22,6.50,SANZIGEN,Sci-Fi,2017.0,2017.0,2017,Not Award Winning,Not Crunchyroll Winner
4,mm,12.0,662,7.01,Xebec,"Comedy, Ecchi",2010.0,2010.0,2010,Not Award Winning,Not Crunchyroll Winner
...,...,...,...,...,...,...,...,...,...,...,...
2352,failure frame i became strongest and annihilat...,12.0,803,6.48,Seven Arcs,"Action, Adventure, Fantasy",2024.0,2024.0,2024,Not Award Winning,Not Crunchyroll Winner
2353,i was reincarnated as th prince so i can take ...,12.0,1078,7.42,Tsumugi Akita Animation Lab,"Adventure, Fantasy",2024.0,2024.0,2024,Not Award Winning,Not Crunchyroll Winner
2354,my isekai life i gained a second character cla...,12.0,1416,6.33,Revoroot,"Action, Adventure, Fantasy",2022.0,2022.0,2022,Not Award Winning,Not Crunchyroll Winner
2355,strongest tanks labyrinth raids a tank with a ...,12.0,139,6.12,Studio Polon,"Action, Adventure, Fantasy",2024.0,2024.0,2024,Not Award Winning,Not Crunchyroll Winner


Unnamed: 0,anime_norm_title,episodes,favorites,score,studios,genres,start_year,end_year,years_aired_str,award,cr_award_winner
489,yuri on ice,12.0,21184,7.9,MAPPA,"Award Winning, Sports",2016.0,2016.0,2016.0,Award Winning,Crunchyroll Award Winner
693,solo leveling,12.0,18746,8.25,A-1 Pictures,"Action, Adventure, Fantasy",2024.0,2024.0,2024.0,Not Award Winning,Crunchyroll Award Winner
699,made in abyss,13.0,46530,8.63,Kinema Citrus,"Adventure, Drama, Fantasy, Mystery, Sci-Fi",2017.0,2017.0,2017.0,Not Award Winning,Crunchyroll Award Winner
900,jujutsu kaisen,24.0,93814,8.54,MAPPA,"Action, Supernatural",2023.0,2023.0,2023.0,Award Winning,Crunchyroll Award Winner
993,attack on titan,25.0,181378,8.56,MAPPA,"Action, Drama, Suspense",2022.0,2022.0,2022.0,Award Winning,Crunchyroll Award Winner
1100,devilman crybaby,10.0,25388,7.74,,,,,,Not Award Winning,Crunchyroll Award Winner
1457,cyberpunk edgerunners,10.0,141,,,,,,,Not Award Winning,Crunchyroll Award Winner
1973,demon slayer kimetsu no yaiba,26.0,93059,8.43,ufotable,"Action, Award Winning, Supernatural",2019.0,2019.0,2019.0,Award Winning,Crunchyroll Award Winner


Total number of Award Winning anime: 46
Total number of CrunchyRoll Award Winning anime: 8


In [14]:
# Missing Data Analysis
print("Number of rows with NaN:", anime_airing.isna().any(axis=1).sum())

# Fill in missing Data 
anime_airing.loc[1100] = anime_airing.loc[1100].fillna({
    'studios': 'Science SARU',
    'genres': 'Action, Supernatural',
    'start_year': 2018,
    'end_year': 2018,
    'years_aired_str': '2018',
    'score': 7.74
})

anime_airing.loc[1457] = anime_airing.loc[1457].fillna({
    'studios': 'Studio Trigger',
    'genres': 'Action, Sci-Fi',
    'start_year': 2022,
    'end_year': 2022,
    'years_aired_str': '2022',
    'score': 8.9
})

display(anime_airing[anime_airing["cr_award_winner"] == "Crunchyroll Award Winner"])
print("Number of rows with NaN:", anime_airing.isna().any(axis=1).sum())

Number of rows with NaN: 2


Unnamed: 0,anime_norm_title,episodes,favorites,score,studios,genres,start_year,end_year,years_aired_str,award,cr_award_winner
489,yuri on ice,12.0,21184,7.9,MAPPA,"Award Winning, Sports",2016.0,2016.0,2016,Award Winning,Crunchyroll Award Winner
693,solo leveling,12.0,18746,8.25,A-1 Pictures,"Action, Adventure, Fantasy",2024.0,2024.0,2024,Not Award Winning,Crunchyroll Award Winner
699,made in abyss,13.0,46530,8.63,Kinema Citrus,"Adventure, Drama, Fantasy, Mystery, Sci-Fi",2017.0,2017.0,2017,Not Award Winning,Crunchyroll Award Winner
900,jujutsu kaisen,24.0,93814,8.54,MAPPA,"Action, Supernatural",2023.0,2023.0,2023,Award Winning,Crunchyroll Award Winner
993,attack on titan,25.0,181378,8.56,MAPPA,"Action, Drama, Suspense",2022.0,2022.0,2022,Award Winning,Crunchyroll Award Winner
1100,devilman crybaby,10.0,25388,7.74,Science SARU,"Action, Supernatural",2018.0,2018.0,2018,Not Award Winning,Crunchyroll Award Winner
1457,cyberpunk edgerunners,10.0,141,8.9,Studio Trigger,"Action, Sci-Fi",2022.0,2022.0,2022,Not Award Winning,Crunchyroll Award Winner
1973,demon slayer kimetsu no yaiba,26.0,93059,8.43,ufotable,"Action, Award Winning, Supernatural",2019.0,2019.0,2019,Award Winning,Crunchyroll Award Winner


Number of rows with NaN: 0


In [16]:
from pathlib import Path

# Ensure processed directory exists
processed_path = Path('data/02-processed')
processed_path.mkdir(parents=True, exist_ok=True)

# Save dataframes using the Path object
anime_airing.to_csv(processed_path / "data/02-processed/anime_airing.csv", index=False)
sentiment_anime.to_csv(processed_path / "sentiment_anime.csv", index=False)
anime_details.to_csv(processed_path / "anime_details.csv", index=False)
manga_anime.to_csv(processed_path / "manga_anime.csv", index=False)

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The social media posts that we want to take data from will all contain publicly available posts. Assuming that they allow for people to see and discuss their posts, then we are also assuming that they consent to their data being used. We will not be taking data from places such as private accounts or direct messages between people. However, it is possible that people could still feel violated because they were not told that their posts would be used for data. There could also be more personal data that gets swept up in the process of us looking for data.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> Because most of the data like anime rating are collected through viewer volunteerly rating the anime out of 10. We have to consider the possibility that only viewers with a strong viewpoint either positively or negatively will volunteerly rate the anime. 
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> It is only public opinions that we are recording for our sentiment analysis. This could leave out many people who do not publicly voice their opinion but vote for the anime of the year award. This could potentially affect the accuracy of our data. Certain comments about an anime could also be based on memes or those who have not actually watched it, so there is also the issue of not being able to account for the seriousness of posts. There are also people who might not directly mention the name of the anime that they are referring to in their posts.
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> As we are basing most of our data on public sentiment, there could be a lot of it that might not be reproducible. Posts can be deleted or edited. From there, it implies that the user may not consent to those words being used as data again. However, other sources of data, such as sales of merchandise, can be replicated.
### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> The models will be based purely on public reception, the resources, and companies involved in the creation of anime shows. In no way should we group fans and creators based on traits such as gender or ethnicity, even if we are trying to evaluate genres that people like. For example, it could be said that a romance anime would only get popular if the voters were majority female. It can create a close-minded and stereotypical view of these groups. We should instead focus on factors that anime fans as a whole will prefer, such as production quality.
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> There could be merchants and retailers that abuse this data to hoard stock for merchandise such as figurines, manga volumes, and DVDs that could create scarcity in markets. Not only that, but the scarcity can have resellers mark prices up, which could further impact the recreatability of our data.
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> A potential issue that we have is that companies could use our data models to gain an unfair advantage against competitors in the industry. They could use this data and its implications to manipulate the odds towards themselves getting the award rather than letting the voting take its course naturally. This would not be fair towards other studios who are doing genuine hard work to have a fighting chance of winning. Similarly, the data could be weaponized to sabotage votes in ways such as using bot accounts to increase negative feedback on websites such as MyAnimeList.

## Team Expectations 

Our team will communicate primarily through Discord for daily updates and GitHub for version control. We will meet weekly via Discord or Zoom, and all members are expected to respond to messages within 24 hours.

We agree to maintain a respectful, clear, and polite tone, encouraging participation from everyone. Decisions will be made through majority agreement, but role leads may decide on smaller or time-sensitive matters when needed.

Tasks will be divided based on interest and strengths, with rotating leads for data wrangling, EDA/visualization, modeling, and writing/editing. Everyone will contribute fairly to each aspect of the project, and progress will be tracked through a shared Google Sheet or GitHub issues.

If conflicts arise, we will address them respectfully and directly, prioritizing understanding over blame. Persistent issues will be escalated to the TA or instructor if needed. Each teammate is responsible for communicating challenges early, contributing equally, and maintaining academic integrity throughout the project.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/8  |  6 pm | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Decide on group communication (Discord), finalize topic; discuss hypothesis and project goals | 
| 10/15  |  6 pm |  Conduct background research on Crunchyroll Awards and relevant data sources | Discuss key variables (popularity, engagement, production), potential datasets (MAL, Reddit/Twitter APIs, etc.); start writing Project Proposal | 
| 10/29  | 6 pm  | Finalize project proposal draft, identify datasets | Discuss wrangling strategies, ethical concerns, and assign roles such as data wrangling, modeling, visualization, writing|
| 11/5  | 6 pm  | 	Import and begin cleaning datasets; start basic EDA | Review data wrangling and cleaning; discuss findings and plan improvements for Checkpoint #1 (Data)  |
| 11/12  | 6 pm  | Finalize data cleaning; conduct full EDA with visualizations | Review and edit EDA; finalize Checkpoint #2: EDA submission; develop analysis and modeling plan |
| 11/24  | 6 pm  | Complete final model, generate predictions/insights |  Integrate results into paper; review visualizations; finalize writeup and prepare short group video |
| 12/3  | 6 pm  | Final edits to report and video | Finalize and revise project / submit |