# COGS 108 - Project Proposal

## Authors

- Jordan Chen: Writing - original draft, Writing - review & editing
- Koji Nakazawa: Conceptualization, Methodology, Software
- Andrew Hoang: Background Research, Visualization
- Amandine Isidro: Data curation, Experimental investigation
- Audrey La Guardia: Analysis, Project Administration

## Research Question

The Goal of this project is to investigate whether it is possible to predict the winner of Crunchyroll's Anime of the Year award using measurable popularity engagement, and production-related variables, in hopes of discovering relationships between fan engagement and media production within the anime industry. specifically, we aim to build a predictive model that uses factors like user rating(data from Crunchyroll, MyAnimeList), social-media engagement(sentiment analysis on Twitter/Reddit data), hype indicators(trailer, views, manga popularity, and merch sales), production characteristics(studio reputation, budget, and seasonal release time) to produce a binary target variable (win/not win). How do factors like rating, engagement, hype, production, and release time affect an anime's chance at winning Crunchyroll's anime of the year? How accurately can we predict the next winner?

## Background and Prior Work

The Japanese language borrowed from the English language *animation*, creating their loanword *animēshon*, which was later shortened to *anime* and borrowed back by the English language as *anime*.<a id="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Varying from the original definition of the word *animation* meaning: the technique of showing successive drawings or pictures of puppets in different positions to create the illusion of movement, to specifically be used to describe animations in the Japanese style which is often characterized by large eyes, a sharp chin, a pointy nose, and colorful hair. 

The Crunchyroll Anime Awards is an annual awards ceremony organized by Crunchyroll, one of the world's largest anime streaming platforms. The Crunchyroll Anime Awards is an annual ceremony that recognizes the hard work of animators, producers, and other contributors, covering both fan favorites and critically acclaimed works.<a id="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This ceremony was first announced in December of 2016 with the winners presented in January of 2017. The awards feature many different categories which include, anime of the year, best opening or ending song, best voice actor for many different languages, best animation, global impact, and more. The awards begin with a panel of industry experts selecting nominees for each category, followed by a fan voting stage. For certain categories, expert judges may also evaluate submissions to ensure technical merit is considered. Finally, the winners are announced early in the year, typically between February and March. 

The SC1015 “AniFame” project looks at whether you can predict an anime’s success using MyAnimeList data. They built a full pipeline that scraped MAL, cleaned the data, did EDA, engineered features, and then trained models for both rating prediction and a simple “success” classification.<a id="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) They tried models like Ridge Regression and Random Forest, and documented what mattered most (things like genre, studio, episode count, and popularity scores). This is directly useful for us because it shows that public engagement and metadata can actually predict popularity-type outcomes. We plan to build on that idea by going beyond MAL: adding social media engagement and sentiment (Twitter/X and Reddit), hype signals (trailer views, Google Trends, merch proxies), and production details, and shifting the goal to predicting whether an anime wins Crunchyroll’s Anime of the Year.

1. <a id="cite_note-1"></a> [^](#cite_ref-1) Oxford University Press. (n.d.). *Anime, n.³.* In *Oxford English Dictionary.* https://doi.org/10.1093/OED/4346804268

2. <a id="cite_note-2"></a> [^](#cite_ref-2) Wikipedia. (2025). Crunchyroll Anime Awards. https://en.wikipedia.org/wiki/Crunchyroll_Anime_Awards

3. <a id="cite_note-3"></a> [^](#cite_ref-3) Hua, J. (2024). SC1015‑Project: Predict the success of an anime using data science and machine learning (regression + classification). GitHub. https://github.com/ztjhz/SC1015-Project


## Hypothesis


We aim to predict the next Crunchyroll Anime of the Year winner by building a data-driven model using variables such as fan ratings, production studio identifiers, and pre-adaptation manga performance.

To create a meaningful analysis with these variables in place we will operationalize these factors. Fan ratings will be measured through aggregated scores from platforms such as MyAnimeList and AniList. Production studio identifiers will be obtained by searching for the name of studio, budget for production, and season the anime is released. Lastly pre-adaptation manga performances can be representated as merchandise popularity and original manag populatrity.  merchandise popularity will be represented by sales rankings and availability data from major retailers. Manga popularity will be measured using circulation numbers prior to the anime’s release.

Our rationale is that pre-adaptation manga performance will be a strong indication of the baseline performance of the anime. However, based on the studio, budget, and timeline of the project, the animation quality will be affected. If the animation quality is good, then the performance of the anime should be enhanced, and if the animation quality is bad, then the performance of the anime will be hindered. For example, Blue Lock was a highly anticipated anime based on manga popularity, but because of tight production timeline and budget, the animation quality is lackluster, and this directly hindered the popularity of the anime despite high anticipation of the manga.

We therefore hypothesize that the likelihood of an anime winning Anime of the Year will be most strongly associated with the interaction between high pre-adaptation manga popularity and favorable production conditions (e.g., strong studio track record, adequate budget, and optimal release timing), which together produce higher fan ratings and stronger overall audience response.

## Data

 The ideal dataset to answer this question would include every possible anime that is in consideration for Anime of the Year, with each observation representing each anime. Each observation would need to record whether that anime was awarded Crunchyroll's Anime of the Year, its genre, release date, total viewership, ratings, online consensus, studio ratings, and manga popularity. This data would ideally be provided by the specific studio and platform that releases the anime as they would have access to the most accurate, up-to-date statistics on these values. The data would be stored in a large CSV file with each row dedicated to one anime, and one column for each feature.

  The real datasets available to us include the Crunchyroll Anime Award for Anime of the Year Wikipedia, sourced from https://en.wikipedia.org/wiki/Crunchyroll_Anime_Award_for_Anime_of_the_Year. This dataset provides the title of every past winner from 2016 to 2024, the year the award was received, as well as the animation studio for each anime mentioned. While it lacks the remaining feature data from our ideal dataset neccessary in answering our question, it can be used in tandem with a more robust dataset that has less to do specifically with Anime of the Year. For this reason, we will be using a Kaggle dataset of the top ten thousand anime, sourced from https://www.kaggle.com/datasets/wiltheman/anime-data-set-for-ml/data, which will provide us the remaining features of anime that did and did not win, including popularity, ratings, and episode count along with other viewership metrics. Unlike the ideal dataset that would include all possible anime in consideration for the award, this dataset only covers the top ten thousand, but its inclusion of all the award-winning animes lays a solid foundation of what distinguishes an award-winning anime from that of unawarded animes, where the size becomes negligible.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The social media posts that we want to take data from will all contain publicly available posts. Assuming that they allow for people to see and discuss their posts, then we are also assuming that they consent to their data being used. We will not be taking data from places such as private accounts or direct messages between people. However, it is possible that people could still feel violated because they were not told that their posts would be used for data. There could also be more personal data that gets swept up in the process of us looking for data.

 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> It is only public opinions that we are recording for our sentiment analysis. This could leave out many people who do not publicly voice their opinion but vote for the anime of the year award. This could potentially affect the accuracy of our data. Certain comments about an anime could also be based on memes or those who have not actually watched it, so there is also the issue of not being able to account for the seriousness of posts. There are also people who might not directly mention the name of the anime that they are referring to in their posts.
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> As we are basing most of our data on public sentiment, there could be a lot of it that might not be reproducible. Posts can be deleted or edited. From there, it implies that the user may not consent to those words being used as data again. However, other sources of data, such as sales of merchandise, can be replicated.
### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> The models will be based purely on public reception, the resources, and companies involved in the creation of anime shows. In no way should we group fans and creators based on traits such as gender or ethnicity, even if we are trying to evaluate genres that people like. For example, it could be said that a romance anime would only get popular if the voters were majority female. It can create a close-minded and stereotypical view of these groups. We should instead focus on factors that anime fans as a whole will prefer, such as production quality.
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> There could be merchants and retailers that abuse this data to hoard stock for merchandise such as figurines, manga volumes, and DVDs that could create scarcity in markets. Not only that, but the scarcity can have resellers mark prices up, which could further impact the recreatability of our data.
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> A potential issue that we have is that companies could use our data models to gain an unfair advantage against competitors in the industry. They could use this data and its implications to manipulate the odds towards themselves getting the award rather than letting the voting take its course naturally. This would not be fair towards other studios who are doing genuine hard work to have a fighting chance of winning. Similarly, the data could be weaponized to sabotage votes in ways such as using bot accounts to increase negative feedback on websites such as MyAnimeList.

## Team Expectations 

Our team will communicate primarily through Discord for daily updates and GitHub for version control. We will meet weekly via Discord or Zoom, and all members are expected to respond to messages within 24 hours.

We agree to maintain a respectful, clear, and polite tone, encouraging participation from everyone. Decisions will be made through majority agreement, but role leads may decide on smaller or time-sensitive matters when needed.

Tasks will be divided based on interest and strengths, with rotating leads for data wrangling, EDA/visualization, modeling, and writing/editing. Everyone will contribute fairly to each aspect of the project, and progress will be tracked through a shared Google Sheet or GitHub issues.

If conflicts arise, we will address them respectfully and directly, prioritizing understanding over blame. Persistent issues will be escalated to the TA or instructor if needed. Each teammate is responsible for communicating challenges early, contributing equally, and maintaining academic integrity throughout the project.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/8  |  6 pm | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Decide on group communication (Discord), finalize topic; discuss hypothesis and project goals | 
| 10/15  |  6 pm |  Conduct background research on Crunchyroll Awards and relevant data sources | Discuss key variables (popularity, engagement, production), potential datasets (MAL, Reddit/Twitter APIs, etc.); start writing Project Proposal | 
| 10/29  | 6 pm  | Finalize project proposal draft, identify datasets | Discuss wrangling strategies, ethical concerns, and assign roles such as data wrangling, modeling, visualization, writing|
| 11/5  | 6 pm  | 	Import and begin cleaning datasets; start basic EDA | Review data wrangling and cleaning; discuss findings and plan improvements for Checkpoint #1 (Data)  |
| 11/12  | 6 pm  | Finalize data cleaning; conduct full EDA with visualizations | Review and edit EDA; finalize Checkpoint #2: EDA submission; develop analysis and modeling plan |
| 11/24  | 6 pm  | Complete final model, generate predictions/insights |  Integrate results into paper; review visualizations; finalize writeup and prepare short group video |
| 12/3  | 6 pm  | Final edits to report and video | Finalize and revise project / submit |