### Data Exploration of the Anime Dataset

In this notebook, we explore the [Anime Dataset 2023](https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset?select=users-score-2023.csv) to gain insights and prepare for training a recommendation model. The key objectives are:

- Understand the structure and content of the dataset.
- Experiment with different analyses to identify patterns or trends.
- Assess whether other models can be applied to this dataset.
  
The final goal is to use these insights to train a recommendation model, which will be done in the [Model Notebook](./model.ipynb).


In [1]:
# download the data from the kaggle api
!python download.py

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

# Dataset Description

This dataset provides extensive information on anime series and is sourced from the popular website MyAnimeList. It includes a wide variety of features related to anime titles, such as rankings, user ratings, genres, and other metadata that are essential for exploring trends and analyzing patterns in anime consumption.

For more details and access to the dataset, please visit the following link: [MyAnimeList Dataset on Kaggle](https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset).

Please consider supporting the original author by acknowledging their work and adhering to any usage guidelines outlined in the dataset's licensing agreement.


## Content

### anime-dataset-2023.csv

- **anime_id**: Unique ID for each anime.
- **Name**: The name of the anime in its original language.
- **English name**: The English name of the anime.
- **Other name**: Native name or title of the anime (can be in Japanese, Chinese, or Korean).
- **Score**: The score or rating given to the anime.
- **Genres**: The genres of the anime, separated by commas.
- **Synopsis**: A brief description or summary of the anime's plot.
- **Type**: The type of the anime (e.g., TV series, movie, OVA, etc.).
- **Episodes**: The number of episodes in the anime.
- **Aired**: The dates when the anime was aired.
- **Premiered**: The season and year when the anime premiered.
- **Status**: The status of the anime (e.g., Finished Airing, Currently Airing, etc.).
- **Producers**: The production companies or producers of the anime.
- **Licensors**: The licensors of the anime (e.g., streaming platforms).
- **Studios**: The animation studios that worked on the anime.
- **Source**: The source material of the anime (e.g., manga, light novel, original).
- **Duration**: The duration of each episode.
- **Rating**: The age rating of the anime.
- **Rank**: The rank of the anime based on popularity or other criteria.
- **Popularity**: The popularity rank of the anime.
- **Favorites**: The number of times the anime was marked as a favorite by users.
- **Scored By**: The number of users who scored the anime.
- **Members**: The number of members who have added the anime to their list on the platform.
- **Image URL**: The URL of the anime's image or poster.

### users-details-2023.csv

- **Mal ID**: Unique ID for each user.
- **Username**: The username of the user.
- **Gender**: The gender of the user.
- **Birthday**: The birthday of the user (in ISO format).
- **Location**: The location or country of the user.
- **Joined**: The date when the user joined the platform (in ISO format).
- **Days Watched**: The total number of days the user has spent watching anime.
- **Mean Score**: The average score given by the user to the anime they have watched.
- **Watching**: The number of anime currently being watched by the user.
- **Completed**: The number of anime completed by the user.
- **On Hold**: The number of anime on hold by the user.
- **Dropped**: The number of anime dropped by the user.
- **Plan to Watch**: The number of anime the user plans to watch in the future.
- **Total Entries**: The total number of anime entries in the user's list.
- **Rewatched**: The number of anime rewatched by the user.
- **Episodes Watched**: The total number of episodes watched by the user.

### users-score-2023.csv

- **user_id**: Unique ID for each user.
- **Username**: The username of the user.
- **anime_id**: Unique ID for each anime.
- **Anime Title**: The title of the anime.
- **rating**: The rating given by the user to the anime.

### General questions

1) How does the demograpic of the anime community look like
2) Show the distribution of gender in the dataset
3) What genres are the most popular
4) Does the number of episodes influence if an anime is droppped
5) Which country has the highest user engagement
6)  How do anime preferences differ by location? Are there specific genres or titles that are more popular in certain countries or regions?

In [3]:
df_anime  = pd.read_csv("./data/anime-dataset-2023.csv")
df_anime.head(3)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...


In [5]:
df_user_details = pd.read_csv("./data/users-details-2023.csv")
df_user_details.head(3)

Unnamed: 0,Mal ID,Username,Gender,Birthday,Location,Joined,Days Watched,Mean Score,Watching,Completed,On Hold,Dropped,Plan to Watch,Total Entries,Rewatched,Episodes Watched
0,1,Xinil,Male,1985-03-04T00:00:00+00:00,California,2004-11-05T00:00:00+00:00,142.3,7.37,1.0,233.0,8.0,93.0,64.0,399.0,60.0,8458.0
1,3,Aokaado,Male,,"Oslo, Norway",2004-11-11T00:00:00+00:00,68.6,7.34,23.0,137.0,99.0,44.0,40.0,343.0,15.0,4072.0
2,4,Crystal,Female,,"Melbourne, Australia",2004-11-13T00:00:00+00:00,212.8,6.68,16.0,636.0,303.0,0.0,45.0,1000.0,10.0,12781.0


In [6]:
df_user_score = pd.read_csv("./data/users-score-2023.csv")
df_user_score.head(3)

Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5


### Recommender specific

1) Show the distribution of given ratings each user has made
2) What average rating does each user give
3) Which genres do users rate the highest on average?
4)  Are there any biases in the ratings? For example, do users tend to give higher ratings to more popular or well-known anime?
5) How sparse is the user / item matrix