# Data and Analysis Plan: Anime Recommendation System
## Team - 67

- Marcos Equiza Gasco (equizagasco.m@northeastern.edu)
- Kelsey Nihezagirwe (nihezagirwe.k@northeastern.edu)'
- Olivia Mintz (mintz.o@northeastern.edu ) 
- Nathan Brito (brito.n@northeastern.edu)


## Project Goal:
Our project will use lists of anime titles from the streaming platform [Crunchyroll](https://crunchyroll.com/) to recommend a new anime to users who input some anime that they already enjoy. Additionally, it will have a second feature that recommends a title based on some features (genre, year, duration, etc.) that the user inputs.

## Data 

### Overview 
We will use a [list of all anime titles available on Crunchyroll](https://www.crunchyroll.com/videos/popular). A dataset found on [kaggle](https://www.kaggle.com/datasets/victorsoeiro/crunchyroll-animes-and-movies?select=titles.csv) has compiled the information for every title.

<img src="Crunchyroll.png" width=500>

The data set we obtained from kaggle contains the following details:
- id
- title
- type
- description
- release year
- age certification
- runtime
- genres
- production countries
- seasons
- imdb id
- imdb_score
- imdb_votes
- tmdb_popularity
- tmdb_score

Throughout this project we plan on using and analyzing the following columns (features) for our model:
- title
- release_year 
- age_certification
- genre
- seasons
- avg_score (computed score from imdb_score and tmbd_score)

### Pipeline Overview

Given that our data has already been compiled into a data set, not that much pre-processing is needed. We will load all of the raw data into a data frame `df_anime`. As mentioned above, a new column will be added to this rudimentary data frame, `avg_score`, which will be the computed average score between the IMDb and TMDb scores. 

After this is computed, a new data frame, `df_anime_feat`, will be compiled with only the necessary features (listed above) for our model. This will simply remove those that are not needed for the analysis. All rows with missing data will be discarded. Altough this step significantly reduces the size of the data from 1081 to 639 rows, it is necessary for our model to function (looked at `df_anime_feat.shape()` before and after discarding rows). Additionally, to make the classification of genres easier, separate columns for each specific genre, `genre_name` will be made, and the values for that column will be `bool`, indicating whether a specific title is associated with that genre or not. The process will be the following:
- Clean up the `genres` column: it is a str which represents a list of the genres.
    - Make a `final_genres` list that contains all the unique genres in the data set.
- Create new columns, `genre_name`, and indicate with a `bool` whether the given anime is categorized as that genre or not.

To make each feature comparable, another data frame will be made, `df_anime_feat_std`, which will contain standardized values for the quantitative features (this will not include `title` or `certification`). This will help make our model more effective down the line.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np

In [None]:
# read csv
pd.read_csv('titles.csv')

# make the data frame of anime titles
df_anime = pd.read_csv('titles.csv', index_col=None)
df_anime.head()

In [None]:
# make new column with average score
df_anime['avg_score'] = (df_anime['imdb_score'] + df_anime['tmdb_score']) / 2

df_anime.head()

In [None]:
# list of features of interest for model
feat_list = ['title', 'release_year', 'age_certification', 'genres', 'seasons', 'avg_score']

# make new data frame with only features we are interested in
df_anime_feat = df_anime[feat_list].copy()

# inspect size
df_anime_feat.shape

In [None]:
df_anime_feat.head()

In [None]:
# discard any rows missing data
df_anime_feat.dropna(axis=0, inplace=True)

# inspect size
df_anime_feat.shape

In [None]:
df_anime_feat.head()

In [None]:
# clean up the data in the 'genres' column and make into a list
genres = df_anime_feat.loc[:, 'genres']
genres_clean = []

# split the values
for anime in genres:
    genres_clean.append(anime.split("', '"))

# get rid of the brackets
for anime in genres_clean:
    anime[0] = anime[0][2:]
    anime[-1] = anime[-1][:-2]

# display
genres_clean[0:5]

In [None]:
# create a list with all the unique genres in the data set
genres_final = []
for anime in genres_clean:
    for genre in anime:
        if genre not in genres_final:
            genres_final.append(genre)
            
# display
genres_final

In [None]:
# update the data frame with the clean genres column
df_anime_feat['genres'] = genres_clean
df_anime_feat.head()

In [None]:
# make new columns for each genre and update with bool value
for genre in genres_final:
    df_anime_feat[genre] = (genre in df_anime_feat['genres'])
    
df_anime_feat

### Issue

We were trying to make individual columns for each genre that outputted True if the genre was associated with the anime title (in the above cell). We could not figure out what we were doing wrong, as everything outputted `False`. The below cell replicates the expected behavior, but we will try to resolve this issue as we progress in the project.

The next step was to standardize the data frame, but since we were unable to make the `genre` columns successfully, we are not able to do. We will also complete this step after we are able to make the appropriate columns. 

In [None]:
genres_final[1] in df_anime_feat['genres'][0]

#'scifi' in df_anime_feat.loc[0, 'genres']

### Visualizations (sanity check / data exploration)

The first graph shows a somewhat interactive visualization which attempts to show all of the information that is needed for a specific anime title in our model. Each dot represents an anime title, categorized by the year it was released and the average score it has received. It is labeled based on its age certification category. When the user hovers over any dot, they can find that information along with the anime's title, the genres associated with it, and the number of seasons it has. This visualization attempts to show the user all of the information that will be used for the model. Although the points are too clustered to analyze how age_certification is spread out throughout our dataset, we can see that most of the anime were released within the last decade. Furthermore, we can see how on average these anime are scored above 7. It seems as though more recent animes are well-liked among viewers. It also appears that the majority of the anime observer in the dataset has an age certification of TV-14.


The second visualization represents a histogram that compares the scores given to the anime titles in the data set from two different websites, IMDb and TMDb. Altough a computed `avg_score` is used in our model and analysis, this graph shows how the websites compare to each other. It would probably be beneficial to have more websites providing scores to have a better `avg_score`, but we thought it would be an interesting feature to visualize.

In [None]:
feat0 = 'release_year'
feat1 = 'avg_score'

sns.set(font_scale=1.2)

px.scatter(data_frame=df_anime_feat, x=feat0, y=feat1, color='age_certification', hover_data=['title', 'genres', 'seasons'])

In [None]:
# get the column with imbd scores
imbd_score = df_anime.loc[:, "imdb_score"]

# get the column with tmbd scores
tmbd_score = df_anime.loc[:, "tmdb_score"]

sns.set(font_scale = 1.2)
bins = np.linspace(0, 10, 10)

# make a histogram
plt.hist(imbd_score, color = "r", alpha = 0.9, label = "IMDb", bins = bins)
plt.hist(tmbd_score, color = "g", alpha = 0.5, label = "TMDB", bins = bins)
plt.xlabel("rating")
plt.ylabel("count")
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10])
plt.title("Histogram of Anime Ratings on IMDb vs TMBD")
plt.legend()


### Analysis Plan
Machine Learning Tools: *SciKit* (K-nearest neighbors) and *SKLearn* (Tree). 

The user inputs a list of titles they’ve watched and enjoyed, and based on the features of these shows (`release_year`, `age_certification`, etc.), the model tries to predict a list of anime that the user might like. We think that **k-NN** is the best because we can pass a list of features from the list of anime and return a list of titles that are similar to the anime that the user has watched. We think that, the more similar an anime is to the list the user has inputted, the better the recommendation. This approach will likely also work for the second feature we want to include, which predicts anime based on a list of features the user inputs. Finding the titles that most resemble that list would be a successful find.

A **decision tree** can also be used to learn more about how the genres are distributed, since every anime has at least 3 or 4 genres associated with it. Using the decision tree can allow us to understand how genres are clustered, and make better prediction for titles by "optimizing" the combinations of genres.

**Relevant assumptions**: 
For k-nearest neighbors, an assumption is made that several titles’ genres are connected and represent multiple genres.

**Use cases**: 
We can build new title recommenders based on year of release, age certification, seasons, genres or a combination of these categories.
