# Phase 2 - Cleaning and Initial Analysis of Anime data
Michelle Yang & Rachel Zhang

## Data collection/description
--------------------------------------------------------------------
All datasets are downloaded from Kaggle. Kaggle user-collected data from official MyAnimeList (MAL) API and unofficial Jikan API. 

Data sets collected have various information updated to one month ago or three months ago. These datasets were created from user interest and not funded by any organization.

Observations are for each anime and each user on MAL. Attributes for anime data include anime ID as assigned by MAL, title of anime, genre, aired year, popularity, and ranking. Attributes for user data include user ID (datasets varied for assignment, one is creator assigned, others are MAL assigned), their reviews for various animes, their scores for animes, and what they have on their anime lists. Missing information is filled in with data from older datasets. 

User information from the website were scraped using API by the dataset creators which users were probably not aware. The information scraped is publicy provided by the user. 

### Raw source data
--------------------------------------------------------------------
#### Compiled individual datasets: 

https://drive.google.com/drive/folders/1I5uVgBwEKWqfPn5RqCgdo9Cxfn2i6Th9?usp=sharing

#### Individual datasets links:

https://www.kaggle.com/qvinhdo/myanimelist?select=mal_db.dump:
- MAL_anime_sept20.csv 
- user_watches_sept20.csv 
- usersID_sept20.csv
    
https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews?select=reviews.csv

- animes_marlesson_may20.csv
- profiles_marlesson_may20.csv
- reviews_marlesson_may20.csv

## Research Questions
1. What are the genre preferences for each gender? Do they play into stereotypes?
2. Which animes are most favorited and why? 
3. What time period were anime most popular among people?
4. What are the most popular anime aired each year?
5. What genres are most released in which season?
6. Which animes are the top rated animes? Why are they top rated?
7. Which anime are most users currently watching/on hold/dropped/planning to watch? Why might they be the most dropped/currently watching?
8. What is the age range of users watching different genres and certain animes? 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast

## Importing the main anime CSV 

In [None]:
anime_data = pd.read_csv("animes_marlesson_jan20.csv")
print("Number of Columns in Original Data: " + str(len(anime_data.columns)))
print("Number of Observations in Original Data: " + str(len(anime_data)))
anime_data.head()

## Cleaning the Data by Deleting and Renaming Columns

Deleted columns like img_url and link because it's not relevant to our data analyses.

In [None]:
anime_data = anime_data.drop(columns = ['img_url', 'link'])
anime_data = anime_data.rename(columns = {'score':'rating'}) 
print("Number of Columns After Cleaning Data: " + str(len(anime_data.columns)))
print("Number of Observations After Cleaning Data: " + str(len(anime_data)))
anime_data.head()

## Analysis of Number of Anime Aired Per Year

In this section, we hope to analyze the number of anime that is aired per year. 

The dataset does not contain a column with only the year, so we first extracted the initial airing year from each observation in the column "aired":

In [None]:
# Function that converts aired to first aired year 
def extract_year(dataframe):
    aired_years = []
    for dates in dataframe['aired']:
        start = dates.index(",") + 2 
        year = dates[start:start+4] #gets the first year  
        aired_years.append(year)

We then appended a new column onto the anime_data dataset that contains just the year in which the anime was originally aired: 

In [None]:
# Adding a new column to represent the aired year 
# Cases: month, year ; Not available ; just 1 year ; 20xx to 20xx 
aired_years = [] 
for dates in anime_data['aired']: 
    if dates == "Not available":
        aired_years.append("NaN")
    elif len(dates) > 4 and dates[0].isalpha():
        start = dates.index(",") + 2 
        year = dates[start:start+4] #gets the first year  
        aired_years.append(int(year))
    else:
        aired_years.append(dates[0:3])
anime_data['aired_year'] = aired_years

In [None]:
anime_copied = anime_data.copy()
print("Below are the first 5 rows of the dataset with the new column 'aired_year' ")
anime_copied.head()

A future direction would be to look at how many anime is aired per year by using a histogram. 

In [None]:
#WILL HAVE IMPLEMENTATION IN THE FUTURE

## Analysis of Correlation Between Popularity and Ranking

Popularity refers to how many users on MyAnimeList have that specific anime added into any list (ex: "want to watch", "watching", "dropped", "finished", etc.) under their account. Given that people often base their watching decisions off of word-of-mouth or online recommendations, higher exposure to an anime name might be what leads people to put the anime down on their list, thus indicating high popularity. In addition, the titles that get passed around tend to be titles that were well-received. Consequently, we predict that higher popularity should be somewhat positively correlated with the anime's ranking, such that a low digit in popularity corresponds with a low digit in ranking. This next section will attempt to analyze this relationship:

Scatterplot of Popularity and Rank:

In [None]:
plt.scatter(anime_data['popularity'], anime_data['ranked'], alpha = 0.1)
plt.xlabel("Popularity") and plt.ylabel("Ranked") and plt.title("Anime Popularity v. Rank")

In [None]:
pop_rank_correlation = anime_data['popularity'].corr(anime_data['ranked'])
print("Correlation between Anime Popularity and Rank: {:.2f}".format(pop_rank_correlation))

## Analysis of The Number of Anime That Each Genre Has 

Many anime in the dataset are tagged with more than one genre. Below, we hoped to analyze which genres are most common in anime. 

In [None]:
anime_data_copy = anime_data.copy()
tags = anime_data_copy['genre'][0] 
tags = ast.literal_eval(tags)

First, we found all the unique anime genres below: 

In [None]:
genre_list = []
for anime_tags in anime_data_copy['genre']: 
    anime_tags = ast.literal_eval(anime_tags)
    for i in range(len(anime_tags)):
        if anime_tags[i] not in genre_list:
            genre_list.append(anime_tags[i])
print("List of anime genres: " + str(genre_list))

### Creating Genre Counting Dataframe

Next, we created a dataframe with the genre and count columns. We instantiated the counts to 0 for each genre. 

In [None]:
genre_count = pd.DataFrame(columns = ['genre', 'count'])
genre_count['genre'] = genre_list
genre_count['count'] = [0] * len(genre_list)
genre_count.head()

In [None]:
for anime_tags in anime_data_copy['genre']: 
    anime_tags = ast.literal_eval(anime_tags)
    for genre in anime_tags:
        i = genre_list.index(genre)
        genre_count['count'][i] = genre_count['count'][i] + 1 
print("First few rows of new dataframe: ")
print(genre_count.head())

### Sorting Genre Counts from Most Counts to Least Counts

In [None]:
genre_count.sort_values(by = ['count'], ascending = False)

## Merging User Profiles CSV with User Reviews CSV

In [None]:
users = pd.read_csv("profiles.csv")
users.head()

After importing the file, we cleaned the data of columns that we don't need: 

In [None]:
users = users.drop(columns = ['birthday', 'link'])

Then, we imported the reviews.csv: 

In [None]:
reviews = pd.read_csv("reviews.csv")
reviews.head()

Finally, we got to merging the user reviews with the user profile: 

In [None]:
user_reviews = pd.merge(users, reviews, on = "profile")
user_reviews.head()

Next, we looked at how many animes received reviews: 

In [None]:
print("Number of Animes that Received Reviews: " + str(len(user_reviews['anime_uid'].unique())))

In [None]:
#import all the datasets
anime_recent = pd.read_csv("MAL_anime_sept20.csv")
user_watch = pd.read_csv("user_watches_sept20.csv")
userID = pd.read_csv("user_watches_sept20.csv")

In [None]:
userID = userID.drop(columns=["join_date", "last_scraped_date"])
userID.head()

## Data limitations
1. Some user data like birthdate and gender are missing or incorrect due to self-reporting of gender and birthdate.
2. Because we are merging and using two different datasets, one 3 months more recent than the other (but missing information on ratings, for example), the more recent observations may be missing information in some columns
3. Popularity data doesn't show which subset of "Anime List" it is in
4. Some anime dont have their air date, some of them only have the year and not the month
5. When plotting correlation, there is a ton of aggregates which makes the graph very difficult to interpret. 

## Questions for reviewers
--------------------------------------------------------------------
1. Are we allowed to set two separate datasets even after merging datasets? For example, one has anime information, the other has solely user information and we want to perform separate analyses on them (they can't be merged)
2. How do we use an API? 
3. How many research questions can we explore?