## WebScrape MyAnime.net


Webscraping myanime.net for an updated anime list to be used for the animerecommender app

In [1]:
import urllib
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

In [None]:
url = "https://myanimelist.net/topanime.php?limit=150"

In [None]:
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")

In [None]:
results_ = soup.find_all(class_= "ranking-list")
len(results_)

In [None]:
results_[7]

For the classifier we need:
1. The Anime name
2. The Genres
3. The ratings/score
4. The number of scored users: the number of members that have scored tis anime
5. Number of episodes
6. Number of members: the number of members that have added this anime
7. Type: i.e. TV, Movie, OVA

The most information can be obtained from href="https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood" in the class "hoverinfo_trigger fl-l ml12 mr8", but we see that a cuple can be retrieved from within the class "information di-ib mt4"

In [None]:
results_[7].find(class_="information di-ib mt4")

In [None]:
results_[7].find(class_="information di-ib mt4").text.strip()

We can extract four info: 
1. the type: here it is TV
2. The number of episodes: 64
3. Dates
4. the number of members

In [None]:
Type, Dates, members = results_[7].find(class_="information di-ib mt4").text.strip().splitlines()

In [None]:
float("".join(members.split()[0].split(",")))

Extract the number of episodes

In [None]:
Type

In [None]:
[Type_, eps, n] = [", ".join(x.split()) for x in re.split(r'[()]',Type)]

In [None]:
eps = float(eps.split(",")[0])

In [None]:
results_[7].find(class_="hoverinfo_trigger fl-l fs14 fw-b")["href"]

In [None]:
url_= results_[7].find(class_="hoverinfo_trigger fl-l fs14 fw-b")["href"]
html_ = requests.get(url_)
soup_ = BeautifulSoup(html_.content, 'html.parser', from_encoding="utf-8")

In [None]:
soup_

In [None]:
# scrolling through this long list we see the tite is in class= "h1-title"
soup_.find(class_="h1-title").text.strip()

Let's see we have: title, the number of member, the type, number of episodes, the dates. We nned to get the genres, score/ratings and the number of scored members. I think we will find all these in the borderClass.

In [None]:
soup_.find(class_="borderClass")

In [None]:
# list the genres from itemprop
soup_.find(class_="borderClass").find_all("span", itemprop="genre")

In [None]:
genres = [genre.text.strip() for genre in soup_.find(class_="borderClass").find_all("span", itemprop="genre")]
genres

In [None]:
# the score
float(soup_.find(class_="borderClass").find_all("span", itemprop="ratingValue")[0].text.strip())

In [None]:
# number of scoring members: the number of members that have scored tis anime
float(soup_.find(class_="borderClass").find_all("span", itemprop="ratingCount")[0].text.strip())

### Puttting it all together

Nowwe have all the elemets, let's define a function to extract the info

In [2]:
from time import sleep
def parse_MAl(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
    results = soup.find_all(class_= "ranking-list")
    
    df = pd.DataFrame(columns=["name","type","episodes","members","score_members", "rating","genre","dates"])
    i = 0
    for result in results:
        #print(i)
        url_= result.find(class_="hoverinfo_trigger fl-l fs14 fw-b")["href"]
        html_ = requests.get(url_)
        soup_ = BeautifulSoup(html_.content, 'html.parser', from_encoding="utf-8")
        
        try:
            name = soup_.find(class_="h1-title").text.strip()
        except:
            print(soup_)
            None
            
        Type, Dates, members = result.find(class_="information di-ib mt4").text.strip().splitlines()
        try:
            members = float("".join(members.split()[0].split(",")))
        except:
            None
            
        [Type_, eps, n] = [", ".join(x.split()) for x in re.split(r'[()]',Type)]
        
        try:
            eps = float(eps.split(",")[0])
        except:
            None
        
        try:
            genres = [genre.text.strip() for genre in soup_.find(class_="borderClass").find_all("span", itemprop="genre")]
            
        except:
            None
        
        try:
            score = float(soup_.find(class_="borderClass").find_all("span", itemprop="ratingValue")[0].text.strip())
        except:
            None
        try:
            score_members = float(soup_.find(class_="borderClass").find_all("span", itemprop="ratingCount")[0].text.strip())
        except:
            None
        
        df = df.append({
            "name": name,
            "type": Type_,
            "episodes": eps,
            "members": members,
            "score_members": score_members,
            "rating": score,
            "genre": genres,
            "dates": Dates,
            "url":url_
        },ignore_index=True)
        
        sleep(10)# pause for 10 seconds betwen each result call
    return df

In [None]:
url = "https://myanimelist.net/topanime.php?limit=150"

parse_MAl(url)

Success!!!! This is awesome. Now to extract all anime. To do so look at the url: url = "https://myanimelist.net/topanime.php?limit=0". The "limit=0" can be used to scrool through the list. From the website the limit is 16750, which we will set it to.

In [3]:
from tqdm import tqdm
from time import sleep
def webscrape_MAl(anime_limit=16750, start=0):
    url_template = "https://myanimelist.net/topanime.php?limit={}"
    df = pd.DataFrame(columns=["name","type","episodes","members","score_members", "rating","genre","dates"])
    for limit in tqdm(range(start,anime_limit, 50)): # iterate in steps of 50
        url = url_template.format(limit)
        df_temp = parse_MAl(url)
        save_mal_temp(df_temp, limit)
        df = df.append(df_temp, ignore_index=True)
        sleep(60) # wait for 60 seconds before the next call
    # save to disk
    df.to_csv('MAL.csv')
    return df

def save_mal_temp(df, limit):
    csvTemp = "temp/MAL_start_{}.csv".format(limit)
    df.to_csv(csvTemp)
    print("Number of missing names, for limit {} = {}".format(limit, df["name"].isnull().sum()))

In [4]:
#test with a smaller number
anime = webscrape_MAl(anime_limit=400)

  0%|          | 0/8 [08:11<?, ?it/s]


KeyboardInterrupt: 

In [None]:
#anime = pd.read_csv("My_Anime_List_uncleaned.csv")
anime.head()

In [None]:
anime.shape

In [None]:
anime.isnull().sum()

In [None]:
anime.tail()

Cool. Now to extract all the data, I'll add the functions to a script that I can conveniently run when i need to update.