
Disclaimer: The following code for scraping MAL was written on Dec 30th, 2020. The code is not garanteed to work if after the stated time, website structures for MAL changed. I will make an effort to update the code as often as possible. However, I did find being able to write the following code allows me to scrape most websites I want. 

### Scraping Static HTML: Using MAL Top Animes as An Example

#### Import libraries

- BeautifulSoup: for scraping
- requests: request html and parse
- re: regular expression for string manipulation
- pandas: convert data scraped into csv files

In [1]:
from bs4 import BeautifulSoup 
import requests
import re
import pandas as pd

#### Helper Function to Parse One Anime Row

Looking at the html of https://myanimelist.net/topanime.php (using chrome, right click and select inspect, navigate to the element section, and you will see the HTML), each anime is a tr (table row) of the table. Within each row, name of anime is wrapped in class anime_ranking_h3, related information in class information, and score in class score. These can be scraped with beautifulsoup rather simply using the select() function. Then the text can be cleaned.

We can further get a show's start year and end year from the related information section. Here I used regular expression to get 4 digits of year to match start and end years.

In [30]:
def getOneRow(targetrow):
    animeTitle=targetrow.select("h3.anime_ranking_h3")[0].text
    animeInformation=targetrow.select("div.information")[0].text.replace("\n","|").replace("  ","")
    animeScore=targetrow.select("td.score")[0].text.replace("\n", "")
    year=animeInformation.split("|") # split by |
    years=re.findall('[0-9]+', year[2]); # get all years in the second section from above
    start="NA"
    end="NA"
    
    if len(year)>0:
        start=years[0]
        if len(years)>1:
            end=years[1]
    return animeTitle, animeInformation,animeScore, start, end

# tablerow[0]

#### Function to Get a Specified Number of Anime on The Top Anime List

Pass in the url into requests.get() function to get the entire page, then make a soup out of it with BeautifulSoup. With the soup ready, we could find the table corresponding to the top anime list and find all its rows. For each row, get desired data with the getOneRow() helper function. Because each page of the top anime list only has 50 animes, if requesting more than 50 anime, make sure to get a loop to scrape pages after the first one.

In [31]:
def getTopAnime(limit):
    topanimedict=[] # I find using a dict to store data is the easiest, and it's easy to convert to JSON or csv
    url = "https://myanimelist.net/topanime.php" #url
    soup = BeautifulSoup(requests.get(url).text, 'lxml') #make soup of html
    toptable = soup.select("table")[0] #get table corresponding to the top anime table.
    tablerow=toptable.select("tr.ranking-list") #get all rows in the table
    for row in tablerow: #get data for each row
        anime, info, score, st, ed=getOneRow(row)
        tempdict={"anime": anime,"start": st, "end":ed,  "score": score, "information": info}
        topanimedict.append(tempdict)
        
    if limit>50: # get page 2, 3, 4 etc after the first one
        ind=limit//50
        for i in range (1,ind):
            url = "https://myanimelist.net/topanime.php?limit="+str(50*i)
            print(url)
            soup = BeautifulSoup(requests.get(url).text, 'lxml')
            toptable = soup.select("table")[0]
            tablerow=toptable.select("tr.ranking-list")
            for row in tablerow:
                anime, info, score, st, ed=getOneRow(row)
                tempdict={"anime": anime,"start": st, "end":ed,  "score": score, "information": info}
                topanimedict.append(tempdict)
    
    topanimedf=pd.DataFrame.from_dict(topanimedict)
    return topanimedf


#### Convert Data

With the help of a dictionary and the pandas library, it is really easy to convert what we scraped into a csv. This script will save the data to the same directory where the script is stored.

In [71]:
df=getTopAnime(3000)
df.to_csv('MALtop3000.csv', index=False)

https://myanimelist.net/topanime.php?limit=50
https://myanimelist.net/topanime.php?limit=100
https://myanimelist.net/topanime.php?limit=150
https://myanimelist.net/topanime.php?limit=200
https://myanimelist.net/topanime.php?limit=250
https://myanimelist.net/topanime.php?limit=300
https://myanimelist.net/topanime.php?limit=350
https://myanimelist.net/topanime.php?limit=400
https://myanimelist.net/topanime.php?limit=450
https://myanimelist.net/topanime.php?limit=500
https://myanimelist.net/topanime.php?limit=550
https://myanimelist.net/topanime.php?limit=600
https://myanimelist.net/topanime.php?limit=650
https://myanimelist.net/topanime.php?limit=700
https://myanimelist.net/topanime.php?limit=750
https://myanimelist.net/topanime.php?limit=800
https://myanimelist.net/topanime.php?limit=850
https://myanimelist.net/topanime.php?limit=900
https://myanimelist.net/topanime.php?limit=950
https://myanimelist.net/topanime.php?limit=1000
https://myanimelist.net/topanime.php?limit=1050
https://myan

Take a look at the scrape data file. Looked pretty neat to me. Index is the ranking-1.

In [72]:
df.tail()

Unnamed: 0,anime,end,information,score,start
2995,Sekirei,2008,"|TV (12 eps)|Jul 2008 - Sep 2008|320,922 members|",7.14,2008
2996,Shin Atashin'chi,2016,"|TV (26 eps)|Oct 2015 - Apr 2016|2,427 members|",7.14,2015
2997,Tantei Opera Milky Holmes Movie: Gyakushuu no ...,2016,"|Movie (1 eps)|Feb 2016 - Feb 2016|3,417 members|",7.14,2016
2998,Tenchi Muyou! Manatsu no Eve,1997,"|Movie (1 eps)|Aug 1997 - Aug 1997|13,514 memb...",7.14,1997
2999,Tengen Toppa Gurren Lagann: Parallel Works,2008,"|Music (8 eps)|Jun 2008 - Sep 2008|29,743 memb...",7.14,2008


### Scraping Dynamic HTML: Using MAL user list as An Example¶

with the code here, you will be able to scrape any user's MAL. Here I used my own anime list as an example (https://myanimelist.net/animelist/iasnobmatsu, fyi I highly highly recommend Attack on Titan, Haikyu, and Hoseki no Kuni).

Dynamic HTML is different from static HTML as the static HTML is rendered from HTML source file (imaging writing an html file and that is what we scrape). Dynamic HTML, on the other side, is not rendered from HTML source files but from JavaScript (Or JQuery or React, whatever framework). Dynamic HTML, unlike static, is not generate the moment a url is opened, but will need some time to render after the document is ready.

#### Helper Function to Get One Row of MAL User List

Similar to the getOneRow function(), this function parses specific data for one anime. This step is the same regardless of static or dynamic HTML.


In [22]:
def getOneRowMAL(targetrow):
    animeTitle=targetrow.select("td.title")[0].select("a.link.sort")[0].text
    animeType=targetrow.select("td.type")[0].text.strip()
    animeScore=targetrow.select("td.score")[0].text.strip()
    animeProgress=targetrow.select("td.progress")[0].text.replace("\n", "").replace("  ","")
    return animeTitle, animeType,animeScore, animeProgress

getOneRowMAL(rows[27])

('Haikyuu!!', 'TV', '7', ' 25 ')

#### Additional Libraries for Dynamic HTML

For scraping dynamic HTML, we need selenium and time. 

In [None]:
from selenium import webdriver
import time

#### Get Dynamic MAL User List Data

to scrape dynamic data, we need the url of the webpage. We also need to have a web browser driver. Here I use the Chrome driver (download here https://chromedriver.chromium.org/ or through homebrew etc). I stored it in my download folder, and I will need the path to the driver.

With the url of webpage and path to browser driver ready, we will use selenium to declare a driver variable, and use it instead of requests to get the url.

Then it is important to delay the rest of the function by some time, here I used .2 but it may differ depend on how fast a page loads on a specific device under specific internet conditions. This time allows dynamic HTML to render so we scrape the desired content instead of the intial script used to generate the HTML (which we cannot parse).
Then similar steps to scrape each row of data from the user anime list using BeautifulSoup.

In [34]:
def getMAL(url, driverPath):
    MALdict=[]
    driver = webdriver.Chrome(driverPath)
    driver.get(url)

    time.sleep(0.2) # may change
    soup=BeautifulSoup(driver.page_source, 'lxml')
    toptable = soup.select("table")[0]
    rows=toptable.select("tbody.list-item")
    for row in rows:
        ti,ty,sc,pr=getOneRowMAL(row)
        MALdict.append({"anime":ti,"type":ty, "score":sc,"progress":pr})
    return pd.DataFrame.from_dict(MALdict)



#### Convert Data

Here we use the function above to get dynamic HTML data from my MAL list (you can replace with any user's MAL list. The data is saved again to a CSV file.

In [28]:
url='https://myanimelist.net/animelist/iasnobmatsu'
driverp="/Users/ziqianxu/Downloads/chromedriver"
df2=getMAL(url,driverp)
df2.to_csv('iasnobmatsuMAL.csv', index=False)
df2.head()

Unnamed: 0,anime,progress,score,type
0,JoJo no Kimyou na Bouken Part 3: Stardust Crus...,- / 24,8,TV
1,One Piece,- / -,8,TV
2,Shingeki no Kyojin: The Final Season,- / 16,10,TV
3,Akagami no Shirayuki-hime,12,5,TV
4,Bleach,366,7,TV
