# Introduction

Our goal is to build a search engine over the "Top Anime Series" from the list of MyAnimeList https://myanimelist.net. There is no provided dataset, so we create our own.

In [1]:
from bs4 import BeautifulSoup
from tqdm import tqdm 
import pandas as pd
import numpy as np
import time
import datetime
import requests
import codecs
import lxml
import time
import csv
import re
import os 

## 1. Data Collection

We start from the list of animes to include in the corpus of documents the search engine will work on. In particular, we focus on the top animes ever list: https://myanimelist.net/topanime.php.  The list is long and splitted in many pages. The first thing we will do is to retrieve the urls (and the names) of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).

### 1.1 Get the list of animes

Here we will extract the *urls* and the *names* of the animes in the list. At first we can have an idea of the necessary steps to extract the informations we want by working on a single anime in the list and then proceed by iteration. 

After inspecting the HTML code of the site, we saw that the all the informations we need from a single anime are stored in  `tr` blocks inside a single `table` that contains the list of all the top animes in the site. To get the  name of an anime in the list we should work on `a` tags, whereas to get the url we need to work on `td` tags (leveraging the property `href`). 

Knowing these HTML details we can use the `BeautifulSoup` library to do the web-scrapping.

In [None]:
# EXECUTE ONLY ONCE
# IF THE FILE links.txt EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take some time (some minutes)

# open an empty .txt file to store the urls we need
links_text = open("links.txt", "w")

# go page by page in the site and scrap the urls we need
for page in tqdm(range(0, 400)):
    url = 'https://myanimelist.net/topanime.php?limit=' + str(page * 50)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup.find_all('tr'):
        links = tag.find_all('a')
        for link in links:        
            if type(link.get('id')) == str and len(link.contents[0]) > 1:
                data = link.get('href')
                # write the scrapped urls in the .txt file with '\n' at the end of each raw
                links_text.write(data)
                links_text.write("\n")

# close the .txt file
links_text.close()

In [None]:
# Read the number of lines in the .txt file
file = open("links.txt", "r")
line_count = 0
for line in file:
    if line != "\n":
        line_count += 1
file.close()

print('There are total {} lines in this file.'.format(line_count))

## 1.2 Crawl animes

We procede to:
- download the html corresponding to each of the collected urls;
- save its html in a file;
- organize the entire set of downloaded html pages into folders. Each folder will contain the htmls of the animes in page 1, page 2, ... of the list of animes.

To do so we extensively use the `os` library to create directories, changing paths, etc...

In [None]:
# EXECUTE ONLY ONCE
# IF THE DIRECTORY TREE ALREADY EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take quite some time (>25 hours)
# REMARK: there is an issue with high frequency site-connections that blocks most of the page requests 
# a time delay between page requests has been included to solve that issue


file = open("links.txt", "r")
lines = file.read().split('\n')
file.close()
# returns current working directory
base = os.getcwd()  
# initialize the number of the first directory to be created
t = 0
# we use the previously created list of lines to get the urls we need
scrapped_urls = lines[0:-1]
for i in range(len(scrapped_urls)):
    if(i%50==0):
        # create a new folder
        # remark: the newley created pages will start from 0
        page_identifier = i-(49*t)
        # subdirectory 
        directory = f"page_{page_identifier}.html"
        # parent directories
        parent_dir = base
        # path
        path = os.path.join(parent_dir, directory)
        # make directory
        os.makedirs(path)
        # checkpoint
        # print("Directory '%s' created" %directory)
        # change directory 
        os.chdir(path)
        t += 1

    # to avoid the issue with high frequency site-connections  
    time.sleep(5)   

    # get urls
    URL = scrapped_urls[i]
    page = requests.get(URL)
    
    # parsing
    soup_data = BeautifulSoup(page.content, "html.parser")
    
    # saving
    with open(f"article_{i}.html", "w") as file:
        file.write(str(soup_data))
        
    # checkpoint
    # print(f"Article {i} successfully written!")
    

## 1.3 Parse downloaded pages

At this point we have all the html documents about the animes of interest and we can start to extract the animes informations:
- Anime Name (to save as `animeTitle`): String
-Anime Type (to save as `animeType`): String
-Number of episode (to save as `animeNumEpisode`): Integer
-Release and End Dates of anime (to save as `releaseDate` and `endDate`): Convert both release and end date into datetime format.
-Number of members (to save as `animeNumMembers`): Integer
-Score (to save as `animeScore`): Float
-Users (to save as `animeUsers`): Integer
-Rank (to save as `animeRank`): Integer
-Popularity (to save as `animePopularity`): Integer
-Synopsis (to save as `animeDescription`): String
-Related Anime (to save as `animeRelated`): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. List of strings.
-Characters (to save as `animeCharacters`): List of strings.
-Voices (to save as `animeVoices`): List of strings
-Staff (to save as `animeStaff`): Include the staff name and their responsibility/task in a list of lists.

In [10]:
animeTitle = []
animeType = []
animeNumEpisode = []
animeNumMembers = []
animeScore = []
animeUsers = []
animeRank = []
animePopularity = []
animeDescription = []
animeRelated = []
animeCharacters = []
animeVoices = []
animeStaff = []

# Needed for the dates part
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']


def parse_function(path):
    """
    Function that extracts anime's informations.
    Input: path (a string that is related to the position of each anime page in the folder tree)
    Output: a list of lists with all the informations mentioned above
    """
    # take article_i.html from the directory tree we previously created
    file = codecs.open(path, "r", "utf-8")
    soup = BeautifulSoup(file, 'html.parser')

    # TITLES
    # a similar code can be used for the other informations
    animeTitle.append(soup.find_all('strong')[0].contents[0])
    
    divs = soup.find_all("div", class_="spaceit_pad")
    
    # DATES
    for div in divs:
        if div.find("span", string='Aired:') != None:
            #dates = div.contents[2].strip().split("to")
            #release = datetime.strptime(dates[0].strip(), '%b %d, %Y').date()
            #end = datetime.strptime(dates[1].strip(), '%b %d, %Y').date()
            
            # check
            # print(releaseDate)
            aired = div.contents[2].strip().split("to")
            releaseDate = datetime.datetime(int(aired[0].split()[2]), months.index(aired[0].split()[0])+1 , int(aired[0].split()[1][:1]))
            if len(aired) == 2:
                endDate = datetime.datetime(int(aired[1].split()[2]), months.index(aired[1].split()[0])+1 , int(aired[1].split()[1][:1]))
            else:
                endDate = "?"

    for div in divs:
        spans = div.find_all("span")
        for span in spans:
            
            # TYPES
            if span.contents[0] == 'Type:':
                animeType.append(div.find_all('a')[0].contents[0])
           
            # NUMBER OF EPISODES
            if span.contents[0] == 'Episodes:':
                animeNumEpisode.append(int(div.contents[2]))

    # similar to what was done before
    divs = soup.find_all("div", {"class": "stats-block po-r clearfix"})
    for div in divs:
        # MEMBERS
        # center of the html page
        members = div.find_all("span", {"class": "numbers members"})
        animeNumMembers.append(int(members[0].contents[1].contents[0].replace(',', '')))
        
        # SCORE
        # center of the html page
        rating=soup.find(name="div",attrs={"class":"fl-l score"})
        animeScore.append(float(rating.text.strip()))
    
        # USERS
        # center of the html page
        users = div.find_all("div", {"class": "fl-l score"})
        # here we we eliminate the word 'user '   
        # that is why there is the [:-6] part
        # we also replace the comma divisor
        animeUsers.append(int(users[0]['data-user'][:-6].replace(',', '')))
        
        # RANK
        # center of the html page
        rank = div.find_all("span", {"class": "numbers ranked"})
        animeRank.append(int(rank[0].contents[1].contents[0][1:]))
    
        # POPULARITY
        # center of the html page
        popularity = div.find_all("span", {"class": "numbers popularity"})
        animePopularity.append(int(popularity[0].contents[1].contents[0][1:]))
    
    
    # DESCRIPTION
    # center of the html page
    animeDescription = soup.find_all("p", itemprop = "description")[0].text.strip().replace('\n', '').replace('  ', '')
     
    
    # RELATED 
    # center of the html page
    x = []
    y = []
    related = soup.find_all("table", {"class": "anime_detail_related_anime"})
    for tr in related:
        td = tr.find_all("td")
        for i in range(0, len(td), 2):
            x.append(td[i].contents[0])
            t = td[i+1].find_all("a")
            y.append(t[0].contents[0])
        animeRelated.append('\n'.join([f'{x} {y}' for x, y in dict(zip(x, y)).items()]).split('\n'))
    
    
    # CHARACTERS
    # center of the html page (bottom
    characters = soup.find_all("div", {"class": "detail-characters-list clearfix"})
    chars = characters[0].find_all("h3", {"class": "h3_characters_voice_actors"})
    x = []
    for i in chars:
        x.append(i.contents[0].contents[0])
    animeCharacters.append(x)
    
    
    # VOICES
    # center of the html page (bottom)
    voices = characters[0].find_all("td", {"class": "va-t ar pl4 pr4"})
    y = []
    for i in voices:
        y.append(i.contents[1].contents[0])
    animeVoices.append(y)

    
    # STAFF
    # center of the html page (bottom)
    if(len(soup.find_all("div", attrs={"class" : "detail-characters-list clearfix"})) == 2):
        staff = soup.find_all("div", {"class": "detail-characters-list clearfix"})
        staff = staff[1].find_all("td")
        x = []
        y = []
        for i in range(1, len(staff), 2):
            x.append(staff[i].contents[1].contents[0])
            y.append(staff[i].find_all("small")[0].contents[0])
        animeStaff.append([list(i) for i in list(zip(x,y))])
        
    return [animeTitle, animeType, animeNumEpisode, releaseDate, endDate, animeNumMembers,
               animeScore, animeUsers, animeRank, animePopularity, animeDescription, animeRelated, 
               animeCharacters, animeVoices, animeStaff]



In [11]:
# create a for loop ro adjurn the path in order to call the function several times
# and get all the articles
# Also a good idea would be to put this function in a python file and then import it
# so the code would be cleaner
path = 'html_pages/page_0.html/article_0.html'
print(parse_function(path))

[['Fullmetal Alchemist: Brotherhood'], ['TV'], [64], datetime.datetime(2009, 4, 5, 0, 0), datetime.datetime(2010, 7, 4, 0, 0), [2675751], [9.16], [1622384], [1], [3], 'After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse\'s body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse\'s soul in the physical realm by binding it to a hulking suit of armor.\rThe brothers are rescued by their neighbor Pinako Rockbell and her granddaughter Winry. Known as a bio-mechanical engineering prodigy, Winry creates prosthetic limbs for Edward by utilizing "automail," a tough, versatile metal used in robots and combat armor. After years of training, the Elric brothers set off on a quest to restore their

In [24]:
# problem with dates parsing in tsv file
# if you pass something like str(releaseDate) you get the following error
# NameError: name 'releaseDate' is not defined
with open('anime_'+str(0)+'.tsv', 'wt') as file:
    file.write(str(animeTitle)+'\t' + str(animeType)+'\t' + str(animeNumEpisode)+'\t' + 'releaseDate' +'\t' + 
               'endDate'+'\t' + str(animeNumMembers)+'\t' + str(animeScore)+'\t' + str(animeUsers)
              +'\t' + str(animeRank)+'\t' + str(animePopularity)+'\t' + str(animeDescription)+'\t' + 
               str(animeRelated)+'\t' + str(animeCharacters)+'\t' + str(animeVoices)+'\t' + str(animeStaff))
file.close()