# Introduction

**DISCLAIMER**: Some parts of the following code was inspired by looking at the work that was done last year about https://www.goodreads.com, for example by https://github.com/GiorgiaSalvatori/ADM-HW3/blob/main/main.ipynb. 

Our goal is to build a search engine over the "Top Anime Series" from the list of MyAnimeList https://myanimelist.net. There is no provided dataset, so we create our own.

In [2]:
from bs4 import BeautifulSoup
from tqdm import tqdm 
import pandas as pd
import numpy as np
import requests
import codecs
import csv
import os 

#import re
#import lxml
#import time
#import datetime
#from datetime import datetime

## 1. Data Collection

We start from the list of animes to include in the corpus of documents the search engine will work on. In particular, we focus on the top animes ever list: https://myanimelist.net/topanime.php.  The list is long and splitted in many pages. The first thing we will do is to retrieve the urls (and the names) of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).

### 1.1 Get the list of animes

Here we will extract the *urls* and the *names* of the animes in the list. At first we can have an idea of the necessary steps to extract the informations we want by working on a single anime in the list and then proceed by iteration. 

After inspecting the HTML code of the site, we saw that the all the informations we need from a single anime are stored in  `tr` blocks inside a single `table` that contains the list of all the top animes in the site. To get the  name of an anime in the list we should work on `a` tags, whereas to get the url we need to work on `td` tags (leveraging the property `href`). 

Knowing these HTML details we can use the `BeautifulSoup` library to do the web-scrapping.

In [None]:
# EXECUTE ONLY ONCE
# IF THE FILE links.txt EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take some time (some minutes)

# open an empty .txt file to store the urls we need
links_text = open("links.txt", "w")

# go page by page in the site and scrap the urls we need
for page in tqdm(range(0, 400)):
    url = 'https://myanimelist.net/topanime.php?limit=' + str(page * 50)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup.find_all('tr'):
        links = tag.find_all('a')
        for link in links:        
            if type(link.get('id')) == str and len(link.contents[0]) > 1:
                data = link.get('href')
                # write the scrapped urls in the .txt file with '\n' at the end of each raw
                links_text.write(data)
                links_text.write("\n")

# close the .txt file
links_text.close()

In [None]:
# EXECUTE IF AND ONLY IF THE links.txt FILE HAS BEEN CREATED

# Read the number of lines in the .txt file
file = open("links.txt", "r")
line_count = 0
for line in file:
    if line != "\n":
        line_count += 1
file.close()

print('There are total {} lines in this file.'.format(line_count))

## 1.2 Crawl animes

We procede to:
- download the html corresponding to each of the collected urls;
- save its html in a file;
- organize the entire set of downloaded html pages into folders. Each folder will contain the htmls of the animes in page 1, page 2, ... of the list of animes.

To do so we extensively use the `os` library to create directories, changing paths, etc...

In [None]:
# EXECUTE ONLY ONCE
# IF THE DIRECTORY TREE ALREADY EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take quite some time (>25 hours)
# REMARK: there is an issue with high frequency site-connections that blocks most of the page requests 
# a time delay between page requests has been included to solve that issue


file = open("links.txt", "r")
lines = file.read().split('\n')
file.close()
# returns current working directory
base = os.getcwd()  
# initialize the number of the first directory to be created
t = 0
# we use the previously created list of lines to get the urls we need
scrapped_urls = lines[0:-1]
for i in range(len(scrapped_urls)):
    if(i%50==0):
        # create a new folder
        # remark: the newley created pages will start from 0
        page_identifier = i-(49*t)
        # subdirectory 
        directory = f"page_{page_identifier}.html"
        # parent directories
        parent_dir = base
        # path
        path = os.path.join(parent_dir, directory)
        # make directory
        os.makedirs(path)
        # check
        # print("Directory '%s' created" %directory)
        # change directory 
        os.chdir(path)
        t += 1

    # to avoid the issue with high frequency site-connections  
    time.sleep(5)   

    # get urls
    URL = scrapped_urls[i]
    page = requests.get(URL)
    
    # parsing
    soup_data = BeautifulSoup(page.content, "html.parser")
    
    # saving
    with open(f"article_{i}.html", "w") as file:
        file.write(str(soup_data))
        
    # check
    # print(f"Article {i} chas been created")
    

## 1.3 Parse downloaded pages

At this point we have all the html documents about the animes of interest and we can start to extract the animes informations:
- Anime Name (to save as `animeTitle`): String
- Anime Type (to save as `animeType`): String
- Number of episode (to save as `animeNumEpisode`): Integer
- Release and End Dates of anime (to save as `releaseDate` and `endDate`): Convert both release and end date into datetime format.
- Number of members (to save as `animeNumMembers`): Integer
- Score (to save as `animeScore`): Float
- Users (to save as `animeUsers`): Integer
- Rank (to save as `animeRank`): Integer
- Popularity (to save as `animePopularity`): Integer
- Synopsis (to save as `animeDescription`): String
- Related Anime (to save as `animeRelated`): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. List of strings.
- Characters (to save as `animeCharacters`): List of strings.
- Voices (to save as `animeVoices`): List of strings
- Staff (to save as `animeStaff`): Include the staff name and their responsibility/task in a list of lists.

In [3]:
# EXECUTE ONLY ONCE (WITH A SUBSET OF PAGES)

# REMARK: the execution with 10 pages should take a couple of minutes, and 
# the execution with 100 pages should take more or less 10 minutes

# REMARK: page_382 in the html_pages directory has 22 articles
# this means that when the outer for loop is executed with j == 382
# the inner for loop should span only the values i == 0,1,2,...,20,21

animeTitle = []
animeType = []
animeNumEpisode = []
releaseDate = []
endDate = []
animeNumMembers = []
animeScore = []
animeUsers = []
animeRank = []
animePopularity = []
animeDescription = []
animeRelated = []
animeCharacters = []
animeVoices = []
animeStaff = []


for j in range(382):
    for i in range(50):
        # path depends on the directory tree that was previously created
        path = "html_pages/"+ "page_"+str(j)+".html/article_" + str(50*j+i) + ".html"

        file = codecs.open(path, "r", "utf-8")
        soup = BeautifulSoup(file, 'html.parser')

        #TITLE
        try:
            animeTitle.append(soup.find_all('strong')[0].contents[0])
        except: 
            animeTitle.append('NA')
        
        # DATES
            if span.contents[0] == 'Aired:':
                try:
                    if len(div.contents[2]) > 21:
                        release = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        releaseDate.append(release)
                        end = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        endDate.append(end)
                    else:
                        release = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        releaseDate.append(release)
                        endDate.append('-')
                except:
                        releaseDate.append('NA')
                        endDate.append('NA')
        
        divs = soup.find_all("div", {"class": "spaceit_pad"})
        for div in divs:
            spans = div.find_all("span")
            for span in spans:

                # TYPES
                try:
                    if span.contents[0] == 'Type:':
                        animeType.append(div.find_all('a')[0].contents[0])
                except:
                    animeType.append('NA')

                # NUMBER OF EPISODES
                if span.contents[0] == 'Episodes:':
                    try: 
                        animeNumEpisode.append(int(div.contents[2]))
                    except:
                        animeNumEpisode.append(0)


        divs = soup.find_all("div", {"class": "stats-block po-r clearfix"})
        for div in divs:

            # MEMBERS
            try:
                members = div.find_all("span", {"class": "numbers members"})
                animeNumMembers.append(int(members[0].contents[1].contents[0].replace(',', '')))
            except: 
                animeNumMembers.append(0)

            # USERS
            users = div.find_all("div", {"class": "fl-l score"})
            # here we we eliminate the word 'user '   
            # that is why there is the [:-6] part
            # we also replace the comma divisor
            try:
                animeUsers.append(int(users[0]['data-user'][:-6].replace(',', '')))
            except:
                animeUsers.append(0)

            # SCORE
            rating=soup.find(name="div",attrs={"class":"fl-l score"})
            try:        
                animeScore.append(float(rating.text.strip()))
            except:
                animeScore.append(None)

            # RANK
            try:
                rank = div.find_all("span", {"class": "numbers ranked"})
                animeRank.append(int(rank[0].contents[1].contents[0][1:]))
            except:
                animeRank.append(None)

            # POPULARITY
            try:   
                popularity = div.find_all("span", {"class": "numbers popularity"})
                animePopularity.append(int(popularity[0].contents[1].contents[0][1:]))
            except:
                animePopularity.append(None)

        # DESCRIPTION
        try:
            description = soup.find_all("p", {"itemprop": "description"})
            for br in description[0].find_all("br"):
                br.replace_with("\n")
            animeDescription.append(description[0].contents)
        except: 
            animeDescription.append('NA')
        
        
        # RELATED 
        try:
            related = soup.find_all("table", {"class": "anime_detail_related_anime"})
            x = []
            y = []
            for tr in related:
                td = tr.find_all("td")
                for i in range(0, len(td), 2):
                    x.append(td[i].contents[0])
                    try:
                        t = td[i+1].find_all("a")
                        y.append(t[0].contents[0])
                    except:
                        y.append('NA')

                animeRelated.append('\n'.join([f'{x} {y}' for x, y in dict(zip(x, y)).items()]).split('\n'))
        except: 
            animeRelated.append('NA')

        # CHARACTERS
        try:
            characters = soup.find_all("div", {"class": "detail-characters-list clearfix"})
            chars = characters[0].find_all("h3", {"class": "h3_characters_voice_actors"})
            x = []
            for i in chars:
                x.append(i.contents[0].contents[0])
            animeCharacters.append(x)
        except:
            animeCharacters.append("NA")

        # VOICES
        try:
            voices = characters[0].find_all("td", {"class": "va-t ar pl4 pr4"})
            y = []
            for i in voices:
                y.append(i.contents[1].contents[0])
            animeVoices.append(y)
        except:
            animeVoices.append("NA")

        # STAFF
        try:
            staff = soup.find_all("div", {"class": "detail-characters-list clearfix"})
            staff = staff[1].find_all("td")
            x = []
            y = []
            for i in range(1, len(staff), 2):
                x.append(staff[i].contents[1].contents[0])
                y.append(staff[i].find_all("small")[0].contents[0])
            animeStaff.append([list(i) for i in list(zip(x,y))])
        except:
            animeStaff.append("NA")



In [4]:
dataset = pd.DataFrame(
    [animeTitle, animeType, animeNumEpisode, animeNumMembers, 
     animeScore, animeUsers, animeRank, animePopularity, animeDescription, animeRelated, 
     animeCharacters, animeVoices, animeStaff], 
    index=['Title', 'Type', 'Episodes', 'Members', 'Score', 
           'Users', 'Rank', 'Popularity', 'Description', 'Related', 'Characters', 'Voices', 'Staff']).T

In [5]:
#example
dataset.head()

Unnamed: 0,Title,Type,Episodes,Members,Score,Users,Rank,Popularity,Description,Related,Characters,Voices,Staff
0,Fullmetal Alchemist: Brotherhood,TV,64,2675751,9.16,1622384,1,3,[After a horrific alchemy experiment goes wron...,"[Adaptation: Fullmetal Alchemist, Alternative ...","[Elric, Edward, Elric, Alphonse, Mustang, Roy,...","[Park, Romi, Kugimiya, Rie, Miki, Shinichiro, ...","[[Cook, Justin, Producer], [Yonai, Noritomo, P..."
1,Gintama°,TV,51,483807,9.09,169476,2,337,"[Gintoki, Shinpachi, and Kagura return as the ...","[Adaptation: Gintama, Prequel: Gintama Movie 2...","[Sakata, Gintoki, Kagura, Shimura, Shinpachi, ...","[Sugita, Tomokazu, Kugimiya, Rie, Sakaguchi, D...","[[Fujita, Youichi, Director, Storyboard, Plann..."
2,Shingeki no Kyojin Season 3 Part 2,TV,10,1596039,9.09,1087519,3,33,[Seeking to restore humanity's diminishing hop...,"[Adaptation: Shingeki no Kyojin, Prequel: Shin...","[Levi, Yeager, Eren, Ackerman, Mikasa, Arlert,...","[Kamiya, Hiroshi, Kaji, Yuki, Ishikawa, Yui, I...","[[Yabuta, Shuuhei, Producer], [Wada, Jouji, Pr..."
3,Steins;Gate,TV,24,2090910,9.09,1109700,4,11,[The self-proclaimed mad scientist Rintarou Ok...,"[Adaptation: Steins;Gate, Alternative setting:...","[Okabe, Rintarou, Makise, Kurisu, Shiina, Mayu...","[Miyano, Mamoru, Imai, Asami, Hanazawa, Kana, ...","[[Iwasa, Gaku, Producer], [Yasuda, Takeshi, Pr..."
4,Fruits Basket: The Final,TV,13,275214,9.07,113310,5,651,"[Hundreds of years ago, the Chinese Zodiac spi...","[Adaptation: Fruits Basket, Prequel: Fruits Ba...","[Souma, Kyou, Honda, Tooru, Souma, Yuki, Souma...","[Uchida, Yuuma, Iwami, Manaka, Shimazaki, Nobu...","[[Ibata, Yoshihide, Director], [Aketagawa, Jin..."


In [8]:
# EXECUTE ONLY ONCE
# IF THE DIRECTORY TREE ALREADY EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution should take a few seconds

# create a directory tree for .tsv files
os.mkdir('tsv_files')
for j in range(382):
    os.mkdir('tsv_files/'+'files_'+str(j)+ '.tsv/')

In [10]:
# EXECUTE ONLY ONCE WITH THE SAME SUBSET OF PAGES USED IN THE PREVIOUS CELLS
# IF THE .tsv FILES ALREADY EXIST THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution should take a few seconds

# create a .tsv file for each anime and put it in the corresponding directory 
for j in range(382):
    for i in range(50):
        with open('tsv_files/'+'files_'+str(j)+ '.tsv/'+'anime_'+str(50*j+i)+'.tsv', 'w') as file:
            tsv_writer = csv.writer(file, delimiter='\t')
            # header
            tsv_writer.writerow([x for x in dataset.columns]) 
            # columns
            tsv_writer.writerow(x for x in dataset.iloc[i+50*j])