# Introduction

**DISCLAIMER**: Some parts of the following code was inspired by looking at the work that was done last year about https://www.goodreads.com, for example by https://github.com/GiorgiaSalvatori/ADM-HW3/blob/main/main.ipynb. Also the following post was useful https://towardsdatascience.com/how-to-use-selenium-to-web-scrape-with-example-80f9b23a843a.

Our goal is to build a search engine over the "Top Anime Series" from the list of MyAnimeList. There is no provided dataset, so we create our own.

In [1]:
# libraries to install
# !pip install selenium

In [2]:
from bs4 import BeautifulSoup
from tqdm import tqdm # useful for progress bars
from selenium import webdriver
from datetime import datetime
import pandas as pd
import numpy as np
import requests
import codecs
import lxml
import time
import re
import os 

## 1. Data Collection

We start from the list of animes to include in the corpus of documents the search engine will work on. In particular, we focus on the top animes ever list: https://myanimelist.net/topanime.php.  The list is long and splitted in many pages. The first thing we will do is to retrieve the urls (and the names) of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).

### 1.1 Get the list of animes

Here we will extract the *urls* and the *names* of the animes in the list. At first we can have an idea of the necessary steps to extract the informations we want by working on a single anime in the list and then proceed by iteration. 

After inspecting the HTML code of the site, we saw that the all the informations we need from a single anime are stored in  `tr` blocks inside a single `table` that contains the list of all the top animes in the site. To get the  name of an anime in the list we should work on `a` tags, whereas to get the url we need to work on `td` tags (leveraging the property `href`). 

Knowing these HTML details we can use the `selenium` library to do the web-scrapping.

In [3]:
from selenium.webdriver.chrome.service import Service

In [4]:
s = Service('/Users/dany/Desktop/adm-hw3/chromedriver')

In [5]:
# selenium with Chrome
driver = webdriver.Chrome(service=s)

In [6]:
# create a dataframe with links of each anime
df = pd.DataFrame(columns = ['Href'])

In [7]:
# go page by page and and store links in a list
anime_list = []

for page in tqdm(range(0, 400)):
    url = 'https://myanimelist.net/topanime.php?limit=' + str(page * 50)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup.find_all('tr'):
        links = tag.find_all('a')
        for link in links:        
            if type(link.get('id')) == str and len(link.contents[0]) > 1:
                anime_list.append((link.contents[0], link.get('href')) )

100%|██████████| 400/400 [07:12<00:00,  1.08s/it]


In [8]:
# total number of animes
print(len(anime_list))

19125


In [84]:
# check for duplicates: ok no duplicates
df['Href'].nunique()

19124

In [81]:
# assign list to dataframe
df['Href'] = [item[1] for item in anime_list]

The following code shows the first informations we have acquired.

In [85]:
df['Href'].head()

0    https://myanimelist.net/anime/5114/Fullmetal_A...
1         https://myanimelist.net/anime/28977/Gintama°
2    https://myanimelist.net/anime/38524/Shingeki_n...
3       https://myanimelist.net/anime/9253/Steins_Gate
4    https://myanimelist.net/anime/42938/Fruits_Bas...
Name: Href, dtype: object

We save the dataframe into a *csv* file without header and comma separator. This is equivalent to a *txt* file, with not only the urls, but also the names of the animes that may be of help in some data processing stages. 

In [86]:
df.to_csv('urls.csv', sep = ' ', header=False)

We could also create a dictionary as this is useful in some circumnstances.

In [87]:
#keys
name = []   
#values
url = []    

for item in anime_list:
    name.append(item[0])
    url.append(item[1])
    
D = dict(zip(name, url))

# display first 5 urls
print(list(D.values())[0:5])

['https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood', 'https://myanimelist.net/anime/28977/Gintama°', 'https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2', 'https://myanimelist.net/anime/9253/Steins_Gate', 'https://myanimelist.net/anime/42938/Fruits_Basket__The_Final']


## 1.2 Crawl animes

We procede to:
- download the html corresponding to each of the collected urls;
- save its html in a file;
- organize the entire set of downloaded html pages into folders. Each folder will contain the htmls of the animes in page 1, page 2, ... of the list of animes.

To do so we extensively use the `os` library to create directories, changing paths, etc...

## 1.3 Parse downloaded pages

At this point we have all the html documents about the animes of interest and we can start to extract the animes informations:
- Anime Name (to save as `animeTitle`): String
-Anime Type (to save as `animeType`): String
-Number of episode (to save as `animeNumEpisode`): Integer
-Release and End Dates of anime (to save as `releaseDate` and `endDate`): Convert both release and end date into datetime format.
-Number of members (to save as `animeNumMembers`): Integer
-Score (to save as `animeScore`): Float
-Users (to save as `animeUsers`): Integer
-Rank (to save as `animeRank`): Integer
-Popularity (to save as `animePopularity`): Integer
-Synopsis (to save as `animeDescription`): String
-Related Anime (to save as `animeRelated`): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. List of strings.
-Characters (to save as `animeCharacters`): List of strings.
-Voices (to save as `animeVoices`): List of strings
-Staff (to save as `animeStaff`): Include the staff name and their responsibility/task in a list of lists.

In [91]:
animeTitle = []
animeType = []
animeNumEpisode = []
releaseDate = []
endDate = []
animeNumMembers = []
animeScore = []
animeUsers = []
animeRank = []
animePopularity = []
animeDescription = []
animeRelated = []
animeCharacters = []
animeVoices = []
animeStaff = []


# I still don't know how to integrate directory paths
# The following code should work only with files
# For the sake of simplicity the following code works for the first 5 files
# article_0, ...., article_5
# Attention: There are some bugs 
for i in range(5):
    # anime titles as test case above
    # a similar code can be used for the other informations
    path = "article_" + str(i) + ".html"
    file = codecs.open(path, "r", "utf-8")
    soup = BeautifulSoup(file, 'html.parser')
    animeTitle.append(soup.find_all('strong')[0].contents[0])
    
    # left of the html page
    # e.g. for article_0 under Alternative Titles we have English: Fullmetal Alchemist: Brotherhood 
    # that corresponds to 
    # <div class="spaceit_pad"> Fullmetal Alchemist: Brotherhood <span class="dark_text">English:</span></div>
    divs = soup.find_all("div", {"class": "spaceit_pad"})
    for div in divs:
        spans = div.find_all("span")
        for span in spans:
            # TYPES
            if span.contents[0] == 'Type:':
                animeType.append(div.find_all('a')[0].contents[0])
            # NUMBER OF EPISODES
            if span.contents[0] == 'Episodes:':
                animeNumEpisode.append(int(div.contents[2]))
            # DATES
            if span.contents[0] == 'Aired:':
                if len(div.contents[2]) > 21:
                    releaseDate.append(pd.to_datetime(div.contents[2][1:16]))
                    endDate.append(pd.to_datetime(div.contents[2][18:-3]))
                else:
                    releaseDate.append(pd.to_datetime(div.contents[2][4:-4]))
                    endDate.append('-')
            #if span.contents[0] == 'Aired:':
                #dates = divs[i].contents[0].strip().split("to")
                #releaseDate = datetime.strptime(dates[0].strip(), '%b %d, %Y').date()
                #endDate = datetime.strptime(dates[1].strip(), '%b %d, %Y').date()
                    
    # center of the html page
    # similar to what was done before
    divs = soup.find_all("div", {"class": "stats-block po-r clearfix"})
    for div in divs:
        
        # MEMBERS
        members = div.find_all("span", {"class": "numbers members"})
        animeNumMembers.append(int(members[0].contents[1].contents[0].replace(',', '')))
        
        
        # SCORE
        rating=soup.find(name="div",attrs={"class":"fl-l score"})
        animeScore.append(float(rating.text.strip()))
    
        # USERS
        users = div.find_all("div", {"class": "fl-l score"})
        # here we we eliminate the word 'user '   
        # that is why there is the [:-6] part
        # we also replace the comma divisor
        animeUsers.append(int(users[0]['data-user'][:-6].replace(',', '')))
        
        
        # RANK
        rank = div.find_all("span", {"class": "numbers ranked"})
        animeRank.append(int(rank[0].contents[1].contents[0][1:]))
       
    
        # POPULARITY
        popularity = div.find_all("span", {"class": "numbers popularity"})
        animePopularity.append(int(popularity[0].contents[1].contents[0][1:]))
    
    
    
    # DESCRIPTION
    # center of the html page
    animeDescription = soup.find_all("p", itemprop = "description")[0].text.strip().replace('\n', '').replace('  ', '')
     
    
    # RELATED 
    # center of the html page
    x = []
    y = []
    related = soup.find_all("table", {"class": "anime_detail_related_anime"})
    for tr in related:
        td = tr.find_all("td")
        for i in range(0, len(td), 2):
            x.append(td[i].contents[0])
            t = td[i+1].find_all("a")
            y.append(t[0].contents[0])
        animeRelated.append('\n'.join([f'{x} {y}' for x, y in dict(zip(x, y)).items()]).split('\n'))
    
    
    # CHARACTERS
    # center of the html page (bottom
    characters = soup.find_all("div", {"class": "detail-characters-list clearfix"})
    chars = characters[0].find_all("h3", {"class": "h3_characters_voice_actors"})
    x = []
    for i in chars:
        x.append(i.contents[0].contents[0])
    animeCharacters.append(x)
    
    
    # VOICES
    # center of the html page (bottom)
    voices = characters[0].find_all("td", {"class": "va-t ar pl4 pr4"})
    y = []
    for i in voices:
        y.append(i.contents[1].contents[0])
    animeVoices.append(y)

    
    # STAFF
    # center of the html page (bottom)
    staff = soup.find_all("div", {"class": "detail-characters-list clearfix"})
    staff = staff[1].find_all("td")
    x = []
    y = []
    for i in range(1, len(staff), 2):
        x.append(staff[i].contents[1].contents[0])
        y.append(staff[i].find_all("small")[0].contents[0])
    animeStaff.append([list(i) for i in list(zip(x,y))])  


In [92]:
dataset = pd.DataFrame(
    [animeTitle, animeType, animeNumEpisode, releaseDate, endDate, animeNumMembers, 
     animeScore, animeUsers, animeRank, animePopularity, animeDescription, animeRelated, 
     animeCharacters, animeVoices, animeStaff], 
    index=['Title', 'Type', 'Episodes','Release date', 'End date', 'Members', 'Score', 
           'Users', 'Rank', 'Popularity', 'Description', 'Related', 'Characters', 'Voices', 'Staff']).T
dataset.head()

Unnamed: 0,Title,Type,Episodes,Release date,End date,Members,Score,Users,Rank,Popularity,Description,Related,Characters,Voices,Staff
0,Fullmetal Alchemist: Brotherhood,TV,64,2009-04-05,2010-07-04,2675751,9.16,1622384,1,3,H,"[Adaptation: Fullmetal Alchemist, Alternative ...","[Elric, Edward, Elric, Alphonse, Mustang, Roy,...","[Park, Romi, Kugimiya, Rie, Miki, Shinichiro, ...","[[Cook, Justin, Producer], [Yonai, Noritomo, P..."
1,Gintama°,TV,51,2015-04-08,2016-03-30,483807,9.09,169476,2,337,u,"[Adaptation: Gintama, Prequel: Gintama Movie 2...","[Sakata, Gintoki, Kagura, Shimura, Shinpachi, ...","[Sugita, Tomokazu, Kugimiya, Rie, Sakaguchi, D...","[[Fujita, Youichi, Director, Storyboard, Plann..."
2,Shingeki no Kyojin Season 3 Part 2,TV,10,2019-04-29,2019-07-01,1596039,9.09,1087519,3,33,n,"[Adaptation: Shingeki no Kyojin, Prequel: Shin...","[Levi, Yeager, Eren, Ackerman, Mikasa, Arlert,...","[Kamiya, Hiroshi, Kaji, Yuki, Ishikawa, Yui, I...","[[Yabuta, Shuuhei, Producer], [Wada, Jouji, Pr..."
3,Steins;Gate,TV,24,2011-04-06,2011-09-14,2090910,9.09,1109700,4,11,d,"[Adaptation: Steins;Gate, Alternative setting:...","[Okabe, Rintarou, Makise, Kurisu, Shiina, Mayu...","[Miyano, Mamoru, Imai, Asami, Hanazawa, Kana, ...","[[Iwasa, Gaku, Producer], [Yasuda, Takeshi, Pr..."
4,Fruits Basket: The Final,TV,13,2021-04-06,2021-06-29,275214,9.07,113310,5,651,r,"[Adaptation: Fruits Basket, Prequel: Fruits Ba...","[Souma, Kyou, Honda, Tooru, Souma, Yuki, Souma...","[Uchida, Yuuma, Iwami, Manaka, Shimazaki, Nobu...","[[Ibata, Yoshihide, Director], [Aketagawa, Jin..."


In [93]:
# for each row create a tsv file
for i in range(5):
    with open('anime_'+str(i)+'.tsv', 'wt') as file:
        tsv_writer = csv.writer(file, delimiter='\t')
        tsv_writer.writerow(x for x in df.iloc[i]) #the value under each columns