This is the code to analyze the animes provided in the Top Anime website. 

# Importing useful packages

In [82]:
import requests
from bs4 import BeautifulSoup
from collections import defaultdict

Here we send a request to get the content of a specific webpage. Here we wanted the content of the main page of the TopAnime webpage. 

In [5]:
TopAnime = requests.get('https://myanimelist.net/topanime.php')

Now we check if we have successfully received the content of the desired webpage or not. 

In [7]:
TopAnime

<Response [200]>

As we have the code '200' as the status of the reponse, so it means that we have successfully received the desired page. 

Then we take a brief look at the content of this page. 

In [8]:
#TopAnime.content

b'\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n<html lang="en">\n<head>\n    \n<link rel="preconnect" href="//fonts.gstatic.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//fonts.googleapis.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//tags-cdn.deployads.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagservices.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagmanager.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//apis.google.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel-sync.sitescout.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel.tapad.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//c.deployads.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//tpc.googlesyndication.com/" crossorigin="anonymous"/>\n<link rel="preconn

As we can see in the output, we have the HTML code of the webpage. Now we should parse this HTML code to extract the URLs associated to each anime. To do so, we will be using **BeautifulSoap** library which is designed to parse HTML codes. 

In [11]:
TopAnimeSoup = BeautifulSoup(TopAnime.content, 'html.parser')

Now we look at the produced HTML code of the webpage in a nicely format using BeautifulSoup. 

In [12]:
#print(TopAnimeSoup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en">
 <head>
  <link crossorigin="anonymous" href="//fonts.gstatic.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//fonts.googleapis.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//tags-cdn.deployads.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//www.googletagservices.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//www.googletagmanager.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//apis.google.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//pixel-sync.sitescout.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//pixel.tapad.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//c.deployads.com/" rel="preconnect"/>
  <link crossorigin="anonymous" href="//tpc.googlesyndication.com/" rel="preconnect"/>
  <link crossorigin="anonym

# Extracting the URLs associated to each anime. 

### Steps to extract the url for just one anime. 

Here we will go through the required steps to extract the url and the name of just the first anime in the list. We can get the information of the rest of these animes by just iteration. 

After chekcing the HTML code, we saw that the information related to each anime is stored in a table which its class is "top-ranking-table" and the animes' information are stored in trs of this table. 

Let's take a look at how many tables we have in the webpage. 

In [17]:
len(list(TopAnimeSoup.find_all('table')))

1

As we have only one table in the webpage, so every tr that we have in the webpage belongs to this table. 

Here we will take a look at how many tr tags we have in the webpage. 

In [77]:
len(list(TopAnimeSoup.find_all('tr')))

51

We know that in each page, we have the information related to 50 animes. But why we have 51 tr tags here? <br/>
Because the first row corresponds to the name of table's column and the rest store the inramtion related to each anime. 

So in order to get the rows which contain the information of each anime, we should go through the tr tags that we have in the webpage except the first one which contains the information of the columns' name of of the table. 

In [68]:
Rows = list(TopAnimeSoup.find_all('tr'))[1:]
len(Rows)

50

As we can see above, we have all the rows correspond which contains the information related to each anime. The total number of animes we have in each page are 50. 

Now we should get the name and the url correspond to each anime. We found out that this information can be found in 'a' tag of each row which its class name is "hoverinfo_trigger fl-l ml12 mr8" and is included in the second 'td' tag of each 'tr'.  

The information in this tag for the first anime can be found below. 

We get all the 'td' tags of the second 'tr' the first 'tr' tag contains the columns' name. 

In [105]:
Tds = Rows[0].find_all('td')

Then we will go to the second 'td' tag's information. The first one contains just the ranking number. 

In [106]:
SecondTD = Tds[1]

Then inside this 'td' tag we look for the 'a' tag which its class is "hoverinfo_trigger fl-l ml12 mr8"

In [107]:
TagA = SecondTD.find('class' == "hoverinfo_trigger fl-l ml12 mr8")
TagA

<a class="hoverinfo_trigger fl-l ml12 mr8" href="https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood" id="#area5114" rel="#info5114">
<img alt="Anime: Fullmetal Alchemist: Brotherhood" border="0" class="lazyload" data-src="https://cdn.myanimelist.net/r/50x70/images/anime/1223/96541.jpg?s=faffcb677a5eacd17bf761edd78bfb3f" data-srcset="https://cdn.myanimelist.net/r/50x70/images/anime/1223/96541.jpg?s=faffcb677a5eacd17bf761edd78bfb3f 1x, https://cdn.myanimelist.net/r/100x140/images/anime/1223/96541.jpg?s=0c3b98cf4905422c00981025cd20d271 2x" height="70" width="50">
</img></a>

The url of an anime is the value of 'href' property of this tag. Let's take a look at it. 

In [108]:
URL = TagA['href']
URL

'https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood'

Here to extract the name of the anime we have two options. 1. To split the 'href' and get the last value of it. 2. Get the value of 'alt' property of the 'img' tag in the 'a' tag. Here we will go for the second approach. 

In [112]:
Image = TagA.find('img')
AnimeName = Image['alt'].replace('Anime: ', '')
AnimeName

'Fullmetal Alchemist: Brotherhood'

## Going through of all animes in the first page. 

Now here we want to get the name and url of all the animes in this specific page. 

In [113]:
MyDict = defaultdict(str)
for Row in Rows: 
    TDs = Row.find_all('td')
    TagA = TDs[1].find('class' == "hoverinfo_trigger fl-l ml12 mr8")
    AnimeName, URL = TagA.find('img')['alt'].replace('Anime: ', ''), TagA['href']
    MyDict[AnimeName] = URL

Now we will check the information for five animes. 

In [115]:
for anime in list(MyDict.keys())[:5]:
    print('Name of anime: ' + anime+'\nURL: ' + MyDict[anime], end = '\n\n')

Name of anime: Fullmetal Alchemist: Brotherhood
URL: https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood

Name of anime: Gintama°
URL: https://myanimelist.net/anime/28977/Gintama°

Name of anime: Shingeki no Kyojin Season 3 Part 2
URL: https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2

Name of anime: Steins;Gate
URL: https://myanimelist.net/anime/9253/Steins_Gate

Name of anime: Fruits Basket: The Final
URL: https://myanimelist.net/anime/42938/Fruits_Basket__The_Final



Here we want to check if we have all the 50 animes' URL in the dictionay. 

In [118]:
print('Number of animes ', len(MyDict))

Number of animes  50


# Extracting animes' information in 400 pages. 

In order to have more readable implementation we will write a function which receives the URL of the webpage that we want to scrap and returns the animes' name and their associated URLs in that webpage. 

In [164]:
def GetAnimeInfo(webpage):
    Request = requests.get(webpage)
    TopAnimeSoup = BeautifulSoup(Request.content, 'html.parser')
    Rows = TopAnimeSoup.find_all('tr')
    MyDict = defaultdict(str)
    for Row in Rows[1:]: 
        TDs = Row.find_all('td')
        TagA = TDs[1].find('class' == "hoverinfo_trigger fl-l ml12 mr8")
        AnimeName, URL = TagA.find('img')['alt'].replace('Anime: ', ''), TagA['href']
        MyDict[AnimeName] = URL
    return MyDict

Now at each time we should pass the function the URL of the webpage that we want to scrap. 

After checking the URL of the next pages, we understood that there is a pattern in URL of the pages. <br/>
For example the 2nd webpage's URL is 'https://myanimelist.net/topanime.php?limit=50' and we can see the only difference between this URL and the main page URL (which is 'https://myanimelist.net/topanime.php') is that there is '?limit=50' string at the end. <br/><br/>
* So we can use this pattern the find the next pages URL. 

In [167]:
MainPageURL = 'https://myanimelist.net/topanime.php'
TheWholeResults = defaultdict(str)
for i in range(5):
    if i == 0:
        Output = GetAnimeInfo(MainPageURL)
    else:
        Output = GetAnimeInfo(MainPageURL+'?limit='+str(50*i))
    TheWholeResults.update(Output)
    

In [168]:
len(TheWholeResults)

250

51