This is the code to analyze the animes provided in the Top Anime website. 

# Importing useful packages

In [1]:
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
import os
from pathlib import Path
import re
import multiprocessing as mp
from math import ceil
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

Here we send a request to get the content of a specific webpage. Here we wanted the content of the main page of the TopAnime webpage. 

In [3]:
TopAnime = requests.get('https://myanimelist.net/topanime.php')

Now we check if we have successfully received the content of the desired webpage or not. 

In [4]:
TopAnime

<Response [200]>

As we have the code '200' as the status of the reponse, so it means that we have successfully received the desired page. 

Then we take a brief look at the content of this page. 

In [100]:
#TopAnime.content

As we can see in the output, we have the HTML code of the webpage. Now we should parse this HTML code to extract the URLs associated to each anime. To do so, we will be using **BeautifulSoap** library which is designed to parse HTML codes. 

In [7]:
TopAnimeSoup = BeautifulSoup(TopAnime.content, 'html.parser')

Now we look at the produced HTML code of the webpage in a nicely format using BeautifulSoup. 

In [101]:
#print(TopAnimeSoup.prettify())

# 1.1. Get the list of animes

### Steps to extract the url for just one anime. 

Here we will go through the required steps to extract the url and the name of just the first anime in the list. We can get the information of the rest of these animes by just iteration. 

After chekcing the HTML code, we saw that the information related to each anime is stored in a table which its class is "top-ranking-table" and the animes' information are stored in trs of this table. 

Let's take a look at how many tables we have in the webpage. 

In [9]:
len(list(TopAnimeSoup.find_all('table')))

1

As we have only one table in the webpage, so every tr that we have in the webpage belongs to this table. 

Here we will take a look at how many tr tags we have in the webpage. 

In [111]:
#print(list(TopAnimeSoup.find_all('tr')))
#len(list(TopAnimeSoup.find_all('tr')))

We know that in each page, we have the information related to 50 animes. But why we have 51 tr tags here? <br/>
Because the first row corresponds to the name of table's column and the rest store the inramtion related to each anime. 

So in order to get the rows which contain the information of each anime, we should go through the tr tags that we have in the webpage except the first one which contains the information of the columns' name of of the table. 

In [12]:
Rows = list(TopAnimeSoup.find_all('tr'))[1:]
len(Rows)

50

As we can see above, we have all the rows correspond which contains the information related to each anime. The total number of animes we have in each page are 50. 

Now we should get the name and the url correspond to each anime. We found out that this information can be found in 'a' tag of each row which its class name is "hoverinfo_trigger fl-l ml12 mr8" and is included in the second 'td' tag of each 'tr'.  

The information in this tag for the first anime can be found below. 

We get all the 'td' tags of the second 'tr' the first 'tr' tag contains the columns' name. 

In [14]:
Tds = Rows[0].find_all('td')

Then we will go to the second 'td' tag's information. The first one contains just the ranking number. 

In [15]:
SecondTD = Tds[1]

Then inside this 'td' tag we look for the 'a' tag which its class is "hoverinfo_trigger fl-l ml12 mr8"

In [107]:
TagA = SecondTD.find('class' == "hoverinfo_trigger fl-l ml12 mr8")
TagA

<a class="hoverinfo_trigger fl-l ml12 mr8" href="https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood" id="#area5114" rel="#info5114">
<img alt="Anime: Fullmetal Alchemist: Brotherhood" border="0" class="lazyload" data-src="https://cdn.myanimelist.net/r/50x70/images/anime/1223/96541.jpg?s=faffcb677a5eacd17bf761edd78bfb3f" data-srcset="https://cdn.myanimelist.net/r/50x70/images/anime/1223/96541.jpg?s=faffcb677a5eacd17bf761edd78bfb3f 1x, https://cdn.myanimelist.net/r/100x140/images/anime/1223/96541.jpg?s=0c3b98cf4905422c00981025cd20d271 2x" height="70" width="50">
</img></a>

The url of an anime is the value of 'href' property of this tag. Let's take a look at it. 

In [108]:
URL = TagA['href']
URL

'https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood'

Here to extract the name of the anime we have two options. 1. To split the 'href' and get the last value of it. 2. Get the value of 'alt' property of the 'img' tag in the 'a' tag. Here we will go for the second approach. 

In [112]:
Image = TagA.find('img')
AnimeName = Image['alt'].replace('Anime: ', '')
AnimeName

'Fullmetal Alchemist: Brotherhood'

### Going through of all animes in the first page. 

Now here we want to get the name and url of all the animes in this specific page. 

In [113]:
MyDict = defaultdict(str)
for Row in Rows: 
    TDs = Row.find_all('td')
    TagA = TDs[1].find('class' == "hoverinfo_trigger fl-l ml12 mr8")
    AnimeName, URL = TagA.find('img')['alt'].replace('Anime: ', ''), TagA['href']
    MyDict[AnimeName] = URL

Now we will check the information for five animes. 

In [115]:
for anime in list(MyDict.keys())[:5]:
    print('Name of anime: ' + anime+'\nURL: ' + MyDict[anime], end = '\n\n')

Name of anime: Fullmetal Alchemist: Brotherhood
URL: https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood

Name of anime: Gintama°
URL: https://myanimelist.net/anime/28977/Gintama°

Name of anime: Shingeki no Kyojin Season 3 Part 2
URL: https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2

Name of anime: Steins;Gate
URL: https://myanimelist.net/anime/9253/Steins_Gate

Name of anime: Fruits Basket: The Final
URL: https://myanimelist.net/anime/42938/Fruits_Basket__The_Final



Here we want to check if we have all the 50 animes' URL in the dictionay. 

In [118]:
print('Number of animes ', len(MyDict))

Number of animes  50


# 1.2. Crawl animes

In order to have more readable implementation we will write a function which receives the URL of the webpage that we want to scrap and writes the urls to the URLs.txt file. 

In [149]:
def GetAnimeInfo(webpage):
    File = open('URLs.txt', mode = 'a')
    Request = requests.get(webpage)
    TopAnimeSoup = BeautifulSoup(Request.content, 'html.parser')
    Rows = TopAnimeSoup.find_all('tr')
    for Row in Rows[1:]: 
        TDs = Row.find_all('td')
        TagA = TDs[1].find('class' == "hoverinfo_trigger fl-l ml12 mr8")
        URL = TagA['href']
        File.write(URL+'\n')
    File.close()

Now at each time we should pass the function the URL of the webpage that we want to scrap. 

After checking the URL of the next pages, we understood that there is a pattern in URL of the pages. <br/>
For example the 2nd webpage's URL is 'https://myanimelist.net/topanime.php?limit=50' and we can see the only difference between this URL and the main page URL (which is 'https://myanimelist.net/topanime.php') is that there is '?limit=50' string at the end. <br/><br/>
* So we can use this pattern the find the next pages URL. 

In [148]:
MainPageURL = 'https://myanimelist.net/topanime.php'
for i in range(400):
    if i == 0:
        GetAnimeInfo(MainPageURL)
    else:
        GetAnimeInfo(MainPageURL+'?limit='+str(50*i))
    

Now we want to download the HTML file for each anime. <br/> 
In order to do this we are going to write a function that passes a number of URLs which their HTML file should be downloaded and stored in a folder.<br/>
These URLS are given in a list of list which correspond to a number of webpages in the website and a number of anime's URL which are in a specific webpage. 

In [161]:
# This function is designed to download the URLs which are contained in some specific webpages. 

def GetHTML(URLS, GroupIndex, NumPageGroup, Start):
    for index, ListURLs in enumerate(URLS):
        if index < Start:
            continue
        FolderName = 'page' + str(GroupIndex*NumPageGroup + index +1)
        Path = os.path.join(os.getcwd(),'HTMLS', FolderName)
        # "/".join([os.getcwd(),'HTMLS', FolderName])
        # /home/mehrdad/ADM-HW3/HTMLS/page1
        os.mkdir(Path)
        for i, url in enumerate(ListURLs):
            AnimePage = requests.get(url)
            AnimePageSoup = BeautifulSoup(AnimePage.content, 'html.parser')
            #/home/mehrdad/ADM-HW3/HTMLS/page1/anime_0.html
            file = open(Path + '/' + 'anime_' + str(i) + '.html',"w")
            file.write(str(AnimePageSoup))
            file.close()    

Here we want to mention what are the arguments of this function exactly:
* URLS: It's a matrix. Each row corresponds to a page and the columns are the URls which are contained in a specific page. 
* GroupIndex: We specify on which core of processor this process is being done. 
* NumPageGroup: Corresponds to the number of pages that is being given to the function. 
* Start: From which page start downloading its animes' URLs. 

Here in order to speed up the process of downloading HTML files, we want to parallelize this process among the number of processors that we have in the system. 

So first we check how many processors we have in the system:

In [50]:
print("The number of processors:", mp.cpu_count())

The number of processors: 4


So in this case we should distribute our URLS over these processors. So here we get our URLs from the file that we stored all the URLs there. 

In [51]:
AllURLs = []
with open('URLs.txt') as f:
    for row in f:
        AllURLs.append(row.strip())

Now we want to check how many URLs we have:

In [52]:
print('The total number of URLs is:', len(AllURLs))

The total number of URLs is: 19128


* **Note:** We know that we should store all the anime's webpage HTML in a single folder for the ones that have been appeared in the same wepbage, so we every webpage's anime should be entirely given to a singel processors. 

* **Note:** We know that we got the URLs in order, so each 50 URLs are the ones that were in a specific webpage. 

Now we group each 50 URls which correspond to the animes that are in the same webpage. 

In [53]:
EachPageURLs = [AllURLs[i*50:(i+1)*50] for i in range(len(AllURLs)//50+1)]

Here we will take a look at how many pages that we have here: 

In [54]:
print("Number of webpages in the website:", len(EachPageURLs))

Number of webpages in the website: 383


So if this number of webpages is true, then the division 'number of all URLS'/50 shoule produce the same result. 

In [55]:
print('The result of division:', len(AllURLs)/50)

The result of division: 382.56


As we know that the last webpage does not contain exactly 50 animes, we got this result, so up until now everything went great. 

Now we check how many pages shold be given to each of the processors. 

In [56]:
NumOfPage = ceil(ceil(len(AllURLs)/50)/4)
print('Number of pages to be given to each processor: ', NumOfPage)

Number of pages to be given to each processor:  96


So we will give 96 pages to each of the processor that we have in the system. 

Now we will put the pages should be given to a specific processor into groups. 

In [57]:
GroupPage = [EachPageURLs[i*NumOfPage:(i+1)*NumOfPage] for i in range(mp.cpu_count())]

Let's check the number of pages in each group:

In [58]:
print(list(map(len , GroupPage)))

[96, 96, 96, 95]


It can be seen the first 3 processors will be given 96 pages and the last one 95 pages.

### Downloading HTMLs in parallel

As we saw before, we have 4 processors in the system and we want to distribute the work among these 4 processors.<br/>
We will give a subsets of pages to each of these processors to download their animes. 

In [196]:
pool = mp.Pool(mp.cpu_count())
results = [pool.apply_async(GetHTML, args = (GroupPage[i],i, NumOfPage, 5)) for i in range(mp.cpu_count())]
pool.close()

# 1.3 Parse downloaded pages

Now in this question we should produce a .tsv file for the desired information for each of the pages that we have just downloaded in the previous section. 

In order to do this, first we check how we can get the information for just one anime and then we will expand the idea to other anime that we have. 

The information that we should extract for each anime are: 

1- **Anime Name** (to save as animeTitle): String <br/>
2- **Anime Type** (to save as animeType): String<br/>
3- **Number of episode** (to save as animeNumEpisode): Integer<br/>
4- **Release and End Dates of anime** (to save as releaseDate and endDate): Convert both release and end date into datetime format.<br/>
5- **Number of members** (to save as animeNumMembers): Integer<br/>
6- **Score** (to save as animeScore): Float<br/>
7- **Users** (to save as animeUsers): Integer<br/>
8- **Rank** (to save as animeRank): Integer<br/>
9- **Popularity** (to save as animePopularity): Integer<br/>
10- **Synopsis** (to save as animeDescription): String<br/>
11- **Related Anime** (to save as animeRelated): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. List of strings.<br/>
12- **Characters** (to save as animeCharacters): List of strings.<br/>
13- **Voices** (to save as animeVoices): List of strings<br/>
14- **Staff** (to save as animeStaff): Include the staff name and their responsibility/task in a list of lists.<br/>

The for getting each of the information that we need, we write a function to have a better understanding of how we should extract each of these information. 

### 1.3.1 Function AnimeName

We know that the name of the anime can be extracted from the title of the webpage. So we send the html code of the webpage to this function, and we will return the name of the anime. 

In [59]:
def GetAnimeName(webpage):
    return webpage.title.getText().strip().replace(' - MyAnimeList.net', '')

### 1.3.2 Function GetAnimeType

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime. 

In [60]:
def GetAnimeType(Tag):
    Temp = Tag.getText().split()
    return Temp[-1] if len(Temp) >1 else ""

### 1.3.3 Function Episode

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime. 

In [61]:
def GetNumOfEpisode(Tag):
    Temp = Tag.getText().strip().split()[-1]
    return int(Temp) if Temp.isdigit() else ''

### 1.3.4 Function DateTime

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime. <br/><br/>
* Some of the animes may have just the release date. So we should be careful about this. 

In [63]:
def GetDates(Tag):
    Release, End = '', ''
    Temp = Tag.getText().strip().replace('Aired:\n  ', '').split('to')
    Release = Temp[0] if len(Temp) else ''
    End = Temp[1] if len(Temp) == 2 else ''
    return Release, End

### 1.3.5 Function Members

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime.

In [64]:
def GetMembers(Tag):
    return int(Tag.getText().replace('\n', '').split()[1].replace(',', ''))

### 1.3.6 Function ScoreAndUsers

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime.

In [65]:
def GetScoreAndUsers(Tag):
    Rank = re.findall('[0-9|,]+ users', str(Tag))
    RankValue = int(Rank[0].split()[0].replace(',', '')) if len(Rank) else ''
    Score = re.findall('Score:[0-9|.]+', Tag.getText().replace("\n", ''))
    ScoreValue = float(Score[0].split(":")[1]) if len(Score) else ''
    return ScoreValue, RankValue

### 1.3.7 Function Rank

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime.

In [66]:
def GetRank(Tag):
    Temp = re.findall('#[0-9|,]+', str(Tag))
    return int(Temp[0][1:]) if len(Temp) else ''

### 1.3.8 Function Popularity

This information can be extracted from the values stored in a specific 'div' tag. In order do this we will send that specific tag to this function and receive the type of the anime.

In [67]:
def GetPopularity(Tag):
    Temp = re.findall('#[0-9|,]+', str(Tag))
    return int(Temp[0][1:]) if len(Temp) else ''

### 1.3.9 Function Synopsis

In order to get the description of the anime, there is a specific tag which can be easily identified by its 'itemprop' property. So we will give the whole html to this function and get back the description. 

In [68]:
def GetSynopsis(webpage):
    Temp = webpage.find('p', {'itemprop': "description"})
    if not Temp:
        return ''
    Temp = Temp.getText().replace('\n', '')
    return Temp

### 1.3.10 Function Related Anime

To get the related animes, there is a table which its 'class' is equal to 'anime_detail_related_anime'. So to get the table we should give the function the whole html file to this function. 

In [69]:
def GetRelatedAnime(webpage):
    Related = webpage.find('table', {'class': "anime_detail_related_anime"})
    if not Related:
        return ""
    Related = Related.find_all('a')
    UniqueRelated = set(i.getText() for i in Related)
    return list(UniqueRelated)

### 1.3.11 Function Characters

We can find this information in tag 'a' that are included in a div which is its class is equal to 'detail-characters-list clearfix'. When we go for the first div that have this feature we will be received the table which is the characters and their original voices is there. 

In [70]:
def GetCharacters(webpage):
    Characters = []
    Tags = webpage.find_all('div', {'class': "detail-characters-list clearfix"})
    for tag in Tags:
        if str(tag).count('character') >1:
            # We can get the name of the character in the 'href' value in 'a' property
            AllTagA = tag.find_all('a')
            # Filter the hrefs to characters
            CharactersHrefs = list(set([i['href'] for i in AllTagA if 'character' in i['href']]))
            # Filter the names of the characters
            Characters = [i.split('/')[-1].replace('_', ' ') for i in CharactersHrefs]
    return Characters

### 1.3.12 Function Voices 

With the same approach as previous, now we just extract the voices. 

In [71]:
def GetVoices(webpage):
    Voices = []
    Tags = webpage.find_all('div', {'class': "detail-characters-list clearfix"})
    for tag in Tags:
        if str(tag).count('character') > 1:
            # We can get the name of the person in charge for the voice in the 'href' value in 'a' property
            AllTagA = tag.find_all('a')
            # Filter the hrefs for voices
            VoicesHrefs = list(set([i['href'] for i in AllTagA if 'people' in i['href']]))
            # Filter the names of the people
            Voices = [i.split('/')[-1].replace('_', ' ') for i in VoicesHrefs]
    return Voices

### 1.3.13 Function Staff

Here in this function we will look for the other div tha its 'class' is equal to 'detail-characters-list clearfix'. Then the name of the staff can be extracted from 'img' tags and also their duty can be found in the 'small' tags which are in the same 'row' as their image. 

In [97]:
def GetStaff(webpage):
    StaffDuty = []
    Tags = webpage.find_all('div', {'class': "detail-characters-list clearfix"})
    for tag in Tags:
        if str(tag).count('character') == 1:
            Staff =  tag.find_all('tr')
            for i in Staff:
                NewStaff = [i.find('a')['href'].split('/')[-1].replace('_', ' ')] # Extracting the name of the staff
                Duties = list(i.find('small').getText().split(',')) # Getting the duties of the staff
                NewStaff += [i.strip() for i in Duties]
                StaffDuty.append(NewStaff)
    return StaffDuty

### 1.3.14 Function Write to TSV

With the help of this function, we can write the information that we extracted from a page to its corresponded .tsv file. 

In [73]:
def WriteToTSV(AnimeInfo, Path):
    File = open(Path, mode = 'w')
    # Writing the header. 
    File.write("\t".join(Keys))
    File.write('\n')
    for i in AnimeInfo:
        STR = AnimeInfo[i].__str__()
        if STR == '[]':
            STR = ''
        File.write(STR + ('\t' if i!='animeStaff' else "" ))
    File.close()    

### 1.3.15 Function Extract

With the help of this function we want to extract the desired information from all the .html files that we have. 

Here we used some key words so we can easily get the needed information which are in some tags that share the same class. 

In [74]:
Info = ['Type:', 'Episodes:', 'Aired:', 'Members:','Score:','Ranked:', 'Popularity:']

Here we define the default datastructure which is a dictionary to store the inforamtion of each page. 

In [75]:
Keys = ['animeTitle', 'animeType', 'animeNumEpisode', 'releaseDate', 'endDate', 'animeNumMembers'
        , 'animeScore', 'animeUsers', 'animeRank', 'animePopularity', 'animeDescription', 'animeRelated',
        'animeCharacters', 'animeVoices', 'animeStaff']
MyAnimeInfo = {i:"" for i in Keys} 

Here we will go through each HTML file that we have just downloaded. Then we extract the infomration that we need from the HTML file and then store them in a .tsv file. 

In [78]:
def FunctionExtract(ListNumPage, ProcesserNum):
    for i in ListNumPage:
        for j in range((50 if i!= 383 else 28)):
            
            exec('NewAnime'+str(ProcesserNum)+'= MyAnimeInfo.copy()')
            
            Path = '/home/mehrdad/ADM-HW3/HTMLS/page'
            File = open(Path + str(i) + '/anime_' + str(j) +'.html', mode = 'r')
            
            Webpage = BeautifulSoup(File.read(), 'html.parser')
            
            for div in Webpage.find_all('div', {'class':"spaceit_pad"}):
                if Info[0] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animeType'] = GetAnimeType(div)")
                elif Info[1] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animeNumEpisode'] = GetNumOfEpisode(div)")
                elif Info[2] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['releaseDate'] = GetDates(div)[0]")
                    exec('NewAnime'+ str(ProcesserNum) +"['endDate'] = GetDates(div)[1]")
                elif Info[3] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animeNumMembers'] = GetMembers(div)")
                elif Info[4] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animeScore'] = GetScoreAndUsers(div)[0]")
                    exec('NewAnime'+ str(ProcesserNum) +"['animeUsers'] = GetScoreAndUsers(div)[1]")
                elif Info[5] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animeRank'] = GetRank(div)")
                elif Info[6] in str(div):
                    exec('NewAnime'+ str(ProcesserNum) +"['animePopularity'] = GetPopularity(div)")
                
            exec('NewAnime'+ str(ProcesserNum) +"['animeTitle'] = GetAnimeName(Webpage)")
            exec('NewAnime'+ str(ProcesserNum) +"['animeDescription'] = GetSynopsis(Webpage)")
            exec('NewAnime'+ str(ProcesserNum) +"['animeRelated'] = GetRelatedAnime(Webpage)")
            exec('NewAnime'+ str(ProcesserNum) +"['animeCharacters'] = GetCharacters(Webpage)")
            exec('NewAnime'+ str(ProcesserNum) +"['animeVoices'] = GetVoices(Webpage)")
            exec('NewAnime'+ str(ProcesserNum) +"['animeStaff'] = GetStaff(Webpage)")
            
            TSVPath = Path + str(i) + '/anime_' + str(j) +'.tsv'
            exec('WriteToTSV(NewAnime'+ str(ProcesserNum)+',TSVPath)')

### Creating the .tsv files in parallel. 

* **In order to speed up the process, we will distribute the work among the available CPUs.**

We should give a subset of pages to each CPU to make its animes' .tsv file. <br/>
Here we will group the page numbers that should be given to each processor. 

In [79]:
RangePage = list(range(1, len(EachPageURLs) + 1))
PageNums = [RangePage[i * NumOfPage:(i+1) * NumOfPage] for i in range(mp.cpu_count())]

Call the function for each CPU give a susbset of .html files. 

In [99]:
pool = mp.Pool(mp.cpu_count())
results = [pool.apply_async(FunctionExtract, args = (PageNums[i],i)) for i in range(mp.cpu_count())]
pool.close()

# 2. Search Engine


Here we are asked to build a search engine which given a query, will give back the documents that are similar to the given query. 

### 2.0 Pre-processing the information

Here we will pre-process all the information of an anime and store it to another .tsv file which will be name SynopsisPrepAnime_(i).csv which i corresponds to the index number of the page in its page. 

To do pre-processing we have 5 stages and for each stage we will write a function. 

### 2.0.1 Tokenization 

Given a string, we will return the words that are in the given string. 

In [102]:
def Tokenization(Sentence):
    return nltk.word_tokenize(Sentence)

### 2.0.2 Lowercasing

Given a list of strings, we will return the same strings but in lower case. 

In [103]:
def Lowercasing(Words):
    return [w.lower() for w in Words]

### 2.0.3 StopWordsRemoval

Here we will define a list which contains all the stopwords in English language. 

In [104]:
StopWords = stopwords.words('english')

Given a list of strings, we will remove the stopwords from that strings. 

In [105]:
def StopWordsRemoval(Words):
    return [w for w in Words if w not in StopWords]

### 2.0.4 PunctuationsRemoval

In this function given a list of strings, we will remove the punctuations from those strings. 

In [106]:
def PunctuationsRemoval(Words):
    return [w for w in Words if w.isalpha()]

### 2.0.5 Stemming

In this function, given a list of strings, we try to return back the stem of each string that we have in the list. <br/>
Here we will use 'PorterStemmer' algorithm to do stemming. 

In [107]:
def Stemming(Words):
    return [PorterStemmer().stem(w) for w in Words]

* Here we are asked to work with the 'Synopsis' of each anime, so we decided pre-process only the synopsis of each anime and not all the information that we extracted. 

* **Note:** For your information, we want to do the pre-process we will distribute the work among the available number of CPUs in the system. Each CPU will be given a subset of pages to do the pre-process for the contained anime information. 

### 2.0.6 Main Function 

In [108]:
def MainFunction(Pages):
    for page in Pages:
        for anime in range((50 if page != 383 else 28)):
            
            Path = '/home/mehrdad/ADM-HW3/HTMLS/page'
            File = open(Path + str(page) + '/anime_' + str(anime) +'.tsv', mode = 'r')
            # For each anime we want to extract the synopsis 
            Data = File.read().split('\n')[1].split('\t')[10]
            File.close()
            
            Tokens = Tokenization(Data)
            Lowercase = Lowercasing(Tokens)
            WithoutStop = StopWordsRemoval(Lowercase)
            WithoutPuncs = PunctuationsRemoval(WithoutStop)
            Stems = Stemming(WithoutPuncs)
            
            PreProcessed = open(Path + str(page) + '/anime_' + str(anime) +'_synopsisPrep.csv', mode = 'w')
            PreProcessed.write(",".join(Stems))
            PreProcessed.close()

Group the pages to be given to the CPUs. 

In [109]:
RangePage = list(range(1, len(EachPageURLs) + 1))
PageNums = [RangePage[i * NumOfPage:(i+1) * NumOfPage] for i in range(mp.cpu_count())]

Pre-processing all the synopsis of animes. 

In [110]:
pool = mp.Pool(mp.cpu_count())
results = [pool.apply_async(MainFunction, args = (PageNums[i],)) for i in range(mp.cpu_count())]
pool.close()

In [None]:
i, j = 1, 0
Path = '/home/mehrdad/ADM-HW3/HTMLS/page'
File = open(Path + str(i) + '/anime_' + str(j) +'.html', mode = 'r')
Webpage = BeautifulSoup(File.read(), 'html.parser')
GetStaff(Webpage)