# Scraping THEM Anime for Anime Distributors

## Project Motivation
Due to COVID-19, Anime Club will have to meet remotely. Unfortunately, our Premium membership through Crunchyroll's Outreach program does not allow us to screen via platforms like Zoom, Discord, etc. I wanted to get an idea of anime distributors that I should contact directly, and going through each THEM Anime review by hand would be tedious. I had been planning on doing a web scraping project for a couple months now, so I figured, "Why not automate it?"

## Step \#1: Examine the review list page structure
The [page with the list of reviews](http://themanime.org/reviewlist.php) puts the links to individual reviews in `<a>` elements that inside `<li>` elements that live within `<ul>` elements.

In [2]:
import requests
import urllib.request
import re

rev_url = "http://themanime.org/reviewlist.php"
response = requests.get(rev_url)

In [3]:
response

<Response [200]>

In [4]:
response.text



In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/scripts/style.css" rel="stylesheet" type="text/css"/>
<link href="/rss.xml" rel="alternate" title="THEM Anime Reviews - Feed" type="application/rss+xml"/>
<link href="/themanime_search.xml" rel="search" title="THEM Anime Reviews" type="application/opensearchdescription+xml"/>
<link href="/favicon-196x196.png" rel="icon" type="image/png"/>
<link href="/favicon-96x96.png" rel="icon" type="image/png"/>
<link href="/favicon-32x32.png" rel="icon" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" type="image/png"/>
<link href="/apple-touch-icon-57x57.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-114x114.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-72x72.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-144

Upon further inspection, all review links have href's that start with `viewreview.php`.

## Step \#2: Get List of Review Links

In [6]:
def not_second_opinion(elem_txt):
    return str(elem_txt).lower() != "second opinion"

def not_from_sidebar(class_name):
    return "sidebar" not in str(class_name).lower()

In [7]:
review_links = soup.findAll(href=re.compile("viewreview.php"),
                            string=not_second_opinion,
                            class_=not_from_sidebar)

In [8]:
len(review_links)

1884

In [9]:
review_links

[<a href="viewreview.php?id=685">.hack//Legend of the Twilight</a>,
 <a href="viewreview.php?id=1">.hack//SIGN</a>,
 <a href="viewreview.php?id=1634">11 Eyes</a>,
 <a href="viewreview.php?id=1962">18if</a>,
 <a href="viewreview.php?id=3">3x3 Eyes</a>,
 <a href="viewreview.php?id=1121">3x3 Eyes 1&amp;2</a>,
 <a href="viewreview.php?id=2">3x3 Eyes 2</a>,
 <a href="viewreview.php?id=1442">5 Centimeters Per Second</a>,
 <a href="viewreview.php?id=20">6 Angels</a>,
 <a href="viewreview.php?id=9">8 Man After</a>,
 <a href="viewreview.php?id=8">801 TTS Airbats</a>,
 <a href="viewreview.php?id=1934">91 Days</a>,
 <a href="viewreview.php?id=1249">A Channel</a>,
 <a href="viewreview.php?id=1977">A Good Librarian Is Like A Good Shepherd</a>,
 <a href="viewreview.php?id=2034">A Silent Voice - The Movie</a>,
 <a href="viewreview.php?id=1877">A Town Where You Live</a>,
 <a href="viewreview.php?id=1989">A.I.C.O. Incarnation</a>,
 <a href="viewreview.php?id=5">The Abashiri Family</a>,
 <a href="viewre

## Step \#3: Examine the review page structure

From each review, I want to pull the anime's:
* title
* other names
* genre
* length
* distributor
* content rating

In [10]:
review_links[0].get('href')

'viewreview.php?id=685'

In [11]:
base_url = "http://themanime.org/"
first_rev_link = base_url + review_links[0].get('href')
first_rev_link

'http://themanime.org/viewreview.php?id=685'

In [12]:
first_rev = requests.get(first_rev_link)
first_rev

<Response [200]>

In [13]:
first_rev.text

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html lang="en">\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n        <link rel="stylesheet" type="text/css" href="/scripts/style.css">\n        <link href="/rss.xml" rel="alternate" type="application/rss+xml" title="THEM Anime Reviews - Feed">\n        <link rel="search" type="application/opensearchdescription+xml" title="THEM Anime Reviews" href="/themanime_search.xml">\n        <link rel="icon" type="image/png" href="/favicon-196x196.png">\n        <link rel="icon" type="image/png" href="/favicon-96x96.png">\n        <link rel="icon" type="image/png" href="/favicon-32x32.png">\n        <link rel="icon" type="image/png" href="/favicon-16x16.png">\n        <link rel="apple-touch-icon" href="/apple-touch-icon-57x57.png">\n        <link rel="apple-touch-icon" href="/apple-touch-icon-114x114.png">\n        <link rel="apple-touch-icon" href=

In [14]:
first_rev_soup = BeautifulSoup(first_rev.text, "html.parser")
print(first_rev_soup.prettify)

<bound method Tag.prettify of <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/scripts/style.css" rel="stylesheet" type="text/css"/>
<link href="/rss.xml" rel="alternate" title="THEM Anime Reviews - Feed" type="application/rss+xml"/>
<link href="/themanime_search.xml" rel="search" title="THEM Anime Reviews" type="application/opensearchdescription+xml"/>
<link href="/favicon-196x196.png" rel="icon" type="image/png"/>
<link href="/favicon-96x96.png" rel="icon" type="image/png"/>
<link href="/favicon-32x32.png" rel="icon" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" type="image/png"/>
<link href="/apple-touch-icon-57x57.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-114x114.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-72x72.png" rel="apple-touch-icon"/>
<link href="/apple-touch-icon-144

In [15]:
first_rev_soup.findAll('h1') # Title!

[<h1>.hack//Legend of the Twilight</h1>]

In [34]:
info_summary = first_rev_soup.findAll(class_='review')
info_summary

[<table class="review">
 <tr><td align="center" class="info"><img alt="[.hack//LEGEND box art]" border="1" height="280" src="/images/reviews/hacklegendbox.jpg" width="200"/></td></tr>
 <tr><td class="info2"><b class="info">AKA:</b> .hack//黄昏の腕輪伝説 (hack//Tasogare no Udewa Densetsu), .hack//Legend of the Twilight Bracelet, .hack//LEGEND, .hack//DUSK</td></tr>
 <tr><td class="info"><b class="info">Genre:</b> Sci-fi with some comedy and fantasy elements</td></tr>
 <tr><td class="info2"><b class="info">Length:</b> Television series, 12 episodes, 23 minutes each</td></tr>
 <tr><td class="info"><b class="info">Distributor:</b> Currently licensed by <a class="info" href="http://www.funimation.com">FUNimation</a>, available streaming on Hulu..</td></tr>
 <tr><td class="info2"><b class="info">Content Rating:</b> 13+ (mild violence, adult themes)</td></tr>
 <tr><td class="info"><b class="info">Related Series:</b> all .hack series</td></tr>
 <tr><td class="info2"><b class="info">Also Recommended:<

In [47]:
summary_fields = info_summary[0].findAll('td')[1:6]
aka = summary_fields[0]
genre = summary_fields[1]
length = summary_fields[2]
distributor = summary_fields[3]
rating = summary_fields[4]

In [55]:
# Other names, genre, length, distributor, rating
print(aka.text)
print(genre.text)
print(length.text)
print(distributor.text)
print(rating.text)

AKA: .hack//黄昏の腕輪伝説 (hack//Tasogare no Udewa Densetsu), .hack//Legend of the Twilight Bracelet, .hack//LEGEND, .hack//DUSK
Genre: Sci-fi with some comedy and fantasy elements
Length: Television series, 12 episodes, 23 minutes each
Distributor: Currently licensed by FUNimation, available streaming on Hulu..
Content Rating: 13+ (mild violence, adult themes)
