# <center>Web Scraping Into SQL Database Demo

<img src = 'anime.png'>

## The goal

Scrape the site https://myanimelist.net/ in order to compile a SQL database of various anime data.

In this demo, we will grab the id, name, number of episodes, and a couple recommended shows for each of the top anime.

## The process

Import libraries.

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import random
import math
import sqlite3
import pandas as pd
import numpy as np

For this demo, we will look just at some of the most popular anime found here https://myanimelist.net/topanime.php?type=bypopularity.

### Scraping links

Make initial request to the URL and parse wth BS.

In [2]:
url = 'https://myanimelist.net/topanime.php?type=bypopularity'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

Find the links to the individual pages for each of the top anime shows.

In [3]:
urls = []
for i in soup.find_all('div', class_="di-ib clearfix"):
    urls.append(i.find('a', href=True)['href'])
urls[:10]

['https://myanimelist.net/anime/16498/Shingeki_no_Kyojin',
 'https://myanimelist.net/anime/1535/Death_Note',
 'https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood',
 'https://myanimelist.net/anime/30276/One_Punch_Man',
 'https://myanimelist.net/anime/11757/Sword_Art_Online',
 'https://myanimelist.net/anime/31964/Boku_no_Hero_Academia',
 'https://myanimelist.net/anime/22319/Tokyo_Ghoul',
 'https://myanimelist.net/anime/20/Naruto',
 'https://myanimelist.net/anime/38000/Kimetsu_no_Yaiba',
 'https://myanimelist.net/anime/11061/Hunter_x_Hunter_2011']

### Scraping in a loop

Create a function that will grab the id, name, and number of episodes for a given anime URL. Also, it will grab the top two recommendations, as well as the id and URL for those recommendations.

In [4]:
def get_name_episodes(url, rec=False):
    response = requests.get(url, headers=headers)
    print(response.status_code, url)
    soup = BeautifulSoup(response.content, 'html.parser')
    main_id = url.split('/')[4]
    episodes = soup.find('span', text='Episodes:').next_element.strip()
    name = soup.find('h1').text.strip()
    print(name)
    rec_urls=[]
    rec_ids=[]
    if rec:
        response = requests.get(url+'/userrecs')
        soup = BeautifulSoup(response.content, 'html.parser')
        recs = soup.find_all('div', style='margin-bottom: 2px;', limit=2)
        for r in recs:
            rec_url = r.find('a', href=True)['href']
            rec_urls.append(rec_url)
            rec_id = rec_url.split('/')[4]
            rec_ids.append(rec_id)
    return main_id, episodes, name, rec_urls, rec_ids

In order to make our scraping activity look normal, we can pass in a header containing a 'User-agent' key. This basically tells the website we are just a normal web browser. <br>
Also, we want to have short, random pauses in between requests. This will mimic human behaviour and help to prevent the risk of being flagged as a bot.

In [5]:
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
sleep_min = 0
sleep_max = 1

The main loop. Now we want to go through each of the URLs, run our scraping function, then run the scraping function on the recommendations as well. <br> <br>
We want to store all the information we gather in lists along the way. <br> <br>
Some of the recommendations may be shows we already scraped. By using a set() of ids we have already scraped, we can avoid scraping the same show page multiple times.

In [6]:
now = time.time()
episode_counts = []
names = []
ids = []
ids_set = set()
recs_main = []
recs1 = []
recs2 = []
sleep_min = 0
sleep_max = 1
for url in urls[:5]:
    main_id, episodes, name, rec_urls, rec_ids = get_name_episodes(url, True)
    
    if main_id not in ids_set:
        ids.append(main_id)
        episode_counts.append(episodes)
        names.append(name)
        recs_main.append(main_id)
        recs1.append(rec_ids[0])
        recs2.append(rec_ids[1])
    
    for u,i in zip(rec_urls, rec_ids):
        if i not in ids_set:
            time.sleep(random.uniform(sleep_min, sleep_max))
            rec_id, episodes, name, _, _ = get_name_episodes(u)
            
            ids.append(rec_id)
            episode_counts.append(episodes)
            names.append(name)
    time.sleep(random.uniform(sleep_min, sleep_max))
print('Took', time.time()-now, 'seconds.')

200 https://myanimelist.net/anime/16498/Shingeki_no_Kyojin
Shingeki no Kyojin
200 https://myanimelist.net/anime/28623/Koutetsujou_no_Kabaneri
Koutetsujou no Kabaneri
200 https://myanimelist.net/anime/26243/Owari_no_Seraph
Owari no Seraph
200 https://myanimelist.net/anime/1535/Death_Note
Death Note
200 https://myanimelist.net/anime/1575/Code_Geass__Hangyaku_no_Lelouch
Code Geass: Hangyaku no Lelouch
200 https://myanimelist.net/anime/19/Monster
Monster
200 https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood
Fullmetal Alchemist: Brotherhood
200 https://myanimelist.net/anime/11061/Hunter_x_Hunter_2011
Hunter x Hunter (2011)
200 https://myanimelist.net/anime/16498/Shingeki_no_Kyojin
Shingeki no Kyojin
200 https://myanimelist.net/anime/30276/One_Punch_Man
One Punch Man
200 https://myanimelist.net/anime/32182/Mob_Psycho_100
Mob Psycho 100
200 https://myanimelist.net/anime/31964/Boku_no_Hero_Academia
Boku no Hero Academia
200 https://myanimelist.net/anime/11757/Sword_Art_Online

### Loading into SQL database

Now that we have the data, we can set up a SQL database.

Initialize and connect to the database.

In [9]:
conn = sqlite3.connect('anime2.db')
cur = conn.cursor()

Create anime table containing the id, name, and number of episodes.

In [10]:
cur.execute('''
            CREATE TABLE anime(
            id INT PRIMARY KEY,
            name TEXT,
            episodes INT
            )
            ''')

<sqlite3.Cursor at 0x2056811f730>

Insert the anime data into the anime table.

In [11]:
for i,name,eps in zip(ids, names, episode_counts):
    cur.execute('''
                INSERT OR REPLACE INTO anime
                VALUES(?, ?, ?)
                ''',(i,name,eps))

Inspect the anime table by loading it into a DataFrame.

In [12]:
cur.execute('''
            SELECT * FROM anime''')
x = cur.fetchall()
anime_df = pd.DataFrame(x)
anime_df.columns = [i[0] for i in cur.description]
anime_df

Unnamed: 0,id,name,episodes
0,28623,Koutetsujou no Kabaneri,Episodes:
1,26243,Owari no Seraph,Episodes:
2,1535,Death Note,Episodes:
3,1575,Code Geass: Hangyaku no Lelouch,Episodes:
4,19,Monster,Episodes:
5,5114,Fullmetal Alchemist: Brotherhood,Episodes:
6,11061,Hunter x Hunter (2011),Episodes:
7,16498,Shingeki no Kyojin,Episodes:
8,30276,One Punch Man,Episodes:
9,32182,Mob Psycho 100,Episodes:


Create recommendations table containing the main show id, the first recommendation id, and the second recommendation id.

In [13]:
cur.execute('''
            CREATE TABLE recs(
            id INT,
            first_rec_id INT,
            second_rec_id INT
            )
            ''')

<sqlite3.Cursor at 0x2056811f730>

Insert the recommendation data into the recommendations table.

In [14]:
for i,r1,r2 in zip(recs_main,recs1,recs2):
    cur.execute('''
                INSERT INTO recs
                VALUES(?, ?, ?)
                ''',(i,r1,r2))

Inspect the recommendations table by loading it into a DataFrame.

In [15]:
cur.execute('''
            SELECT * FROM recs''')
x = cur.fetchall()
recs_df = pd.DataFrame(x)
recs_df.columns = [i[0] for i in cur.description]
recs_df

Unnamed: 0,id,first_rec_id,second_rec_id
0,16498,28623,26243
1,1535,1575,19
2,5114,11061,16498
3,30276,32182,31964
4,11757,17265,11759


Commit the changes to the database.

In [16]:
conn.commit()

### SQL command practice

Create a DataFrame containing each of the most popular shows and the two recommendations.

In [17]:
cur.execute('''
            SELECT a1.name as "if you like", a2.name as "try", a3.name as "or" FROM anime a1
            JOIN recs
            ON a1.id=recs.id
            JOIN anime a2
            ON recs.first_rec_id=a2.id
            JOIN anime a3
            ON recs.second_rec_id=a3.id
            GROUP BY a1.name
            ''')
x= cur.fetchall()
all_df = pd.DataFrame(x)
all_df.columns = [i[0] for i in cur.description]
all_df

Unnamed: 0,if you like,try,or
0,Death Note,Code Geass: Hangyaku no Lelouch,Monster
1,Fullmetal Alchemist: Brotherhood,Hunter x Hunter (2011),Shingeki no Kyojin
2,One Punch Man,Mob Psycho 100,Boku no Hero Academia
3,Shingeki no Kyojin,Koutetsujou no Kabaneri,Owari no Seraph
4,Sword Art Online,Log Horizon,Accel World
