# Web Scraping 

## 👉 Ethical Scraping Guidelines:

- __Terms of Use :__ Respect the _Terms of Use_ & _robots.txt_
- __Minimize Burden :__ Minimizing burden on website owners. Introduce a time delay (>1 second) between scrapes, and scrape at times of day when web traffic is likely to be low.
- __Personal Information :__ Don't scrape personal information.
- __Notify :__ Notify the website owners where possible. They may just send you the data 😁

For more information on how to read __robots.txt__ 👉 https://www.seerinteractive.com/insights/how-to-read-robots-txt

In this notebook we will learn how to scrape websites responsibly using BeautifulSoup. For more information, please check out the BeautifulSoup Documentation page : https://beautiful-soup-4.readthedocs.io/en/latest/

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup

### Books to Scrape : https://books.toscrape.com/
A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well.

In [2]:
# Request the url for the HTML script
url = 'https://books.toscrape.com/'
response = requests.get(url)

In [3]:
# Check the response status
response.status_code

200

In [4]:
# Get the HTML script 😂
html_script = response.text
html_script



In [5]:
# Try the print function
print(html_script)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="static/oscar/favicon.

In [6]:
# Let us read the script using BeautifulSoup
bs_books_pg = BeautifulSoup(html_script, 'html.parser')
bs_books_pg

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [7]:
# What is the diffrence?
print(type(html_script))

print(type(bs_books_pg))

<class 'str'>
<class 'bs4.BeautifulSoup'>


In [8]:
# Let us now check the HTML tags and see what tags are available for one of the books in the page
bs_books_pg.select('li.col-xs-6')[0]


<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>

In [None]:
# Try and get the following info. from one Book : Book name, Rating, Price, In Stock


In [None]:
# Now let us get the info for all the books in the page


In [None]:
# Now let us get the info for all listed books in all the pages 😅


In [None]:
# What else can we do? 🤔


### Bahrain Airport Arrivals : https://www.bahrainairport.bh/flight-arrivals
⚠️ First, check if we are allowed to scrape the page using __robots.txt__ :
https://www.bahrainairport.bh/robots.txt

In [None]:
bahrain_airport_arrival_url = 'https://www.bahrainairport.bh/flight-arrivals'
# Request for the HTML script

# Check the response status code


In [None]:
# View the script


In [None]:
# Let us use BeautifulSoup to get access to the Tags


In [None]:
# Inspect the first flight and look at what info can we gather. Refer to the website's HTML code (Inspect View)
# to make sure that all flights are in the same format


In [None]:
# Gather all the info needed then put them in a DataFrame


## __Web Scraping Exercise__

### Wikipedia Tables:
Take your pick of any of the following tables:
- __List of stadiums by capacity :__ https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity
- __List of megaprojects :__ https://en.wikipedia.org/wiki/List_of_megaprojects#
- __List of Bahraini records in athletics :__ https://en.wikipedia.org/wiki/List_of_Bahraini_records_in_athletics#
- __List of tallest buildings :__ https://en.wikipedia.org/wiki/List_of_tallest_buildings
- __List of best-selling comic series :__ https://en.wikipedia.org/wiki/List_of_best-selling_comic_series 

In [157]:
# Request for the HTML script

# Check the response status code

In [159]:
# View the script


In [161]:
# Let us use BeautifulSoup to get access to the Tags


In [163]:
# Inspect the Tags and try to get the data


In [None]:
# 