# Web Crawling and web Scraping

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually build our own wrapper functions.

Web crawling and web scraping are two related techniques used to extract information from websites.

Web crawling, also known as web indexing or web spidering, is the process of automatically exploring and indexing web pages on the internet. Web crawlers, also called spiders, bots, or robots, navigate through websites, follow links, and index the content of the pages they encounter. Search engines like Google and Bing use web crawlers to build their indexes of web pages, which enables users to find information easily.

Web scraping, on the other hand, is the process of extracting specific data from web pages. Web scraping involves analyzing the HTML structure of a webpage, identifying the relevant information, and extracting it into a structured format such as a CSV or JSON file. Web scraping can be used to extract product information, pricing data, news articles, and more.

Web crawling and web scraping can be done manually, but it's often more efficient to use specialized software tools. Python is a popular language for web crawling and web scraping, and there are many libraries available, including BeautifulSoup, Scrapy, and Selenium.

However, it's important to note that web scraping can raise legal and ethical concerns, particularly if done without permission or in violation of website terms of service. Web scraping can also put a strain on website servers, potentially causing them to crash or become unavailable. As such, it's important to use web scraping responsibly and within legal and ethical boundaries.

##### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

    pip install beautifulsoup4 lxml

In [15]:
pip install beautifulsoup4 lxml selenium

Collecting selenium
  Downloading selenium-4.19.0-py3-none-any.whl.metadata (6.9 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.0-py3-none-any.whl.metadata (8.7 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting typing_extensions>=4.9.0 (from selenium)
  Using cached typing_extensions-4.10.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading selenium-4.19.0-py3-none-any.whl (10.5 MB

In [1]:
## Do the following if you have not
from bs4 import BeautifulSoup as Soup
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
import sqlite3

In [2]:
with open("data/sample.html", "r") as sample:
    sample_contents = sample.read()

The structure of HTML is not displayed properly without BeautifulSoup, which is really hand!

In [3]:
sample_contents

'<!DOCTYPE html>\n<html>\n<body>\n\n<h1>My First Heading</h1>\n\n<p>My 1st paragraph.</p>\n<p>My 2nd paragraph.</p>\n<p>My 3rd paragraph.</p>\n\n</body>\n</html>'

In [4]:
type(sample_contents)

str

In [5]:
sample_soup = Soup(sample_contents, 'lxml')

In [6]:
type(sample_soup)

bs4.BeautifulSoup

By printing it, we can see the exact contents as shown above with proper indentation

In [7]:
print(sample_soup.prettify())

<!DOCTYPE html>
<html>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My 1st paragraph.
  </p>
  <p>
   My 2nd paragraph.
  </p>
  <p>
   My 3rd paragraph.
  </p>
 </body>
</html>



Get the contents of interest: all the `p`'s

_`p` means paragraph in html. Check more tag definitions on [w3schools.org](https://www.w3schools.com/tags/default.asp)

In [8]:
p_tags = sample_soup.find_all("p")

In [9]:
p_tags

[<p>My 1st paragraph.</p>, <p>My 2nd paragraph.</p>, <p>My 3rd paragraph.</p>]

In [10]:
p_tags[1]

<p>My 2nd paragraph.</p>

In [11]:
type(p_tags[0])

bs4.element.Tag

For each of the `p` tag, we get the textual value out.

In [12]:
for p in p_tags:
    print(p.text)

My 1st paragraph.
My 2nd paragraph.
My 3rd paragraph.


---

#### A real example

Let's use a real website for illustration. For example, if we are interested in the danish parliments webpage for handeling citizen proposals [borgerforslag](https://www.borgerforslag.dk).

To view the "text style" or the real structure of a web page, you can use ___`developer tools`___ function in your browser.

Recall that [`requests`](http://docs.python-requests.org/) is a convenient package for sending HTTP requests.

In [13]:
import requests

In [23]:
url = "https://www.expedia.ie/Flights-Search?leg1=from%3ABillund%20%28BLL%29%2Cto%3AM%C3%A1laga%20%28AGP%29%2Cdeparture%3A17%2F4%2F2024TANYT&mode=search&options=carrier%3A%2Ccabinclass%3A%2Cmaxhops%3A1%2Cnopenalty%3AN&pageId=0&passengers=adults%3A1%2Cchildren%3A0%2Cinfantinlap%3AN&trip=oneway"

r = requests.get(url)
r.status_code

200

In [24]:
r.content



Convert it to a soup object

In [25]:
soup = Soup(r.content, 'html.parser')

Find the correponding tag. Note that `class_` has a trailing underscore `_`

In [26]:
soup

<!DOCTYPE html>
/*# sourceMappingURL=expedia.227e80fb55358a894550a991faa2850b.css.map*/
</style><style>@font-face {font-display: optional;font-family: "Centra No2";font-style: normal;font-weight: 300;src:url("https://a.travel-assets.com/egds/fonts/CentraNo2/CentraNo2-Light.woff2") format("woff2");unicode-range: U+000-0FF;}@font-face {font-display: optional;font-family: "Centra No2";font-style: normal;font-weight: 400;src:url("https://a.travel-assets.com/egds/fonts/CentraNo2/CentraNo2-Book.woff2") format("woff2");unicode-range: U+000-0FF;}@font-face {font-display: optional;font-family: "Centra No2";font-style: normal;font-weight: 500;src:url("https://a.travel-assets.com/egds/fonts/CentraNo2/CentraNo2-Medium.woff2") format("woff2");unicode-range: U+000-0FF;}@font-face {font-display: optional;font-family: "Centra No2";font-style: normal;font-weight: 700;src:url("https://a.travel-assets.com/egds/fonts/CentraNo2/CentraNo2-Bold.woff2") format("woff2");unicode-range: U+000-0FF;}:root {font-fa

In [27]:
#Find flight schedule and price data elements
schedule_elements = soup.find_all("div", class_="flight-schedule")
price_elements = soup.find_all("span", class_="flight-price")
                            

In [28]:
#Extract and print flight schedule information
for schedule_element in schedule_elements:
    departure_time = schedule_element.find("span", class_="arrival-time").text
    arrival_time = schedule_element.find("span", class_="flight-number").text
    print(f"Flight {flight_number}: Departure {departure_time} - Arrival{arrival_time}")

In [29]:
#Extract and print flight price information
for index, price_element in enumerate(price_elements):
    price = price_element.text
    print(f"Price for Flight {index+1}: {price}")


In [27]:
borger_content = summary_tag.contents

In [28]:
borger_content

['\n',
 <div id="SkTIHRKeh"><div class="_3dLODA" data-reactroot="" data-readaloud-ancestor="true"><div class="no-print _1NXxBn vFact_DoNotReadAloud" style="transform:translateY(-100%)"><div class="_28H2Ci"><div class="cc552X"><span>Udskiftning af dans i idrætundervisningen med alternative aktiviteter</span></div><a class="dFsu8t vFact_DoNotReadAloud _2CdOjN _202cZS _3CrCss" href="/se-og-stoet-forslag/cpr/?Id=FT-14316" role="button" style="min-width:180px"><span>Støt</span></a></div></div><div class="_3l86Vg" data-readaloud="true"><div><span>Startdato</span><strong>17. marts 2023</strong></div><div><span>Slutdato</span><strong>13. september 2023</strong></div><div><span>Antal støtter</span><strong>11</strong><div class="ssLhPH"><div class="_2KXwjO" style="width:0.7416198487095663%"></div></div></div></div><section data-readaloud="true"><span class="_3jrW-w">ID: <!-- -->FT-14316</span></section><button class="_3iicqI" disabled=""><svg height="28px" viewbox="0 0 20 20" width="28px" xmlns=

In [29]:
print(borger_content[2].text)

_components.push({"name":"ProposalEditor","props":{"proposalCreationViewModel":{"proposal":{"id":15727,"title":"Udskiftning af dans i idrætundervisningen med alternative aktiviteter","proposalContent":"Jeg foreslår, at dans fjernes fra den obligatoriske idrætundervisning i skolerne i Danmark og erstattes med alternative fysiske aktiviteter. Mange elever oplever udfordringer med dans, da det kan være en intimiderende og ubehagelig oplevelse for nogle elever. Desuden kan dansen have en kønsstereotyp påvirkning, der kan føre til kønsdiskrimination.\n\nVed at erstatte dansen med alternative fysiske aktiviteter vil eleverne stadig have mulighed for at deltage i sjove og udfordrende aktiviteter, der vil forbedre deres fysiske og mentale sundhed. Der er mange alternative aktiviteter, som kan udvikle elevernes styrke, smidighed, koordination og samarbejdsevner, som f.eks. yoga, fitness, svømning, eller andre idrætsgrene.\n\nJeg tror, at fjernelse af dans fra idrætundervisningen vil give elever

With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. 

***But keep in mind that you should act politely, with propoer permission!! To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

---

Note that the examples we are using here are relatively simple. There are cases that we cannot access the pagination/scoll simply by `requests` alone. In those cases, [Selenium](http://selenium-python.readthedocs.io/) will save our lifes by ___simulating Browsers___!

Some more tutorials/tools:

- https://scrapy.org/ #building a crawler 
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- https://www.quora.com/Python-programming-language-1/How-is-BeautifulSoup-different-from-Scrapy

---

return to [overview](../00_overview.ipynb)