<div class="frontmatter text-center">
<h1>Introduction to Data Science and Programming</h1>
<h2>Lecture 10: Python Crash Course - Web scraping</h2>
<h3>IT University of Copenhagen, Fall 2023</h3>
<h3>Instructor: Anastassia Vybornova</h3>
</div>

Today you will learn:

* What is HTML?
* `requests` library for sending HTTP requests
* `beautifulsoup4` for parsing HTTP responses

# HTML: HyperText Markup Language

* standard language for creating & structuring website content
* not a programming, but a **markup** language - describes the layout and formatting of a document (just like [markdown](https://www.markdownguide.org)!)

Basic idea:
- HTML consists of elements
- each element is denoted by start and end tags in <> brackets
- start tags can contain attributes 

```html
<title>MyHomepage</title>
```

* the tag is "title"
* the start tag is `<title>`
* the end tag is `</title>`
* everything between the start and end tag ("MyHomepage") is the text content of that tag

```html
<a href="https://itu.dk">MyUniversity</a>
```

* the tag is "a" (anchor tag: for hyperlinks)
* the start tag is `<a href="...">` with the **attribute** `href`
* the end tag is `</a>`
* everything between the start and end tag ("MyLink") is the text content of that tag

> shown in sublime text editor: how to create an html document

# Try it out yourself: My first HTML file

* Copy-paste the HTML-formatted text below into an empty text document
* Fill the gaps "INSERT..." with the text of your choice
* Save the file with an `.html` extension
* Open the file - it should automatically open in a browser; does it look like you expected?

```html
<title>INSERT TITLE HERE</title>
<h1>INSERT HEADER 1 HERE</h1>
<p>INSERT A PARAGRAPH OF TEXT HERE</p> 
<h2>INSERT HEADER 2 HERE</h2>
<p>INSERT ANOTHER PARAGRAPH OF TEXT HERE</p>
<a href="INSERT A LINK HERE">INSERT SOME TEXT HERE</a>
```

# Source code of websites

Now that we have an intuition of how HTML works, let's look at the HTML "source code" of some websites!

Shortcut/way to do it depends on the browser you're using.

For most browsers (Firefox, Microsoft Edge, Safari, Google Chrome): 

1. Open website
2. Right click with mouse
3. Click `view page source` (or similar)

> shown in Safari: source code of [Michael's homepage](http://michael.szell.net)

# Try it out yourself

* Go to [booking.com](http://booking.com) & open the source code of the website
* Can you find the link for "Unpacked: Travel articles" (in the very bottom of the page) **in the source code**?

<p style="text-align:left;">
    <img src="booking.png" alt="booking.com" width=90%px>
</p>

# HTTP: HyperText transfer protocol

Protocol used to access data on the Web. Main (SIMPLIFIED) idea:
* **Client** sends **HTTP request** to the server
* **Server** replies with an **HTTP response**

We can do this in a web browser... but we can also do it with Python!

> shown: clientrequest - serverresponse in browser

# `requests` library

("library" ~ "collection of packages". but these terms are often used interchangeably.)

Python package to send HTTP requests and receive HTTP responses

We will see 2 examples:
* a [website](https://raw.githubusercontent.com/anastassiavybornova/pythoncrashcourse/main/exercises/exercise10/quote.txt) containing only text (no HTML formatting)
* a [website](http://michael.szell.net/) containing HTML

> websites shown


**Sending an HTTP request with Python, example 1 (pure text)**

In [None]:
# first, import the library
import requests

In [None]:
# the function .get(url) sends a request; 
# and returns a response
my_url = "https://raw.githubusercontent.com/anastassiavybornova/pythoncrashcourse/main/exercises/exercise10/quote.txt"
requests.get(my_url)

In [None]:
# save response to a variable to explore it
my_response = requests.get(my_url)

In [None]:
# my_response is of type Response
type(my_response)

In [None]:
# ... and has several attributes, for example:
print(".encoding attribute:", my_response.encoding)
print(".url attribute:", my_response.url)
print(".text attribute:", my_response.text)

In [None]:
# save the text attribute of the response to a variable:
my_text = my_response.text
print(type(my_text))
print(my_text)

**Sending an HTTP request with Python, example 2 (HTML content)**

In [None]:
# the function get(url) sends a request; 
# and returns a response
my_url = "http://michael.szell.net/"
requests.get(my_url)

In [None]:
# save response to a variable to explore it
my_response = requests.get(my_url)

In [None]:
# my_response is of type Response
type(my_response)

In [None]:
# ... and has several attributes, for example:
print(".encoding attribute:", my_response.encoding)
print(".url attribute:", my_response.url)

In [None]:
# ... but most importantly, the .content attribute contains the 
my_response.content

# Try it out yourself

1. Use `requests.get()` and `.text` to save the text from link 1 below to a variable, `my_text`.
2. Use `requests.get()` and `.content` to save the content from link 2 below to a variable, `my_content`.

Link 1: https://raw.githubusercontent.com/anastassiavybornova/pythoncrashcourse/main/exercises/exercise10/quote2.txt

Link 2: https://www.youtube.com/watch?v=JBdxhFzTQ4s

In [None]:
# save text from link1 into a variable, my_text
my_response = requests.get("https://raw.githubusercontent.com/anastassiavybornova/pythoncrashcourse/main/exercises/exercise10/quote2.txt")
my_text = my_response.text
print(my_text)

In [None]:
# save content from link2 into a variable, my_content
my_response = requests.get("https://www.youtube.com/watch?v=JBdxhFzTQ4s")
my_text = my_response.text
my_text

# Is there a Python package to help me read HTML?

Yes there is - `beautifulsoup4`!

We will use beautifulsoup4 to "parse" (read) the content of our HTTP response, and to extract information from the HTTP response.

In [None]:
# import beautifulsoup4
import bs4

In [None]:
# the BeautifulSoup function creates a "soup" object from an HTTP response content:
my_response = requests.get("http://michael.szell.net/")
my_content = my_response.content
my_soup = bs4.BeautifulSoup(my_content)

In [None]:
# what is this "my_soup" object?
print(type(my_soup))
print(my_soup)

**Methods for the "BeautifulSoup" object**
* `.find(tag)` returns the FIRST specified tag
* `.find_all(tag)` returns ALL specified tags

In [None]:
# this will find the FIRST tag "a" (i.e. the first link)
my_soup.find("meta")

In [None]:
# this will find ALL tags "a" (i.e. all links)
my_soup.find_all("a")

In [None]:
# .find(tag) returns a TAG object:
link_found = my_soup.find("a")
print(link_found)
print(type(link_found))

**Methods for the "Tag" object**
* `.get(attribute)` returns the specified attribute of the tag

In [None]:
# to access the attribute "href",
# which contains the actual URL,
# use the method .get(attribute):
print(link_found)
link_found.get("href")

In [None]:
# this gives us the URL as a string:
my_hyperlink = link_found.get("href")
print(my_hyperlink)
print(type(my_hyperlink))

# Summing up: How do you find the first link on a website with Python?

1. `requests.get()` to send an HTTP request; 
2. `bs4.BeautifulSoup()` to parse the `.content` attribute of the HTTP response
3. `find("a")` to find the first `<a>` tag (on BeautifulSoup object)
4. `.get("href")` to get the url (on Tag object)

In [None]:
my_url = "https://www.wikipedia.org"
my_response = requests.get(my_url)
my_soup = bs4.BeautifulSoup(my_response.content)
my_tag = my_soup.find("a")
my_hyperlink = my_tag.get("href")
print(my_hyperlink)

# Summing up: How do you find all links on a website with Python?

1. `requests.get()` to send an HTTP request; 
2. `bs4.BeautifulSoup()` to parse the `.content` attribute of the HTTP response
3. `.find_all("a")` to find all  `<a>` tags
4. `.get("href")` (on each tag that **has** an href attribute) to get the url

In [None]:
my_url = "https://itu.dk"
my_response = requests.get(my_url)
my_soup = bs4.BeautifulSoup(my_response.content)
my_tags = my_soup.find_all("a")
print(len(my_tags))
my_tags

In [None]:
# we have 282 tags! but some of them don't have an href, let's get rid of them:
my_tags_href = [tag for tag in my_tags if tag.get("href")]
len(my_tags_href)
my_tags_href

In [None]:
# let's now get all the hyperlinks
my_hrefs = [tag.get("href") for tag in my_tags_href]
my_hrefs

In [None]:
# let's know only keep the links to *external* websites
my_hrefs_external = [href for href in my_hrefs if "http" in href]
print(len(my_hrefs_external))
my_hrefs_external

# What is "webscraping"?

Download & process data from the internet, in an automated fashion.

We just learned how to do this in Python, using 2 packages:
* `requests` to **download** website content in HTML
* `beautifulsoup4` to **read and process** HTML