# Session 7: Web Scraping 2, HTML and parsing

*Hjalte Fejerskov Boas*

## Recap

Recall the different steps in web scraping:
1. Mapping (session 6):
    - We learned how to use the structure of the URL to go through all the webpages you want to scrape
2. Downloading (session 6):
    - We learned how to download the HTML code of webpages
    - We learned how to use the network panel to download data directly from the webpage's server
3. Parsing (this session)

In this session we will learn how to parse the downloaded HTML into meaningful and structured data

## Required readings

- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)

# Overview of Session 7

1. What is HTML?
    - How does the tree structure work?
2. How can we find our way in the HTML code? I.e. find the data we need (parse the HTML)
    - Regex
    - CSS selectors
    - BeautifulSoup
        - Today we will mainly spend time on BeautifulSoup

## Introduction to HTML

### Recall from previous session

How a human sees a webpage             |  How a computer sees a webpage (**HTML**)
:-------------------------:|:-------------------------:
![](https://drive.google.com/thumbnail?id=1cbrC303j-gQnXbXyTEQBPT2xH7kgz6Cy&sz=w1000)  |  ![](https://drive.google.com/thumbnail?id=1VFlfDcJHCzbtmkpr4kvXzGecrDE7KmLY&sz=w1000)

## [What is HTML?](https://www.w3schools.com/html/html_intro.asp)  

HTML(Hyper Text Markup Language) is the standard language for creating webpages

### HTML elements and tags

HTML consists of different elements: These elements tell your browser what to display and how to display it

An HTML element consists of a tag and the element content.
- The tag defines the content: for example the tag ```<h1>``` defines the content as "a large heading"
- Example: 
```html 
<h1> My first heading </h1>
```

In the browser, the HTML above will show up like this: <h1> My first heading </h1>

### Important tags

Here are some examples of often used tags:
```html 
<h1> Defines a large header </h1>
<p> Defines a paragraph </p>    
<div> Defines a section </div>
<a> Defines a link </a> 
<table> Defines a table </table> 
```

### Attributes to the HTML elements
Each element can have some [attributes](https://www.w3schools.com/html/html_attributes.asp)

- They are specified in the tags
- Example: 
```html 
<div class=myclass> My first section </div>
```

### Important attributes
Here are some examples of often used attributes:
- class: Specifies a class for an HTML element (multiple elements can share the same class)
- id: Specifies a *unique* id for an HTML element
- href: Specifies the link's destination/URL (used in combination with the ```<a>``` tag)

### HTML is like a tree

An element is also called a node

A node can have more nodes inside it. The nodes inside are then called *children*

- Example: 
```html 
<div> 
    <p> My first paragraph </p>
</div>
```
In this example, ```<p>``` is the child, and ```<div>``` is the parent.
- You may come across expressions like *children*, *siblings*, *parents*, *descendants*

### Here is an example of an HTML tree (can you see the similarity with a family tree?) 
<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png"/>

# Video 7.1: Navigating the HTML tree, intro

## How do we find our way around the HTML tree?

The HTML contains the information that we are interested in!
- But how do we locate it?

### Three ways of finding the information you want:
1. Regex: Exploiting string patterns in HTML using regular expresssions
2. CSS-selectors: Specifying paths in the tree using CSS-selectors
3. ```BeautifulSoup```: A Python package that makes it easy to navigate the HTML tree

### 1. Regex
**What is regex?**

Regex is used to define a search pattern in text

Suppose we want to search for all links in an HTML tree:
- We can then define a search pattern in regex that searches for "www." for example
- Using regex we will then find all the places in the HTML where it says "www."

Note: Regex only works on text/strings. So we need to convert our HTML tree into one large string before we can use regex on HTML

More about regex in session 8!

### 2. [CSS Selectors ](https://en.wikipedia.org/wiki/CSS)
A CSS selector is used to select the HTML elements ([How can you use a CSS selector?](https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/))
- At first it will seem very similar to the BeautifulSoup way of selecting elements (which you will learn in a minute)
    - However, a CSS selector is useful when you cannot rely on *class* and *id* attributes (for example in very messy written HTML)

It is a need way to define a unique path to an element or multiple similar elements in the HTML tree

You can download a CSS Selector as a Google Chrome extension that will do the work for you: [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### BeautifulSoup has a built-in CSS selector:

Just use the function `.select`

In [2]:
url = 'https://www.dr.dk/nyheder/udland'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml') #Make the BeautifulSoup object (soup): Take the HTML content as input and choose your parser (lxml)

In [3]:
# The CSS selector ".dre-title-text" selects all titles on the DR international news page
soup.select('.dre-title-text')[0].text #Selecting first title

'Skovbrand i Californien er tre gange så stor som Bornholm'

### 3. Parsing HTML with BeautifulSoup
A third way to navigate the HTML tree is BeautifulSoup

It exploits the stucture of tags and attributes

It allows you to:
- Search for elements by tag name and/or by attribute.
- Iterate through them, go up, sideways or down the tree.
- Furthermore it helps you with standard tasks such as extracting raw text from html

# Video 7.2: Parsing the HTML with BeautifulSoup

## Learning by doing: Creating a dataset from www.dr.dk/nyheder/udland

### Let's put together some of the stuff we have learned so far
1. **Mapping:** In this exercise we will collect some URLs from webpages with news articles and save them into a list
2. **Downloading:** Then we will download the HTML content of the webpages
3. **Parsing:** At last we will collect relevant information in each article

## 1. MAPPING

#### First, we investigate the site trying to understand its structure

We do this by opening up the Chrome Developer Tools on the webpage:
1. Right-click anywhere on the webpage
2. Click "Inspect"
3. Choose the panel "Elements"

You can now see the HTML of the webpage and the tree structure.

First, we want to understand where the articles are located in the HTML: 
- The "Elements" panel will jump to the place in the HTML tree where you right-click
- So to find the location of articles in the HTML, just right-click on one of them

#### Get the webpage content and make the BeautifulSoup object:

In [4]:
# Define our URL
url = 'https://www.dr.dk/nyheder/udland' 

# Connects to site
response = requests.get(url)

# Parse data with BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')

#### Find the articles to scrape:

[`find_all`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) finds all elements in the HTML that have the tag ```<div>``` and the class attribute 'dre-teaser-content' 

In [5]:
# Identify articles to scrape by inspecting site
articles = soup.find_all('div', class_ = 'dre-teaser-content') #(class_ is used because class is reserved in Python)

In [6]:
articles

[<div class="dre-teaser-content"><div class="dre-article-teaser__text-box dre-article-teaser__text-box--transparent dre-article-teaser__text-box--xxs-4 dre-article-teaser__text-box--xs-4 dre-article-teaser__text-box--sm-4 dre-article-teaser__text-box--md-4 dre-article-teaser__text-box--lg-8 dre-article-teaser__text-box--xl-8"><div class="dre-article-teaser-meta-label"><div aria-hidden="true" class="dre-teaser-meta"><span class="dre-label-text dre-label-text--xxs-x-small"><span class="dre-label-text__text"><span class="dre-teaser-meta__part dre-teaser-meta__part--primary"><span class="dre-teaser-meta-label dre-teaser-meta-label--primary">Udland</span></span><span class="dre-teaser-meta__part"><span class="dre-teaser-meta-label">I dag kl. 07:08</span></span></span></span></div></div><a aria-label='Skovbrand i Californien er tre gange så stor som Bornholm, I dag klokken 07:08, fra sektionen "Udland"' class="dre-teaser-title dre-teaser-title--margin-top dre-teaser-title--xxs-small dre-teas

#### Now we want the links to all the articles:
First, I show how to find the link for *one* article, and afterwards I show how to loop through all article links

You can use [`find`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) to find the *first* element. In the code below it is the first element that has the tag ```<a>```.

You can use `['href']` to select the attribute. Here we are interested in the content of the href attribute.

In [7]:
# First find the "link" tag in the HTML
article_link = articles[0].find('a') #(We are only taking the first article)
# Then locate the URL in the href attribute
article_url = article_link['href']
print(article_url)

/nyheder/udland/skovbrand-i-californien-er-tre-gange-saa-stor-som-bornholm


In [8]:
# Another way to find the tag is by writing `.a` instead of `.find('a')`:
article_link = articles[0].a
article_url = article_link['href']
print(article_url)

/nyheder/udland/skovbrand-i-californien-er-tre-gange-saa-stor-som-bornholm


#### We create a list of URLs that we want to scrape:

In [9]:
# Create an empty list
list_of_article_urls = []

# Creating a loop that appends the article url to the list above
for i in range(len(articles)):
    list_of_article_urls.append(articles[i].find('a')['href'])

In [10]:
list_of_article_urls

['/nyheder/udland/skovbrand-i-californien-er-tre-gange-saa-stor-som-bornholm',
 '/nyheder/udland/se-billederne-southport-protester-har-naaet-et-nyt-niveau',
 '/nyheder/udland/premierminister-starmer-fordoemmer-uroligheder-som-har-spredt-sig-til-flere-engelske',
 '/nyheder/udland/tre-russere-ville-oenske-vesten-ikke-havde-hjulpet-dem-ud-af-putins-greb-de-er-blevet',
 '/nyheder/udland/amerikansk-forsvarsminister-dropper-aftaler-med-911-gerningsmaend',
 '/nyheder/udland/loekke-om-hjaelp-til-33-aarige-dansker-i-russisk-haer-vi-staar-klar-men-kan-naeppe',
 '/nyheder/udland/sjaelden-indroemmelse-fra-kreml-tre-udvekslede-fanger-var-russiske-spioner',
 '/nyheder/seneste/kamala-harris-er-nu-officielt-demokraternes-praesidentkandidat-har-faaet-nok-stemmer',
 '/nyheder/udland/engelske-byer-og-moskeer-skruer-op-sikkerheden-i-weekenden-frygter-nye-optoejer',
 '/nyheder/udland/nyt-studie-norske-vikinger-var-mere-voldelige-end-danske',
 '/nyheder/udland/fangeudveksling-er-en-sejr-biden-og-harris-det-

#### Some of the links are not to articles 

So we write this code to only keep the article links:

In [11]:
list_of_article_urls_final = []
for link in list_of_article_urls:
    if '/nyheder/udland' in link: #All article URLs have this string in them, so we restrict on it being in the URL
        list_of_article_urls_final.append(link)

## 2. DOWNLOADING + 3. PARSING

#### Now we are ready to scrape each webpage from the URL list:
First, I will show you the procedure for *one* link, and then I will show you how to scrape the first 10 articles

In [12]:
# Creating empty list for the infomation we want to extract for every article
title_list = []
lead_list = []
time_list = []

# This time we scrape for each news article in the url list we created before
url = 'https://www.dr.dk' + list_of_article_urls_final[0] #The scraped links are relative, so we need to add the base URL (Here we have just taken the first link)
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')

In [13]:
# Find title
temp = soup.find_all('h1')
temp = temp[1]
temp = temp.text.strip() #Use strip() to get rid of trailing and leading spaces
title_list.append(temp)

In [14]:
# Find lead
temp = soup.find('p', class_='dre-article-title__summary')
temp = temp.text.strip()
lead_list.append(temp)

In [15]:
# Find time posted
temp = soup.find('time', class_='dre-byline__date')
temp = temp['datetime']
time_list.append(temp)

#### Combine all of the code above in a loop to scrape the first 10 articles:

In [16]:
# We want to extract title, lead and time posted from the articles

# Creatig empty list for the infomation we want to extract for every article
title_list = []
lead_list = []
time_list = []

for i in range(10): #len(list_of_article_urls)
    
    # This time we scrape for each news article in the url list we created before
    url = 'https://www.dr.dk' + list_of_article_urls_final[i] #The scraped links are relative, so we need to add the base url
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'lxml')
    
    # Append title to list
    temp = soup.find_all('h1')
    temp = temp[1]
    temp = temp.text.strip()
    title_list.append(temp)
    
    # Append lead to list
    temp = soup.find('p', class_='dre-article-title__summary')
    temp = temp.text.strip()
    lead_list.append(temp)

    # Append time posted to list
    temp = soup.find('time', class_='dre-byline__date')
    temp = temp['datetime']
    time_list.append(temp)

In [17]:
title_list

['Skovbrand i Californien er tre gange så stor som Bornholm',
 'Se billederne: Southport-protester har nået et nyt niveau',
 'Premierminister Starmer fordømmer uroligheder, som har spredt sig til flere engelske byer',
 "Tre russere ville ønske, at Vesten ikke havde hjulpet dem ud af Putins greb: 'De er blevet politisk kastreret'",
 'Amerikansk forsvarsminister dropper aftaler med 9/11-gerningsmænd',
 'Løkke om hjælp til 33-årige dansker i russisk hær: Vi står klar - men kan næppe gøre meget',
 'Sjælden indrømmelse fra Kreml: Tre udvekslede fanger var russiske spioner',
 'Engelske byer og moskeer skruer op for sikkerheden i weekenden: Frygter nye optøjer',
 'Nyt studie: Norske vikinger var mere voldelige end danske',
 "Fangeudveksling er en sejr for Biden og Harris: 'Det vil nok blive brugt i valgkampen'"]

In [23]:
lead_list

["Joe Biden har været nødt til erkende, at faklen skal gives videre, lyder det fra Philip Khokhar, DR's USA-korrespondent.",
 'Forud for Netanyahus tale havde demonstranter vist deres utilfredshed med den amerikanske støtte til Israel.',
 'I årtier har de to lande skændes om, hvor kebabretten stammer fra.',
 'Ungdommens kontinent er i oprør. Generation Z har fået nok i flere lande og kræver forandring.',
 'Især den oprindelige befolkning, māorierne, har været udsat for overgreb.',
 "På Kamala Harris' første vælgermøde så vi en klassisk valgkampstale, siger retoriker.",
 'Netanyahus besøg i USA har givet anledning til at diskutere Kamala Harris’ syn på Israel.',
 'Harris har holdt sit første vælgermøde i svingstaten Wisconsin, hvor hun talte om alt fra Trump til abort.',
 'Israels Netanyahu afviser, at der er tale om ulovlige besættelser. Han siger, han vil præsentere "sandheden om vores retfærdige krig" i Kongressen i USA, hvor han skal tale i dag.',
 'Ifølge DR’s USA-korrespondent er 

In [18]:
time_list

['2024-08-04T05:08:00+00:00',
 '2024-08-04T05:03:00+00:00',
 '2024-08-03T20:21:00+00:00',
 '2024-08-03T16:39:00+00:00',
 '2024-08-03T04:43:00+00:00',
 '2024-08-02T18:25:00+00:00',
 '2024-08-02T17:32:00+00:00',
 '2024-08-02T16:44:00+00:00',
 '2024-08-02T15:30:00+00:00',
 '2024-08-02T14:47:00+00:00']

#### Lastly, we put our collected information into a dataframe:

In [19]:
import pandas as pd
df = pd.DataFrame({'title':title_list, 'lead':lead_list, 'time':time_list})
df

Unnamed: 0,title,lead,time
0,Skovbrand i Californien er tre gange så stor s...,Park Fire har slugt hundredvis af bygninger og...,2024-08-04T05:08:00+00:00
1,Se billederne: Southport-protester har nået et...,En 17-årig dreng dræbte i mandags tre piger ti...,2024-08-04T05:03:00+00:00
2,"Premierminister Starmer fordømmer uroligheder,...",Der meldes om uroligheder i blandt andet Liver...,2024-08-03T20:21:00+00:00
3,"Tre russere ville ønske, at Vesten ikke havde ...",Tre russiske politiske fanger er kommet til Ty...,2024-08-03T16:39:00+00:00
4,Amerikansk forsvarsminister dropper aftaler me...,Retssagen mod den formodede hovedmand bag terr...,2024-08-03T04:43:00+00:00
5,Løkke om hjælp til 33-årige dansker i russisk ...,"Udenrigsministeren understreger, at bistand fr...",2024-08-02T18:25:00+00:00
6,Sjælden indrømmelse fra Kreml: Tre udvekslede ...,"To spioners børn vidste ikke, at de var russer...",2024-08-02T17:32:00+00:00
7,Engelske byer og moskeer skruer op for sikkerh...,"Borgmesteren i Southport, hvor knivangrebet ba...",2024-08-02T16:44:00+00:00
8,Nyt studie: Norske vikinger var mere voldelige...,Skeletter fra vikinger fundet i Norge har fler...,2024-08-02T15:30:00+00:00
9,Fangeudveksling er en sejr for Biden og Harris...,"Joe Biden tog selv imod de frigivne fanger, da...",2024-08-02T14:47:00+00:00


#### One more thing:
What if we also want the body text of an article?

In [20]:
url = 'https://www.dr.dk/nyheder/udland/gazprom-strammer-ifoelge-tyskland-skruen-uden-grund' 
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')

In [21]:
# We locate the body of the article:
body = soup.find('div', class_ = 'dre-article-body')
body



This body consists of both sections with text and figures. We want it all.

But sections and figures have different tags, so we cannot just use `find_all` to find all elements in the body.

Instead we can use [`.children`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children). It finds all children of the element body:

In [22]:
body_text = []
for child in body.children:
    body_text.append(child.text)

In [23]:
body_text

['Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.',
 '',
 'Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for ga

Note: We have used `.text` to get the text of the HTML. The figure elements do not contain any text, so they will just be empty. 

We can use `.join()` to join all the strings in the list. Just join it on an empty string:

In [24]:
''.join(body_text)

'Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for gasforsyninge