# Session 6: Web Scraping 1
*Hjalte Fejerskov Boas*

# About me

- PhD student in Economics at UCPH
    - At the moment: working on a project about combatting tax havens
- Master and bachelor from UCPH
- Had this course in 2017
- Personal website: https://www.hjalteboas.com/

## Recap
In session 5 you briefly touched upon extracting data from the internet
- You heard about HTTP requests: The computer's way to communicate with the website and underlying server
- You learned about HTML: The language behind a website
- You worked with APIs: A way to retrieve structured data from websites
    - You learned how to send requests to an API and in return recieve the desired data
    - You learned how to deal with the JSON files that is usually how the data is sent

In the next three sessions we will build on your newly acquired skills

You will learn how to extract data from websites when no API is available

It will open up a whole new world of data possibilities!

## Required readings

- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)

- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)

- Shiab, Nael. 2015. [On the Ethics of Web Scraping and Data Journalism](http://gijn.org/2015/08/12/on-the-ethics-of-web-scraping-and-data-journalism/). Global Investigative Journalism Network.

## Overview of Session 6

Today, we will learn about interacting with websites and extracting their unstructured data (web scraping). In particular, we will cover:
1. Basics of web scraping:
    - What is web scraping?
    - How is a webpage built? How does a computer "see" a webpage?
    - Ethical considerations
2. Mapping the URLs of the webpages you want to scrape:
    - How can we systematically go through all the webpages and scrape their content?
        - You will learn to exploit the pattern in the URLs
3. Connecting to a webpage:
    - How do we extract the HTML-string behind the webpage?
4. Good practices of web scraping
    - Limit the rate of your calls to the webpage
    - Logging
    - Handle exceptions
5. Exploiting the webpage's own data requests to extract data easily
    - Use the network panel in Chrome Developer Tools to locate the data

## What is web scraping?

Web scraping is the practice of extracting information from websites in an automated and structured way

- The internet is the biggest source of information/data you can find! [90% of the data on the internet has been created since 2016](https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day/)
- Web scraping unlock this new world of data possibilities 
- The biggest limitation is your own imagination!

#### One note:
- Learning how to web scrape may be frustrating in the beginning
    - It is a new way of thinking data 
    - It requires technological knowledge
- Keep working with it!
    - I promise you; you will master it
    - And the internet will never be the same again

## How a human sees a webpage vs. how a computer sees it

### [www.jobnet.dk](https://job.jobnet.dk/CV/FindWork?Offset=0&SortValue=BestMatch)

How a human sees a webpage             |  How a computer sees a webpage
:-------------------------:|:-------------------------:
![](https://drive.google.com/uc?exportview&id=1cbrC303j-gQnXbXyTEQBPT2xH7kgz6Cy)  |  ![](https://drive.google.com/uc?export=view&id=1VFlfDcJHCzbtmkpr4kvXzGecrDE7KmLY)

## Ethical Considerations (rule of thumbs)
1. Fair use: Take only the stuff you need
2. Be careful with copyrighted material
3. If a regular user can’t access it, we shouldn’t try to get it: [That is considered hacking](https://www.dr.dk/nyheder/penge/gjorde-opmaerksom-paa-cpr-hul-nu-bliver-han-politianmeldt-hacking)
4. If monetizing on the data, be careful not to be in direct competition with whom you are taking the data from
5. LinkedIn case: [Scraping data on LinkedIn is legal](https://gizmodo.com/linkedin-scraping-data-legal-court-case-1848811335)
6. Don't hit it too fast: Essentially a DENIAL OF SERVICE attack (DOS); [Again considered hacking](https://www.dr.dk/nyheder/indland/folketingets-hjemmeside-ramt-af-hacker-angreb)

<img src="https://github.com/snorreralund/images/raw/master/Sk%C3%A6rmbillede%202017-08-03%2014.46.32.png"/>

## The Web Scraping Recipe

Three (main) steps in scraping:
1. **MAPPING (this session)**: Find URLs of the webpages containing the information you want.
2. **DOWNLOADING (this session)**: Download the HTML of the webpages.
3. **PARSING (session 7)**: Extract the information from the HTML. 

### What browser to use?
- Lectures and exercises are solely based on Chrome as my browser --> I recommend you to use Chrome as well. 
- All browsers (Chrome, Firefox, Safari, Edge etc.) can be used to investigate the webpages like you will learn to do here
    - The practice might differ a bit

In [1]:
import requests
import time
import tqdm
import pandas as pd
import os
import json

# Video 6.1: Mapping URLs and downloading webpage content

## 1. Mapping: How do we find the relevant URLs

### Navigating websites to collect links

How can you automate the navigation of links?

### Building URLs using a recognizable pattern
A nice trick is to understand how URLs are constructed to communicate with a server

This will allow us to navigate the page:

* / is like folders on your computer.
* ? entails the start of a query with parameters 
* = defines a variable: e.g. page=1000 or offset = 100 or showNumber=20
* & separates different parameters.
* \+ is html for whitespace

Lets look at how [www.jobindex.dk](https://www.jobindex.dk/jobsoegning) does it:
- We simply click around and take note at how the address line changes

#### We want to create the URLs for the first 5 pages
Is there a pattern in the URL that we can exploit?

In [2]:
links = []
for page in range(1,6,1):
    url = f'https://www.jobindex.dk/jobsoegning?page={page}'
    links.append(url)

In [3]:
links

['https://www.jobindex.dk/jobsoegning?page=1',
 'https://www.jobindex.dk/jobsoegning?page=2',
 'https://www.jobindex.dk/jobsoegning?page=3',
 'https://www.jobindex.dk/jobsoegning?page=4',
 'https://www.jobindex.dk/jobsoegning?page=5']

## 2. Connect to the webpage and *download* its content

### Here is how you connect to and download the HTML of a webpage

In [4]:
import requests
response = requests.get('https://www.jobindex.dk/jobsoegning?page=1')

In [5]:
response.text



#### Remember to tell who you are
- Write your name and email in the header of the request you send to the webpage
- Then the managers of the webpage will know you are not a malicious actor

In [6]:
response = requests.get('https://www.jobindex.dk/jobsoegning?page=1', headers={'name':'Hjalte Fejerskov Boas','email':'hfb@econ.ku.dk'})

### We now want to download the content of all 5 links we made earlier

In [7]:
list_htmls = []
for url in links:
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)

In [8]:
list_htmls



#### It is good practice to limit the rate of your calls to the website
You can do that with the function `time.sleep()`

In [9]:
list_htmls = []
for url in tqdm.tqdm(links): #Track the time left before completing the loop
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.11s/it]


# Video 6.2: Logging and handling exceptions

## Logging
Logging your web scraping activity is crucial to ensuring and demonstrating data quality. It helps you:
- to document the data you extract
- to understand the reasons behind any unexpected stops of your web scraper

### Minimum essentials to log 
- *Time* of the scrape
- The *status code* of the request response
    - If succesful, the status code is normally *200*
    - A common error is *404*: "*Page not found*"
- The *length* of the output
    - To indicate whether there may be a mistake
    - In our case the output is usually an HTML-string
- The *path* to the output file
    - What did we actually scrape?

#### Here is a simple logging function that you can use:

In [10]:
# Define the log function to gather the log information
def log(response,logfile,output_path=os.getcwd()):
    # Open or create the csv file
    if os.path.isfile(logfile): #If the log file exists, open it and allow for changes     
        log = open(logfile,'a')
    else: #If the log file does not exist, create it and make headers for the log variables
        log = open(logfile,'w')
        header = ['timestamp','status_code','length','output_file']
        log.write(';'.join(header) + "\n") #Make the headers and jump to new line
        
    # Gather log information
    status_code = response.status_code #Status code from the request result
    timestamp = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) #Local time
    length = len(response.text) #Length of the HTML-string
    
    # Open the log file and append the gathered log information
    with open(logfile,'a') as log:
        log.write(f'{timestamp};{status_code};{length};{output_path}' + "\n") #Append the information and jump to new line

#### Apply to web scrape:

In [11]:
list_htmls = []
logfile = 'log.csv'
for url in tqdm.tqdm(links):
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5)
    log(response,logfile)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.15s/it]


## Handling exceptions
When you web scrape you will encounter unexpected errors or crashes.
Some common errors may be:
- The URL does not exist
- The connection to the internet stops

How can we mitigate such problems?
- The `Try/Except` block in python can help us
    - If the computer encounter an error then it will execute the except code block instead

In [12]:
list_htmls = []
for url in tqdm.tqdm(links):
    try:
        response = requests.get(url)
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        with open("list_htmls", "w") as l: #Save the list_htmls as a json file to retrieve at another time
            json.dump(list_htmls, l)
        continue #Continue to next iteration of the loop
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.10s/it]


# Video 6.3: The network panel

## Background
Many webpages are built dynamically 
- Each time you open up the webpage, it sends some **requests** to the server to retrieve the data that you see on the webpage

We can find these requests and use them!
- We can then send them directly to the server
- And then extract the data *before* it is written into the HTML
    - Much more preferable: The data usually comes in a structured JSON format
    - Makes it easy to use the data right away

## The network panel in Chrome Developer Tools
Use the **network panel** in the Chrome Developer Tools
- The network panel monitors all uploads and downloads to and from the webpage
- I.e. also requests about data we are interested in

### The network panel: [www.boligsiden.dk](https://www.boligsiden.dk/tilsalg)
<img src="https://drive.google.com/uc?export=view&id=1vGk2b1jxH1LU642-YWo1QgR3Q5vz2rQe">

### Which request is the one?
1. We want to locate an XHR ([XMLHttpRequest](https://en.wikipedia.org/wiki/XMLHttpRequest)). The XHRs transfer data between the server and the webpage 
    - Pick "*Fetch/XHR*"
2. Which XHR carries the information about the properties?
    - We need to look through them all --> You can get a preview of the JSON file in "*Preview*"
3. When we have found the right XHR, we need to find the request URL
    - Go to "*Headers*". Here you see the request URL

### The steps:

#### Send the request to the server and get the data

In [13]:
response = requests.get('https://api.prod.bs-aws-stage.com/search/cases?addressTypes=villa%2Ccondo%2Cterraced+house%2Choliday+house%2Ccooperative%2Cfarm%2Chobby+farm%2Cfull+year+plot%2Cvilla+apartment%2Choliday+plot&per_page=50&page=1&highlighted=true&sortAscending=true&sortBy=timeOnMarket')

#### Convert the data to a JSON format in Python

In [14]:
result = response.json()

#### The JSON file consists of two different key-value pairs ("cases" and "links"). 
#### We are only interested in the information about properties which are stored in the "cases" key-value pair

In [15]:
result_properties = result['cases']

#### Now we can easily convert the JSON file to a pandas DataFrame

In [16]:
data = pd.DataFrame(result_properties)

In [17]:
data

Unnamed: 0,_links,address,addressType,caseID,caseUrl,coordinates,daysOnMarket,defaultImage,descriptionBody,descriptionTitle,...,realtor,slug,status,totalClickCount,totalFavourites,weightedArea,yearBuilt,basementArea,nextOpenHouse,cooperative
0,{'self': {'href': '/cases/1634c6f4-6592-4e66-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,1634c6f4-6592-4e66-b1ef-e3540150f5f0,https://www.danbolig.dk?propertyid=0140000590&...,"{'lat': 55.75399, 'lon': 11.708402, 'type': 'E...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",I Udby ligger denne flotte og fuldstændig nyre...,Nyrenoveret villa i skønne omgivelser i Udby,...,{'_links': {'self': {'href': '/realtors/ea7520...,udbyvej-35-4300-holbaek-03161776__35_______,open,390,3,92.8,1910.0,,,
1,{'self': {'href': '/cases/0e2339cf-2efa-4aaa-b...,{'_links': {'self': {'href': '/addresses/ef6bf...,terraced house,0e2339cf-2efa-4aaa-bc4d-6e54ab394909,https://www.danbolig.dk?propertyid=0140000526&...,"{'lat': 55.712513, 'lon': 11.763541, 'type': '...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",Bo i Holbæks nye parklignende område tæt på fj...,Arkitekttegnet rækkehus klos op ad fredet natur,...,{'_links': {'self': {'href': '/realtors/ea7520...,wegeners-have-21-4300-holbaek-03162181__21_______,open,529,3,,2021.0,,,
2,{'self': {'href': '/cases/2a5eaae5-193b-4e66-9...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,2a5eaae5-193b-4e66-980c-6831714d08db,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.162, 'lon': 11.977034, 'type': 'EPS...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",Denne bolig i Tappernøje passer perfekt til de...,Mulighedsrig ejendom med fremragende garage i ...,...,{'_links': {'self': {'href': '/realtors/cb08f4...,sneserevej-19-4733-tappernoeje-03701546__19___...,open,136,3,205.25,1946.0,90.0,,
3,{'self': {'href': '/cases/09172d2a-964e-42a2-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,terraced house,09172d2a-964e-42a2-ac45-83b391a46f0f,https://www.husmadsen.dk/sag.aspx?mgl=2676&sag...,"{'lat': 55.75249, 'lon': 11.961139, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Stort rækkehus på Grønnevej.Et af de eftertrag...,Attraktiv beliggenhed,...,{'_links': {'self': {'href': '/realtors/516593...,groennevej-16-4050-skibby-02500429__16_______,open,233,0,120.0,1971.0,,,
4,{'self': {'href': '/cases/38d73064-1712-45ab-8...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,38d73064-1712-45ab-8a62-f3465efd80ce,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.25264, 'lon': 11.76687, 'type': 'EP...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",I et roligt kvarter i Holsted nord for Næstved...,Vedligeholdt etplansvilla i Holsted / Næstved ...,...,{'_links': {'self': {'href': '/realtors/cb08f4...,sofiedalsvej-9-4700-naestved-03701557___9_______,open,263,1,107.55,1966.0,,"{'date': '2022-07-31T11:00:00Z', 'duration': 3...",
5,{'self': {'href': '/cases/b2dbda4d-f35f-4417-9...,{'_links': {'self': {'href': '/addresses/0a3f5...,condo,b2dbda4d-f35f-4417-9c63-f9187092dd6d,https://www.danbolig.dk?propertyid=0350000250&...,"{'lat': 55.839928, 'lon': 12.430592, 'type': '...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Nu kan du komme til at bo lige i midten af bye...,Herskabslejlighed i centrum af byen!,...,{'_links': {'self': {'href': '/realtors/3c9ae3...,hovedgaden-45-1-th-3460-birkeroed-02300271__45...,open,375,4,146.0,1926.0,,"{'date': '2022-07-31T11:00:00Z', 'duration': 3...",
6,{'self': {'href': '/cases/7529a18d-9701-450b-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,7529a18d-9701-450b-b614-abf3c7f58ca8,https://www.fynskeboliger.dk/sag.aspx?mgl=2288...,"{'lat': 55.20126, 'lon': 10.775188, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",VORMARK GL. SKOLE - IDEEL FOR DEN PLADSKRÆVEND...,,...,{'_links': {'self': {'href': '/realtors/817f0a...,revsoerevej-23-5874-hesselager-04790382__23___...,open,200,4,246.05,1840.0,,,
7,{'self': {'href': '/cases/b1549121-0d0b-4daf-9...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,b1549121-0d0b-4daf-99eb-71a068e769a1,https://www.danbolig.dk?propertyid=1900000158&...,"{'lat': 55.562973, 'lon': 12.257923, 'type': '...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",På Vårgyvelvej 43 i Karlslunde ligger denne vi...,123 m2 etplansvilla i røde sten og med betonta...,...,{'_links': {'self': {'href': '/realtors/1fdea5...,vaargyvelvej-43-2690-karlslunde-02539415__43__...,open,404,3,121.65,1979.0,,,
8,{'self': {'href': '/cases/0dd4feb9-a73e-49de-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,terraced house,0dd4feb9-a73e-49de-b0af-8594b6dd50c2,http://www.estate-maeglerne.dk/maegler/pages/p...,"{'lat': 55.793953, 'lon': 12.488909, 'type': '...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",På en attraktiv adresse i Brede udbyder vi nu ...,,...,{'_links': {'self': {'href': '/realtors/22d38c...,fyrrevang-22-2830-virum-01730262__22_______,open,480,7,110.8,1944.0,32.0,"{'date': '2022-07-31T10:00:00Z', 'duration': 2...",
9,{'self': {'href': '/cases/0f7a12fc-2bbc-4646-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,0f7a12fc-2bbc-4646-a063-342e52d3f250,https://www.danbolig.dk?propertyid=0140000390&...,"{'lat': 55.54366, 'lon': 11.710705, 'type': 'E...",3,"{'imageSources': [{'size': {'height': 80, 'wid...","Her får I et skønt byhus, der ligger centralt ...",Dejligt byhus med skøn have i Store Merløse,...,{'_links': {'self': {'href': '/realtors/ea7520...,bygaden-10-4370-store-merloese-03160221__10___...,open,291,4,177.85,1912.0,18.0,,


## And we have the data!