# Obtaining, parsing and structuring static HTML websites

In this notebook we will learn how to scrape basic static, i.e. non-interactive HTML-based websites. We will
- obtain the HTML raw content using the `requests` module
- convert the raw HTML into a format that is easier to search, or parse, using the `BeautifulSoup` module
- learn how to identify the elements of interest in the raw HTML using the browser's inspect functionality and the CSS SelectorGadget
- construct a table, or dataframe, with the popular table calculation module `pandas` and store the output locally in a standard spreadsheet format

1. Open the Anaconda Prompt and install the module `requests`

In [6]:
import requests

In [4]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [1]:
seed = 'https://www.uni-potsdam.de/de/'

2. What data type is the object `seed`? How can you check?

In [2]:
type(seed)

str

3. Is this domain an admissible path? Hint: Check the `robots.txt`

4. Was the request successful? How can you check the status? Hint: Check the available methods by using Jupyter's auto-complete functionality, i.e. type a dot at the end of the object you're investigating followed by <kbd>Tab</kbd>

In [7]:
html=requests.get(seed)

5. Which method could be most informative w.r.t. actual content? How many characters long is the raw HTML file?

In [8]:
html.status_code

200

6. Display the first 518 characters of the `html` object.

In [9]:
html.content

b'<!DOCTYPE html><html dir="ltr" lang="de-DE"><head><meta charset="utf-8"><!-- benaja - web solutions (www.benaja-websolutions.com) Markus Meier, Roland Brandt und Tobias Gaertner GbR This website is powered by TYPO3 - inspiring people to share! TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL. TYPO3 is copyright 1998-2021 of Kasper Skaarhoj. Extensions are copyright of their respective owners. Information and contribution at https://typo3.org/ --><meta http-equiv="x-ua-compatible" content="IE=edge"/><meta name="generator" content="TYPO3 CMS"/><meta name="description" content="Wo Wissen w\xc3\xa4chst: Die Universit\xc3\xa4t Potsdam punktet mit einer besonderen Vielfalt an Studienm\xc3\xb6glichkeiten und einem ausgepr\xc3\xa4gten interdisziplin\xc3\xa4ren Forschungsprofil. "/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><meta name="author" content="Sabine Schwarz"/><meta name="keywords" con

In [10]:
html.text[:518]

'<!DOCTYPE html><html dir="ltr" lang="de-DE"><head><meta charset="utf-8"><!-- benaja - web solutions (www.benaja-websolutions.com) Markus Meier, Roland Brandt und Tobias Gaertner GbR This website is powered by TYPO3 - inspiring people to share! TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL. TYPO3 is copyright 1998-2021 of Kasper Skaarhoj. Extensions are copyright of their respective owners. Information and contribution at https://typo3.org/'

7. Display meta information on the origin of the HTTP request, e.g. date. Note that it is possible to specify the `user-agent` that the server receives and provides the response (website representation) such that it optimised, e.g. Desktop vs. mobile. If it's not specified, the request will be sent using default values (potentially) containing information about your operating system, screen resolution, keyboard language, IP address and many more.

In [11]:
html.headers

{'Date': 'Thu, 15 Apr 2021 12:29:57 GMT', 'Server': 'Apache/2.4.29 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Last-Modified': 'Thu, 15 Apr 2021 12:25:33 GMT', 'Accept-Ranges': 'bytes', 'Content-Length': '11841', 'Cache-Control': 'max-age=0', 'Expires': 'Thu, 15 Apr 2021 12:29:57 GMT', 'X-UA-Compatible': 'IE=edge', 'X-Content-Type-Options': 'nosniff', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=utf-8'}

The cell below saves the HTML object's text attribute in HTML format locally.

In [14]:
with open('Uni_Potsdam.html', 'w', encoding='utf-8') as f:
    f.write(html.text)

8. Install the module `BeautifulSoup` via `pip install beautifulsoup4`

In [15]:
from bs4 import BeautifulSoup

In [17]:
pip install beautifulsoup4




In [16]:
soup = BeautifulSoup(html.text, "html.parser")

9. Parse the BeautifulSoup object `soup` for all Affiliate Links. Hint: In a HTML document all elements that lead to another domain are indicated by an `a` and follow the structure `<a href="...", ... >text</a>`. Hint: Use `soup`'s method `find_all()` where the input argument is the elements' prefix. What object type is the output? Can you iterate over it? How many elements of an Affiliate Link type are contained in the HTML file?

In [18]:
len(soup.find_all('a'))

275

10. Convert the BeautifulSoup object into a "plain" Python list object containing the elements' **text** attributes by iterating over it. Hint: Instantiate an empty `list` object, write a for-loop and `append` each element to the list object. You may also remove any unwanted whitespaces by using the `strip` function.

In [21]:
empty_list = []
for link in soup.find_all ('a'):
    empty_list.append(link.text.strip())

#### Pro-Tipp
Instead of explicitly writing a for-loop when disentangling specific objects from an aggregate object you can use Python's built-in `map` and `lambda` functions as a one-liner.

In [22]:
results_list = list(map(lambda x: x.text.strip(), soup.find_all('a')))

11. Identify the element which text attribute's value is equal to "alle Artikel". Return the element's position (`index`) within the list.

In [23]:
results_list.index('alle Artikel')

220

12. Obtain this element's value of the `href` attribute. It should be an URL pointing at the domain where the news at Universität Potsdam are collected.

In [None]:
new_seed = soup.find_all('a')[all_news_index].get('href')

13. Write a function which takes a String-type object (e.g. an URL) as input and returns a readily parse-able `BeautifulSoup` object.

In [None]:
# def URL_to_BS(url):
    

    
#     return soup

14. Open the `new_seed` URL in your browser and enable the CSS SelectorGadget. Highlight the box containing the first article. The other, similar boxes should be highlighted as well. Copy the identified CSS selector and parse through the `news_soup` object but this time over elements corresponding to the CSS selector you found (use `.select()` instead of `find_all()`). Store the subset of elements in a list. You can achieve all of this in one line of code. How many items does this list contain?

15. Split the list's elements into their hyperlinks (`href`) and text attributes' values.

## Pagination
You have probably realised that the articles presented on the first news page are not the entire collection of the University of Potsdam. Your goal is to retrieve a complete collection of all articles that are available on the university's website and you can easily apply your new knowledge in a repetive manner.

16. Figure out how many pages containing articles content there are in total. You can do it manually by e.g. inspecting the URL when you proceed through the collection in your browser or by checking it programmatically by writing a `while` loop that continues until some condition, such as a status returned from your request, is violated. Make sure to include a short pause (1 second) in order not to overcharge the server that in some cases could lead to a temporary ban of your device.

In [None]:
# Long code block

17. Read in the JSON file you stored in step 17 and iterate over each hyperlink. Split the list into 4 evenly sized chunks and iterate over each chunk. In each iteration, obtain the HTML, parse it and identify the elements of the publication date, the contact, the contact's email address, the image's hyperlink/reference and the main text body's length. Note that some, or even all, of these elements may not be available. Define an appropriate data type for each field and append it **as a dictionary** in each iteration to a list.

## Asynchronous HTTP requests

18. Install the libaries `asyncio`, `aiohttp` and `tqdm`.

In [None]:
import asyncio
import aiohttp
import bs4
import tqdm

19. Find the missing link that appears in `articles_links_r` but not in `results_list` using a list comprehension.

20. Install the `pandas` library.

In [None]:
import pandas as pd

21. Convert the `publication_date` into a `pandas` `datetime` object and plot a time series of published articles on a daily basis. Bonus: Aggregate the time series into monthly frequency. In which month-year were most articles published?

22. Install the library `matplotlib`.

In [None]:
import matplotlib
import matplotlib.pyplot as plt

23. Install the libraries `cufflinks` and `plotly`.

In [None]:
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

24. Install the `chart-studio` library.

In [None]:
import chart_studio
import chart_studio.plotly as py
import plotly.graph_objs as go

25. Log in to [Plotly Chart Studio](https://chart-studio.plotly.com/Auth/login/#/) and obtain your `Username` and `API key`. Store them both line-by-line in a .py file, e.g. name it "plotly_config.py".

In [None]:
import plotly_config

chart_studio.tools.set_credentials_file(username=plotly_config.Username, api_key=plotly_config.api_key)