# A Very Quick Introduction to Web Scraping
_Copyright 2020 Andre M. Maier_

## 1. Import some necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup

## 2. Use requests to load webpage

In [2]:
req = requests.get("https://raw.githubusercontent.com/profqubit/webscraping/main/example1.html", verify=True)
print(req) # 200 -> ok

<Response [200]>


## 3. Build a BeautifulSoup object

In [3]:
content = req.content
soup = BeautifulSoup(content)
# print(soup.prettify())

## 4. Use BeautifulSoup to extract information

BeautifulSoup maps the hierarchic structure of an HTML document into a tree of objects. You can navigate this tree by specifying HTML tags and by using specific attributes, e.g. .parent, .children, next_sibling, previous_sibling, and many more. Note that if you specify an HTML tag, it always refers to its first occurrance on the page.<br>
The following examples will illustrate the principle.<br>
For more information visit https://www.crummy.com/software/BeautifulSoup/bs4/doc/<br>

In [4]:
print(soup.title)
print(soup.h1)
print("---------------------------")
print(soup.title.parent)

<title>Example 1</title>
<h1>The Black Cat</h1>
---------------------------
<head>
<title>Example 1</title>
</head>


.contents allows you to extract all elements between an opening and closing tag as a list of objects.

In [5]:
elements = soup.head.contents
print(elements)
title = elements[1]
print(title.contents)

['\n', <title>Example 1</title>, '\n']
['Example 1']


Use the .string generator to directly extract text between an opening and closing tag. Note that this only works with tags that actually contain a string!

In [6]:
print(soup.title.string)
first_paragraph = soup.p.string
print("The word \"the\" occurs", first_paragraph.count("the"), "times in the first paragraph.")

Example 1
The word "the" occurs 12 times in the first paragraph.


If you are only interested in the human readable text in a document or between tags, you can also use get_text(). In this case there will be no difference, as there is only human readable text in the first paragraph.

In [7]:
first_paragraph = soup.p.get_text()
print("The word \"the\" occurs", first_paragraph.count("the"), "times in the first paragraph.")

The word "the" occurs 12 times in the first paragraph.


If you want to find and extract all occurrances of a specific HTML tag, you can use find_all().

In [8]:
list = soup.find_all('p')
print("There are", len(list), "paragraphs on the web page.")
last_paragraph = list[len(list)-1]
print("The text in the last paragraph is as follows: \n", last_paragraph.string)

There are 32 paragraphs on the web page.
The text in the last paragraph is as follows: 
 Of my own thoughts it is folly to speak. Swooning, I staggered to the opposite wall. For one instant the party upon the stairs remained motionless, through extremity of terror and of awe. In the next, a dozen stout arms were toiling at the wall. It fell bodily. The corpse, already greatly decayed and clotted with gore, stood erect before the eyes of the spectators. Upon its head, with red extended mouth and solitary eye of fire, sat the hideous beast whose craft had seduced me into murder, and whose informing voice had consigned me to the hangman. I had walled the monster up within the tomb! 


find_all() is also able to match regular expressions.

In [9]:
import re
regex = re.compile('^t')
tags_starting_with_t = soup.find_all(regex)
print(tags_starting_with_t)

[<title>Example 1</title>]


## 5. Exercise

* Write a Python program that loads Example #2 available at https://raw.githubusercontent.com/profqubit/webscraping/main/example2.html
* Extract the following information:
 - A list that contains all food categories mentioned on the page
 - A list that contains all fruits, vegetables, nuts, and seeds mentioned on the page
 - A list that only contains all seeds mentioned on the page
 - The number of different fruits mentioned on the page
 - A list that containes all URLs referred to by the hyperlinks on the page