# Access Contents of HTML Page

- Author:      Johannes Maucher
- Last update: 2018-10-21

This notebook demonstrates how to parse a HTML document and access dedicated elements of the parse tree.
[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#) is a python package for parsing HTML. Download and install version 4 by typing:

> `pip install beautifulsoup4`

into the command shell. Once it is installed it can be imported by

In [1]:
from bs4 import BeautifulSoup

For accessing arbitrary resources by URL the python modul [urllib](https://docs.python.org/2/library/urllib.html) must also be installed. Import the method _urlopen()_ from this module:  

In [2]:
from urllib.request import urlopen

If these two modules are available the HTML parse tree of the specified URL can easily be generated as follows.

In [3]:
#url="http://www.zeit.de"
url="http://www.spiegel.de"
#url="http://www.sueddeutsche.de"
html=urlopen(url).read()
soup=BeautifulSoup(html,"html.parser")

Now e.g. the title of the URL can be accessed by:

In [4]:
titleTag = soup.html.head.title
print("Title of page:  ",titleTag.string)

Title of page:   DER SPIEGEL | Online-Nachrichten


## Get all links in the page
All links in the page can be retrieven by the following code (only the first 20 links are printed)

In [5]:
hreflinks=[]
Alllinks=soup.findAll('a') #The <a> tag defines a hyperlink, which is used to link from one page to another.
for l in Alllinks:
    if l.has_attr('href'):
        hreflinks.append(l)
print("Number of links in this page: ",len(hreflinks))
for l in hreflinks[:20]:
    print(l['href'])

Number of links in this page:  561
#Inhalt
https://www.spiegel.de/nutzungsbedingungen
https://www.spiegel.de/
https://abo.spiegel.de/?b=SPOHNAVABO&requestAccessToken=true&sara_icid=disp_upd_9h6L5hu8K1AAnttzYATx3hvk7taDkP&targetUrl=https%3A%2F%2Fwww.spiegel.de%2Ffuermich%2F
https://gruppenkonto.spiegel.de/authenticate?requestAccessToken=true&targetUrl=https%3A%2F%2Fwww.spiegel.de%2Ffuermich%2F
https://www.spiegel.de/fuermich/
https://www.spiegel.de/
https://www.spiegel.de/schlagzeilen/
https://www.spiegel.de/plus/
https://www.spiegel.de/audio/
https://www.spiegel.de/fuermich/
https://www.spiegel.de/schlagzeilen/
https://www.spiegel.de/plus/
https://www.spiegel.de/thema/coronavirus/
https://www.spiegel.de/politik/deutschland/
https://www.spiegel.de/politik/ausland/
https://www.spiegel.de/panorama/
https://www.spiegel.de/sport/
https://www.spiegel.de/wirtschaft/
https://www.spiegel.de/netzwelt/


## Get all news titles
Get title of all news, which are currently listed on [www.zeit.de](http://www.zeit.de):

In [6]:
#print soup.get_text()hreflinks=[]
AllTitles=soup.findAll('h2')
alltitles=[]
alltitleLinks=[]
for l in AllTitles:
    #print l
    try:
        title = l.find('a')['title']
        link = l.find('a')['href']
        print('-'*40)
        print(title)
        print(link)
        alltitles.append(title)
        alltitleLinks.append(link)
    except:
        pass

----------------------------------------
Childbirth in the Pandemic: How COVID-19 Is Indirectly Killing Mothers and Babies
https://www.spiegel.de/international/tomorrow/childbirth-in-the-pandemic-how-covid-19-is-indirectly-killing-mothers-and-babies-a-adf7c1f1-441b-4aa3-87bd-86e9f3345b0b
----------------------------------------
Trotz harter Vorwürfe: Staatsbank verlängerte Wirecard noch im Herbst 2019 die Kreditlinie
https://www.spiegel.de/wirtschaft/unternehmen/wirecard-staatsbank-verlaengerte-wirecard-noch-im-herbst-2019-die-kreditlinie-a-b8971bfc-7f45-4787-922b-3b649cb1cdba
----------------------------------------
Energiekonzern: BP rechnet mit Ende des Ölzeitalters
https://www.spiegel.de/wirtschaft/unternehmen/bp-oelfoerderung-steht-laut-energiekonzern-vor-dem-ende-a-655257b8-7e33-4a29-8c3e-ed37569219fa
----------------------------------------
Stolberg in Nordrhein-Westfalen: Ermittler prüfen Zusammenhang von Messerangriff mit Kommunalwahl
https://www.spiegel.de/politik/deutschland

## Get all images of the page

Get url of all images, which are currently displayed on [www.zeit.de](http://www.zeit.de):

In [7]:
imglinks=[]
AllImgs=soup.findAll('img')
for l in AllImgs:
    if l.has_attr('src'):
       imglinks.append(l)

for l in imglinks[:10]:
    print(l['src'])

data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://cdn.prod.www.spiegel.de/images/f0706c1e-c6fc-4a27-b5b6-dca6f542114a_w948_r2.11_fpx52_fpy65.jpg
https://cdn.prod.www.spiegel.de/images/f0706c1e-c6fc-4a27-b5b6-dca6f542114a_w920_r1.77_fpx52_fpy65.jpg
https://cdn.prod.www.spiegel.de/images/883748c5-c973-4882-aa5f-c9504b4f783d_w117_r1.33_fpx66.54_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/883748c5-c973-4882-aa5f-c9504b4f783d_w488_r1.77_fpx66.54_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/e809de3e-525b-494a-8199-8640f2b1648d_w56_r1_fpx55.01_fpy45.png
https://cdn.prod.www.spiegel.de/images/e809de3e-525b-494a-8199-8640f2b1648d_w56_r1_fpx55.01_fpy45.png
https://cdn.prod.www.spiegel.de/images/c2e5f84a-1b35-4b13-aca8-8e7fc29e002f_w488_r1.77_fpx49.54_fpy50.jpg
https://cdn.prod.www.spiegel.de/images/c2e5f84a-1b35-4b13-aca8-8e7fc29e002f_w488_r1.77_fpx49.54_fpy50.jpg
https://cdn.prod.www.spiegel.de/images/35ea4af8-9b3e-4055-8e41-cfe808b8c767_w117_r1.3

## Get entire text of a news-article

In [8]:
IDX=0
suburl=alltitleLinks[IDX]
try:
    html=urlopen(suburl).read() #works if subdomains are referenced by absolute path
except:
    html=urlopen(url+suburl).read() #works if subdomains are referenced by relative path
soup=BeautifulSoup(html,"html.parser")
AllP=soup.findAll('p')
for p in AllP:
    print(p.get_text())


SPIEGEL+-Zugang wird gerade auf einem anderen Gerät genutzt

SPIEGEL+ kann nur auf einem Gerät zur selben Zeit genutzt werden.

Klicken Sie auf den Button, spielen wir den Hinweis auf dem anderen Gerät aus und Sie können SPIEGEL+ weiter nutzen.


Midwife Emily Owino cuts the umbilical cord of a newborn girl with a razor blade in Kenya.
Brian Inganga/ AP
For our Global Societies project, reporters around the world will be writing about societal problems, sustainability and development in Asia, Africa, Latin America and Europe. The series will include features, analyses, photo essays, videos and podcasts looking behind the curtain of globalization. The project is generously funded by the Bill & Melinda Gates Foundation.
It's shortly before midnight on May 29 when Brian Inganga gets the call. "You have to hurry," the midwife tells him. The 31-year-old photographer immediately jumps on a motorcycle taxi and heads to Kibera, a slum in Nairobi, the capital city of Kenya. Inganga is anxious,

## Questions and Remarks
1. This notebook demonstrates how raw-text can be crawled from news-sites. But what is the drawback of this method?
2. Execute the entire notebook also for `www.spiegel.de` and `www.sueddeutsche.de`.
3. What do you observe? How to solve the problem?