# Access Contents of HTML Page

- Author:      Johannes Maucher
- Last update: 2018-10-21

This notebook demonstrates how to parse a HTML document and access dedicated elements of the parse tree.
[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#) is a python package for parsing HTML. Download and install version 4 by typing:

> `pip install beautifulsoup4`

into the command shell. Once it is installed it can be imported by

In [1]:
from bs4 import BeautifulSoup

For accessing arbitrary resources by URL the python modul [urllib](https://docs.python.org/2/library/urllib.html) must also be installed. Import the method _urlopen()_ from this module:  

In [2]:
from urllib.request import urlopen

If these two modules are available the HTML parse tree of the specified URL can easily be generated as follows.

In [3]:
#url="http://www.zeit.de"
url="http://www.spiegel.de"
#url="http://www.sueddeutsche.de"
html=urlopen(url).read()
soup=BeautifulSoup(html,"html.parser")

Now e.g. the title of the URL can be accessed by:

In [4]:
titleTag = soup.html.head.title
print("Title of page:  ",titleTag.string)

Title of page:   DER SPIEGEL | Online-Nachrichten


## Get all links in the page
All links in the page can be retrieven by the following code (only the first 20 links are printed)

In [5]:
hreflinks=[]
Alllinks=soup.findAll('a') #The <a> tag defines a hyperlink, which is used to link from one page to another.
for l in Alllinks:
    if l.has_attr('href'):
        hreflinks.append(l)
print("Number of links in this page: ",len(hreflinks))
for l in hreflinks[:20]:
    print(l['href'])

Number of links in this page:  588
#Inhalt
https://www.spiegel.de/
https://abo.spiegel.de/?b=SPOHNAVABO&requestAccessToken=true&sara_icid=disp_upd_9h6L5hu8K1AAnttzYATx3hvk7taDkP&targetUrl=https%3A%2F%2Fwww.spiegel.de%2Ffuermich%2F
https://gruppenkonto.spiegel.de/authenticate?requestAccessToken=true&targetUrl=https%3A%2F%2Fwww.spiegel.de%2Ffuermich%2F
https://www.spiegel.de/fuermich/
https://www.spiegel.de/
https://www.spiegel.de/schlagzeilen/
https://www.spiegel.de/spiegel/
https://www.spiegel.de/audio/
https://www.spiegel.de/fuermich/
https://www.spiegel.de/schlagzeilen/
https://www.spiegel.de/plus/
https://www.spiegel.de/magazine
https://www.spiegel.de/thema/coronavirus/
https://www.spiegel.de/thema/klimawandel/
https://www.spiegel.de/politik/deutschland/
https://www.spiegel.de/ausland/
https://www.spiegel.de/panorama/
https://www.spiegel.de/sport/
https://www.spiegel.de/wirtschaft/


## Get all news titles
Get title of all news, which are currently listed on [www.zeit.de](http://www.zeit.de):

In [6]:
#print soup.get_text()hreflinks=[]
AllTitles=soup.findAll('h2')
alltitles=[]
alltitleLinks=[]
for l in AllTitles:
    #print(l)
    try:
        title = l.find('a')['title']
        link = l.find('a')['href']
        print('-'*40)
        print(title)
        print(link)
        alltitles.append(title)
        alltitleLinks.append(link)
    except:
        pass

----------------------------------------
Tax Havens in Europa: "Finance Ministers Often Couldn't See Through Them"
https://www.spiegel.de/international/world/tax-havens-in-europa-finance-ministers-often-couldn-t-see-through-them-a-d06f7761-eb3a-4a65-94a5-900912f47dc9
----------------------------------------
Baden-Württemberg: Seniorinnen und Senioren können Führerschein gegen ÖPNV-Ticket tauschen
https://www.spiegel.de/auto/baden-wuerttemberg-seniorinnen-und-senioren-koennen-fuehrerschein-gegen-oepnv-ticket-tauschen-a-f4c2e16f-c32d-469a-bd32-18f879cec571
----------------------------------------
Scheidungsauktion in New York": Ausverkauf einer Ehe
https://www.spiegel.de/kultur/auktion-kunstsammlung-von-harry-und-linda-macklowe-a-073c7501-2380-482d-a322-78e257a18eb1
----------------------------------------
Verhaltensökonomin zur Pandemie: »Eine Impfpflicht wäre ein Vertrauensbruch«
https://www.spiegel.de/gesundheit/diagnose/corona-interview-mit-katrin-schmelz-eine-impfpflicht-waere-ein-v

## Get all images of the page

Get url of all images, which are currently displayed on [www.zeit.de](http://www.zeit.de):

In [7]:
imglinks=[]
AllImgs=soup.findAll('img')
for l in AllImgs:
    if l.has_attr('src'):
       imglinks.append(l)

for l in imglinks[:10]:
    print(l['src'])

data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://cdn.prod.www.spiegel.de/images/a4262e76-bd5f-40f5-8cf3-b8da59ddc739_w872_r1.778_fpx48_fpy45.jpg
https://cdn.prod.www.spiegel.de/images/706de5a3-10a3-4213-87c5-14b32fbeabeb_w117_r1.33_fpx54_fpy51.jpg
https://cdn.prod.www.spiegel.de/images/706de5a3-10a3-4213-87c5-14b32fbeabeb_w488_r1.778_fpx54_fpy51.jpg
https://cdn.prod.www.spiegel.de/images/58bf8c1e-e932-4142-9ba8-13be8d6a6267_w117_r1.33_fpx47.97_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/58bf8c1e-e932-4142-9ba8-13be8d6a6267_w117_r1.33_fpx47.97_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/58bf8c1e-e932-4142-9ba8-13be8d6a6267_w488_r1.778_fpx47.97_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/58bf8c1e-e932-4142-9ba8-13be8d6a6267_w488_r1.778_fpx47.97_fpy49.98.jpg
https://cdn.prod.www.spiegel.de/images/966809db-288f-4387-92ce-cd69a90ebdaf_w488_r1.778_fpx43.2_fpy50.jpg
https://cdn.prod.www.spiegel.de/images/966809db-288f-4387-92ce-cd69a90

## Get entire text of a news-article

In [8]:
IDX=0
suburl=alltitleLinks[IDX]
try:
    html=urlopen(suburl).read() #works if subdomains are referenced by absolute path
except:
    html=urlopen(url+suburl).read() #works if subdomains are referenced by relative path
soup=BeautifulSoup(html,"html.parser")
AllP=soup.findAll('p')
for p in AllP:
    print(p.get_text())

The Yacht harbor in Valletta, Malta: "The usual suspects"
For decades, a handful of European member states have lured large corporations to set up shop there in exchange for extremely low tax rates that have meant billions in extra earnings. Martijn Nouwen, 37, spent years conducting detective work on the practice, collecting more than 2,500 internal documents from a European Union panel that is supposed to be working to combat these tax-dumping practices. Together with its partners in the European Investigative Collaborations (EIC), DER SPIEGEL reviewed the documents and revealed how nearly 25 years of efforts by the Code of Conduct Group to stop competition for lower corporate tax rates within the EU have been in vain. In an interview, Nouwen, who is now an assistant professor at the University of Leiden, discusses the reasons, some of which are to be found in Berlin.
Tax expert Martijn Nouwen
Andreas Chudowski / DER SPIEGEL
DER SPIEGEL: Mr. Nouwen, the EU has been trying for a quart

## Questions and Remarks
1. This notebook demonstrates how raw-text can be crawled from news-sites. But what is the drawback of this method?
2. Execute the entire notebook also for `www.spiegel.de` and `www.sueddeutsche.de`.
3. What do you observe? How to solve the problem?