# Scraping WWW Exercise

## Exercise 1

Examine front page of BILD newspaper (www.bild.de) and create a list of all articles that can be found on that page. Each item of the list must contain

* article title,
* main image of the article,
* url of the article.

**Hint:**

* request content of `www.bild.de` page and use `"rel": "bookmark"` properties for identifying links pointing at articles,
* request the content of each article for obtaining the url, the title, the teaser and main image of the article,
* you can use `"og"` properties of `<meta>` tag whithin an article to retrieve its title, main image and url.

### 1. Import libraries

In [1]:
import requests
import bs4

### 2. Scrape bild.de

In [2]:
url = 'http://www.bild.de'
page = requests.get(url).text

### 3. Create a BeautifulSoup object

In [4]:
page = bs4.BeautifulSoup(page, 'html.parser')

### 4. Create a list of article links

In [5]:
article_links = [url + a['href'] for a in page.select('article a[href$="bild.html"]')]

print(len(article_links))

117


### 5. Scrape the article data

In [26]:
articles = []

for link in article_links:
    try:
        # scrape the article
        article = requests.get(link).text
        article_bs_tree = bs4.BeautifulSoup(article, 'html.parser')
        
        # select relevant data from the article
        title = article_bs_tree.find(name='meta', attrs={'property': 'og:title'}).get('content')
        image = article_bs_tree.find(name='meta',  attrs={'property': 'og:image'}).get('content')
        url = article_bs_tree.find(name='link', attrs={'rel': 'canonical'}).get('href')

        # store the data in a dict
        article = {
                'title': title,
                'image': image,
                'url': url}
        
        # add that dict to the list of articles
        articles.append(article)
        
    except Exception:
        continue




In [25]:
print(articles[:3])
print(len(articles))

[{'title': 'Gefälschte Idiotentests: MPU-Abzocker ohne Führerschein im Ferrari', 'image': 'https://images.bild.de/65c0b2a322c3637e12ee9d6f/c3ccd824748b138956e603207b0628da,56969e6?w=1280', 'url': 'https://www.bild.de/bild-plus/regional/koeln/regional/gefaelschte-idiotentests-mpu-abzocker-ohne-fuehrerschein-im-ferrari-87056642.bild.html'}, {'title': 'Kanye West: Fans fürchten nach Nackt-Auftritt um seine Frau Bianca', 'image': 'https://images.bild.de/65c33c01bfc8c93720775fe4/8837017ff6f28c0f534cf5e35ef0511e,85451f1?w=1280', 'url': 'https://www.bild.de/unterhaltung/leute/leute/kanye-west-fans-fuerchten-nach-nackt-auftritt-um-seine-frau-bianca-87066312.bild.html'}, {'title': 'DFB-Pokal: Viertelfinale zwischen Leverkusen und Stuttgart sorgt für Quoten-Rekord!', 'image': 'https://images.bild.de/65c3573cbfc8c9372077615a/85179069876389dd1aa44754aa9a76d7,690030c8?w=1280', 'url': 'https://www.bild.de/sport/fussball/dfb-pokal/dfb-pokal-viertelfinale-zwischen-leverkusen-und-stuttgart-sorgt-fuer-q