# SOLUTIONS NOTEBOOK

In this tutorial we use [requests](https://pypi.org/project/requests/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) Python packages to look at information in 3 BBC articles.

Disclaimer: Note that you should always double check a website's Terms of Service and double check with the website owner that you're allowed to scrape. This tutorial is just a research exercise - we're not actually scraping any articles data from BBC.

## Package imports

In [None]:
import requests
from bs4 import BeautifulSoup

## News article URL

In [None]:
url = "https://www.bbc.co.uk/news/business-65321487"

## Getting the HTML source code with requests

We start by using the requests package to get the HTML source code for the URL above.

Each time you do `requests.get()` you're accessing the website once.

In [None]:
page_html = requests.get(url = url)
soup = BeautifulSoup(markup = page_html.content, features = "html.parser")

Alternatively you can (and should) declare who you are in headers through the user agent, for transparency purposes.

In [None]:
page_html = requests.get(url = url, headers = {'User-Agent':"Data collection for the purpose of research. For questions contact: xx@xx.com"})

A 200 response/status code means that the request was successful:

In [None]:
page_html

In [None]:
# bytes variable
type(page_html.content)

## Parsing HTML with BeautifulSoup

In [None]:
soup = BeautifulSoup(markup = page_html.content, features = "html.parser")

In [None]:
type(soup)

We make use of `print()` and the `prettify()` functions to better understand the HTML structure.

In [None]:
# Using prettify and print help with understanding the HTML since they add indentation
print(soup.prettify())

## Let us now extract information from our "soup"

### Main Heading

`find()` will return the first matching item

In [None]:
title = soup.find(id="main-heading")

In [None]:
title.text

### Body of text

Let us start by using `find()` to get the first matching item.

In [None]:
soup.find(name='div', attrs={'data-component':"text-block"})

In [None]:
first_match = soup.find(name='div', attrs={'data-component':"text-block"})
print(first_match.prettify())

In [None]:
first_match.text

`find_all()` returns a list of matching items

In [None]:
# Full list of matching items
all_matches = soup.find_all(name='div', attrs={'data-component':"text-block"})
all_matches

In [None]:
# List size
len(all_matches)

In [None]:
# List of text strings
[t.text for t in all_matches]

In [None]:
# All sentences together
print(" ".join([t.text for t in all_matches]))

### Publication time

Time as text:

In [None]:
time_info = soup.find_all(name='time')
time_info

In [None]:
print(time_info[0].prettify())

In [None]:
# probably BST time
soup.find(name='time').text

Time as datime (UTC time):

In [None]:
soup.find(name='time').attrs

In [None]:
soup.find(name='time').attrs["datetime"]

### All links found on the webpage

In [None]:
# First link
soup.find(name="a").get("href")

In [None]:
# Printing all links found on the webpage
for link in soup.find_all(name='a'):
    print(link.get('href'))


### Topic links

In [None]:
# List of topic links
all_links = soup.find_all(name='a')
topic_links = [link.get('href') for link in all_links if "topics" in link.get('href')]
topic_links

### Topic names list

In [None]:
# Topics all in one string without a clear separator
[i.text for i in soup.find_all(name='div', attrs={'data-component':"topic-list"})]

In [None]:
soup.find(name='div', attrs={'data-component':"topic-list"}).find_all("li")

In [None]:
[topic.text for topic in soup.find(name='div', attrs={'data-component':"topic-list"}).find_all("li")]

### Author information


In [None]:
# We only really want the author
soup.find(name='div', attrs={'data-component':"byline-block"}).text

In [None]:
by = soup.find(name='div', attrs={'data-component':"byline-block"})
print(by.prettify())

In [None]:
by.find("div", attrs={"class":"ssrcss-68pt20-Text-TextContributorName e8mq1e96"}).text

## Exercises

1. Does the code above work on other articles from bbc.co.uk? 

You can give it a try with the following ones or others:

```
url = "https://www.bbc.co.uk/news/science-environment-57159056"
url = "https://www.bbc.co.uk/news/business-64261457"
```

It does work on these 3 webpages - that doesn't mean it will work in all remaining pages, but when creating a scraper you should aim for it to be as general as possible if you aim to collect data from multiple similar pages.

2. This article `https://www.bbc.co.uk/news/science-environment-57159056` contains headlines (besides the main title). Write code to extract those headlines.

**Steps:**

a) Use requests to get the HTML source code

b) Use BeautifulSoup to parse the HTML.

c) Build a rule that allows you to extract the information you need. 

In [None]:
url = "https://www.bbc.co.uk/news/science-environment-57159056"
page_html = requests.get(url = url)
soup = BeautifulSoup(markup = page_html.content, features = "html.parser")
soup.find_all('div', attrs={'data-component':"subheadline-block"})

In [None]:
soup.find_all('div', attrs={'data-component':"subheadline-block"})

3. Extract the figure captions from one of the articles.

In [None]:
first_fig_caption = soup.find('figcaption')
print(first_fig_caption.prettify())

In [None]:
soup.find('figcaption').text

In [None]:
all_fig_captions = soup.find_all('figcaption')

[caption.text for caption in all_fig_captions]

4. Extract relevant metadata information from the pictures in one of the articles.

In [None]:
print(soup.find('picture').prettify())

In [None]:
# We can for example extract a picture Alt text
soup.find('picture').findChildren()[1].attrs["alt"]

5. How do you interpret the following `robots.txt` files?


**Robots #1:**

```
User-agent: Google
Disallow:

User-agent: *
Disallow: /
```

**Robots #2:**

```
User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /search/
Request-rate: 15/100
```

Robots 1:
- Google is allowed to crawl all website pages with no restrictions;
- All other bots are not allowed to access any page of the website;

Robots 2:
- BadBot can't access any page;
- All other user agents can access all paths except `/search/`. In all other paths there should be only 15 requests every 100 seconds (roughly 1 request every 6 seconds)