##  Parsing HTML with BeautifulSoup

BeautifulSoup is a Python library for extracting data from HTML or XML files. Refer to the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more information. 

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://www.bbc.com/pidgin/topics/c2dwqd1zr92t"

In [None]:
response = requests.get(url)
print(dir(response))   # dir allows us to check the methods available for an object

In [None]:
page_html = response.content
print(page_html)

In [None]:
print(type(page_html))   # Without bs4, we obtain a string object which is tedious to interact with

In [None]:
page_soup = BeautifulSoup(page_html, "html.parser") 
print("type: ",type(page_soup))
print(f"page soup: {page_soup}")

BeautifulSoup allows us to use a variety of parsers. You can check this out :) 

In [None]:
print(dir(page_soup))  # we get a rich list of methods available for our BeautifulSoup object

**`TODO`**: Complete the `get_page_soup` function in `scraper.py`

## Locating elements

### 1. Locating elements by tag

In [11]:
all_urls = page_soup.findAll("a")

In [12]:
print(len(all_urls))

90


In [13]:
print(type(all_urls[0]))

<class 'bs4.element.Tag'>


In [None]:
print(dir(all_urls[0]))

In [15]:
all_urls[0].get("href")     # hrefs may be absolute

'https://www.bbc.co.uk'

In [16]:
all_urls[12].get("href")     # hrefs may also be relative

'/pidgin/topics/c2dwqd1zr92t'

In [None]:
all_urls

### 2. Locating elements by tag & class

To obtain the class of an element ona web page, right click the element and select `Inspect`. 

![InspectElement](static/inspect_element.png)

In [None]:
headline = page_soup.find(
        "h1", attrs={"class": "bbc-13dm3d0 e1yj3cbb0"}
        )

### How many pages of article are there in a category?
**`TODO`**: Go to any of the category pages of BBC Pidgin. Can you retrieve the `class` of the `span` element that contains the number of total pages as text? In the picture below, for example, retrieve the `class` of the `span` that contains `100` as text. 

![PageNumber](static/PageNumber.png)

## What is a valid article link?

Website may have a pattern for successive pages in a category. This eases our work. Check through a number of articles. Do you notice a pattern to the article URLS?

In [None]:
for url in all_urls:
    href = url.get("href")
    
    if (href.startswith("/pidgin/tori") or \
        href.startswith("pidgin/world") or \
        href.startswith("pidgin/sport")) and href[-1].isdigit():
        print(href)

valid_url = "https://www.bbc.com" + href

In [None]:
base_url = "https://www.bbc.com"

In [None]:
article = requests.get(# insert a valid URL here) 
# 

**`TODO`**: Complete the `get_valid_urls` and `get_urls` functions in `scraper.py`

## Getting article text

In [None]:
response = requests.get("https://www.bbc.com/pidgin/sport-58110814")

In [None]:
page_html = response.text
page_soup = # TODO: Create a BeautifulSoup object from `page_html`

Text on websites are frequently embedded within one or more `div` elements. Open up the article we requested above in your browser.`Inspect` any paragraph from the article. What do you notice about the `div` elements? What is the value of their `class` attribute?

In [None]:
story_div = page_soup.find_all(
        "div", attrs={"class": "bbc-19j92fr e57qer20"}
        )

In [None]:
story_div

Inside `div` elements, text is usually written in `p` elements. BeautifulSoup allows us to search the `div` element we retrieved above for elements it contains. 

In [None]:
p_elements = [div.findAll("p") for div in story_div]
p_elements

In [None]:
p_elements[0]

In [None]:
p_elements[0][0]

In [None]:
p_elements[0][0].text

**`TODO`**: Complete the `get_article_data` and `scrape` functions in `scraper.py`. 

**`TODO`**: Complete the `get_parser` function in `scraper.py`