##  Parsing HTML with BeautifulSoup

BeautifulSoup is a Python library for extracting data from HTML or XML files. Refer to the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more information. 

In [None]:
# pipenv shell

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://www.bbc.com/pidgin/topics/c2dwqd1zr92t"

In [3]:
response = requests.get(url)
print(dir(response))   # dir allows us to check the methods available for an object

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [None]:
page_html = response.content
print(page_html)

In [7]:
print(type(page_html))   # Without bs4, we obtain a string object which is tedious to interact with

<class 'bytes'>


In [None]:
page_soup = BeautifulSoup(page_html, "html.parser") 
print("type: ",type(page_soup))
print(f"page soup: {page_soup}")

BeautifulSoup allows us to use a variety of parsers. You can check this out :) 

In [20]:
print(len(page_soup))  # we get a rich list of methods available for our BeautifulSoup object

3


**`TODO`**: Complete the `get_page_soup` function in `scraper.py`

## Locating elements

### 1. Locating elements by tag

In [10]:
all_urls = page_soup.findAll("a")

In [11]:
print(len(all_urls))

90


In [12]:
print(type(all_urls[0]))

<class 'bs4.element.Tag'>


In [None]:
print(dir(all_urls[0]))

In [14]:
all_urls[0].get("href")     # hrefs may be absolute

'https://www.bbc.co.uk'

In [15]:
all_urls[12].get("href")     # hrefs may also be relative

'/pidgin/topics/c2dwqd1zr92t'

In [16]:
all_urls

[<a href="https://www.bbc.co.uk">Homepage</a>,
 <a href="#main-content">Waka go wetin de inside</a>,
 <a href="https://www.bbc.co.uk/accessibility/" id="orb-accessibility-help">Help to enter</a>,
 <a href="https://www.bbc.com/news">News</a>,
 <a href="https://www.bbc.com/sport">Sport</a>,
 <a href="https://www.bbc.com/weather">Weather</a>,
 <a href="https://www.bbc.co.uk/worldserviceradio">Radio</a>,
 <a href="https://www.bbc.co.uk/arts">Arts</a>,
 <a class="istats-notrack" data-alt="Plenti" href="#orb-footer">Menu<span class="orb-icon orb-icon-arrow"></span></a>,
 <a class="gs-u-pv" href="/pidgin" id="brand"><span><svg viewbox="0 0 150 23" xmlns="http://www.w3.org/2000/svg"><title>BBC News Pidgin</title><g fill="#fff"><path class="cls-1" d="M12.5 3.54h2.14v15.78H12.7L2.15 7.17v12.15H.02V3.54h1.83L12.5 15.8V3.54zM18.72 3.54h8.94v2.01h-6.68v4.81h6.46v2.03h-6.46v4.9h6.9v2.01h-9.16V3.54zM51.78 3.54h2.26l-6.39 15.85h-.49L42 6.56l-5.22 12.83h-.48L29.93 3.54h2.28l4.35 10.88 4.38-10.88h2.14l4

### 2. Locating elements by tag & class

To obtain the class of an element ona web page, right click the element and select `Inspect`. 

![InspectElement](static/inspect_element.png)

In [None]:
headline = page_soup.find(
        "h1", attrs={"class": "bbc-13dm3d0 e1yj3cbb0"}
        )

### How many pages of article are there in a category?
**`TODO`**: Go to any of the category pages of BBC Pidgin. Can you retrieve the `class` of the `span` element that contains the number of total pages as text? In the picture below, for example, retrieve the `class` of the `span` that contains `100` as text. 

![PageNumber](static/PageNumber.png)

## What is a valid article link?

Website may have a pattern for successive pages in a category. This eases our work. Check through a number of articles. Do you notice a pattern to the article URLS?

In [17]:
for url in all_urls:
    href = url.get("href")
    
    if (href.startswith("/pidgin/tori") or \
        href.startswith("pidgin/world") or \
        href.startswith("pidgin/sport")) and href[-1].isdigit():
        print(href)

/pidgin/tori-58745492
/pidgin/tori-58745492
/pidgin/tori-58745492
/pidgin/tori-58738377
/pidgin/tori-58738377
/pidgin/tori-58738377
/pidgin/tori-58709437
/pidgin/tori-58709437
/pidgin/tori-58709437
/pidgin/tori-58736254
/pidgin/tori-58736254
/pidgin/tori-58736254
/pidgin/tori-58732508
/pidgin/tori-58732508
/pidgin/tori-58732508
/pidgin/tori-58731043
/pidgin/tori-58731043
/pidgin/tori-58731043
/pidgin/tori-58731038
/pidgin/tori-58731038
/pidgin/tori-58731038


valid_url = "https://www.bbc.com" + href

In [None]:
base_url = "https://www.bbc.com"

In [None]:
article = requests.get(# insert a valid URL here) 
# 

**`TODO`**: Complete the `get_valid_urls` and `get_urls` functions in `scraper.py`

## Getting article text

In [None]:
response = requests.get("https://www.bbc.com/pidgin/sport-58110814")

In [None]:
page_html = response.text
page_soup = # TODO: Create a BeautifulSoup object from `page_html`

Text on websites are frequently embedded within one or more `div` elements. Open up the article we requested above in your browser.`Inspect` any paragraph from the article. What do you notice about the `div` elements? What is the value of their `class` attribute?

In [None]:
story_div = page_soup.find_all(
        "div", attrs={"class": "bbc-19j92fr e57qer20"}
        )

In [None]:
story_div

Inside `div` elements, text is usually written in `p` elements. BeautifulSoup allows us to search the `div` element we retrieved above for elements it contains. 

In [None]:
p_elements = [div.findAll("p") for div in story_div]
p_elements

In [None]:
p_elements[0]

In [None]:
p_elements[0][0]

In [None]:
p_elements[0][0].text

**`TODO`**: Complete the `get_article_data` and `scrape` functions in `scraper.py`. 

**`TODO`**: Complete the `get_parser` function in `scraper.py`