# Part III. Extracting Data from HTML with BeautifulSoup

## The Task
In Part III, we will create a reusable function to extract information below from the HTML strings of a single quote on [Quotes to Scrape](http://quotes.toscrape.com/).

| **Variable Name** | **Description**                                        |
| :---------------- | :----------------------------------------------------- |
| quote_text        | Text of the quote                                      |
| author            | Name of the author                                     |
| author_url        | URL of the author page, e.g. '/author/Albert-Einstein' |
| tags              | Tags that assigned to the quote                        |


## Main Steps

- Import BeautifulSoup Library
- Make the soup
- Locate target elements from the soup
- Extract information from the retrieved elements
- Tidy up the retrieved outputs
- Store the results
- Create a reusable function

## The Sample Quote
We will start by extracting the four target elements from a sample quote defined as `quote` below.

In [None]:
quote = '''
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
  <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
  <span>by <small class="author">Albert Einstein</small>
    <a href="/author/Albert-Einstein">(about)</a>
  </span>
  <div class="tags">Tags:
    <a class="tag" href="/tag/change/page/1/">change</a>
    <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
    <a class="tag" href="/tag/thinking/page/1/">thinking</a>
    <a class="tag" href="/tag/world/page/1/">world</a>
  </div>
</div>
'''

## Step 1. Import BeautifulSoup library

In [None]:
from bs4 import BeautifulSoup

## Step 2.  Make the soup

In [None]:
soup = BeautifulSoup(quote)

## Step 3. Locate target elements from soup
### Locate elements by tag names

| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Finding item(s) by <br>Tag Name, e.g., span | Output<br>Type| Remarks                                                    |
| :-------------- | :-------------------------------------- | :---------- | :--------------------------------------------------------- |
| soup.find()     | soup.find('span')                       | Item        | Recommended when you are looking for a single item.          |
| soup.find_all() | soup.find_all('span')                   | List of items        | Recommended when you are looking for a set of items.         |
| soup.select()   | soup.select('span')                     | List of items        | Recommended when you are looking for items by CSS selectors. |

In [None]:
quote_text = soup.find('span')
author = soup.find('small')
author_url = soup.find('a')
tags = soup.find_all('a')

### Locate elements by tag names and attributues

| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Finding item(s) by <br>Tag Name, e.g., span &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Finding item(s) by attributes, <br>e.g., class="author" &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
| :-------------- | :-------------------------------------- | :-------------------------------------------------- |
| soup.find()     | soup.find('span')                       | soup.find('span', {'class': 'author'})              |
| soup.find_all() | soup.find_all('span')                   | soup.find_all('span', {'class': 'author'})          |
| soup.select()   | soup.select('span')                     | soup.select ('span .author')                             |

In [None]:
tags = soup.find_all('a', {'class': 'tag'})

## Step 4. Extract information from the elements retrieved
### Retrieve the attribute value of an HTML element
For e.g., the get the value of the attribute `href` for the element `ele`<br>
`ele.get('href')`

In [None]:
author_url = author_url.get('href')

### Retrieve the content from an HTML element
For e.g., to get the content of an element `ele`
<br>
`ele.text`

In [None]:
quote_text = quote_text.text
author = author.text

In [None]:
tags_list = []
for tag in tags:
    tags_list.append(tag.text)

## Step 5. Tidy up the retrieved outputs

1. Add the domain 'http://quotes.toscrape.com' to the author url

In [None]:
author_url = 'http://quotes.toscrape.com' + author_url

2. Convert the list of tags as a text string and seperate the tags with `;`. Overwrite the variable `tags` with the outputs.

In [None]:
tags = ';'.join(tags_list)

## Step 6. Store the result set into a dictionary

In [None]:
results_dictionary = {
  'quote_text': quote_text,
  'author': author,
  'author_url': author_url,
  'tags': tags
}

## Step 7. Create a reusable function
Don't waste your efforts! Make you codes resuable to other HTML strings with same patterns by creating a reusable function

In [None]:
def get_quote(text):
    # Make the soup
    soup = BeautifulSoup(text, "lxml")
    
    # Locate target elements from the soup
    quote_text = soup.find('span')
    author = soup.find('small')
    author_url = soup.find('a')
    tags = soup.find_all('a', {'class': 'tag'})

    # Extract information from the retrieved elements
    quote_text = quote_text.text
    author = author.text
    author_url = author_url.get('href')
    tags_ls = []
    for tag in tags:
        tag = tag.text
        tags_ls.append(tag)  
    
    # Tidy up the retrieved outputs
    author_url = 'http://quotes.toscrape.com' + author_url
    tags = ';'.join(tags_ls)
    
    # Store the results
    results_dt = {
    'author': author,
    'author_url': author_url,
    'tags': ';'.join(tags_ls), 
    'quote_text': quote_text
    }
    
    return results_dt

## Test the function

In [None]:
quote_test = '''
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
  <span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
  <span>by <small class="author">J.K. Rowling</small>
    <a href="/author/J-K-Rowling">(about)</a>
  </span>
  <div class="tags">
    Tags:
    <a class="tag" href="/tag/abilities/page/1/">abilities</a>
    <a class="tag" href="/tag/choices/page/1/">choices</a>
  </div>
</div>
'''