# Soups

Soups is a module that contains functions for the Beautiful Soup library.

The Beautiful soup library is a library for retrieving data from HTML files. For more information on Beautiful Soup, see the website at [Crummy](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


# Initialization

The following code imports soups. The code assumes that the current directory contains the scrape package.

In [1]:
import os
import sys
CURR_DIR = os.path.dirname(os.path.abspath('..'))
print('Current dir: ' + CURR_DIR)
sys.path.append(CURR_DIR)
from scrape import soups
from bs4 import BeautifulSoup

Current dir: D:\Projects\Python\projects\scrape
Initializing scrape ...


# Working with soups

### Importing and exporting from HTML to file
The functions `read` and `write` are used to read and write HTML pages from and to files.

In [9]:
filename = r'testpage.html'
bpage = soups.read(filename)
#soups.write(filename_out, bpage)

### Importing and exporting from HTML to BS4
In this example we will retrieve information from a small HTML text. The text is similar to the example used to explain Beautiful Soup. We use the function`get_soup` to import an HTML page, and convert it to a BS4 object. We use `get_page` to convert the soup to HTML again.

In [2]:
page = """<html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """

soup = soups.get_soup(page)
print(type(soup))

apage = soups.get_page(soup)
print(type(apage))

<class 'bs4.BeautifulSoup'>
<class 'str'>


### Retrieving strings and links from the soup
The function `get_strings` retrieves a list with all strings from `soup`.

In [3]:
alist = soups.get_strings(soup)
[astr for astr in alist]

["The Dormouse's story",
 "The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were',
 'Elsie',
 ',',
 'Lacie',
 'and',
 'Tillie',
 ';\n    and they lived at the bottom of a well.',
 '...']

 The function `get_hrefs` retrieves a list with all links (href tags) from `soup`.

In [4]:
alist = soups.get_hrefs(soup)
[astr for astr in alist]

['http://example.com/elsie',
 'http://example.com/lacie',
 'http://example.com/tillie']

### Retrieving information from Tags and Results

The functions `get_strings` and `get_hrefs` can also be used to get information from BS4 objects tag and resultset.

**ResultSet**

In [5]:
#Get strings from a resultset with all a tags
results = soup.find_all("a")
alist = soups.get_strings(results)
print(type(results).__name__)
[astr for astr in alist]
hlist = soups.get_hrefs(results)
[hstr for hstr in hlist]

ResultSet


[['http://example.com/elsie'],
 ['http://example.com/lacie'],
 ['http://example.com/tillie']]

**Tag**

In [6]:
#Get string from a tag with id="link1"
tag = soup.find(id="link1")
print(type(tag).__name__)
alist = soups.get_strings(tag)
print(alist)
hlist = soups.get_hrefs(tag)
print(hlist)

Tag
['Elsie']
['http://example.com/elsie']


## Extracting a table

TODO

# Editing a soup

### Splitting and merging a soup
The method `split` returns a Beautiful Soup object, meta, which contains all HTML code outside \< body \>, plus an empty \< body \> and a tag which contains the \< body \>. Splitting provides a deep copy.

In [7]:
[meta,body] = soups.split(soup)
print('Meta:')
print(meta)
print('\nBody:')
print(body)

Meta:
<html><head><title>The Dormouse's story</title></head>
<body></body></html>

Body:
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


In [8]:
#Change meta and body
meta.title.string = meta.title.string.upper()
atag = body.find(id="link1")
atag.string = atag.string.upper()

new_soup = soups.merge(meta,body)
print('Old Soup:')
print(soup)

print('\nNew Soup:')
print(new_soup)

Old Soup:
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

New Soup:
<html><head><title>THE DORMOUSE'S STORY</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">ELSIE</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the b