## Processing Website Content

We can process the content and extract HTML Tags as well as data using BeautifulSoup.
* We have to pass the content using `html.parser` and build the BeautifulSoup object.
* Let us prettify and print the content.

In [None]:
import requests

python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(python_page.content, 'html.parser')

In [None]:
print(soup.prettify())

* Let us extract all the `a` tags. We can extract links provided as part of this website.
* Here is the code snippet to get the `a` tags from the landing page.

In [None]:
for a in soup.find_all('a'):
    print(a)

* We can use `field_name.string` to get only the value.

In [None]:
for a in soup.find_all('a'):
    print(a.string)

In [None]:
for a in soup.find_all('a'):
    print(a.get_text())

* We can also get the urls used as part of these `a` tags.

In [None]:
for a in soup.find_all('a'):
    print(a['href'])

In [None]:
for a in soup.find_all('a'):
    if a.get('href'):
        print(a['href'])

* We can also pass attributes such as `class`, `id` etc to access HTML data.

In [35]:
for a in soup.find_all('a'):
    if a.get('class'):
        print(a['class'])

['navbar-brand', 'text-wrap']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['reference', 'internal']
['dropdown-buttons']
['repository-button']
['issues-button']
['edit-button']
['full-screen-button']
['binder-button']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['reference', 'internal', 'nav-link']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink']
['headerlink'

In [39]:
soup.find('a', {'class': 'reference internal'})

<a class="reference internal" href="#">
   Mastering Python
  </a>

In [41]:
soup.find('a', class_='reference internal')

<a class="reference internal" href="#">
   Mastering Python
  </a>

In [37]:
for a in soup.find_all('a', {'class': 'reference internal'}):
    if a.get('href'):
        print(a['href'])

#
01_overview_of_windows_os/01_overview_of_windows_os.html
04_postgres_database_operations/01_postgres_database_operations.html
05_getting_started_with_python/01_getting_started_with_python.html
06_basic_programming_constructs/01_basic_programming_constructs.html
07_pre_defined_functions/01_pre_defined_functions.html
08_user_defined_functions/01_user_defined_functions.html
09_overview_of_collections_list_and_set/01_overview_of_collections_list_and_set.html
10_overview_of_collections_dict_and_tuple/01_overview_of_collections_dict_and_tuple.html
11_manipulating_collections_using_loops/01_manipulating_collections_using_loops.html
12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
13_understanding_map_reduce_libraries/01_understanding_map_reduce_libraries.html
14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
15_overview_of_pandas_libraries/01_overview_of_pandas_libraries.html
16_web_scraping_using_beautifulsoup/01_web_scraping

In [42]:
for a in soup.find_all('a'):
    if a.get('id'):
        print(a['id'])

next-link


In [43]:
soup.find('a', {'id': 'next-link'})

<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>

In [44]:
soup.find('a', id='next-link')

<a class="right-next" href="01_overview_of_windows_os/01_overview_of_windows_os.html" id="next-link" title="next page">Overview of Windows Operating System</a>