## Parsing HTML using BeautifulSoup

As we have created beautiful soup object, let us explore APIs or methods to scrape the content in HTML. We will see below examples.

* Accessing first occurrence of `tr`.
* Accessing first `th` value, we can use attribute `string` or method `get_text()`
* Accessing first occurrence of anchor tag
* Getting the url from `href` attribute of anchor tag
* Accessing the value of anchor tag.
* Get all anchor tags
* Get all `td` tags
* Get value from all `td` tags.
* Get values and URLs from anchor tags as a list of dicts

In [1]:
%run 03_overview_of_beautifulsoup.ipynb

Details,URL
Video Content,YouTube Channel
Reference Material,GitHub Repository


<table>
 <tbody>
  <tr>
   <th>
    Details
   </th>
   <th>
    URL
   </th>
  </tr>
  <tr>
   <td>
    Video Content
   </td>
   <td>
    <a href="https://www.youtube.com/itversityin">
     YouTube Channel
    </a>
   </td>
  </tr>
  <tr>
   <td>
    Reference Material
   </td>
   <td>
    <a href="https://www.github.com/dgadiraju/itversity-books">
     GitHub Repository
    </a>
   </td>
  </tr>
 </tbody>
</table>


* Accessing first occurrence of `tr`

In [6]:
soup.table

<table>
<tbody>
<tr>
<th>Details</th>
<th>URL</th>
</tr>
<tr>
<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
</tr>
<tr>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>
</tr>
</tbody>
</table>

In [5]:
soup.table.tbody.tr

<tr>
<th>Details</th>
<th>URL</th>
</tr>

* Accessing first `th` value, we can use attribute `string` or method `get_text()`

In [76]:
soup.table.tbody.tr.th.string

'Details'

In [85]:
soup.table.tbody.tr.th.get_text()

'Details'

* Accessing first occurrence of anchor tag

In [8]:
soup.table.tbody.a

<a href="https://www.youtube.com/itversityin">YouTube Channel</a>

* Getting the url from `href` attribute of anchor tag

In [9]:
soup.table.tbody.a['href']

'https://www.youtube.com/itversityin'

* Accessing the value of anchor tag.

In [10]:
soup.table.tbody.a.string

'YouTube Channel'

* Get all anchor tags

In [11]:
soup.table.tbody.find_all('a')

[<a href="https://www.youtube.com/itversityin">YouTube Channel</a>,
 <a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>]

* Get all `td` tags

In [12]:
for a in soup.find_all('td'):
    print(a)

<td>Video Content</td>
<td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
</td>
<td>Reference Material</td>
<td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
</td>


* Get value from all `td` tags.

In [13]:
# If the text in the tag have characters like new line, string might return None
for td in soup.find_all('td'):
    print(td.string)

Video Content
None
Reference Material
None


In [14]:
# If the text in the tag have characters like new line, we can use get_text
for td in soup.find_all('td'):
    print(td.get_text())

Video Content
YouTube Channel

Reference Material
GitHub Repository



In [15]:
# Stripping new line characters
for td in soup.find_all('td'):
    print(td.get_text().rstrip('\n'))

Video Content
YouTube Channel
Reference Material
GitHub Repository


* Get values and URLs from anchor tags as a list of dicts

itversity_details = []
for a in soup.find_all('a'):
    rec = {'description': a.string, 'url': a['href']}
    itversity_details.append(rec)

itversity_details