# Working with Parsed Data in BeautifulSoup

1. Parsing data: just pass an HTML or XML document to the BeautifulSoup() constructor; the constructor converts the document to unicode and then parses it with a built-in HTML parser (by default)
2. Getting data from a parse tree 
3. Searching and retrieving data from a parse tree 

### Searching and Retrieving Data 

The **find_all()** method 

- Searches a tag and its descendants to retrieve tags or string that match your filter 

## Data Parsing

In [4]:
from bs4 import BeautifulSoup
import urllib
import urllib.request
import re

In [5]:
with urllib.request.urlopen('https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html') as response:
    html = response.read()

In [7]:
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

### Parsing Your Data

In [9]:
print(soup.prettify()[0:100])

<html>
 <head>
  <title>
   IoT Articles
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    


### Getting Data from a Parse Tree

In [17]:
text_only = soup.get_text()
print(text_only[0:240])

IoT Articles

2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers
It’s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use...




## Searching & Retrieving Data from a Parse Tree

### Retrieving Tags by Filtering with Name Arguments

In [21]:
soup.find_all('li')[0:5]

[<li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>,
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>,
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>,
 <li>Enable built-in sensing to build a weather station that measures ambient temperature and humidity<

### Retrieving Tags by Filtering with Keyword Arguments

In [24]:
soup.find_all(id="link 7")

[<a class="preview" href="http://www.skyfilabs.com/iot-online-courses" id="link 7">SkyFi</a>]

In [26]:
soup.find_all('a', {'class': 'preview'})

[<a class="preview" href="http://bit.ly/LPlNDJj" id="link 1">Last month Ericsson Digital invited me</a>,
 <a class="preview" href="http://www.data-mania.com/blog/m2m-vs-iot/" id="link 2">IoT</a>,
 <a class="preview" href="bit.ly/LPlNDJj" id="link 3"><img alt="Get your new iot device ideas here" class="aligncenter size-full wp-image-3802" height="683" src="http://www.data-mania.com/blog/wp-content/uploads/2017/12/new-IoT-device-ideas.jpg" width="1024"/></a>,
 <a class="preview" href="http://mat.se/" id="link 4">Mat.se</a>,
 <a class="preview" href="http://bit.ly/LPlNDJj" id="link 5">watch the videos on this page</a>,
 <a class="preview" href="https://click.linksynergy.com/deeplink?id=*JDLXjeE*wk&amp;mid=39197&amp;murl=https%3A%2F%2Fwww.udemy.com%2Ftopic%2Finternet-of-things%2F%3Fsort%3Dhighest-rated" id="link 6">IoT courses on Udemy</a>,
 <a class="preview" href="http://www.skyfilabs.com/iot-online-courses" id="link 7">SkyFi</a>,
 <a class="preview" href="https://www.coursera.org/specia

In [28]:
soup.find_all('a', {'id': 'link 1'})

[<a class="preview" href="http://bit.ly/LPlNDJj" id="link 1">Last month Ericsson Digital invited me</a>]

### Retrieving Tags by Filtering with String Arguments

In [31]:
soup.find_all('ol')[0]

<ol>
<li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>
<li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>
<li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>
</ol>

### Retrieving Tags by Filtering with List Objects

In [34]:
soup.find_all(['ol',['b']])[0:2]

[<b>2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers</b>,
 <ol>
 <li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>
 </ol>]

### Retrieving tags by Filtering with Regular Expressions

In [37]:
t = re.compile('t')
for tag in soup.find_all(t):
    print(tag.name)

html
title
strong
strong
strong
strong
strong
strong


### Retrieving Tags by Filtering with Boolean Value

In [41]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
br
br
h1
span
strong
a
a
a
img
a
span
strong
a
h1
ol
li
strong
li
strong
li
strong
h1
a
a
a
h2
ol
li
li
li
li
li
li
h2
ol
li
li
li
li
li
a
img
h2
ol
li
li
li
li
li
h2
ol
li
li
li
li
span
strong
a
em
p


### Retrieving Weblinks by Filtering with String Object

In [46]:
for link in soup.find_all('a'):
    #print(link.attrs['href'])
    print(link.get('href')

http://bit.ly/LPlNDJj
http://www.data-mania.com/blog/m2m-vs-iot/
bit.ly/LPlNDJj
http://mat.se/
http://bit.ly/LPlNDJj
https://click.linksynergy.com/deeplink?id=*JDLXjeE*wk&mid=39197&murl=https%3A%2F%2Fwww.udemy.com%2Ftopic%2Finternet-of-things%2F%3Fsort%3Dhighest-rated
http://www.skyfilabs.com/iot-online-courses
https://www.coursera.org/specializations/iot
bit.ly/LPlNDJj
http://bit.ly/LPlNDJj


### Retrieving Strings by Filtering with Regular Expressions

In [49]:
soup.find_all(string=re.compile('data'))

[' & Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it’s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.']