## Chapter 6 - Data sourcing from web

### Segment 1 - Accessing html elements 

In [1]:
import sys
import requests
from bs4 import BeautifulSoup


In [2]:
from http.client import REQUESTED_RANGE_NOT_SATISFIABLE


url = "https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_summarizingdata/bs704_summarizingdata7.html"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify()[:300])


<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="SoftChalk Create 9.02.10" name="generator"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- This file is created by SoftChalk LessonBuilder -->
  <!-- LessonBuilder Version 9


**Tag objects**

In [3]:
soup_tag = BeautifulSoup('<h1 attr_1="heading level 1">title goes here </h1>', 'lxml')
tag = soup_tag.h1
print(tag)

<h1 attr_1="heading level 1">title goes here </h1>


In [4]:
tag.name


'h1'

In [5]:
tag.attrs

{'attr_1': 'heading level 1'}

In [6]:
tag['attribute2'] = 'foo'
tag

<h1 attr_1="heading level 1" attribute2="foo">title goes here </h1>

In [7]:
del tag['attr_1']
tag

<h1 attribute2="foo">title goes here </h1>

In [8]:
soup.title

<title>InterQuartile Range (IQR)</title>

In [9]:
soup.body.h1


<h1>InterQuartile Range (IQR)</h1>

In [10]:
soup.li

<li class="arrowprev"><a href="BS704_SummarizingData6.html" title="Go to page 6">Prev</a></li>

In [11]:
soup.a

<a href="BS704_SummarizingData_print.html" target="_blank">print all</a>

### Segment 2 - NavigatableString Objects

In [12]:
soup2 = BeautifulSoup('<h1 attr1="foo">Future Trends</h1>', 'lxml')
tag = soup2.h1

In [13]:
tag.name

'h1'

In [14]:
tag.string

'Future Trends'

In [15]:
type(tag.string)

bs4.element.NavigableString

In [16]:
navi_str = tag.string
navi_str

'Future Trends'

In [17]:
navi_str.replace_with('NaN')

'Future Trends'

In [18]:
tag.string

'NaN'

In [19]:
for string in soup.stripped_strings:
    print(string)

InterQuartile Range (IQR)
Summarizing Data
Descriptive Statistics
print all
Prev
Next
1
|
2
|
3
|
4
|
5
|
6
|   7
|
8
|
9
|
10
InterQuartile Range (IQR)
Outliers and Tukey Fences:
Contents
All Modules
InterQuartile Range (IQR)
When a data set has outliers or extreme values, we summarize a typical value using the
median
as opposed to the mean.  When a data set has outliers, variability is often summarized by a statistic called the
interquartile range
, which is the difference between the first and third quartiles. The first quartile, denoted Q
1
, is the value in the data set that holds 25% of the values
below
it. The third quartile, denoted Q
3
, is the value in the data set that holds 25% of the values
above
it. The quartiles can be determined following the same approach that we used to determine the median, but we now consider each half of the data set separately. The interquartile range is defined as follows:
Interquartile Range = Q
3
-Q
1
With an Even Sample Size:
For the sample (n

In [20]:
first_link = soup.a
first_link

<a href="BS704_SummarizingData_print.html" target="_blank">print all</a>

In [21]:
first_link.parent

<div id="printall" role="menu"><a href="BS704_SummarizingData_print.html" target="_blank">print all</a></div>

In [22]:
first_link.string

'print all'

**Segment 3 - Data parsing**

In [23]:
import re
soup.find_all('Even')

with requests.get('https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html') as resp:
    html = resp.text
html
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify()[:100])

<html>
 <head>
  <title>
   IoT Articles
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    


**Getting data from parse tree**

In [24]:
text_only = soup.get_text()
print(text_only[:1000])

IoT Articles

2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers
It�s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use...


It�s almost 2018 and IoT is on the cusp of an explosive expansion. In this article, I offer you a listing of new IoT device ideas that you can use to get practice in designing your first IoT applications.
Looking Back at My Coolest IoT Find in 2017
Before going into detail about best new IoT device ideas, here�s the backstory. Last month Ericsson Digital invited me to tour the Ericsson Studio in Kista, Sweden. Up until that visit, IoT had been largely theoretical to me. Of course, I know the usual mumbo-jumbo about wearables and IoT-connected fitness trackers. That stuff is all well and good, but it�s somewhat old hat � plus I am not sure we are really benefiting so much from those, so I�m not that impressed.

It wasn�t until I got to the Ericss


**Searching and retrieving**

In [25]:
soup.find_all('li')

[<li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it�s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>,
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>,
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>,
 <li>Enable built-in sensing to build a weather station that measures ambient temperature and humidity<

In [26]:
soup.find_all(id="link 7")

[<a class="preview" href="http://www.skyfilabs.com/iot-online-courses" id="link 7">SkyFi</a>]

In [27]:
soup.find_all(['ol', 'b'])

[<b>2018 Trends: Best New IoT Device Ideas for Data Scientists and Engineers</b>,
 <ol>
 <li><strong>Big Data</strong> &amp; Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it�s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.</li>
 <li><strong>Machine Learning</strong> Data Science: While a lot of IoT devices are still operated according to rules-based decision criteria, the age of artificial intelligence is upon us. IoT will increasingly depend on machine learning algorithms to control device operations so that devices are able to autonomously respond to a complex set of overlapping stimuli.</li>
 <li><strong>Blockchain</strong>-Enabled Security: Above all else, IoT networks must be secure. Blockchain technology is primed to meet the security demands that come along with building and expanding the IoT.</li>
 </ol>,
 <ol>
 <li

In [28]:
t = re.compile('t')
for tag in soup.find_all(t):
    print(tag.name)

html
title
strong
strong
strong
strong
strong
strong


In [29]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
br
br
h1
span
strong
a
a
a
img
a
span
strong
a
h1
ol
li
strong
li
strong
li
strong
h1
a
a
a
h2
ol
li
li
li
li
li
li
h2
ol
li
li
li
li
li
a
img
h2
ol
li
li
li
li
li
h2
ol
li
li
li
li
span
strong
a
em
p


**Find weblinks**

In [30]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://bit.ly/LPlNDJj
http://www.data-mania.com/blog/m2m-vs-iot/
bit.ly/LPlNDJj
http://mat.se/
http://bit.ly/LPlNDJj
https://click.linksynergy.com/deeplink?id=*JDLXjeE*wk&mid=39197&murl=https%3A%2F%2Fwww.udemy.com%2Ftopic%2Finternet-of-things%2F%3Fsort%3Dhighest-rated
http://www.skyfilabs.com/iot-online-courses
https://www.coursera.org/specializations/iot
bit.ly/LPlNDJj
http://bit.ly/LPlNDJj


In [31]:
soup.find_all(string=re.compile('data'))

[' & Data Engineering: Sensors that are embedded within IoT devices spin off machine-generated data like it�s going out of style. For IoT to function, the platform must be solidly engineered to handle big data. Be assured, that requires some serious data engineering.']