# Web Scrapping

Web scrapping is used to extract data from publicly available websites in automated fashion. The method is useful when the public website you want to get data from does not have an API, or it does but provides only limited access to the data. urllib and requests are Python modules used for web requests in web scraping.

* Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators).
* Beautiful Soup is a library that makes it easy to scrape information from web pages. 

In [7]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

from urllib.request import urlopen  #provides APIs to establish a non-streaming connection with target servers
from urllib.error import HTTPError
from urllib.error import URLError

In [8]:
html = urlopen('https://www.ncses.nsf.gov/about')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
print(bs.h2)
print(bs.h3)
print(bs.find_all(["h1", "h2"]));
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));

<h1>National Center for Science and Engineering Statistics</h1>
<h2 class="blue">Our Mission</h2>
<h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2 class="blue">Our Mission</h2>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>]
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2 class="blue">Our Mission</h2>, <h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>, <h3 class="card-title" data-property="title">Research</h3>, <h3 class="card-title" data-property="title"><div>Funding O

In [9]:
try:
    html = urlopen("https://www.ncses.nsf.gov/about")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [14]:
print(bs.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

In [15]:
tag_object=bs.title
print("tag object:",tag_object)

tag object: None


In [None]:
print("tag object type:",type(tag_object))

In [10]:
resp = requests.get('https://www.ncses.nsf.gov/about')
print(resp.content)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [11]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs2 = BeautifulSoup(html, "html.parser")

In [12]:
nameList = bs2.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
