# Web Scrapping

Web scrapping is used to extract data from publicly available websites in automated fashion. The method is useful when the public website you want to get data from does not have an API, or it does but provides only limited access to the data. urllib and requests are Python modules used for web requests in web scraping.

* Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators).
* Beautiful Soup is a library that makes it easy to scrape information from web pages. 

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

from urllib.request import urlopen  #provides APIs to establish a non-streaming connection with target servers
from urllib.error import HTTPError
from urllib.error import URLError

In [2]:
html = urlopen('https://www.ncses.nsf.gov/about')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
print(bs.h2)
print(bs.h3)
print(bs.find_all(["h1", "h2"]));
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));

<h1>National Center for Science and Engineering Statistics</h1>
<h2 class="blue">Our Mission</h2>
<h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2 class="blue">Our Mission</h2>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>]
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2 class="blue">Our Mission</h2>, <h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>, <h3 class="card-title" data-property="title">Research</h3>, <h3 class="card-title" data-property="title"><div>Funding O

In [3]:
try:
    html = urlopen("https://www.ncses.nsf.gov/about")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [4]:
print(bs.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   About | NSF - National Science Foundation
  </title>
  <meta content="IE=11" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta charset="utf-8"/>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <link href="/resources/assets/images/statistics/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link as="font" crossorigin="anonymous" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" rel="preload" type="font/woff2"/>
  <link href="/resources/assets/css/pages/statistics/bootstrap.min.css" rel="stylesheet"/>
  <script src="/resources/assets/js/pages/statistics/statistics.concat.js">
  </script>
  <link as="font" crossorigin="true" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" rel="preload" type="font/woff2"/>
  <link href="/resources/assets/css/pages/statistics/default-without-menu.css?v=1.86.0" rel="stylesheet"/>
 <

In [5]:
tag_object=bs.title
print("tag object:",tag_object)

tag object: <title>About | NSF - National Science Foundation</title>


In [6]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


In [7]:
resp = requests.get('https://www.ncses.nsf.gov/about')
print(resp.content)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [8]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs2 = BeautifulSoup(html, "html.parser")

In [9]:
nameList = bs2.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
