# Web Scrapping

Web scrapping is used to extract data from publicly available websites in automated fashion. The method is useful when the public website you want to get data from does not have an API, or it does but provides only limited access to the data. urllib and requests are Python modules used for web requests in web scraping.

* Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators).
* Beautiful Soup is a library that makes it easy to scrape information from web pages. 

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

from urllib.request import urlopen  #provides APIs to establish a non-streaming connection with target servers
from urllib.error import HTTPError
from urllib.error import URLError

In [2]:
html = urlopen('https://www.ncses.nsf.gov/about')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
print(bs.h2)
print(bs.h3)
print(bs.find_all(["h1", "h2"]));
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));

<h1>National Center for Science and Engineering Statistics</h1>
<h2>Who We Are</h2>
<h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2>Who We Are</h2>, <h2 class="blue">Our Mission</h2>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>]
[<h1>National Center for Science and Engineering Statistics</h1>, <h1>About NCSES</h1>, <h2>Who We Are</h2>, <h2 class="blue">Our Mission</h2>, <h3 class="card-title" data-property="title"><p style="font-style:italic">
Principles and Practices for a Federal Statistical Agency</p></h3>, <h2 class="blue">Our Core Activities</h2>, <h2 class="blue">Our Products</h2>, <h2 class="blue">How We Support Research</h2>, <h3 class="card-title" data-property="title">Research</h3>, <h3 class="card-title" data-pro

In [4]:
try:
    html = urlopen("https://www.ncses.nsf.gov/about")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [5]:
print(bs.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   About | NSF - National Science Foundation
  </title>
  <meta content="IE=11" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta charset="utf-8"/>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <link href="/resources/assets/images/statistics/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link as="font" crossorigin="anonymous" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" rel="preload" type="font/woff2"/>
  <link href="/resources/assets/css/pages/statistics/bootstrap.min.css" rel="stylesheet"/>
  <script src="/resources/assets/js/pages/statistics/statistics.concat.js">
  </script>
  <link as="font" crossorigin="true" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" rel="preload" type="font/woff2"/>
  <link href="/resources/assets/css/pages/statistics/default-without-menu.css?v=1.95.0" rel="stylesheet"/>
 <

In [6]:
tag_object=bs.title
print("tag object:",tag_object)

tag object: <title>About | NSF - National Science Foundation</title>


In [7]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


In [8]:
resp = requests.get('https://www.ncses.nsf.gov/about')
print(resp.content)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<title>About | NSF - National Science Foundation</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=11" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta charset="UTF-8" />\n\t<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n\t<link rel="shortcut icon" href="/resources/assets/images/statistics/favicon.ico" type="image/x-icon" />\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="anonymous"/>\n    \n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/bootstrap.min.css" />\n\t<script src="/resources/assets/js/pages/statistics/statistics.concat.js"></script>\n\n\n    <link rel="preload" as="font" type="font/woff2" href="/resources/assets/fonts/fontawesome-webfont.woff2?v=4.7.0" crossorigin="true"/>\n    <link rel="stylesheet" href="/resources/assets/css/pages/statistics/default-wit

In [9]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs2 = BeautifulSoup(html, "html.parser")

In [10]:
nameList = bs2.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [11]:
import socket

HOST = 'www.google.com'  # Server hostname or IP address
PORT = 80                # The standard port for HTTP is 80, for HTTPS it is 443

client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
client_socket.connect(server_address)

request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n'
client_socket.sendall(request_header)

response = ''
while True:
    recv = client_socket.recv(1024)
    if not recv:
        break
    response += str(recv)

print(response)
client_socket.close()


b'HTTP/1.0 200 OK\r\nDate: Sun, 23 Oct 2022 00:48:07 GMT\r\nExpires: -1\r\nCache-Control: private, max-age=0\r\nContent-Type: text/html; charset=ISO-8859-1\r\nP3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."\r\nServer: gws\r\nX-XSS-Protection: 0\r\nX-Frame-Options: SAMEORIGIN\r\nSet-Cookie: 1P_JAR=2022-10-23-00; expires=Tue, 22-Nov-2022 00:48:07 GMT; path=/; domain=.google.com; Secure\r\nSet-Cookie: AEC=AakniGOZk5-xdjFSwVnCH3z49GpwhI_4UEILGi4mGg5pRfn4kKUsA-zgNgU; expires=Fri, 21-Apr-2023 00:48:07 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax\r\nSet-Cookie: NID=511=UATiK2i2tQsoKUmF1yA52bWv4llN6JpsluqqGRpY_0JHYsjtbL6gGAXAjuupPbjspRsW2sjodjzmnP5ax9LEVtUEDWjh5sNOvHQN0XxWyCH_XOt3k2Q4sCBsBV9okNc3mKJMydmxdidkyhIMrtdEH-lp_2g91ZMjeS-fX7rv9fk; expires=Mon, 24-Apr-2023 00:48:07 GMT; path=/; domain=.google.com; HttpOnly\r\nAccept-Ranges: none\r\nVary: Accept-Encoding\r\n\r\n<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><

In [12]:
import re

html_content = '<p>Price : 19.99$</p>'

m = re.match('<p>(.+)<\/p>', html_content)
if m:
    print(m.group(1))

Price : 19.99$


In [13]:
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.google.com')
print(r.data)


b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="Xxxrnh3MFOuTEBq-C-XPCA">(function(){window.google={kEI:\'UY9UY73-GKmp5NoP5_260A8\',kEXPI:\'0,1302536,56873,6058,207,4804,2316,383,246,5,5367,1123753,1197754,380736,16115,19398,9286,22430,1362,12313,17586,4998,13228,3847,10622,14762,7981,2853,2226,885,710,1277,2451,293,147,1103,840,1983,214,4100,4120,2023,2297,6342,8328,3227,2846,6,4774,28996,1851,15324,432,3,346,1244,1,5445,150,11321,2652,4,1528,2304,7039,22023,5708,7356,13659,2980,1457,15351,1435,5818,2539,4094,17,4035,3,3541

In [14]:
from lxml import html

# We reuse the response from urllib3
data_string = r.data.decode('utf-8', errors='ignore')

# We instantiate a tree object from the HTML
tree = html.fromstring(data_string)

# We run the XPath against this HTML
# This returns an array of element
links = tree.xpath('//a')

for link in links:
    # For each element we can easily get back the URL
    print(link.get('href'))

http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
