# Web Scraping

> **Web scraping** is a computer software technique of extracting information from websites.  
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

In [1]:
from bs4 import BeautifulSoup

## Web Scraping Process

1. Download HTML of web page
2. Locate information of interest
3. Locate element inside HTML source code- common pattern
4. Give instructions to Scrapper/Extractor to give all elements of the desired pattern

## Search by Tag and TagByClass

In [3]:
# To download full text form particular URL
import requests

from bs4 import BeautifulSoup

In [4]:
url = "http://www.imdb.com/chart/boxoffice/"
html = requests.get(url)

In [5]:
html.text

'\n\n\n\n<!DOCTYPE html>\n<html\nxmlns:og="http://ogp.me/ns#"\nxmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    \n    \n    \n\n    \n    \n    \n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n            <script type="text/javascript">var ue_t0=window.ue_t0||+new Date();</script>\n            <script type="text/javascript">\n                var ue_mid = "A1EVAM02EL8SFB"; \n                var ue_sn = "www.imdb.com";  \n                var ue_furl = "fls-na.amazon.com";\n                var ue_sid = "000-0000000-0000000";\n                var ue_id = "0SRMK5BP6ZVB6NVGPY84";\n                (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0

In [6]:
#Convert HTML source code to soup object
bsObj = BeautifulSoup(html.text, 'lxml')

In [7]:
bsObj

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var ue_t0=window.ue_t0||+new Date();</script>
<script type="text/javascript">
                var ue_mid = "A1EVAM02EL8SFB"; 
                var ue_sn = "www.imdb.com";  
                var ue_furl = "fls-na.amazon.com";
                var ue_sid = "000-0000000-0000000";
                var ue_id = "0SRMK5BP6ZVB6NVGPY84";
                (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:argument

In [8]:
bsObj.find('title')

<title>IMDb Top Box Office - IMDb</title>

In [9]:
bsObj.find('title').getText()

'IMDb Top Box Office - IMDb'

In [12]:
h1List = bsObj.findAll('h1')

In [14]:
for h1 in h1List:
    print(h1)

<h1 class="imdb-pro-ad__title">The leading information resource for the entertainment industry</h1>
<h1 class="header">Top Box Office (US)</h1>


In [15]:
h4List = bsObj.findAll('h4')

In [16]:
for h4 in h4List:
    print(h4.getText())

Weekend of December 15 - 17, 2017


In [17]:
titleList = bsObj.findAll('td', {'class':'titleColumn'})
# Use Shift-Tab to find documentation for any method

In [20]:
for title in titleList:
    print(title.getText())


Star Wars: Episode VIII - The Last Jedi


Ferdinand


Coco


Wonder


Justice League


Daddy's Home 2


Thor: Ragnarok


The Disaster Artist


Murder on the Orient Express


Lady Bird



## Practical Activity

> Fetch web page: https://www.reuters.com/finance/stocks/company-officers/GOOG.O  
Scrap the first table data from above url  
Save it to .csv file by comma separated each field.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://www.reuters.com/finance/stocks/company-officers/GOOG.O"
html = requests.get(url)
bsObj = BeautifulSoup(html.text, 'lxml')

In [3]:
file = open('ActivityCsv.csv', 'w')

In [4]:
trList = bsObj.findAll('tr')

for tr in trList[:1]:
    thList = tr.findAll('th')
    text=""
    for th in thList:
        print(th.getText())
        text += th.getText() + ','
    #print(text[:-1])
    file.write(text[:-1] + '\n')

Name
Age
Since
Current Position


In [5]:
trList = bsObj.findAll('tr')

for tr in trList[1:16]:
    tdList = tr.findAll('td')
    text=""
    for td in tdList:
        print(td.getText().strip())
        text += td.getText().strip() + ','
    file.write(text[:-1] + '\n')


Eric Schmidt
61
2015
Executive Chairman of the Board of Director
Sergey Brin
43
2015
President, Director
Lawrence Page
44
2015
Chief Executive Officer, Director
Ruth Porat
59
2015
Chief Financial Officer, Senior Vice President
Sundar Pichai
45
2017
Director, Chief Executive Officer, Google Inc.
David Drummond
54
2015
Senior Vice President - Corporate Development, Chief Legal Officer, Secretary
John Hennessy
65
2007
Lead Independent Director
Diane Greene
61
2015
Director
L. John Doerr
65
2016
Independent Director
Roger Ferguson
65
2016
Independent Director
Ann Mather
57
2005
Independent Director
Alan Mulally
71
2014
Independent Director
Paul Otellini
66
2004
Independent Director
Kavitark Shriram
60
1998
Independent Director
Shirley Tilghman
70
2005
Independent Director
