# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 1: Basics of web-scraping using `requests` and `BeautifulSoup` libraries

### Import `requests` library

In [None]:
!pip install beautifulsoup4

In [None]:
import requests

### Exercise 1: Use `requests` to get a response from the Wikipedia home page

In [None]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [None]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [5]:
# What is this 'response' object anyway
type(response)

requests.models.Response

In [6]:
for r in response: print(r)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclo'
b'pedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":['
b'"",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May"'
b',"June","July","August","September","October","November","December"],"wgRequestId":"2905378d-a6f4-4e8c-890c-386978280edf","wgCSP'
b'Nonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitl'
b'e":"Main Page","wgCurRevisionId":1108085777,"wgRevisionId":1108085777,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":f'
b'alse,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel'
b'":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":1

### Exercise 2: Write a small function to check the status of web request

This kind of small helper/utility functions are incredibly useful for complex projects.

Start building **a habit of writing small functions to accomplish small modular tasks**, instead of writing long scripts, which are hard to debug and track.

In [7]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [8]:
status_check(response)

Success!


1

### Exercise 3: Write small function to check the encoding of the web page

In [9]:
def encoding_check(r):
    return (r.encoding)

In [10]:
encoding_check(response)

'UTF-8'

### Exercise 4: Write a small function to decode the concents of the `response`

In [11]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [12]:
contents = decode_content(response,encoding_check(response))

#### What is the type of the contents?

In [13]:
type(contents)

str

In [14]:
contents

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2905378d-a6f4-4e8c-890c-386978280edf","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1108085777,"wgRevisionId":1108085777,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable

#### Fantastic! Finally we got a string object. Did you see how easy it was to read text from a popular webpage like Wikipedia?

### Exercise 5: Check the length of the text you got back and print some selected portions

In [15]:
len(contents)

82523

In [16]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2905378d-a6f4-4e8c-890c-386978280edf","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1108085777,"wgRevisionId":1108085777,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":fals

In [17]:
print(contents[15000:16000])

/September 2022">Archive</a></b></li>
<li><b><a href="https://lists.wikimedia.org/postorius/lists/daily-article-l.lists.wikimedia.org/" class="extiw" title="mail:daily-article-l">By email</a></b></li>
<li><b><a href="/wiki/Wikipedia:Featured_articles" title="Wikipedia:Featured articles">More featured articles</a></b></li>
<li><b><a href="/wiki/Wikipedia:About_Today%27s_featured_article" title="Wikipedia:About Today&#39;s featured article">About</a></b></li></ul>
</div></div>
<h2 id="mp-dyk-h2" class="mp-h2"><span class="mw-headline" id="Did_you_know_...">Did you know&#160;...</span></h2>
<div id="mp-dyk" class="mp-contains-float">
<div class="dyk-img" style="float: right; margin-left: 0.5em;">
<div class="thumbinner mp-thumb" style="background: transparent; border: none; padding: 0; max-width: 101px;">
<a href="/wiki/File:San_Remo_apartments_from_Central_Park,_NYC.jpg" class="image" title="The San Remo"><img alt="The San Remo" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Sa

### Exercise 6: Use `BeautifulSoup` package to parse the raw HTML text more meaningfully and search for a particular text

In [19]:
from bs4 import BeautifulSoup

In [20]:
soup = BeautifulSoup(contents, 'html.parser')

#### What is this new `soup` object?

In [21]:
type(soup)

bs4.BeautifulSoup

### Exercise 7: Can we somehow read intelligible text from this `soup` object?

In [22]:
txt_dump=soup.text

In [23]:
type(txt_dump)

str

In [24]:
len(txt_dump)

8342

In [25]:
print(txt_dump[0:])





Wikipedia, the free encyclopedia












































Main Page

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search



Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,553,743 articles in English




From today's featured article


Portland Streetcar vehicle on the Broadway Bridge

The A and B Loop is a streetcar circle route of the Portland Streetcar system in Portland, Oregon, United States. Operated by Portland Streetcar, Inc. and TriMet, it consists of two services within the Central City that travel a loop between the east and west sides of the Willamette River by crossing the Broadway Bridge (pictured) in the north and Tilikum Crossing in the south. The services connect Portland's downtown, Pearl District, Lloyd District, Central Eastside, and South Waterfront. Portland city officials considered an eastside streetcar extension upon authorizing the Central City Streetcar project in 1997. After several years of p

### Exercise 8: Extract the text from the section *'From today's featured article'*

In [26]:
# First extract the starting and end indecies of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [27]:
print(txt_dump[idx1+len("From today's featured article"):idx2])




Portland Streetcar vehicle on the Broadway Bridge

The A and B Loop is a streetcar circle route of the Portland Streetcar system in Portland, Oregon, United States. Operated by Portland Streetcar, Inc. and TriMet, it consists of two services within the Central City that travel a loop between the east and west sides of the Willamette River by crossing the Broadway Bridge (pictured) in the north and Tilikum Crossing in the south. The services connect Portland's downtown, Pearl District, Lloyd District, Central Eastside, and South Waterfront. Portland city officials considered an eastside streetcar extension upon authorizing the Central City Streetcar project in 1997. After several years of planning, the Portland Streetcar Loop Project was approved and held its groundbreaking in 2009. It opened between the Broadway Bridge and the Oregon Museum of Science and Industry on September 22, 2012. The opening of Tilikum Crossing in 2015 further extended its tracks from the museum to the South 

### Exercise 9: Try to extract the important historical events that happened on today's date...

In [28]:
idx3=txt_dump.find("On this day")

In [29]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])



September 22



Richard Wagner

1586 – Eighty Years' War: Spanish forces defeated an Anglo-Dutch army at the Battle of Zutphen.
1869 – Das Rheingold, the first of four operas in Der Ring des Nibelungen by the German composer Richard Wagner (pictured), was first performed in Munich.
1934 – One of Britain's worst mining accidents took place when an explosion at Gresford Colliery in Wales killed 266 men.
1979 – An American Vela satellite detected an unidentified flash of light near the Prince Edward Islands in the Indian Ocean, thought to be a nuclear weapons test.
1994 – The Nordhordland Bridge, crossing Salhusfjorden between Klauvaneset and Flatøy in Vestland, and Norway's second-longest bridge, officially opened.
Wilhelm Keitel  (b. 1882)Norma McCorvey  (b. 1947)Aurelio López  (d. 1992)

More anniversaries: 
September 21
September 22
September 23


Archive
By email
List of days of the year




Today's featured picture






Onésime Reclus (22 September 1837 – 30 June 1916), was a Fre

### Exercise 10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look

In [30]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

1586 – Eighty Years' War: Spanish forces defeated an Anglo-Dutch army at the Battle of Zutphen.
1869 – Das Rheingold, the first of four operas in Der Ring des Nibelungen by the German composer Richard Wagner (pictured), was first performed in Munich.
1934 – One of Britain's worst mining accidents took place when an explosion at Gresford Colliery in Wales killed 266 men.
1979 – An American Vela satellite detected an unidentified flash of light near the Prince Edward Islands in the Indian Ocean, thought to be a nuclear weapons test.
1994 – The Nordhordland Bridge, crossing Salhusfjorden between Klauvaneset and Flatøy in Vestland, and Norway's second-longest bridge, officially opened.
Wilhelm Keitel  (b. 1882)Norma McCorvey  (b. 1947)Aurelio López  (d. 1992)
September 21
September 22
September 23
Archive
By email
List of days of the year


In [31]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)
                

In [None]:
len(text_list)

In [32]:
for i in text_list:
    print(i)
    print('-'*100)

1586 – Eighty Years' War: Spanish forces defeated an Anglo-Dutch army at the Battle of Zutphen.
1869 – Das Rheingold, the first of four operas in Der Ring des Nibelungen by the German composer Richard Wagner (pictured), was first performed in Munich.
1934 – One of Britain's worst mining accidents took place when an explosion at Gresford Colliery in Wales killed 266 men.
1979 – An American Vela satellite detected an unidentified flash of light near the Prince Edward Islands in the Indian Ocean, thought to be a nuclear weapons test.
1994 – The Nordhordland Bridge, crossing Salhusfjorden between Klauvaneset and Flatøy in Vestland, and Norway's second-longest bridge, officially opened.
----------------------------------------------------------------------------------------------------
Wilhelm Keitel  (b. 1882)Norma McCorvey  (b. 1947)Aurelio López  (d. 1992)
----------------------------------------------------------------------------------------------------
September 21
September 22
Septem

In [33]:
print(text_list[0])

1586 – Eighty Years' War: Spanish forces defeated an Anglo-Dutch army at the Battle of Zutphen.
1869 – Das Rheingold, the first of four operas in Der Ring des Nibelungen by the German composer Richard Wagner (pictured), was first performed in Munich.
1934 – One of Britain's worst mining accidents took place when an explosion at Gresford Colliery in Wales killed 266 men.
1979 – An American Vela satellite detected an unidentified flash of light near the Prince Edward Islands in the Indian Ocean, thought to be a nuclear weapons test.
1994 – The Nordhordland Bridge, crossing Salhusfjorden between Klauvaneset and Flatøy in Vestland, and Norway's second-longest bridge, officially opened.


### Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page

In [34]:
def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [35]:
print(wiki_on_this_day())

1586 – Eighty Years' War: Spanish forces defeated an Anglo-Dutch army at the Battle of Zutphen.
1869 – Das Rheingold, the first of four operas in Der Ring des Nibelungen by the German composer Richard Wagner (pictured), was first performed in Munich.
1934 – One of Britain's worst mining accidents took place when an explosion at Gresford Colliery in Wales killed 266 men.
1979 – An American Vela satellite detected an unidentified flash of light near the Prince Edward Islands in the Indian Ocean, thought to be a nuclear weapons test.
1994 – The Nordhordland Bridge, crossing Salhusfjorden between Klauvaneset and Flatøy in Vestland, and Norway's second-longest bridge, officially opened.


#### A wrong URL produces an error message as expected

In [36]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry could not reach the web page!
-1
