# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 1: Basics of web-scraping using `requests` and `BeautifulSoup` libraries

### Import `requests` library

In [1]:
!pip install beautifulsoup4








You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.


In [2]:
import requests

### Exercise 1: Use `requests` to get a response from the Wikipedia home page

In [3]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [4]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [5]:
# What is this 'response' object anyway
type(response)

requests.models.Response

In [6]:
for r in response: print(r)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclo'
b'pedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":['
b'"",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May"'
b',"June","July","August","September","October","November","December"],"wgRequestId":"240ed5bf-6f05-4c48-9b4b-4ca09608373a","wgCSP'
b'Nonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitl'
b'e":"Main Page","wgCurRevisionId":1085170884,"wgRevisionId":1085170884,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":f'
b'alse,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel'
b'":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":1

### Exercise 2: Write a small function to check the status of web request

This kind of small helper/utility functions are incredibly useful for complex projects.

Start building **a habit of writing small functions to accomplish small modular tasks**, instead of writing long scripts, which are hard to debug and track.

In [7]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [8]:
status_check(response)

Success!


1

### Exercise 3: Write small function to check the encoding of the web page

In [9]:
def encoding_check(r):
    return (r.encoding)

In [10]:
encoding_check(response)

'UTF-8'

### Exercise 4: Write a small function to decode the concents of the `response`

In [11]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [12]:
contents = decode_content(response,encoding_check(response))

#### What is the type of the contents?

In [13]:
type(contents)

str

#### Fantastic! Finally we got a string object. Did you see how easy it was to read text from a popular webpage like Wikipedia?

### Exercise 5: Check the length of the text you got back and print some selected portions

In [14]:
len(contents)

84449

In [15]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"240ed5bf-6f05-4c48-9b4b-4ca09608373a","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1085170884,"wgRevisionId":1085170884,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":fals

In [16]:
print(contents[15000:16000])

/" class="extiw" title="mail:daily-article-l">By email</a></b></li>
<li><b><a href="/wiki/Wikipedia:Featured_articles" title="Wikipedia:Featured articles">More featured articles</a></b></li></ul>
</div></div>
<h2 id="mp-dyk-h2" class="mp-h2"><span class="mw-headline" id="Did_you_know_...">Did you know&#160;...</span></h2>
<div id="mp-dyk">
<div class="dyk-img" style="float: right; margin-left: 0.5em;">
<div class="thumbinner mp-thumb" style="background: transparent; border: none; padding: 0; max-width: 156px;">
<a href="/wiki/File:Levitan_Evening_bells_1892.jpg" class="image" title="Evening Bells by Isaac Levitan"><img alt="Evening Bells by Isaac Levitan" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Levitan_Evening_bells_1892.jpg/156px-Levitan_Evening_bells_1892.jpg" decoding="async" width="156" height="125" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Levitan_Evening_bells_1892.jpg/234px-Levitan_Evening_bells_1892.jpg 1.5x, //upload.wikimedia.org/wikipedia/c

### Exercise 6: Use `BeautifulSoup` package to parse the raw HTML text more meaningfully and search for a particular text

In [17]:
from bs4 import BeautifulSoup

In [18]:
soup = BeautifulSoup(contents, 'html.parser')

#### What is this new `soup` object?

In [19]:
type(soup)

bs4.BeautifulSoup

### Exercise 7: Can we somehow read intelligible text from this `soup` object?

In [20]:
txt_dump=soup.text

In [21]:
type(txt_dump)

str

In [22]:
len(txt_dump)

8548

In [37]:
print(txt_dump[0:])





Wikipedia, the free encyclopedia











































Main Page

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search



Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,512,161 articles in English





From today's featured article




The Tower House, in the district of Holland Park in Kensington and Chelsea, London, is a late Victorian townhouse built between 1875 and 1881 by the architect and designer William Burges as his personal residence. Designed in the French Gothic Revival style, it echoes elements of Burges's earlier work. The house was built of red brick by the Ashby Brothers, with a distinctive cylindrical tower and conical roof. The interior was decorated by members of Burges's long-standing team of craftsmen including Thomas Nicholls and Henry Stacy Marks. The house retains most of its internal structural decoration, but much of the furniture, fittings and contents that Burges designed have been disperse

### Exercise 8: Extract the text from the section *'From today's featured article'*

In [24]:
# First extract the starting and end indecies of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [25]:
print(txt_dump[idx1+len("From today's featured article"):idx2])






The Tower House, in the district of Holland Park in Kensington and Chelsea, London, is a late Victorian townhouse built between 1875 and 1881 by the architect and designer William Burges as his personal residence. Designed in the French Gothic Revival style, it echoes elements of Burges's earlier work. The house was built of red brick by the Ashby Brothers, with a distinctive cylindrical tower and conical roof. The interior was decorated by members of Burges's long-standing team of craftsmen including Thomas Nicholls and Henry Stacy Marks. The house retains most of its internal structural decoration, but much of the furniture, fittings and contents that Burges designed have been dispersed. Many items, including the Great Bookcase, the Zodiac settle, the Golden Bed and the Red Bed, are now in institutions such as The Higgins Bedford and the Victoria and Albert Museum. The house was designated a Grade I listed building in 1949. (Full article...)





### Exercise 9: Try to extract the important historical events that happened on today's date...

In [26]:
idx3=txt_dump.find("On this day")

In [27]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])



June 12: Dia dos Namorados in Brazil; Independence Day in the Philippines (1898); Loving Day in the United States (1967)



Anne Frank

1798 – Following the successful French invasion of Malta, the Knights Hospitaller surrendered Malta to Napoleon, initiating two years of occupation.
1899 – The New Richmond tornado killed 117 people and injured 125 others in the Upper Midwest region of the United States.
1942 – On her thirteenth birthday, Anne Frank (pictured) began keeping a diary during the Nazi occupation of the Netherlands in World War II.
1991 – More than 150 Sri Lankan Tamil civilians were massacred by members of the military in the village of Kokkadichcholai.
Thomas Farnaby  (d. 1647)Egwale Seyon  (d. 1818)Javed Miandad  (b. 1957)

More anniversaries: 
June 11
June 12
June 13


Archive
By email
List of days of the year




Today's featured picture






The Notre-Dame fire broke out in the cathedral of Notre-Dame de Paris on 15 April 2019, causing severe damage to the building

### Exercise 10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look

In [28]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

1798 – Following the successful French invasion of Malta, the Knights Hospitaller surrendered Malta to Napoleon, initiating two years of occupation.
1899 – The New Richmond tornado killed 117 people and injured 125 others in the Upper Midwest region of the United States.
1942 – On her thirteenth birthday, Anne Frank (pictured) began keeping a diary during the Nazi occupation of the Netherlands in World War II.
1991 – More than 150 Sri Lankan Tamil civilians were massacred by members of the military in the village of Kokkadichcholai.
Thomas Farnaby  (d. 1647)Egwale Seyon  (d. 1818)Javed Miandad  (b. 1957)
June 11
June 12
June 13
Archive
By email
List of days of the year


In [38]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)
                

In [39]:
len(text_list)

4

In [31]:
for i in text_list:
    print(i)
    print('-'*100)

1798 – Following the successful French invasion of Malta, the Knights Hospitaller surrendered Malta to Napoleon, initiating two years of occupation.
1899 – The New Richmond tornado killed 117 people and injured 125 others in the Upper Midwest region of the United States.
1942 – On her thirteenth birthday, Anne Frank (pictured) began keeping a diary during the Nazi occupation of the Netherlands in World War II.
1991 – More than 150 Sri Lankan Tamil civilians were massacred by members of the military in the village of Kokkadichcholai.
----------------------------------------------------------------------------------------------------
Thomas Farnaby  (d. 1647)Egwale Seyon  (d. 1818)Javed Miandad  (b. 1957)
----------------------------------------------------------------------------------------------------
June 11
June 12
June 13
----------------------------------------------------------------------------------------------------
Archive
By email
List of days of the year
-------------------

In [32]:
print(text_list[0])

1798 – Following the successful French invasion of Malta, the Knights Hospitaller surrendered Malta to Napoleon, initiating two years of occupation.
1899 – The New Richmond tornado killed 117 people and injured 125 others in the Upper Midwest region of the United States.
1942 – On her thirteenth birthday, Anne Frank (pictured) began keeping a diary during the Nazi occupation of the Netherlands in World War II.
1991 – More than 150 Sri Lankan Tamil civilians were massacred by members of the military in the village of Kokkadichcholai.


### Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page

In [33]:
def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [34]:
print(wiki_on_this_day())

1798 – Following the successful French invasion of Malta, the Knights Hospitaller surrendered Malta to Napoleon, initiating two years of occupation.
1899 – The New Richmond tornado killed 117 people and injured 125 others in the Upper Midwest region of the United States.
1942 – On her thirteenth birthday, Anne Frank (pictured) began keeping a diary during the Nazi occupation of the Netherlands in World War II.
1991 – More than 150 Sri Lankan Tamil civilians were massacred by members of the military in the village of Kokkadichcholai.


#### A wrong URL produces an error message as expected

In [35]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry could not reach the web page!
-1
