# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 1: Basics of web-scraping using `requests` and `BeautifulSoup` libraries

### Import `requests` library

In [1]:
import requests

In [9]:
foo = "choc"
foo \
.upper() \
.lower()

'choc'

### Exercise 1: Use `requests` to get a response from the Wikipedia home page

In [2]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [4]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [65]:
# What is this 'response' object anyway
type(response)

requests.models.Response

### Exercise 2: Write a small function to check the status of web request

This kind of small helper/utility functions are incredibly useful for complex projects.

Start building **a habit of writing small functions to accomplish small modular tasks**, instead of writing long scripts, which are hard to debug and track.

In [66]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [67]:
status_check(response)

Success!


1

### Exercise 3: Write small function to check the encoding of the web page

In [68]:
def encoding_check(r):
    return (r.encoding)

In [69]:
encoding_check(response)

'UTF-8'

### Exercise 4: Write a small function to decode the concents of the `response`

In [70]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [71]:
contents = decode_content(response,encoding_check(response))

#### What is the type of the contents?

In [72]:
type(contents)

str

#### Fantastic! Finally we got a string object. Did you see how easy it was to read text from a popular webpage like Wikipedia?

### Exercise 5: Check the length of the text you got back and print some selected portions

In [73]:
len(contents)

76942

In [74]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":865422981,"wgRevisionId":865422981,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShor

In [75]:
print(contents[15000:16000])

y_I_Soter_Louvre_Ma849.jpg/160px-Ptolemy_I_Soter_Louvre_Ma849.jpg 2x" data-file-width="2700" data-file-height="4050" /></a><div class="thumbcaption" style="padding: 0.25em 0; word-wrap: break-word;">Bust of Ptolemy I in the <a href="/wiki/Louvre" title="Louvre">Louvre</a></div></div>
</div>
<ul><li>... that <b><a href="/wiki/Ptolemy_I_Soter" title="Ptolemy I Soter">Ptolemy I Soter</a></b> <i>(pictured)</i>, a companion of <a href="/wiki/Alexander_the_Great" title="Alexander the Great">Alexander the Great</a>, founded the <a href="/wiki/Ptolemaic_dynasty" title="Ptolemaic dynasty">Ptolemaic dynasty</a> to which <a href="/wiki/Cleopatra" title="Cleopatra">Cleopatra</a> belonged?</li>
<li>... that the novella <i><b><a href="/wiki/Aus_dem_Leben_eines_Taugenichts" title="Aus dem Leben eines Taugenichts">Aus dem Leben eines Taugenichts</a></b></i> was translated as <i>Memoirs of a Good-for-Nothing</i> several times, the first in 1866 by <a href="/wiki/Charles_Godfrey_Leland" title="Charles G

### Exercise 6: Use `BeautifulSoup` package to parse the raw HTML text more meaningfully and search for a particular text

In [76]:
from bs4 import BeautifulSoup

In [77]:
soup = BeautifulSoup(contents, 'html.parser')

#### What is this new `soup` object?

In [78]:
type(soup)

bs4.BeautifulSoup

### Exercise 7: Can we somehow read intelligible text from this `soup` object?

In [79]:
txt_dump=soup.text

In [80]:
type(txt_dump)

str

In [81]:
len(txt_dump)

15986

In [82]:
print(txt_dump[10000:11000])

 (11 lb) in mass, with short ears and tail. The rock hyrax is found across Africa and the Middle East, at elevations up to 4,200 metres (13,800 ft). It resides in habitats with rock crevices which it uses to escape from predators. Along with the other hyrax species and the manatee, these are the animals most closely related to the elephant.

Photograph: Charles J. Sharp

Recently featured: 
World War I
Soyuz TMA-14M
Embryonic stem cell


Archive
More featured pictures





Other areas of Wikipedia

Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.
Local embassy – For Wikipedia-related communication in languages other than English.
Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
Village pump – For disc

### Exercise 8: Extract the text from the section *'From today's featured article'*

In [83]:
# First extract the starting and end indecies of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [84]:
print(txt_dump[idx1+len("From today's featured article"):idx2])




Archie vs. Predator is a comic book series written by Alex de Campi (pictured) of Dark Horse Comics and drawn by Fernando Ruiz of Archie Comics. It features Predator, a deadly alien trophy hunter, who stalks the clean-cut teenager Archie Andrews and his high school classmates, until the survivors realize they are being hunted and fight back. A four-issue limited series was released in the US in 2015 between April and July, and a hardcover collection went on sale in November. Archie Comics proposed the idea to Dark Horse, which holds the license to comics featuring 20th Century Fox's Predator. The series received positive reviews from critics, who enjoyed the strange matchup and dark humor. The April issue was the top seller for both publishers, and garnered an average review rating of 7.9 out of 10 according to the review aggregator Comic Book Roundup. The series won a Ghastly Award for Best Limited Series. (Full article...)





### Exercise 9: Try to extract the important historical events that happened on today's date...

In [85]:
idx3=txt_dump.find("On this day")

In [86]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])



November 12



William Heffelfinger

1892 – William Heffelfinger (pictured) was paid $500 by the Allegheny Athletic Association, becoming the first professional American football player on record.
1912 – The bodies of Robert Falcon Scott and his companions were discovered, roughly eight months after their deaths during the ill-fated British Antarctic Expedition 1910.
1928 – At least 110 people died after the British ocean liner SS Vestris was abandoned as it sank in the western Atlantic Ocean.
1940 – World War II: Free French forces captured Gabon from Vichy France.
2011 – A blast in Iran's Shahid Modarres missile base led to the death of 17 members of the Revolutionary Guards, including Hassan Tehrani Moghaddam, a key figure in Iran's missile program.
Claude of France (b. 1547) · William Henry Barlow (d. 1902) · Naomi Wolf (b. 1962)


More anniversaries: 
November 11
November 12
November 13


Archive
By email
List of historical anniversaries




From today's featured list

Gary Able

### Exercise 10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look

In [87]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

1892 – William Heffelfinger (pictured) was paid $500 by the Allegheny Athletic Association, becoming the first professional American football player on record.
1912 – The bodies of Robert Falcon Scott and his companions were discovered, roughly eight months after their deaths during the ill-fated British Antarctic Expedition 1910.
1928 – At least 110 people died after the British ocean liner SS Vestris was abandoned as it sank in the western Atlantic Ocean.
1940 – World War II: Free French forces captured Gabon from Vichy France.
2011 – A blast in Iran's Shahid Modarres missile base led to the death of 17 members of the Revolutionary Guards, including Hassan Tehrani Moghaddam, a key figure in Iran's missile program.
November 11
November 12
November 13
Archive
By email
List of historical anniversaries


In [109]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)

In [110]:
len(text_list)

3

In [111]:
for i in text_list:
    print(i)
    print('-'*100)

1892 – William Heffelfinger (pictured) was paid $500 by the Allegheny Athletic Association, becoming the first professional American football player on record.
1912 – The bodies of Robert Falcon Scott and his companions were discovered, roughly eight months after their deaths during the ill-fated British Antarctic Expedition 1910.
1928 – At least 110 people died after the British ocean liner SS Vestris was abandoned as it sank in the western Atlantic Ocean.
1940 – World War II: Free French forces captured Gabon from Vichy France.
2011 – A blast in Iran's Shahid Modarres missile base led to the death of 17 members of the Revolutionary Guards, including Hassan Tehrani Moghaddam, a key figure in Iran's missile program.
----------------------------------------------------------------------------------------------------
November 11
November 12
November 13
----------------------------------------------------------------------------------------------------
Archive
By email
List of historical 

In [112]:
print(text_list[0])

1892 – William Heffelfinger (pictured) was paid $500 by the Allegheny Athletic Association, becoming the first professional American football player on record.
1912 – The bodies of Robert Falcon Scott and his companions were discovered, roughly eight months after their deaths during the ill-fated British Antarctic Expedition 1910.
1928 – At least 110 people died after the British ocean liner SS Vestris was abandoned as it sank in the western Atlantic Ocean.
1940 – World War II: Free French forces captured Gabon from Vichy France.
2011 – A blast in Iran's Shahid Modarres missile base led to the death of 17 members of the Revolutionary Guards, including Hassan Tehrani Moghaddam, a key figure in Iran's missile program.


### Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page

In [122]:
def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [123]:
print(wiki_on_this_day())

1892 – William Heffelfinger (pictured) was paid $500 by the Allegheny Athletic Association, becoming the first professional American football player on record.
1912 – The bodies of Robert Falcon Scott and his companions were discovered, roughly eight months after their deaths during the ill-fated British Antarctic Expedition 1910.
1928 – At least 110 people died after the British ocean liner SS Vestris was abandoned as it sank in the western Atlantic Ocean.
1940 – World War II: Free French forces captured Gabon from Vichy France.
2011 – A blast in Iran's Shahid Modarres missile base led to the death of 17 members of the Revolutionary Guards, including Hassan Tehrani Moghaddam, a key figure in Iran's missile program.


#### A wrong URL produces an error message as expected

In [124]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry could not reach the web page!
-1
