# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 1: Basics of web-scraping using `requests` and `BeautifulSoup` libraries

### Import `requests` library

In [None]:
!pip install beautifulsoup4

In [None]:
import requests

### Exercise 1: Use `requests` to get a response from the Wikipedia home page

In [None]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [None]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [None]:
# What is this 'response' object anyway
type(response)

In [None]:
for r in response: print(r)

### Exercise 2: Write a small function to check the status of web request

This kind of small helper/utility functions are incredibly useful for complex projects.

Start building **a habit of writing small functions to accomplish small modular tasks**, instead of writing long scripts, which are hard to debug and track.

In [None]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [None]:
status_check(response)

### Exercise 3: Write small function to check the encoding of the web page

In [None]:
def encoding_check(r):
    return (r.encoding)

In [10]:
encoding_check(response)

'UTF-8'

### Exercise 4: Write a small function to decode the concents of the `response`

In [11]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [12]:
contents = decode_content(response,encoding_check(response))

#### What is the type of the contents?

In [13]:
type(contents)

str

#### Fantastic! Finally we got a string object. Did you see how easy it was to read text from a popular webpage like Wikipedia?

### Exercise 5: Check the length of the text you got back and print some selected portions

In [14]:
len(contents)

89254

In [15]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"eda142fd-b4ac-499b-9443-e8d0e01d631c","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1069328725,"wgRevisionId":1069328725,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":fals

In [16]:
print(contents[15000:16000])

he Lord of the Rings">impression of depth</a> and adopting an <a href="/wiki/Decline_and_fall_in_Middle-earth#Fading" title="Decline and fall in Middle-earth">elegiac tone</a>. Tolkien admired the way that the poem, written by a Christian looking back at a <a href="/wiki/Paganism" title="Paganism">pagan</a> past, used symbolism without becoming <a href="/wiki/Allegory" title="Allegory">allegorical</a>. The names of races, including <a href="/wiki/Ent" title="Ent">ents</a>, <a href="/wiki/Orc" title="Orc">orcs</a>, and <a href="/wiki/Elf_(Middle-earth)" title="Elf (Middle-earth)">elves</a>, and placenames such as <a href="/wiki/Isengard" title="Isengard">Orthanc</a> and <a href="/wiki/Rohan_(Middle-earth)#Capital" title="Rohan (Middle-earth)">Meduseld</a>, derive from <i>Beowulf</i>. The <a href="/wiki/Rohan_(Middle-earth)" title="Rohan (Middle-earth)">Riders of Rohan</a> are distinctively Old English. The werebear <a href="/wiki/Beorn" title="Beorn">Beorn</a> in <i><a href="/wiki/The_H

### Exercise 6: Use `BeautifulSoup` package to parse the raw HTML text more meaningfully and search for a particular text

In [18]:
from bs4 import BeautifulSoup

In [19]:
soup = BeautifulSoup(contents, 'html.parser')

#### What is this new `soup` object?

In [20]:
type(soup)

bs4.BeautifulSoup

### Exercise 7: Can we somehow read intelligible text from this `soup` object?

In [21]:
txt_dump=soup.text

In [22]:
type(txt_dump)

str

In [23]:
len(txt_dump)

9160

In [24]:
print(txt_dump[9000:11000])

n, Inc., a non-profit organization.


Privacy policy
About Wikipedia
Disclaimers
Contact Wikipedia
Mobile view
Developers
Statistics
Cookie statement













### Exercise 8: Extract the text from the section *'From today's featured article'*

In [25]:
# First extract the starting and end indecies of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [26]:
print(txt_dump[idx1+len("From today's featured article"):idx2])




J. R. R. Tolkien

J. R. R. Tolkien drew on Beowulf when creating the fictional world he called Middle-earth for The Lord of the Rings. Tolkien (pictured), a fantasy author, linguist, and philologist, took many elements from the Old English poem Beowulf, including names, monsters, and heroic-age customs and beliefs. He emulated its style, creating an impression of depth and adopting an elegiac tone. Tolkien admired the way that the poem, written by a Christian looking back at a pagan past, used symbolism without becoming allegorical. The names of races, including ents, orcs, and elves, and placenames such as Orthanc and Meduseld, derive from Beowulf. The Riders of Rohan are distinctively Old English. The werebear Beorn in The Hobbit has been likened to the hero Beowulf himself; both names mean "bear" and both characters have enormous strength. Scholars have compared some of Tolkien's monsters, including Gollum, the trolls, and the dragon Smaug, to those in the poem. (Full article...)

### Exercise 9: Try to extract the important historical events that happened on today's date...

In [27]:
idx3=txt_dump.find("On this day")

In [28]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])



February 25: Soviet Occupation Day in Georgia (1921); National Day in Kuwait (1961)



Edvard Beneš

628 – Khosrow II, the last great king of the Sasanian Empire, was overthrown by his son Kavad II.
1866 – Miners in Calaveras County, California, discovered a human skull that a prominent geologist claimed was proof (later disproved) that humans had existed during the Pliocene.
1948 – Fearful of civil war and Soviet intervention in recent unrest, President Edvard Beneš (pictured) ceded control of the government to the Communist Party of Czechoslovakia.
1956 – In a speech to the 20th Congress of the Communist Party, Soviet leader Nikita Khrushchev denounced the personality cult and dictatorship of his predecessor Joseph Stalin.
1992 – First Nagorno-Karabakh War: Armenian armed forces killed at least 161 ethnic Azerbaijani civilians in the Nagorno-Karabakh village of Khojaly.
Sharafkhan Bidlisi  (b. 1543)S. O. Davies  (d. 1972)Yi Han-yong  (d. 1997)

More anniversaries: 
February 24
Febr

### Exercise 10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look

In [29]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

628 – Khosrow II, the last great king of the Sasanian Empire, was overthrown by his son Kavad II.
1866 – Miners in Calaveras County, California, discovered a human skull that a prominent geologist claimed was proof (later disproved) that humans had existed during the Pliocene.
1948 – Fearful of civil war and Soviet intervention in recent unrest, President Edvard Beneš (pictured) ceded control of the government to the Communist Party of Czechoslovakia.
1956 – In a speech to the 20th Congress of the Communist Party, Soviet leader Nikita Khrushchev denounced the personality cult and dictatorship of his predecessor Joseph Stalin.
1992 – First Nagorno-Karabakh War: Armenian armed forces killed at least 161 ethnic Azerbaijani civilians in the Nagorno-Karabakh village of Khojaly.
Sharafkhan Bidlisi  (b. 1543)S. O. Davies  (d. 1972)Yi Han-yong  (d. 1997)
February 24
February 25
February 26
Archive
By email
List of days of the year


In [30]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)

In [31]:
len(text_list)

4

In [32]:
for i in text_list:
    print(i)
    print('-'*100)

628 – Khosrow II, the last great king of the Sasanian Empire, was overthrown by his son Kavad II.
1866 – Miners in Calaveras County, California, discovered a human skull that a prominent geologist claimed was proof (later disproved) that humans had existed during the Pliocene.
1948 – Fearful of civil war and Soviet intervention in recent unrest, President Edvard Beneš (pictured) ceded control of the government to the Communist Party of Czechoslovakia.
1956 – In a speech to the 20th Congress of the Communist Party, Soviet leader Nikita Khrushchev denounced the personality cult and dictatorship of his predecessor Joseph Stalin.
1992 – First Nagorno-Karabakh War: Armenian armed forces killed at least 161 ethnic Azerbaijani civilians in the Nagorno-Karabakh village of Khojaly.
----------------------------------------------------------------------------------------------------
Sharafkhan Bidlisi  (b. 1543)S. O. Davies  (d. 1972)Yi Han-yong  (d. 1997)
----------------------------------------

In [None]:
print(text_list[0])

### Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page

In [33]:
def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [34]:
print(wiki_on_this_day())

628 – Khosrow II, the last great king of the Sasanian Empire, was overthrown by his son Kavad II.
1866 – Miners in Calaveras County, California, discovered a human skull that a prominent geologist claimed was proof (later disproved) that humans had existed during the Pliocene.
1948 – Fearful of civil war and Soviet intervention in recent unrest, President Edvard Beneš (pictured) ceded control of the government to the Communist Party of Czechoslovakia.
1956 – In a speech to the 20th Congress of the Communist Party, Soviet leader Nikita Khrushchev denounced the personality cult and dictatorship of his predecessor Joseph Stalin.
1992 – First Nagorno-Karabakh War: Armenian armed forces killed at least 161 ethnic Azerbaijani civilians in the Nagorno-Karabakh village of Khojaly.


#### A wrong URL produces an error message as expected

In [35]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry could not reach the web page!
-1
