<a href="https://colab.research.google.com/github/peeush-the-developer/projects-on-data-science/blob/main/Web-Scraping/01-Historical-Events-100-years/WebScraping_Events.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [2]:
response = requests.get('https://emlii.com/78-events-across-100-years-that-completely-changed-the-world/')

In [3]:
soup_obj = BeautifulSoup(response.text, 'html.parser')

In [4]:
events = soup_obj.find_all(class_='article-inner-block')

In [5]:
len(events)

88

In [19]:
events[0]

<div class="article-inner-block">
<h2 class="article-subtitle">1. Queen Victoria’s Funeral (1901)</h2>
<div class="article-inner-block-image">
<p><img alt="" class="attachment-full size-full aligncenter" height="389" loading="lazy" sizes="(max-width: 640px) 100vw, 640px" src="https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg" srcset="https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg 640w, https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f-300x182.jpg 300w" width="640"/></p>
<div class="horizontal-share-block">royal-portraits.blogspot.com</div>
</div>
<p> </p>
<p class="article-inner-description">Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George’s Chapel, Windsor Castle. She was the longest-reigning British monarch in histo

In [20]:
events[1]

<div class="article-inner-block">
<div class="article-inner-block-image">
<p><img alt="" class="attachment-full size-full aligncenter" height="450" loading="lazy" sizes="(max-width: 600px) 100vw, 600px" src="https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg" srcset="https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg 600w, https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d-300x225.jpg 300w" width="600"/></p>
<div class="horizontal-share-block"></div>
</div>
<div class="article-source"></div>
<p class="article-inner-description">Queen Victoria’s funeral procession, Windsor, 1901</p>
</div>

In [25]:
# Define function to extract "Title", "Year" from event
def get_title_year(event):
  '''
  Extract title and year from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    title, year: Title and year of the event when available, otherwise None for 
    both
  '''
  title, year = None, None
  heading = event.find('h2', class_='article-subtitle')
  if heading:
      title = heading.text[3:-7]
      year = heading.text[-5:-1]
      if not year.isdigit():
        year = None
  return title, year

In [36]:
# Define function to extract "Url" of image source from event
def get_image_url(event):
  '''
  Extract Url of image source from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    url: Url of image source if available, otherwise None
  '''
  url = None
  image = event.find('img')
  if image:
      url = image['src']
  return url

In [95]:
# Define function to extract "Description" from event
def get_description(event):
  '''
  Extract description from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    desc: description of the event if available, otherwise None
  '''
  desc = None
  # desc_tag = event.find('p', class_='article-inner-description')
  # if desc_tag:
  #   desc = desc_tag.text
  # else:
  desc = event.contents[-2].text
  return desc

In [96]:
get_description(events[39])
#get_title_year(events[39])

'June 1958, Sweden v Brazil in the World Cup Final. Brazil were a goal down until a17 year old newcomer, Pele, equalised with a stunning goal. Brazil went on to win 5-2 and Pele became a sporting hero and is considered one of the greatest football players of all time.'

In [97]:
# Test our functions for first 10 events
for event in events[:10]:
  title, year = get_title_year(event)
  url = get_image_url(event)
  desc = get_description(event)
  print(year, url, title)
  print(desc)

1901 https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg Queen Victoria’s Funeral
Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George’s Chapel, Windsor Castle. She was the longest-reigning British monarch in history.
None https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg None
Queen Victoria’s funeral procession, Windsor, 1901
1903 https://emlii.com/wp-content/uploads/2020/04/1587470238-1367-5356a98eb8337.jpg Wright Brother’s First Flight
On December 17 1903, news came through that two brothers had flown a curious air machine for more than a minute. This event marked a revolution in human transportation and was in fact one of human’s greatest achievements.
1913 https://emlii.com/wp-content/uploads/2020/04/1587470235-3021-5356aa65cfc3b.jpg Emily Daviso

In [98]:
# Loop through all events and put them in a list
events_list = []
for event in events:
    title, year = get_title_year(event)
    url = get_image_url(event)
    desc = get_description(event)
    events_list.append((title, year, desc, url))

In [99]:
df = pd.DataFrame(events_list, columns=[
                  'Title', 'Year', 'Description', 'Img_url'])
df.head(20)

Unnamed: 0,Title,Year,Description,Img_url
0,Queen Victoria’s Funeral,1901.0,Crowds line up to bid a final farewell to Quee...,https://emlii.com/wp-content/uploads/2020/04/1...
1,,,"Queen Victoria’s funeral procession, Windsor, ...",https://emlii.com/wp-content/uploads/2020/04/1...
2,Wright Brother’s First Flight,1903.0,"On December 17 1903, news came through that tw...",https://emlii.com/wp-content/uploads/2020/04/1...
3,Emily Davison Throws Herself Under The Kings H...,1913.0,Suffragette Emily Davison’s Derby Day protest ...,https://emlii.com/wp-content/uploads/2020/04/1...
4,Abdication of the Tsar Nikolas II,1917.0,"On March 15, 1917 following the Feburary Revol...",https://emlii.com/wp-content/uploads/2020/04/1...
5,,,The former tsar Nicholas II and his children s...,https://emlii.com/wp-content/uploads/2020/04/1...
6,Irish Free State Treaty Signed,1921.0,"In late 1921, the Irish Free State Treaty is s...",https://emlii.com/wp-content/uploads/2020/04/1...
7,Suzanne Lenglen Breaks Wimbledon Record,1925.0,Suzanne Lenglen wins an unprecedented sixth si...,https://emlii.com/wp-content/uploads/2020/04/1...
8,Start Of UK General Strike,1926.0,Start Of UK General Strike (1926). The General...,https://emlii.com/wp-content/uploads/2020/04/1...
9,Charles Lindbergh Flies the Atlantic Solo,1927.0,Charles Lindbergh achieves the world’s first n...,https://emlii.com/wp-content/uploads/2020/04/1...


In [100]:
df.isna().sum()

Title          12
Year           12
Description     0
Img_url         0
dtype: int64

From above results, we can see that there are some missing values in Title, Year and we need to revisit the web page to see if the structure is being followed across all events.

## Data cleaning

In our first attempt we could see that there are total 88 events extracted, though in the article, there are only 78 events to be precise. 

We revisited the article again and observe structure of events again and found certain discrepancies, as following:
+ An event is extended into two `article-inner-block` divs and that's why we have 2 events instead of 1 for few events.
+ Title (class=`article-subtitle`) doesn't have year at the end.

Let's attempt above issues one by one.



Handle issue

Title (class=`article-subtitle`) doesn't have year at the end.

In the event structure, we have 2 `h2` tags with different classes and one of them has year in it. 

So, if first `h2` tag doesn't have year, then we look into another `h2` tag and assume that should have year in it.

In [101]:
# Example:
# <h2 class="article-subtitle">4. Abdication of the Tsar Nikolas II</h2>
# ...
# <h2 class="article-source">Tsar Nicholas II in detention after his abdication in March 1917</h2>
# ...

# Define function to extract "Title", "Year" from event
def get_year(event, class_='article-source'):
  '''
  Extract year from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
    class_: class in Html. Default is 'article-source'.
  
  Returns:
    year: Year of the event when available, otherwise None
  '''
  year = None
  heading = event.find('h2', class_=class_)
  if heading:
      year = heading.text[-4:]
      if not year.isdigit():
        year = None
  return year


# Define function to extract "Title", "Year" from event
def get_title_year(event):
  '''
  Extract title and year from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    title, year: Title and year of the event when available, otherwise None for 
    both
  '''
  title, year = None, None
  heading = event.find('h2', class_='article-subtitle')
  if heading:
    heading_text = heading.text.strip().strip('.')
    title = heading_text[3:-7]
    year = heading_text[-5:-1]
    if not year.isdigit():
      year = get_year(event)
      if year:
        title = heading_text[3:]
  return title, year


In [102]:
for event in events[:5]:
  title, year = get_title_year(event)
  print(year, title)

1901 Queen Victoria’s Funeral
None None
1903 Wright Brother’s First Flight
1913 Emily Davison Throws Herself Under The Kings Horse
1917 Abdication of the Tsar Nikolas II


Handle issue

An event is extended into two `article-inner-block` divs and that's why we have 2 events instead of 1 for few events.

In [103]:
# Loop through all events and put them in 2 lists such as:
#  - Put valid event in 1st list with a mapping or event index
#  - Put extension event in 2nd list with mapping to previous seen event
events_list = []
ext_events_list = []
for id, event in enumerate(events):
    title, year = get_title_year(event)
    url = get_image_url(event)
    desc = get_description(event)
    index = id
    if title:
      events_list.append((index, title, year, desc, url))
    else:
      # This is an extension of previous event
      ext_events_list.append((index-1, desc, url))

print('Actual events count:', len(events_list))
print('Extension events count:', len(ext_events_list))

Actual events count: 76
Extension events count: 12


In [104]:
df = pd.DataFrame(events_list, columns=['Index','Title','Year','Description','Url'])
ext_df = pd.DataFrame(ext_events_list, columns=['Index', 'Description (extra)', 'Url (extra)'])

In [105]:
df.head(20)

Unnamed: 0,Index,Title,Year,Description,Url
0,0,Queen Victoria’s Funeral,1901,Crowds line up to bid a final farewell to Quee...,https://emlii.com/wp-content/uploads/2020/04/1...
1,2,Wright Brother’s First Flight,1903,"On December 17 1903, news came through that tw...",https://emlii.com/wp-content/uploads/2020/04/1...
2,3,Emily Davison Throws Herself Under The Kings H...,1913,Suffragette Emily Davison’s Derby Day protest ...,https://emlii.com/wp-content/uploads/2020/04/1...
3,4,Abdication of the Tsar Nikolas II,1917,"On March 15, 1917 following the Feburary Revol...",https://emlii.com/wp-content/uploads/2020/04/1...
4,6,Irish Free State Treaty Signed,1921,"In late 1921, the Irish Free State Treaty is s...",https://emlii.com/wp-content/uploads/2020/04/1...
5,7,Suzanne Lenglen Breaks Wimbledon Record,1925,Suzanne Lenglen wins an unprecedented sixth si...,https://emlii.com/wp-content/uploads/2020/04/1...
6,8,Start Of UK General Strike,1926,Start Of UK General Strike (1926). The General...,https://emlii.com/wp-content/uploads/2020/04/1...
7,9,Charles Lindbergh Flies the Atlantic Solo,1927,Charles Lindbergh achieves the world’s first n...,https://emlii.com/wp-content/uploads/2020/04/1...
8,11,Hitler Becomes German Chancellor,1933,In an attempt to form a stable coalition gover...,https://emlii.com/wp-content/uploads/2020/04/1...
9,12,King Edward VIII Abdicates,1936,Edward VIII abdicates in order to marry the Am...,https://emlii.com/wp-content/uploads/2020/04/1...


In [106]:
ext_df.head(10)

Unnamed: 0,Index,Description (extra),Url (extra)
0,0,"Queen Victoria’s funeral procession, Windsor, ...",https://emlii.com/wp-content/uploads/2020/04/1...
1,4,The former tsar Nicholas II and his children s...,https://emlii.com/wp-content/uploads/2020/04/1...
2,9,"On September 27th, 1930 the American golfer Bo...",https://emlii.com/wp-content/uploads/2020/04/1...
3,14,Austria opens its gates to German troops.,https://emlii.com/wp-content/uploads/2020/04/1...
4,18,London after the monstrous attacks that left s...,https://emlii.com/wp-content/uploads/2020/04/1...
5,29,India hoists its flag for the first time on th...,https://emlii.com/wp-content/uploads/2020/04/1...
6,37,The famous Manchester United Football team who...,https://emlii.com/wp-content/uploads/2020/04/1...
7,51,First human footprint on the moon : Neil Armst...,https://emlii.com/wp-content/uploads/2020/04/1...
8,57,A starving family in a famine-ravaged village ...,https://emlii.com/wp-content/uploads/2020/04/1...
9,60,The crew members of the doomed space shuttle C...,https://emlii.com/wp-content/uploads/2020/04/1...


In [107]:
# We can merge these 2 DataFrames based on 'Index' column
df_merged = pd.merge(left=df, right=ext_df, how='left', on='Index')
df_merged.drop('Index', axis=1, inplace=True)
df_merged.head(20)

Unnamed: 0,Title,Year,Description,Url,Description (extra),Url (extra)
0,Queen Victoria’s Funeral,1901,Crowds line up to bid a final farewell to Quee...,https://emlii.com/wp-content/uploads/2020/04/1...,"Queen Victoria’s funeral procession, Windsor, ...",https://emlii.com/wp-content/uploads/2020/04/1...
1,Wright Brother’s First Flight,1903,"On December 17 1903, news came through that tw...",https://emlii.com/wp-content/uploads/2020/04/1...,,
2,Emily Davison Throws Herself Under The Kings H...,1913,Suffragette Emily Davison’s Derby Day protest ...,https://emlii.com/wp-content/uploads/2020/04/1...,,
3,Abdication of the Tsar Nikolas II,1917,"On March 15, 1917 following the Feburary Revol...",https://emlii.com/wp-content/uploads/2020/04/1...,The former tsar Nicholas II and his children s...,https://emlii.com/wp-content/uploads/2020/04/1...
4,Irish Free State Treaty Signed,1921,"In late 1921, the Irish Free State Treaty is s...",https://emlii.com/wp-content/uploads/2020/04/1...,,
5,Suzanne Lenglen Breaks Wimbledon Record,1925,Suzanne Lenglen wins an unprecedented sixth si...,https://emlii.com/wp-content/uploads/2020/04/1...,,
6,Start Of UK General Strike,1926,Start Of UK General Strike (1926). The General...,https://emlii.com/wp-content/uploads/2020/04/1...,,
7,Charles Lindbergh Flies the Atlantic Solo,1927,Charles Lindbergh achieves the world’s first n...,https://emlii.com/wp-content/uploads/2020/04/1...,"On September 27th, 1930 the American golfer Bo...",https://emlii.com/wp-content/uploads/2020/04/1...
8,Hitler Becomes German Chancellor,1933,In an attempt to form a stable coalition gover...,https://emlii.com/wp-content/uploads/2020/04/1...,,
9,King Edward VIII Abdicates,1936,Edward VIII abdicates in order to marry the Am...,https://emlii.com/wp-content/uploads/2020/04/1...,,


In [108]:
df_merged.isna().sum()

Title                   0
Year                    0
Description             0
Url                     0
Description (extra)    64
Url (extra)            64
dtype: int64

As above result shows that we have now extracted all information about events from the webpage.

We can then save this merged dataframe into csv file for further analysis like:
+ No. of events year-wise
+ NLP model to categorize if event was happy or sad
+ Extract dates from description and see pattern (if any)
+ Use image url to generate captions for the images.

Above is just some thoughts of how we can use this dataset.