<a href="https://colab.research.google.com/github/peeush-the-developer/projects-on-data-science/blob/main/Web-Scraping/01-Historical-Events-100-years/WebScraping_Events.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [2]:
response = requests.get('https://emlii.com/78-events-across-100-years-that-completely-changed-the-world/')

In [3]:
soup_obj = BeautifulSoup(response.text, 'html.parser')

In [4]:
events = soup_obj.find_all(class_='article-inner-block')

In [5]:
len(events)

88

In [19]:
events[0]

<div class="article-inner-block">
<h2 class="article-subtitle">1. Queen Victoria’s Funeral (1901)</h2>
<div class="article-inner-block-image">
<p><img alt="" class="attachment-full size-full aligncenter" height="389" loading="lazy" sizes="(max-width: 640px) 100vw, 640px" src="https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg" srcset="https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg 640w, https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f-300x182.jpg 300w" width="640"/></p>
<div class="horizontal-share-block">royal-portraits.blogspot.com</div>
</div>
<p> </p>
<p class="article-inner-description">Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George’s Chapel, Windsor Castle. She was the longest-reigning British monarch in histo

In [20]:
events[1]

<div class="article-inner-block">
<div class="article-inner-block-image">
<p><img alt="" class="attachment-full size-full aligncenter" height="450" loading="lazy" sizes="(max-width: 600px) 100vw, 600px" src="https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg" srcset="https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg 600w, https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d-300x225.jpg 300w" width="600"/></p>
<div class="horizontal-share-block"></div>
</div>
<div class="article-source"></div>
<p class="article-inner-description">Queen Victoria’s funeral procession, Windsor, 1901</p>
</div>

In [25]:
# Define function to extract "Title", "Year" from event
def get_title_year(event):
  '''
  Extract title and year from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    title, year: Title and year of the event when available, otherwise None for 
    both
  '''
  title, year = None, None
  heading = event.find('h2', class_='article-subtitle')
  if heading:
      title = heading.text[3:-7]
      year = heading.text[-5:-1]
      if not year.isdigit():
        year = None
  return title, year

In [36]:
# Define function to extract "Url" of image source from event
def get_image_url(event):
  '''
  Extract Url of image source from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    url: Url of image source if available, otherwise None
  '''
  url = None
  image = event.find('img')
  if image:
      url = image['src']
  return url

In [43]:
# Define function to extract "Description" from event
def get_description(event):
  '''
  Extract description from the event instance

  Arguments:
    event: An instance of BeautifulSoup object read from events
  
  Returns:
    desc: description of the event if available, otherwise None
  '''
  desc = None
  desc_tag = event.find('p', class_='article-inner-description')
  if desc_tag:
      desc = desc_tag.text
  return desc

In [38]:
# Test our functions for first 10 events
for event in events[:10]:
  title, year = get_title_year(event)
  url = get_image_url(event)
  desc = get_description(event)
  print(year, url, title)
  print(desc)

1901 https://emlii.com/wp-content/uploads/2020/04/1587470234-9470-5356a7d7bb38f.jpg Queen Victoria’s Funeral
Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George’s Chapel, Windsor Castle. She was the longest-reigning British monarch in history.
None https://emlii.com/wp-content/uploads/2020/04/1587470231-2944-5356a7faa7b5d.jpg None
Queen Victoria’s funeral procession, Windsor, 1901
1903 https://emlii.com/wp-content/uploads/2020/04/1587470238-1367-5356a98eb8337.jpg Wright Brother’s First Flight
On December 17 1903, news came through that two brothers had flown a curious air machine for more than a minute. This event marked a revolution in human transportation and was in fact one of human’s greatest achievements.
1913 https://emlii.com/wp-content/uploads/2020/04/1587470235-3021-5356aa65cfc3b.jpg Emily Daviso

In [39]:
# Loop through all events and put them in a list
events_list = []
for event in events:
    title, year = get_title_year(event)
    url = get_image_url(event)
    desc = get_description(event)
    events_list.append((title, year, desc, url))

In [41]:
df = pd.DataFrame(events_list, columns=[
                  'Title', 'Year', 'Description', 'Img_url'])
df.head(20)

Unnamed: 0,Title,Year,Description,Img_url
0,Queen Victoria’s Funeral,1901.0,Crowds line up to bid a final farewell to Quee...,https://emlii.com/wp-content/uploads/2020/04/1...
1,,,"Queen Victoria’s funeral procession, Windsor, ...",https://emlii.com/wp-content/uploads/2020/04/1...
2,Wright Brother’s First Flight,1903.0,"On December 17 1903, news came through that tw...",https://emlii.com/wp-content/uploads/2020/04/1...
3,Emily Davison Throws Herself Under The Kings H...,1913.0,Suffragette Emily Davison’s Derby Day protest ...,https://emlii.com/wp-content/uploads/2020/04/1...
4,Abdication of the Tsar Nik,,"On March 15, 1917 following the Feburary Revol...",https://emlii.com/wp-content/uploads/2020/04/1...
5,,,The former tsar Nicholas II and his children s...,https://emlii.com/wp-content/uploads/2020/04/1...
6,Irish Free State Treaty Signed,1921.0,"In late 1921, the Irish Free State Treaty is s...",https://emlii.com/wp-content/uploads/2020/04/1...
7,Suzanne Lenglen Breaks Wimbledon Record,1925.0,Suzanne Lenglen wins an unprecedented sixth si...,https://emlii.com/wp-content/uploads/2020/04/1...
8,Start Of UK General Strike,1926.0,Start Of UK General Strike (1926). The General...,https://emlii.com/wp-content/uploads/2020/04/1...
9,Charles Lindbergh Flies the Atlantic Solo,1927.0,Charles Lindbergh achieves the world’s first n...,https://emlii.com/wp-content/uploads/2020/04/1...


In [42]:
df.isna().sum()

Title          12
Year           14
Description     1
Img_url         0
dtype: int64

From above results, we can see that there are some missing values in Title, Year and we need to revisit the web page to see if the structure is being followed across all events.