# Extraction of Recent Deaths data from Wikipedia

**Table of Contents**
1. Motivation
1. Outline of Necessary Steps
1. Relevant Documentation

## Motivation

This work was envisioned to exercise a few different skills:
- requests module, to get the information from Wikipedia
- BeautifulSoup, part of the bs4 module, to process the HTML data
- re module, to emply RegEx to extract the relevant data 
- pandas module, to create dataframes of the data of interest 

## Outline of Necessary Steps

- Import dependencies, which include requests, bs4, re, pandas, numpy, matplotlib
- Define function to build the URL to send request
- Define function to send request
- Define function to parse the HTML into a BeautifulSoup object
- Define one or more functions to get the name, age and nationality of each line in the page
- Create dataframe using the data

## Relevant Documentation

- [requests](https://requests.readthedocs.io/en/latest/)
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [re](https://docs.python.org/3/library/re.html)
- [pandas](https://pandas.pydata.org/docs/reference/index.html#api)
- [matplotlib](https://matplotlib.org/stable/api/index)

## Imports

In [12]:
import requests
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from collections import Counter

## Functions

### URL builder

The first function takes a month as string with first letter capital, and a year as an int, and returns the URL used to find the deaths page for that year and month

In [13]:
def build_url(month, year):
    return "https://en.wikipedia.org/wiki/Deaths_in_{}_{}".format(month, year)

Tests for the URL builder:

In [14]:
print("Deaths for August, 2016 are found in {}".format(build_url("August", 2016)))
print("Deaths for May, 2007 are found in {}".format(build_url("May", 2007)))
print("Deaths for December, 2020 are found in {}".format(build_url("December", 2020)))

Deaths for August, 2016 are found in https://en.wikipedia.org/wiki/Deaths_in_August_2016
Deaths for May, 2007 are found in https://en.wikipedia.org/wiki/Deaths_in_May_2007
Deaths for December, 2020 are found in https://en.wikipedia.org/wiki/Deaths_in_December_2020


In [15]:
url = build_url("August", 2009)

print("Deaths for August, 2009 are found in {}".format(url))

Deaths for August, 2009 are found in https://en.wikipedia.org/wiki/Deaths_in_August_2009


In [16]:
r = requests.get(url)
soup = BeautifulSoup(r.text)

Test for the regex

In [17]:
test_string = "Jerome Anderson, 55, American basketball player"
print(re.search(r"(.+), (\d{1,3}), (.+)", test_string))

<re.Match object; span=(0, 47), match='Jerome Anderson, 55, American basketball player'>


In [18]:
for i in soup.find_all("li"):
    line = re.search(r"(.+), (\d{1,3}), (.+)", i.text)
    if line and (i.text.count("^") == 0):
        print(re.split(r", (\d{1,3}), ", i.text))

['Jerome Anderson', '55', 'American basketball player (Boston Celtics) and coach.[1]']
['Corazon Aquino', '76', 'Filipino politician, first female President (1986–1992), colon cancer.[2]']
['Devendra Nath Dwivedi', '74', 'Indian politician, Governor designate of Gujarat.[3]']
['Flavia Irwin', '92', ' British painter.[4]']
['Edward D. Ives', '83', 'American folklorist and professor.[5]']
['Keith Macklin', '78', 'British journalist and broadcaster.[6]']
['George Taylor Morris', '62', 'American radio personality, throat cancer.[7]']
["Nicholas D'Antonio Salza", '93', 'Honduran Bishop of Juticalpa (1963–1977).[8]']
['Naomi Sims', '61', 'American model and author, breast cancer.[9]']
['Rana Chandra Singh', '78', 'Pakistani politician.[10]']
['Howard Smit', '98', 'American film make-up artist (The Wizard of Oz).[11]']
['Panakkad Sayeed Mohammedali Shihab Thangal', '73', 'Indian politician, cardiac arrest.[12]']
['Borka Vučić', '83', 'Serbian politician and banker, traffic collision.[13]']
['

Based on these tests, we can build a function that takes a url, makes the request, parses the soup, and applies the regex, returning a list of strings, which contain the deaths data for that url

In [20]:
def get_deaths(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    list_of_deaths = []
    for i in soup.find_all("li"):
        line = re.search(r"(.+), (\d{1,3}), (.+)", i.text)
        if line and (i.text.count("^") == 0):
            list_of_deaths.append(re.split(r", (\d{1,3}), ", i.text))
        list_of_deaths = list(filter(lambda three_items: len(three_items) == 3, list_of_deaths))
    return list_of_deaths

In [21]:
get_deaths(build_url("January", 2014))

[['Peter Austin', '92', 'British brewer (Ringwood Brewery).[1]'],
 ['Bobbi Jean Baker',
  '49',
  'American transgender activist and minister, traffic collision.[2]'],
 ['John Hanson Briscoe', '79', 'American politician.[3]'],
 ['George Deas Brown', '91', 'Australian politician.[4]'],
 ['Traian T. Coșovei', '59', 'Romanian poet.[5]'],
 ['Pierre Cullaz', '78', 'French jazz guitarist and cellist.[6]'],
 ['Herman Pieter de Boer', '85', 'Dutch writer, lyricist and journalist.[7]'],
 ['Pete DeCoursey',
  '52',
  'American political journalist, pancreatic and lung cancer.[8]'],
 ['Michael Glennon',
  '69',
  'Australian Roman Catholic priest and convicted child molester.[9]'],
 ['James H. Harless', '94', 'American industrialist and philanthropist.[10]'],
 ['Higashifushimi Kunihide', '103', 'Japanese Buddhist monk.[11]'],
 ['Milan Horvat', '94', 'Croatian conductor.[12]'],
 ['Jamal al-Jamal',
  '56',
  'Palestinian diplomat, Ambassador to the Czech Republic (since 2013), injuries sustained in

The next step is to convert the list into a pandas dataframe

In [22]:
deaths_01_2021_list = get_deaths(build_url("January", 2021))

In [23]:
def list_to_dataframe(deaths_list, month, year):
    deaths_df = pd.DataFrame(deaths_list, columns = ["Name", "Age", "Misc. Info."])
    deaths_df['Name'] = deaths_df['Name'].astype("string")
    deaths_df['Age'] = pd.to_numeric(deaths_df['Age'])
    deaths_df['Misc. Info.'] = deaths_df['Misc. Info.'].astype("string")
    deaths_df['Month'] = month
    deaths_df['Year'] = year
    return deaths_df

In [24]:
df = list_to_dataframe(deaths_01_2021_list, "January", 2021)

In [25]:
print(df.dtypes)
df['Name'] = df['Name'].astype("string")
df['Age'] = pd.to_numeric(df['Age'])
df['Misc. Info.'] = df['Misc. Info.'].astype("string")
print(df.dtypes)

Name           string[python]
Age                     int64
Misc. Info.    string[python]
Month                  object
Year                    int64
dtype: object
Name           string[python]
Age                     int64
Misc. Info.    string[python]
Month                  object
Year                    int64
dtype: object


Now, let's make some simple analyses

In [26]:
df.describe()

Unnamed: 0,Age,Year
count,1198.0,1198.0
mean,78.015025,2021.0
std,15.217326,0.0
min,8.0,2021.0
25%,71.0,2021.0
50%,81.0,2021.0
75%,89.0,2021.0
max,106.0,2021.0


We will now make an object that will perform a text analysis. This object will contain a string, a dictionary with the unique words as keys, and their counts on the string as values, and a counter with the same information, which will be used to "add" the dictionaries of the different string in a collumn.

In [27]:
class TextAnalyzer(object):
    def __init__(self, text):
        for ch in [".", ",", "?", "!", "(", ")", "[", "]", "–", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "-"]:
            if ch in text:
                text = text.replace(ch, "")
        self.text = text.lower().split()
        self.word_count_dict = {}
        for word in self.text:
            self.word_count_dict[word] = self.text.count(word)
        self.word_counter = Counter(self.word_count_dict)

In [28]:
analyzer = TextAnalyzer("This is a is")

In [29]:
print(analyzer.text)
print(analyzer.word_count_dict)
print(analyzer.word_counter)

['this', 'is', 'a', 'is']
{'this': 1, 'is': 2, 'a': 1}
Counter({'is': 2, 'this': 1, 'a': 1})


In [30]:
def word_count_analysis(df, column_label):
    total_counter = Counter()
    for _, i in df.iterrows():
        analyzer = TextAnalyzer(i[column_label])
        total_counter = total_counter + analyzer.word_counter
    return total_counter.most_common()

In [31]:
word_count_analysis(df, "Misc. Info.")

[('and', 434),
 ('of', 374),
 ('american', 336),
 ('the', 231),
 ('covid', 215),
 ('politician', 191),
 ('player', 102),
 ('british', 82),
 ('olympic', 69),
 ('actor', 61),
 ('complications', 61),
 ('from', 61),
 ('cancer', 61),
 ('since', 58),
 ('footballer', 56),
 ('national', 56),
 ('french', 55),
 ('football', 53),
 ('team', 52),
 ('member', 50),
 ('english', 47),
 ('indian', 41),
 ('writer', 40),
 ('mp', 39),
 ('heart', 37),
 ('actress', 34),
 ('canadian', 31),
 ('deputy', 31),
 ('singer', 29),
 ('artist', 29),
 ('spanish', 29),
 ('minister', 29),
 ('south', 29),
 ('catholic', 29),
 ('prelate', 29),
 ('roman', 28),
 ('bishop', 28),
 ('director', 27),
 ('new', 27),
 ('state', 26),
 ('manager', 25),
 ('indonesian', 25),
 ('house', 24),
 ('italian', 24),
 ('journalist', 24),
 ('german', 24),
 ('irish', 24),
 ('russian', 24),
 ('poet', 24),
 ('officer', 24),
 ('historian', 24),
 ('coach', 23),
 ('australian', 22),
 ('a', 21),
 ('disease', 20),
 ('musician', 20),
 ('military', 20),
 ('

## Case Study

Now that we have basic tools, we can perform this analysis for a wider range of time. Let's employ these tools to get some data over a few different time periods.

In [37]:
def get_deaths_over_time_range(start_year, end_year):

    if start_year < 1989 or start_year > end_year:
        raise ValueError("Input data is invalid. Check if the start date is greater than 1989, and that end_year is larger than start_year")
    
    final_df = pd.DataFrame()
    
    for month in ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]:
        for year in range(start_year, end_year):
            url = build_url(month, year)
            list_of_deaths = get_deaths(url)
            df_deaths = list_to_dataframe(list_of_deaths, month, year)
            final_df = pd.concat([final_df, df_deaths], ignore_index=True)
    
    final_df.to_csv("{}_-_{}.csv".format(str(start_year), str(end_year)))

    word_count = word_count_analysis(final_df, "Misc. Info.")
    word_df = pd.DataFrame(word_count, columns=["Word", "Count"])
    word_df.to_csv("{}_-_{}_-_Word_Count.csv".format(str(start_year), str(end_year)))
    
    return final_df, word_df

In [39]:
get_deaths_over_time_range(1989, 2000)
get_deaths_over_time_range(2000, 2010)
get_deaths_over_time_range(2010, 2019)
get_deaths_over_time_range(2020, 2023)

(                                       Name  Age  \
 0                               János Aczél   95   
 1                              Lexii Alijai   21   
 2                              Chris Barker   39   
 3                               Joan Benson   94   
 4      Aleksandr Aleksandrovich Blagonravov   86   
 ...                                     ...  ...   
 31425                           Wang Fosong   89   
 31426                          Darren Watts   53   
 31427                          Karla Wilson   88   
 31428                            Cary Young   83   
 31429                             Yu Dequan   90   
 
                                              Misc. Info.     Month  Year  
 0                   Hungarian-Canadian mathematician.[1]   January  2020  
 1         American rapper, drug and alcohol overdose.[2]   January  2020  
 2      English footballer (Barnsley, Cardiff City, So...   January  2020  
 3                           American keyboard player.[4] 