### Bishop Data Collection Automation Framework

This notebook provides a basic framework for scraping the relevant data from Wikipedia pages where those pages include lists of bishops for specific dioceses. It can collect from a few basic list formats, but it may not work well with tables or other data sources. As I try to run it on more pages I will try to expand its scope to account for these.

Note: The final collection function takes a 'path' argument to enable the selenium webdriver; this path is generally something like '/Users/*yourname*/Downloads/chromedriver'.

### *CURRENT ISSUES*

1. References to other pages that incude a date in the *href* text (for example, the hyperlinked Weblink *Greenland in the Diplomatarium Norwegicum after 1364* on the page https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Gr%C3%B6nland) get treated like bishops
2. *link_collector* not working for *Ísleifur Gissurarson* on the page https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Gr%C3%B6nland
3. There was an error when using the *googletrans Translator*; I have not looked into this further, so for now I have commented out that code, and simply instruct *bio_collector* to include a blank "English Bio" field
4. *bio_collector* does not work if the hyperlinked URL is not just the exact name from the list; for example, *Peder Jansen Lodehat* in *Aarhus* works, because the hyperlinked URL is just https://de.wikipedia.org/wiki/Peder_Jansen_Lodehat, but *Eskil* in *Roskilde* does not work because the hyperlinked URL is https://de.wikipedia.org/wiki/Eskil_von_Lund instead of just https://de.wikipedia.org/wiki/Eskil
5. I have not accounted for the binary *Archbishop* column yet... not sure how to do that... maybe we can check Yada's code? https://github.com/pruksmhc/JobDioceScrape/blob/master/main.py
6. I have not gotten to this stage yet because I wanted to go through a number of list pages for testing, but eventually I will write a function to loop through a list of per-diocese URLS and merge all DataFrames for a given country

In [2]:
# Package import cell
import re
import pandas as pd
import numpy as np
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
from requests import get
from googletrans import Translator
translator = Translator()

In [3]:
# Define function to collect data from well-defined list Wikis
def list_collector(path, url):
    primary_url = url
    driver = webdriver.Chrome(executable_path = path)
    driver.get(primary_url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    driver.close()
    bishops = soup.find_all('li')
    return bishops

In [4]:
# Define function to collect list text from list_collector
def list_to_text(bishop_list):
    bishop_list_text = [bishop.text for bishop in bishop_list]
    return bishop_list_text

In [22]:
# Define function to create merge-able dataframe of bishops with bio links
def link_collector(bishop_list):
    secondary_urls = {}
    for bishop in bishop_list:
        urls = bishop.findAll('a')
        for a in urls:
            if 'redlink' not in a['href']:
                name = ' '.join(a['href'].split('/')[-1].split('_'))
                url = 'https://de.wikipedia.org'+a['href']
                secondary_urls.update({name:url})
    bishops_with_bios = pd.DataFrame(list(secondary_urls.items()), columns=['Name', 'Bio Link'])
    return bishops_with_bios

In [6]:
# Define function to convert bishop list text to clean list of lists
def list_cleaner(bishop_list_text):
    clean_bishops = []
    for bishop in bishop_list_text:
        bishop = bishop.replace('ca. ', '').replace('um ', '').replace('seit ', '')
        bishop = bishop.replace(' –', '–')
        bishop = bishop.replace('– ', '–')
        bishop = bishop.replace('- ', '–')
        bishop = bishop.replace(' -', '–')
        bishop = bishop.replace(':', '')
        bishop = bishop.replace('vakant', 'Vacant')
        bishop = re.sub(r' ?\([^)]+\)', '', bishop)
        if bishop[0:3].isdigit():
            bishop_elements = bishop.split(' ')
            years = bishop_elements[0]
            if '–' in years:    
                year_elements = years.split('–')
                year_in = year_elements[0]
                year_out = year_elements[1]
            else:
                year_in = years
                year_out = ''
            name = ' '.join(bishop_elements[1:])
            clean_bishops.append([name, year_in, year_out])
        elif ((bishop[-3:].isdigit()) | (bishop.endswith('??'))):
            bishop_elements = bishop.split(' ')
            years = bishop_elements[-1]
            if '–' in years:    
                year_elements = years.split('–')
                year_in = year_elements[0]
                year_out = year_elements[1]
            else:
                year_in = years
                year_out = ''
            name = ' '.join(bishop_elements[:-1])
            clean_bishops.append([name, year_in, year_out])
    return clean_bishops

In [7]:
# Define function to convert clean list of lists to clean dataframe
def dataframer(list_of_bishop_lists, country, diocese):
    data = pd.DataFrame.from_records(list_of_bishop_lists)
    data = data.rename({0:'Name', 1:'From', 2:'To'}, axis='columns')
    data['Country'] = country
    data['Diocese'] = diocese
    data = data[(~data['From'].str.startswith('17')) & (~data['From'].str.startswith('18')) 
                & (~data['From'].str.startswith('19')) & (~data['From'].str.startswith('2')) 
                & (~data['From'].str.startswith('-'))]
    return data

In [8]:
# Define function to merge bishop dataframe with link dataframe
def url_merger(bishop_dataframe, link_dataframe):
    merged_dataframe = pd.merge(bishop_dataframe, link_dataframe, on='Name', how='left')
    merged_dataframe = merged_dataframe.fillna('')
    return merged_dataframe

In [9]:
# Define function to process available links and collect biographies in dataframe
def bio_collector(dataframe, path):
    bio_list = []
    for link in dataframe['Bio Link']:
        if link != '':
            driver = webdriver.Chrome(executable_path = path)
            driver.get(link)
            html = driver.page_source
            soup = BeautifulSoup(html, 'html.parser')
            driver.close()
            name = ' '.join(link.split('/')[-1].split('_'))
            paragraphs = soup.find_all('p')
            gbio = ' '.join(paragraph.text for paragraph in paragraphs).replace('\n', ' ')
#             ebio = translator.translate(gbio, src='de', dest='en')
            ebio = ''
#             bio_list.append([name, gbio, ebio.text])
            bio_list.append([name, gbio, ebio])
    bio_dataframe = pd.DataFrame.from_records(bio_list)
    bio_dataframe = bio_dataframe.rename({0:'Name', 1:'German Bio', 2:'English Bio'}, axis='columns')
    return bio_dataframe

In [10]:
# Define function to create final clean dataframe
def dataframe_finalizer(bishop_dataframe, bio_dataframe):
    if not bio_dataframe.empty:
        final = pd.merge(bishop_dataframe, bio_dataframe, on='Name', how='left')
        final = final.fillna('')
    else:
        final = bishop_dataframe
        final['German Bio'] = ''
        final['English Bio'] = ''
        final = final.fillna('')
    return final

In [11]:
# Define function to automate the collection process using above functions
def collector(path, url, country, diocese):
    bishop_list = list_collector(path, url)
    clean_bishop_list = list_cleaner(list_to_text(bishop_list))
    clean_bishop_dataframe = dataframer(clean_bishop_list, country, diocese)
    link_dataframe = link_collector(bishop_list)
    merged_dataframe = url_merger(clean_bishop_dataframe, link_dataframe)
    bio_dataframe = bio_collector(merged_dataframe, path)
    output = dataframe_finalizer(merged_dataframe, bio_dataframe)
    return output

In [12]:
denmark_aarhus = collector('/Users/orion/Downloads/chromedriver', 
                           'https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Aarhus', 
                           'Denmark', 
                           'Aarhus')

In [13]:
denmark_funen = collector('/Users/orion/Downloads/chromedriver', 
                          'https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_F%C3%BCnen', 
                          'Denmark', 
                          'Funen')

In [14]:
denmark_roskilde = collector('/Users/orion/Downloads/chromedriver', 
                             'https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Roskilde', 
                             'Denmark', 
                             'Roskilde')

In [15]:
denmark_viborg = collector('/Users/orion/Downloads/chromedriver', 
                           'https://de.wikipedia.org/wiki/Bistum_Viborg', 
                           'Denmark', 
                           'Viborg')

In [16]:
denmark_aalborg = collector('/Users/orion/Downloads/chromedriver', 
                            'https://de.wikipedia.org/wiki/Bistum_Aalborg', 
                            'Denmark', 
                            'Aalborg')

In [24]:
denmark_gronland = collector('/Users/orion/Downloads/chromedriver', 
                             'https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Gr%C3%B6nland', 
                             'Denmark', 
                             'Gronland')