Goal:

Collect relevant information on the following pages:

* 1) https://en.wikipedia.org/wiki/17th_century 
* 2) https://en.wikipedia.org/wiki/Timeline_of_the_17th_century
* 3) https://en.wikipedia.org/wiki/1600s_(decade)
* 4) https://en.wikipedia.org/wiki/1600

I will start with a Wikidata-based approach:

- For each section, get all hyperlinks
- For each hyperlink:
    - Check if it is a person:
        - If so, get date of birth, date of death, place of birth and place of death
        - Store QID + information
        - Get Wikipage
        - Get full text
    
    
Let us make some API calls:

In [1]:
import requests
from functions import functions
import pandas as pd

query = "https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=17th_century&format=json"
sections = requests.get(query)
sections_df = pd.json_normalize(sections.json()["parse"]["sections"])
sections_df.head()

Unnamed: 0,toclevel,level,line,number,index,fromtitle,byteoffset,anchor
0,1,2,Events,1.0,1,17th_century,7373,Events
1,2,3,1601–1650,1.1,2,17th_century,7423,1601–1650
2,2,3,1651–1700,1.2,3,17th_century,17327,1651–1700
3,1,2,Significant people,2.0,4,17th_century,23000,Significant_people
4,2,3,Musicians,2.1,5,17th_century,27283,Musicians


Let's get all links for the significant people in the _Science_and_philosophy_ section, section numbered 9.


In [2]:
query = "https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=17th_century&format=json&section=9"
links = requests.get(query)

links_df = pd.json_normalize(links.json()["parse"]["links"])

links_df.head()

Unnamed: 0,ns,exists,*
0,0,,Age of Enlightenment
1,0,,Antonie van Leeuwenhoek
2,0,,Athanasius Kircher
3,0,,Baruch Spinoza
4,0,,Blaise Pascal


Now that I have the pages that are linked, let's try and get information about these pages from Wikidata


In [3]:
links_df = functions.add_ids_and_urls_to_dataframe(links_df)
links_df.head()

Unnamed: 0,ns,exists,*,page_url,page_id,wikidata_id
0,0,,Age of Enlightenment,https://en.wikipedia.org/wiki/Age of Enlighten...,30758.0,Q12539
1,0,,Antonie van Leeuwenhoek,https://en.wikipedia.org/wiki/Antonie van Leeu...,42001.0,Q43522
2,0,,Athanasius Kircher,https://en.wikipedia.org/wiki/Athanasius Kircher,93815.0,Q76738
3,0,,Baruch Spinoza,https://en.wikipedia.org/wiki/Baruch Spinoza,3408.0,Q35802
4,0,,Blaise Pascal,https://en.wikipedia.org/wiki/Blaise Pascal,4068.0,Q1290


In [4]:
human_info = functions.return_data_about_humans(links_df["wikidata_id"])
human_info.head()

Unnamed: 0,wikidata_id,label,birth_date,birthplace,birthplace_coordinate,death_date
0,Q159592,John Napier,1550-01-01T00:00:00Z,Merchiston Tower,Point(-3.21391 55.9333),1617-04-04T00:00:00Z
1,Q46830,Robert Hooke,1635-07-18T00:00:00Z,Freshwater,Point(-1.524883333 50.682566666),1703-03-03T00:00:00Z
2,Q46830,Robert Hooke,1635-07-28T00:00:00Z,Freshwater,Point(-1.524883333 50.682566666),1703-03-03T00:00:00Z
3,Q93128,William Harvey,1578-04-11T00:00:00Z,Folkestone,Point(1.164722222 51.081388888),1657-06-03T00:00:00Z
4,Q191850,Tommaso Campanella,1568-09-14T00:00:00Z,Stilo,Point(16.466666666 38.483333333),1639-05-21T00:00:00Z


A few of the names are duplicated. That is due to multiple values for some features. For example, some sources say that Robert Hooke was born on 1635-07-18 while others say 1635-07-28.

Anyways, I will add the English Wikipedia URLs and page ids.



In [5]:
human_info = human_info.merge(links_df[["wikidata_id", "page_url", "page_id"]])

human_info.head()

Unnamed: 0,wikidata_id,label,birth_date,birthplace,birthplace_coordinate,death_date,page_url,page_id
0,Q159592,John Napier,1550-01-01T00:00:00Z,Merchiston Tower,Point(-3.21391 55.9333),1617-04-04T00:00:00Z,https://en.wikipedia.org/wiki/John Napier,15993.0
1,Q159592,John Napier,1550-01-01T00:00:00Z,Merchiston Tower,Point(-3.213888888 55.933333333),1617-04-04T00:00:00Z,https://en.wikipedia.org/wiki/John Napier,15993.0
2,Q46830,Robert Hooke,1635-07-18T00:00:00Z,Freshwater,Point(-1.524883333 50.682566666),1703-03-03T00:00:00Z,https://en.wikipedia.org/wiki/Robert Hooke,49720.0
3,Q46830,Robert Hooke,1635-07-28T00:00:00Z,Freshwater,Point(-1.524883333 50.682566666),1703-03-03T00:00:00Z,https://en.wikipedia.org/wiki/Robert Hooke,49720.0
4,Q93128,William Harvey,1578-04-11T00:00:00Z,Folkestone,Point(1.164722222 51.081388888),1657-06-03T00:00:00Z,https://en.wikipedia.org/wiki/William Harvey,50203.0


Nice! The next steps are:

- Run pipeline and store results for all pages and all sections

Let's start by the sections on the 17th_century page


In [6]:
query = "https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=17th_century&format=json"
sections = requests.get(query)
sections_df = pd.json_normalize(sections.json()["parse"]["sections"])

In [7]:
page = "17th_century"
human_infos_by_section = {}

for i, row, in sections_df.iterrows():
    print(i)
    n = row["index"]
    try:
        human_info = functions.get_human_info_for_section(page, n)
    except:
        try:
            human_info = functions.get_human_info_for_section(page, n)
        except:
            try:
                human_info = functions.get_human_info_for_section(page, n)
            except:
                human_info = "failed"

    human_infos_by_section[row["anchor"]] =  human_info
    


0
1
2
3
4
5
6
7
8
9
10
11
12
13


In [15]:
import os

path = "./"+page

try:
    os.mkdir(path)
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print ("Successfully created the directory %s " % path)
    
for i in human_infos_by_section:
    print(i)
    title = "./"+page + "/" + i + ".csv"
    try:
        (human_infos_by_section[i].to_csv(title))
    except:
        pass

Successfully created the directory ./17th_century 
Events
1601–1650
1651–1700
Significant_people
Musicians
Visual_artists
Literature
Explorers
Science_and_philosophy
Inventions,_discoveries,_introductions
References
Further_reading
Focus_on_Europe
External_links


Okay, good enough. Now it is time to make it possible to run it all with a single function.