Goal:

Collect relevant information on the following pages:

* 1) https://en.wikipedia.org/wiki/17th_century 
* 2) https://en.wikipedia.org/wiki/Timeline_of_the_17th_century
* 3) https://en.wikipedia.org/wiki/1600s_(decade)
* 4) https://en.wikipedia.org/wiki/1600

I will start with a Wikidata-based approach:

- For each section, get all hyperlinks
- For each hyperlink:
    - Check if it is a person:
        - If so, get date of birth, date of death, place of birth and place of death
        - Store QID + information
        - Get Wikipage
        - Get full text
    
    
Let us make some API calls:

In [1]:
import requests


query = "https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=17th_century&format=json"
sections = requests.get(query)


<Response [200]>

In [5]:
import pandas as pd
sections_df = pd.json_normalize(sections.json()["parse"]["sections"])
sections_df

Unnamed: 0,toclevel,level,line,number,index,fromtitle,byteoffset,anchor
0,1,2,Events,1.0,1,17th_century,7373,Events
1,2,3,1601–1650,1.1,2,17th_century,7423,1601–1650
2,2,3,1651–1700,1.2,3,17th_century,17327,1651–1700
3,1,2,Significant people,2.0,4,17th_century,23000,Significant_people
4,2,3,Musicians,2.1,5,17th_century,27283,Musicians
5,2,3,Visual artists,2.2,6,17th_century,28208,Visual_artists
6,2,3,Literature,2.3,7,17th_century,29597,Literature
7,2,3,Explorers,2.4,8,17th_century,31260,Explorers
8,2,3,Science and philosophy,2.5,9,17th_century,31859,Science_and_philosophy
9,1,2,"Inventions, discoveries, introductions",3.0,10,17th_century,34207,"Inventions,_discoveries,_introductions"


Let's get all links for the significant people in the _Science_and_philosophy_ section, section numbered 9.


In [13]:
query = "https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=17th_century&format=json&section=9"
links = requests.get(query)

links_df = pd.json_normalize(links.json()["parse"]["links"])

links_df.head()

Unnamed: 0,ns,exists,*
0,0,,Age of Enlightenment
1,0,,Antonie van Leeuwenhoek
2,0,,Athanasius Kircher
3,0,,Baruch Spinoza
4,0,,Blaise Pascal


In [58]:

links_df["*"]


titles = "|".join(links_df["*"])
query = "https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=wikibase_item&redirects=1&titles="+titles+"&format=json"
wdi = requests.get(query)

urls_dict = {}
urls_id = {}
wikidata_ids = {}

for i in wdi.json()['query']['pages']:
    link = (wdi.json()['query']['pages'][i])
    hyperlink_title = link["title"]
    hyperlink_url = "https://en.wikipedia.org/wiki/" + link["title"]
    hyperlink_page_id = link["pageid"]
    
    wikidata_id = link['pageprops']['wikibase_item']
    
    urls_dict[hyperlink_title] = hyperlink_url
    urls_id[hyperlink_title] = hyperlink_page_id
    wikidata_ids[hyperlink_title] = wikidata_id
    


In [59]:
link

{'pageid': 171133,
 'ns': 0,
 'title': 'John Amos Comenius',
 'pageprops': {'wikibase_item': 'Q12735'}}

In [61]:
links_df["page_url"] = links_df["*"].map(urls_dict)
links_df["page_id"] = links_df["*"].map(urls_id)
links_df["wikidata_id"] = links_df["*"].map(wikidata_ids)

In [68]:
from functions import functions
functions.return_instances_of_specific_qid(links_df["wikidata_id"], "Q5")

Unnamed: 0,QID,title
0,Q307,Galileo Galilei
1,Q935,Isaac Newton
2,Q1290,Blaise Pascal
3,Q8963,Johannes Kepler
4,Q9191,René Descartes
5,Q9353,John Locke
6,Q35802,Benedictus de Spinoza
7,Q39599,Christiaan Huygens
8,Q37621,Thomas Hobbes
9,Q43393,Robert Boyle


In [71]:
functions.make_string_of_qids_for_sparql(links_df["wikidata_id"])

'wd:Q12539 wd:Q43522 wd:Q76738 wd:Q35802 wd:Q1290 wd:Q39599 wd:Q47434 wd:Q102490 wd:nan wd:Q231256 wd:Q307 wd:nan wd:nan wd:Q154959 wd:Q935 wd:Q122392 wd:Q294100 wd:nan wd:Q8963 wd:Q9353 wd:Q159592 wd:Q235853 wd:Q188663 wd:Q160187 wd:Q60095 wd:Q214816 wd:Q192315 wd:Q75655 wd:Q270141 wd:Q9191 wd:Q43393 wd:Q46830 wd:Q214078 wd:nan wd:Q37621 wd:Q191850 wd:Q93128 '