Goal:

Collect relevant information on the following pages:

* 1) https://en.wikipedia.org/wiki/17th_century 
* 2) https://en.wikipedia.org/wiki/Timeline_of_the_17th_century
* 3) https://en.wikipedia.org/wiki/1600s_(decade)
* 4) https://en.wikipedia.org/wiki/1600

I will start with a Wikidata-based approach:

- For each section, get all hyperlinks
- For each hyperlink:
    - Check if it is a person:
        - If so, get date of birth, date of death, place of birth and place of death
        - Store QID + information
        - Get Wikipage
        - Get full text
    
    
Let us make some API calls:

In [1]:
import requests
from functions import functions
import pandas as pd

query = "https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=17th_century&format=json"
sections = requests.get(query)
sections_df = pd.json_normalize(sections.json()["parse"]["sections"])
sections_df.head()

Unnamed: 0,toclevel,level,line,number,index,fromtitle,byteoffset,anchor
0,1,2,Events,1.0,1,17th_century,7373,Events
1,2,3,1601–1650,1.1,2,17th_century,7423,1601–1650
2,2,3,1651–1700,1.2,3,17th_century,17327,1651–1700
3,1,2,Significant people,2.0,4,17th_century,23000,Significant_people
4,2,3,Musicians,2.1,5,17th_century,27283,Musicians


Let's get all links for the significant people in the _Science_and_philosophy_ section, section numbered 9.


In [2]:
query = "https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=17th_century&format=json&section=9"
links = requests.get(query)

links_df = pd.json_normalize(links.json()["parse"]["links"])

links_df.head()

Unnamed: 0,ns,exists,*
0,0,,Age of Enlightenment
1,0,,Antonie van Leeuwenhoek
2,0,,Athanasius Kircher
3,0,,Baruch Spinoza
4,0,,Blaise Pascal


Now that I have the pages that are linked, let's try and get information about these pages from Wikidata


In [None]:
links_df = functions.add_ids_and_urls_to_dataframe(links_df)
links_df.head()

In [None]:
human_info = functions.return_data_about_humans(links_df["wikidata_id"])
human_info.head()

A few of the names are duplicated. That is due to multiple values for some features. For example, some sources say that Robert Hooke was born on 1635-07-18 while others say 1635-07-28.

Anyways, I will add the English Wikipedia URLs and page ids.



In [None]:
human_info = human_info.merge(links_df[["wikidata_id", "page_url", "page_id"]])

human_info.head()

Nice! The next steps are:

- Run pipeline and store results for all pages and all sections

Let's start by the sections on the 17th_century page. But first, it is time to make it possible to run it all with a single function, int the functions.py file. 


In [None]:
page = "17th_century"
functions.detect_and_save_people_per_section(page)

Nice. Now I'll run it for the other pages.

In [None]:
page = "Timeline_of_the_17th_century"
functions.detect_and_save_people_per_section(page)

In [None]:
page = "1600s_(decade)"
functions.detect_and_save_people_per_section(page)

In [None]:
page = "1600"
functions.detect_and_save_people_per_section(page)