For the informations that are not about people (and likely not on Wikidata) I'll try a different strategy. 


For each section of each page, I'll get the Wikitext. 

From the wikitext, I'll extract, for each entry in a section:
    
    - the date string
    - the full wikitext of the section

Then I'll build functions to:
    - Parse date string in two (if possible)
    - Get the first link of the entry
    - Get the additional link URLs of the entry
    - Convert wikitext into readable human text.
    
   

In [1]:
import requests
from functions import functions
import pandas as pd

sections_df = functions.get_sections_dataframe("17th_century")
sections_df.head()

Unnamed: 0,toclevel,level,line,number,index,fromtitle,byteoffset,anchor
0,1,2,Events,1.0,1,17th_century,7373,Events
1,2,3,1601–1650,1.1,2,17th_century,7423,1601–1650
2,2,3,1651–1700,1.2,3,17th_century,17327,1651–1700
3,1,2,Significant people,2.0,4,17th_century,23000,Significant_people
4,2,3,Musicians,2.1,5,17th_century,27283,Musicians


I'll start the exploration with the first section: Events

In [2]:
page = "17th_century"
section = "1"

df = functions.get_bullets_on_page_section(page, section)
df.head()

Unnamed: 0,date_string,wikitext_string
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ..."
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...


In [3]:
df["id"] = ["entry_" + str(a+ 1)for a in df.index ]

Nice, now we have the raw material to build further. We will use a Python library for wikitext to extract the informations of interest. This is the library that will be used:
https://pypi.org/project/wikitextparser/#id11

In [4]:
import wikitextparser as wtp

In [5]:
df = functions.get_years_from_bullets(df)
df.head()

Unnamed: 0,date_string,wikitext_string,id,from_year,to_year
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...,entry_1,1600,
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ...",entry_2,1601,
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...,entry_3,1601,1603.0
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...,entry_4,1602,
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...,entry_5,1602,


Ok, now I extracted the dates. Hooray. now lets get the snippet in plain text. 

In [6]:
df = functions.get_main_text(df)
df.head()

Unnamed: 0,date_string,wikitext_string,id,from_year,to_year,main_text
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...,entry_1,1600,,Michael the Brave unifies the three Romanian ...
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ...",entry_2,1601,,"Battle of Kinsale, England defeats Irish and ..."
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...,entry_3,1601,1603.0,The Russian famine of 1601–1603 kills perhaps...
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...,entry_4,1602,,Matteo Ricci produces the Map of the Myriad C...
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...,entry_5,1602,,The Dutch East India Company (VOC) is establi...


Now let's get the main links. I will assume that the first link is the main link.

In [7]:
df = functions.get_main_info(df)
df.head()

Unnamed: 0,date_string,wikitext_string,id,from_year,to_year,main_text,main_link_text,main_link_urls
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...,entry_1,1600,,Michael the Brave unifies the three Romanian ...,Michael the Brave,https://en.wikipedia.org/wiki/Michael_the_Brave
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ...",entry_2,1601,,"Battle of Kinsale, England defeats Irish and ...",Battle of Kinsale,https://en.wikipedia.org/wiki/Battle_of_Kinsale
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...,entry_3,1601,1603.0,The Russian famine of 1601–1603 kills perhaps...,Russian famine of 1601–1603,https://en.wikipedia.org/wiki/Russian_famine_o...
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...,entry_4,1602,,Matteo Ricci produces the Map of the Myriad C...,Matteo Ricci,https://en.wikipedia.org/wiki/Matteo_Ricci
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...,entry_5,1602,,The Dutch East India Company (VOC) is establi...,Dutch East India Company,https://en.wikipedia.org/wiki/Dutch_East_India...


Now let's make an API request to Wikimedia API and get the page ids

In [8]:
page_titles = df["main_link_text"].values

title_to_page_id = functions.get_wikipedia_page_ids(page_titles)

In [9]:

df["main_link_id"] = df["main_link_text"].map(title_to_page_id)
df.head()

Unnamed: 0,date_string,wikitext_string,id,from_year,to_year,main_text,main_link_text,main_link_urls,main_link_id
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...,entry_1,1600,,Michael the Brave unifies the three Romanian ...,Michael the Brave,https://en.wikipedia.org/wiki/Michael_the_Brave,2468688
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ...",entry_2,1601,,"Battle of Kinsale, England defeats Irish and ...",Battle of Kinsale,https://en.wikipedia.org/wiki/Battle_of_Kinsale,1486771
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...,entry_3,1601,1603.0,The Russian famine of 1601–1603 kills perhaps...,Russian famine of 1601–1603,https://en.wikipedia.org/wiki/Russian_famine_o...,39938514
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...,entry_4,1602,,Matteo Ricci produces the Map of the Myriad C...,Matteo Ricci,https://en.wikipedia.org/wiki/Matteo_Ricci,7575977
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...,entry_5,1602,,The Dutch East India Company (VOC) is establi...,Dutch East India Company,https://en.wikipedia.org/wiki/Dutch_East_India...,42737


Nice. Now I will get all the other hyperlinks, separated by " ; "

In [12]:
df_test = functions.get_other_link_info(df.head())
df_test.head()
        


100%|██████████| 5/5 [00:09<00:00,  1.88s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["other_link_text"] = df["id"].map(other_link_texts)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["other_link_url"] = df["id"].map(other_link_urls)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["other_link_id"] = df["id"].map(other_link_ids)


Unnamed: 0,date_string,wikitext_string,id,from_year,to_year,main_text,main_link_text,main_link_urls,main_link_id,other_link_text,other_link_url,other_link_id
0,[[1600]],[[Michael the Brave]] unifies the three [[Rom...,entry_1,1600,,Michael the Brave unifies the three Romanian ...,Michael the Brave,https://en.wikipedia.org/wiki/Michael_the_Brave,2468688,Romania ; Wallachia ; Moldavia ; Principality ...,https://en.wikipedia.org/wiki/Romania ; https:...,25445 ; 46026 ; 46007 ; 6258616 ; 1924255
1,[[1601]],"[[Battle of Kinsale]], England defeats Irish ...",entry_2,1601,,"Battle of Kinsale, England defeats Irish and ...",Battle of Kinsale,https://en.wikipedia.org/wiki/Battle_of_Kinsale,1486771,,,
2,[[1601]]–[[1603]],The [[Russian famine of 1601–1603]] kills per...,entry_3,1601,1603.0,The Russian famine of 1601–1603 kills perhaps...,Russian famine of 1601–1603,https://en.wikipedia.org/wiki/Russian_famine_o...,39938514,,,
3,[[1602]],[[Matteo Ricci]] produces the [[Kunyu Wanguo ...,entry_4,1602,,Matteo Ricci produces the Map of the Myriad C...,Matteo Ricci,https://en.wikipedia.org/wiki/Matteo_Ricci,7575977,Kunyu Wanguo Quantu,https://en.wikipedia.org/wiki/Kunyu_Wanguo_Quantu,25783080
4,[[1602]],The [[Dutch East India Company]] (VOC) is est...,entry_5,1602,,The Dutch East India Company (VOC) is establi...,Dutch East India Company,https://en.wikipedia.org/wiki/Dutch_East_India...,42737,Netherlands ; Dutch Golden Age,https://en.wikipedia.org/wiki/Netherlands ; ht...,21148 ; 241517


As not all links point to redirects, I could not always get the page ids in the correct order. 

Meaning that they will be there, but perhaps not in the same order as the names.

All right, now we have to run the pipeline for all sections of all pages.



In [17]:
def get_bullet_info_for_section(page, section):

    df = get_bullets_on_page_section(page, section)
    df["id"] = ["entry_" + str(a+ 1)for a in df.index ]
    df = get_years_from_bullets(df)
    df = get_main_text(df)
    df = get_main_info(df)
    
    page_titles = df["main_link_text"].values
    title_to_page_id = get_wikipedia_page_ids(page_titles)
    df["main_link_id"] = df["main_link_text"].map(title_to_page_id)
    df = get_other_link_info(df)
    
    return(df)


In [18]:
functions.get_bullet_info_for_section("17th_century", "1")

  3%|▎         | 3/113 [00:06<04:05,  2.23s/it]


KeyboardInterrupt: 

In [None]:
detect_and_save_people_per_section("17th_century")
