## CRUD

| | SQL | RESTful API |
|:-:|:-:|:-:|
| create | `INSERT` | `POST` |
| read | `SELECT` | `GET` |
| update | `UPDATE` | `PUT` |
| delete | `DELETE` | `DELETE` |

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxLyslPzk7JTExXqOZSUFAPcnV0UUgrys9VKM_MzixIBcqoK-jaKRSlFpfmlBSDmOqefsGuQSEKJfkKBfnFJelAKXVrrloAaBAXgg)

## Table Setup

As there are pages that could have many categories, we need an additional pages-categories table as a part of our schema. It's like a single author writing multiple books - something like the below:

![](http://2.bp.blogspot.com/-XqXauv9hrQ4/U3RXOjaW8qI/AAAAAAAAIiw/X7Pd2C41cAU/s1600/DatabaseRelationships.png)

## PostgreSQL Schema

To follow that same structure, we will want the following three tables:

    pages
    ====
    links
    page_categories
    pageid (primary key)
    summary
    title
    url  
|page_id | title |
|:-:|:-:|
| 1 | logistic regression |
| 2 | perceptron | 
| 3 | random forest | 

    category_pages
    ========
    page_id (foreign key)
    timestamp
    title
    type
    category_id (foreign key)
    category_name
| category_id | category_name | page_id | title |
|:-:|:-:|:-:|:-:|
| 1 | machine learning | 1 | logistic_regression
| 2 | linear models | 2 | perceptron
| 3 | tree models | 3 | random forest

    categories
    ========
    category_id (primary key)
    category_name
    category_summary
    url
| category_id | category_name |
|:-:|:-:|
| 1 | machine learning |
| 2 | linear models |
| 3 | tree models |

In [2]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Running setup.py bdist_wheel for wikipedia ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/bf/87/25/df698dd7b66a42c1c5f3bd36f8155d4518d210f5e2c128b440
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
import urllib
import json
import requests
import pandas as pd
import wikipedia
import re

### Generating Categories Table (didn't use this one)...

I was able to get all the pages and categories for Machine Learning and Business Software with the code below. However, for the purposes of this project, I needed more pages that belonged to several other categories as we are wanting to eventually build out a model that will predict categories. So we need a large dataset of several categories to train the model on. So this code didn't end up working and needed to develop a recursive function in order to accomplish this.

In [None]:
ml_pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&list=categorymembers&titles=Main+Page&exintro=1&explaintext=1&cmtitle=Category%3AMachine_learning&cmprop=ids%7Ctitle%7Ctype%7Ctimestamp&cmlimit=max")
ml_str_response = ml_pages.read().decode('utf-8')
ml_data = json.loads(ml_str_response)
ml_cat_df = pd.DataFrame(ml_data['query']['categorymembers'])
ml_cat_df['category_id'] = wikipedia.page('Category:Machine_learning').pageid
ml_cat_df['category'] = 'Machine learning'
ml_cat_df['category_summary'] = wikipedia.page('Category:Machine_learning').summary
ml_cat_df['url'] = wikipedia.page('Category:Machine_learning').url

bu_pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&list=categorymembers&titles=Main+Page&exintro=1&explaintext=1&cmtitle=Category%3ABusiness+software&cmprop=ids%7Ctitle%7Ctype%7Ctimestamp&cmlimit=max")
bu_str_response = bu_pages.read().decode('utf-8')
bu_data = json.loads(bu_str_response)
bu_cat_df = pd.DataFrame(bu_data['query']['categorymembers'])
bu_cat_df['category_id'] = wikipedia.page('Category:Business software').pageid
bu_cat_df['category'] = 'Business software'
bu_cat_df['category_summary'] = wikipedia.page('Category:Business software').summary
bu_cat_df['url'] = wikipedia.page('Category:Business software').url

categories_pages_df = pd.concat([ml_cat_df,bu_cat_df], axis=0, join='outer')

categories_pages_df.reset_index
categories_pages_df.sample(5)

In [None]:
categories_pages_df.reset_index(drop='index')

In [None]:
categories_pages_df.to_csv('categories_pages_dont_use.csv',index=False)

## Generating Categories Tables with Functions

The functions below is what I used in order to generate the categories_pages table. The end product is a recursive function that gets the contents of the pages within a category.

I ended up just using the "summary" from the wikipedia API for each page as it had enough textual content to work with. Wikipedia had an option to call for all the content of a page but it was way too much text and it would affect model performence later on. So decided to go with the "summary" as it still had a substantial amount of text that describes the topic of the page.

In [None]:
def generate_category(category):
    category = re.sub('\s','+',category)
    return category

def generate_query(category):
    query = """
            http://en.wikipedia.org/w/api.php?
            action=query&
            format=json&
            prop=extracts&
            list=categorymembers&
            titles=Main+Page&
            exintro=1&
            explaintext=1&
            cmtitle=Category%3A{}&
            cmprop=ids%7Ctitle%7Ctype%7Ctimestamp&
            cmlimit=max
            """.format(generate_category(category))
    query = re.sub('\s','',query)
    return query

def execute_category_query(category):
    r = requests.get(generate_query(category))
    response = r.json()
    return pd.DataFrame(response['query']['categorymembers'])

def category_to_dataframe(category):
    df = execute_category_query(category)
    category_string = 'Category:{}'.format(category)
    df['category_id'] = wikipedia.page(category_string).pageid
    df['category'] = category_string
    df['category_summary'] = wikipedia.page(category_string).summary
    df['url'] = wikipedia.page(category_string).url
    return df

def remove_category(category):
    category = re.sub('Category:','',category)
    return category

#categories_to_query = category_to_dataframe(category)[category_mask]['title'].apply(remove_category).tolist()

def get_all_pages_rec(category):
    category_df = category_to_dataframe(category)
    pages_list = []
    category_mask = category_df['title'].str.contains('Category:')
    pages_df = category_df[~category_mask]
    pages_list.append(pages_df)
    categories = category_df[category_mask]['title'].str.replace('Category:','').tolist()
    if len(categories) > 0:
        for cat in categories:
            try:
                pages_list.append(get_all_pages_rec(cat))
            except:
                pass
            
    pages_df = pd.concat(pages_list)
    pages_df.reset_index()
    return pages_df

In [None]:
machine_learning_cat_df = get_all_pages_rec('Machine_learning')
machine_learning_cat_df.drop_duplicates()
machine_learning_cat_df.head()

In [None]:
business_cat_df = get_all_pages_rec('Business software')
business_cat_df.drop_duplicates()
business_cat_df.head()

In [None]:
categories_df = pd.concat([machine_learning_cat_df,business_cat_df], axis=0, join='outer')
categories_df.reset_index(drop='index',inplace=True)

In [None]:
categories_df.to_csv('categories_extract.csv',index=False)

In [None]:
category = categories_df[['category_id','category','category_summary','url']]
category.drop_duplicates(inplace=True)
category.to_csv('categories.csv',index=False)

In [None]:
categories_df.drop(['ns','category_summary','url'],axis=1,inplace=True)

In [None]:
categories_df.to_csv('categories_pages.csv',index=False)

### Generating Pages Table - with Wikipedia API

To get all the contents of the pages, I used the python library for the Wikipedia API. After getting all the data that I need, I threw everything into a dataframe and saved it to CSV's to put into my postgres database. Because the loads were taking so long and internet connection was spotty, ended up breaking these out into batches.

In [None]:
cat_df = pd.read_csv('categories_extract.csv')

In [None]:
pages_id_list_1 = cat_df['pageid'].iloc[0:1000].tolist()
pages_id_list_2 = cat_df['pageid'].iloc[1000:2000].tolist()
pages_id_list_3 = cat_df['pageid'].iloc[2000:3000].tolist()
pages_id_list_4 = cat_df['pageid'].iloc[3000:4000].tolist()
pages_id_list_5 = cat_df['pageid'].iloc[5000:6000].tolist()
pages_id_list_6 = cat_df['pageid'].iloc[6000:].tolist()

In [None]:
def pages_query(p_id):
    temp_dict = {}
    request = wikipedia.page(pageid=p_id)
    temp_dict['pageid'] = request.pageid
    temp_dict['title'] = request.title
    temp_dict['links'] = request.links
    temp_dict['page_categories'] = request.categories
#    temp_dict['revision_id'] = request.revision_id
#    temp_dict['images'] = request.images
#    temp_dict['content'] = request.content
    temp_dict['summary'] = request.summary
#    temp_dict['references'] = request.references
#    temp_dict['parent_id'] = request.parent_id
    temp_dict['url'] = request.url
    return temp_dict

In [None]:
###1st list

pages = []
for s in pages_id_list_1:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_1 = pd.DataFrame(pages)
pages_df_1.head()

In [None]:
pages_df_1.to_csv('pages_df_1.csv',index=False)

In [None]:
###2nd list

pages = []
for s in pages_id_list_2:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_2 = pd.DataFrame(pages)
pages_df_2.head()

In [None]:
pages_df_2.to_csv('pages_df_2.csv',index=False)

In [None]:
###3rd list

pages = []
for s in pages_id_list_3:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_3 = pd.DataFrame(pages)
pages_df_3.head()

In [None]:
pages_df_3.to_csv('pages_df_3.csv',index=False)

In [None]:
###4th list

pages = []
for s in pages_id_list_4:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_4 = pd.DataFrame(pages)
pages_df_4.head()

In [None]:
pages_df_4.to_csv('pages_df_4.csv',index=False)

In [None]:
###5th list

pages = []
for s in pages_id_list_5:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_5 = pd.DataFrame(pages)
pages_df_5.head()

In [None]:
pages_df_5.to_csv('pages_df_5.csv',index=False)

In [None]:
###6th list

pages = []
for s in pages_id_list_6:
    try:
        pages.append(pages_query(s))
        print('adding page')
    except:
        pass
    
pages_df_6 = pd.DataFrame(pages)
pages_df_6.head()

In [None]:
pages_df_6.to_csv('pages_df_6.csv',index=False)

In [None]:
pages_df_1 = pd.read_csv('pages_df_1.csv')
pages_df_2 = pd.read_csv('pages_df_2.csv')
pages_df_3 = pd.read_csv('pages_df_3.csv')
pages_df_4 = pd.read_csv('pages_df_4.csv')
pages_df_5 = pd.read_csv('pages_df_5.csv')
pages_df_6 = pd.read_csv('pages_df_6.csv')

pages_df = pd.concat([pages_df_1,pages_df_2,pages_df_3,pages_df_4,pages_df_5,pages_df_6], axis=0, join='outer')
pages_df.reset_index(drop='index',inplace=True)
pages_df.drop_duplicates(inplace=True)

In [None]:
pages_df.to_csv('pages.csv',index=False)

In [None]:
cat_df = pd.read_csv('categories_pages_extract.csv')

### Querying pages with API URL's (didn't use this method)..

Rather than using the Wikipedia Python API, I tried gathering the data directly with a query from the API. However, this method was difficult because it was taking a very long time to load the pages and also, it wasn't getting loaded into the dataframe correctly. So decided to use the above method instead

In [None]:
def generate_category(category):
    category = re.sub('\s','+',category)
    return category

def generate_page_query(page_id):
    query = """
            http://en.wikipedia.org/w/api.php?
            action=query&
            format=json&
            prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&
            list=&
            pageids={}&
            exlimit=20&
            exintro=1&
            explaintext=1&
            exsectionformat=wiki&
            inprop=url%7Cdisplaytitle
            """.format(generate_category(page_id))
    query = re.sub('\s','',query)
    return query

def execute_page_query(page_id):
    r = requests.get(generate_page_query(page_id))
    response = r.json()
    return pd.DataFrame.from_dict(response['query']['pages'], orient='index')


In [None]:
execute_page_query('15702071')['extract'].values

In [None]:
#pages_id_list = categories_df['pageid'].tolist()

df = pd.DataFrame(execute_page_query('40973765').columns)
for s in sample_list:
    try:
        df.append(execute_page_query('s'))
    except:
        pass
    
df

### Generating Pages Table - with API urls (didn't use this method)..

Couldn't get this to work very well because wikipedia api limits url's to have only 50 page id's and of those 50, can get only 20 extracts of text at a time. So didn't think this was a very good solution so didn't go the route below

In [None]:
api_links = [
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=48290454%7C19317802%7C51112472%7C12386904%7C49316492%7C787776%7C416612%7C5767980%7C34042707%7C28255458%7C54550729%7C847558%7C4118276%7C6968451%7C3118600%7C387537%7C9583985%7C2934910%7C28650287%7C22795783%7C17114678%7C8964665%7C39182554%7C53631046%7C44439173%7C1191936%7C50646178%7C205393%7C40678189%7C50211107%7C40973765%7C35867897%7C9732182%7C31877832%7C14003441%7C19463198%7C49242352%7C20890511%7C50773876%7C30511763%7C52642349%7C45049676%7C28801798%7C43808044%7C3771060%7C53198248%7C53587467%7C233488%7C49082762%7C43385931&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=10747879%7C28037054%7C54033657%7C43688866%7C4144848%7C46798615%7C41370976%7C2090057%7C5721283%7C48987892%7C470314%7C22589574%7C53279262%7C33762888%7C41644056%7C42579971%7C173926%7C5008963%7C32402755%7C1041204%7C4375576%7C50336055%7C40254%7C1222578%7C313845%7C2085584%7C34061548%7C38870173%7C36126852%7C46207323%7C1299404%7C21638340%7C14923880%7C8416103%7C460689%7C43218024%7C48833041%7C12304987%7C52003586%7C1455062%7C13750669%7C213214%7C10747995%7C45390860%7C1331441%7C12155912%7C579867%7C1422176%7C41755648%7C43169442&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=41200806%7C24825162%7C9731945%7C21985449%7C6881120%7C34072838%7C11360852%7C2291650%7C126706%7C40946774%7C23864280%7C173332%7C42129549%7C10748030%7C44577560%7C19208664%7C7309022%7C45627703%7C48976249%7C18475546%7C52242050%7C14082194%7C48841414%7C30909817%7C30928751%7C938663%7C2854828%7C49786340%7C33998310%7C871681%7C4615464%7C12306500%7C44628821%7C47937215%7C53049812%7C48777199%7C53970843%7C53802271%7C5721403%7C45378845%7C44632031%7C53047074%7C46963137%7C523173%7C35272263%7C33890474%7C23864530%7C25050663%7C3274742%7C48827727&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=42936114%7C33547387%7C47527969%7C52992310%7C7578809%7C43502368%7C39945557%7C47228422%7C233497%7C37787103%7C22999791%7C19058043%7C5077439%7C47577902%7C3920550%7C960361%7C50828755%7C1514392%7C20926%7C3119546%7C48844125%7C10704974%7C47845063%7C19667111%7C1053303%7C1579244%7C33886025%7C50227596%7C48813654%7C405562%7C47509138%7C38059657%7C29288159%7C2829632%7C50222574%7C14271782%7C995455%7C43269516%7C7517319%7C3290880%7C38782554%7C54625345%7C35887507%7C926722%7C43932548%7C37697003%7C14529261%7C41929726%7C44108758%7C41732818&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=34845963%7C32797209%7C25957629%7C41672405%7C53108275%7C11971726%7C522230%7C38722262%7C34026570%7C53113973%7C19657756%7C35959361%7C22847264%7C1762176%7C54594603%7C24061342%7C12715119%7C44133735%7C28502793%7C5211212%7C41270069%7C1037763%7C52763828%7C31176997%7C52763867%7C40149461%7C11737376%7C52763829%7C3832584%7C24059390%7C47991509%7C44943481%7C33547228%7C28004586%7C36407925%7C19314112%7C20924581%7C12535256%7C29003796%7C3061615%7C754055%7C3985352%7C29549713%7C42320378%7C33542714%7C12932492%7C34310097%7C22532673%7C1991254%7C1718975&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=17106978%7C1660841%7C26480448%7C16342561%7C34900000%7C40502799%7C9133131%7C24093035%7C469578%7C5988487%7C30429756%7C27208838%7C32958985%7C43689922%7C32244195%7C4166591%7C50994297%7C36808856%7C8385046%7C39219632%7C36862865%7C49791445%7C48824910%7C33085387%7C1733999%7C2528278%7C1155952%7C42005%7C32421587%7C43673868%7C22832929%7C50421011%7C26137900%7C24104531%7C48234685%7C41053071%7C32613108%7C40158733%7C53640074%7C7106579%7C18841448%7C4032051%7C1631564%7C3543438%7C27954944%7C3927666%7C54133478%7C38935938%7C2093407%7C33065316&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=31975428%7C6987871%7C42363103%7C22230795%7C3691948%7C1174674%7C15516115%7C13443187%7C20391794%7C41669016%7C39946273%7C8522483%7C27976545%7C3070013%7C18487118%7C8083806%7C41667831%7C40693259%7C8343747%7C11028436%7C54135637%7C31954624%7C28544576%7C20414115%7C52641508%7C786016%7C28370294%7C18515583%7C34293559%7C49107824%7C2302514%7C47967038%7C12640130%7C24310774%7C52993539%7C2286665%7C904795%7C9597318%7C22807593%7C6603087%7C34208511%7C7674023%7C12840831%7C6456883%7C16144050%7C49680032%7C8648665%7C24747714%7C30522786%7C3445672&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=39238529%7C25822348%7C54083259%7C2672138%7C2988291%7C22162830%7C40863938%7C1950870%7C1758239%7C23062020%7C237494%7C13452317%7C2370618%7C51412587%7C24200863%7C1743401%7C18096221%7C4996092%7C12185719%7C16567431%7C39244415%7C3139122%7C2029470%7C53969134%7C50313336%7C40312213%7C19885252%7C2011728%7C9611373%7C42812494%7C3620743%7C34025898%7C3568755%7C2361538%7C36488797%7C8793238%7C6093560%7C48231694%7C2901621%7C8148765%7C41261061%7C30708494%7C30703633%7C30727442%7C4175227%7C1971849%7C48547307%7C13224058%7C24766750%7C54205765&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=47642826%7C27165706%7C32768675%7C49677616%7C2585120%7C46753689%7C22602354%7C17989579%7C17916442%7C23232660%7C51783746%7C41399235%7C34045703%7C44504550%7C36731269%7C33242335%7C40252780%7C7686145%7C29345392%7C15628716%7C638133%7C42113520%7C5588452%7C19029406%7C15465111%7C13264388%7C220633%7C40928146%7C30864276%7C25114536%7C6903579%7C3069932%7C52048578%7C28192984%7C1699107%7C45075900%7C26703974%7C52529493%7C9409598%7C53815511%7C47305095%7C5851855%7C3849008%7C41653781%7C27656596%7C41102584%7C41352219%7C31897073%7C28836625%7C50498591&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=11154407%7C1290733%7C28096875%7C19925726%7C49258658%7C30518856%7C30821028%7C41533727%7C34027125%7C50644682%7C19140610%7C49478757%7C33761479%7C2399787%7C4175276%7C49521169%7C38109156%7C7364194%7C34793528%7C5845752%7C206586%7C2756846%7C30890479%7C42865956%7C19547325%7C24363618%7C30233970%7C40402832%7C28847674%7C26480028%7C33439781%7C3765816%7C32845520%7C6163915%7C25279503%7C31800111%7C22350998%7C12941228%7C25296555%7C2269864%7C43024112%7C49788699%7C30797416%7C35252701%7C37789882%7C22038673%7C28005288%7C39853660%7C47113160%7C19674177&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=29576811%7C1706137%7C30220188%7C21137368%7C34292819%7C38739665%7C27490351%7C30303453%7C3813661%7C23785395%7C21283046%7C3964070%7C6702491%7C8733307%7C2504415%7C45686490%7C24925014%7C15459030%7C41517953%7C41517954%7C37041135%7C14541812%7C17799618%7C20727104%7C34061482%7C33185688%7C33906593%7C3234557%7C39214950%7C1693503%7C14568189%7C9466979%7C51957134%7C50515193%7C2840555%7C25745688%7C14546175%7C51567252%7C40453325%7C36292664%7C15095926%7C1745691%7C53984403%7C40727883%7C24272165%7C42460055%7C11648918%7C35152919%7C35127328%7C15309361&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
    'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=9622164%7C23739219%7C9186941%7C53417207%7C26651713%7C38798197%7C9761144%7C2543335&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle'
]

In [None]:
pages = {}
for link in api_links:
    page = urllib.request.urlopen(link)
    str_response = page.read().decode('utf-8')
    data = json.loads(str_response)
    query = data['query']
    pages.update(query)
    
    
    pages_1 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=48290454%7C19317802%7C51112472%7C12386904%7C49316492%7C787776%7C416612%7C5767980%7C34042707%7C28255458%7C54550729%7C847558%7C4118276%7C6968451%7C3118600%7C387537%7C9583985%7C2934910%7C28650287%7C22795783%7C17114678%7C8964665%7C39182554%7C53631046%7C44439173%7C1191936%7C50646178%7C205393%7C40678189%7C50211107%7C40973765%7C35867897%7C9732182%7C31877832%7C14003441%7C19463198%7C49242352%7C20890511%7C50773876%7C30511763%7C52642349%7C45049676%7C28801798%7C43808044%7C3771060%7C53198248%7C53587467%7C233488%7C49082762%7C43385931&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_1 = pages_1.read().decode('utf-8')
data_1 = json.loads(str_response_1)
query_1 = data_1['query']

In [None]:
pages_1 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=48290454%7C19317802%7C51112472%7C12386904%7C49316492%7C787776%7C416612%7C5767980%7C34042707%7C28255458%7C54550729%7C847558%7C4118276%7C6968451%7C3118600%7C387537%7C9583985%7C2934910%7C28650287%7C22795783%7C17114678%7C8964665%7C39182554%7C53631046%7C44439173%7C1191936%7C50646178%7C205393%7C40678189%7C50211107%7C40973765%7C35867897%7C9732182%7C31877832%7C14003441%7C19463198%7C49242352%7C20890511%7C50773876%7C30511763%7C52642349%7C45049676%7C28801798%7C43808044%7C3771060%7C53198248%7C53587467%7C233488%7C49082762%7C43385931&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_1 = pages_1.read().decode('utf-8')
data_1 = json.loads(str_response_1)
query_1 = data_1['query']
pages_1 = pd.DataFrame(query_1['pages']).T

pages_2 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=10747879%7C28037054%7C54033657%7C43688866%7C4144848%7C46798615%7C41370976%7C2090057%7C5721283%7C48987892%7C470314%7C22589574%7C53279262%7C33762888%7C41644056%7C42579971%7C173926%7C5008963%7C32402755%7C1041204%7C4375576%7C50336055%7C40254%7C1222578%7C313845%7C2085584%7C34061548%7C38870173%7C36126852%7C46207323%7C1299404%7C21638340%7C14923880%7C8416103%7C460689%7C43218024%7C48833041%7C12304987%7C52003586%7C1455062%7C13750669%7C213214%7C10747995%7C45390860%7C1331441%7C12155912%7C579867%7C1422176%7C41755648%7C43169442&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_2 = pages_2.read().decode('utf-8')
data_2 = json.loads(str_response_2)
query_2 = data_2['query']
pages_2 = pd.DataFrame(query_2['pages']).T

pages_3 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=41200806%7C24825162%7C9731945%7C21985449%7C6881120%7C34072838%7C11360852%7C2291650%7C126706%7C40946774%7C23864280%7C173332%7C42129549%7C10748030%7C44577560%7C19208664%7C7309022%7C45627703%7C48976249%7C18475546%7C52242050%7C14082194%7C48841414%7C30909817%7C30928751%7C938663%7C2854828%7C49786340%7C33998310%7C871681%7C4615464%7C12306500%7C44628821%7C47937215%7C53049812%7C48777199%7C53970843%7C53802271%7C5721403%7C45378845%7C44632031%7C53047074%7C46963137%7C523173%7C35272263%7C33890474%7C23864530%7C25050663%7C3274742%7C48827727&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_3 = pages_3.read().decode('utf-8')
data_3 = json.loads(str_response_3)
query_3 = data_3['query']
pages_3 = pd.DataFrame(query_3['pages']).T

pages_4 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=42936114%7C33547387%7C47527969%7C52992310%7C7578809%7C43502368%7C39945557%7C47228422%7C233497%7C37787103%7C22999791%7C19058043%7C5077439%7C47577902%7C3920550%7C960361%7C50828755%7C1514392%7C20926%7C3119546%7C48844125%7C10704974%7C47845063%7C19667111%7C1053303%7C1579244%7C33886025%7C50227596%7C48813654%7C405562%7C47509138%7C38059657%7C29288159%7C2829632%7C50222574%7C14271782%7C995455%7C43269516%7C7517319%7C3290880%7C38782554%7C54625345%7C35887507%7C926722%7C43932548%7C37697003%7C14529261%7C41929726%7C44108758%7C41732818&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_4 = pages_4.read().decode('utf-8')
data_4 = json.loads(str_response_4)
query_4 = data_4['query']
pages_4 = pd.DataFrame(query_4['pages']).T

pages_5 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=34845963%7C32797209%7C25957629%7C41672405%7C53108275%7C11971726%7C522230%7C38722262%7C34026570%7C53113973%7C19657756%7C35959361%7C22847264%7C1762176%7C54594603%7C24061342%7C12715119%7C44133735%7C28502793%7C5211212%7C41270069%7C1037763%7C52763828%7C31176997%7C52763867%7C40149461%7C11737376%7C52763829%7C3832584%7C24059390%7C47991509%7C44943481%7C33547228%7C28004586%7C36407925%7C19314112%7C20924581%7C12535256%7C29003796%7C3061615%7C754055%7C3985352%7C29549713%7C42320378%7C33542714%7C12932492%7C34310097%7C22532673%7C1991254%7C1718975&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_5 = pages_5.read().decode('utf-8')
data_5 = json.loads(str_response_5)
query_5 = data_5['query']
pages_5 = pd.DataFrame(query_5['pages']).T

pages_6 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=17106978%7C1660841%7C26480448%7C16342561%7C34900000%7C40502799%7C9133131%7C24093035%7C469578%7C5988487%7C30429756%7C27208838%7C32958985%7C43689922%7C32244195%7C4166591%7C50994297%7C36808856%7C8385046%7C39219632%7C36862865%7C49791445%7C48824910%7C33085387%7C1733999%7C2528278%7C1155952%7C42005%7C32421587%7C43673868%7C22832929%7C50421011%7C26137900%7C24104531%7C48234685%7C41053071%7C32613108%7C40158733%7C53640074%7C7106579%7C18841448%7C4032051%7C1631564%7C3543438%7C27954944%7C3927666%7C54133478%7C38935938%7C2093407%7C33065316&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_6 = pages_6.read().decode('utf-8')
data_6 = json.loads(str_response_6)
query_6 = data_6['query']
pages_6 = pd.DataFrame(query_6['pages']).T

pages_7 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=31975428%7C6987871%7C42363103%7C22230795%7C3691948%7C1174674%7C15516115%7C13443187%7C20391794%7C41669016%7C39946273%7C8522483%7C27976545%7C3070013%7C18487118%7C8083806%7C41667831%7C40693259%7C8343747%7C11028436%7C54135637%7C31954624%7C28544576%7C20414115%7C52641508%7C786016%7C28370294%7C18515583%7C34293559%7C49107824%7C2302514%7C47967038%7C12640130%7C24310774%7C52993539%7C2286665%7C904795%7C9597318%7C22807593%7C6603087%7C34208511%7C7674023%7C12840831%7C6456883%7C16144050%7C49680032%7C8648665%7C24747714%7C30522786%7C3445672&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_7 = pages_7.read().decode('utf-8')
data_7 = json.loads(str_response_7)
query_7 = data_7['query']
pages_7 = pd.DataFrame(query_7['pages']).T

pages_8 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=39238529%7C25822348%7C54083259%7C2672138%7C2988291%7C22162830%7C40863938%7C1950870%7C1758239%7C23062020%7C237494%7C13452317%7C2370618%7C51412587%7C24200863%7C1743401%7C18096221%7C4996092%7C12185719%7C16567431%7C39244415%7C3139122%7C2029470%7C53969134%7C50313336%7C40312213%7C19885252%7C2011728%7C9611373%7C42812494%7C3620743%7C34025898%7C3568755%7C2361538%7C36488797%7C8793238%7C6093560%7C48231694%7C2901621%7C8148765%7C41261061%7C30708494%7C30703633%7C30727442%7C4175227%7C1971849%7C48547307%7C13224058%7C24766750%7C54205765&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_8 = pages_8.read().decode('utf-8')
data_8 = json.loads(str_response_8)
query_8 = data_8['query']
pages_8 = pd.DataFrame(query_8['pages']).T

pages_9 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=47642826%7C27165706%7C32768675%7C49677616%7C2585120%7C46753689%7C22602354%7C17989579%7C17916442%7C23232660%7C51783746%7C41399235%7C34045703%7C44504550%7C36731269%7C33242335%7C40252780%7C7686145%7C29345392%7C15628716%7C638133%7C42113520%7C5588452%7C19029406%7C15465111%7C13264388%7C220633%7C40928146%7C30864276%7C25114536%7C6903579%7C3069932%7C52048578%7C28192984%7C1699107%7C45075900%7C26703974%7C52529493%7C9409598%7C53815511%7C47305095%7C5851855%7C3849008%7C41653781%7C27656596%7C41102584%7C41352219%7C31897073%7C28836625%7C50498591&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_9 = pages_9.read().decode('utf-8')
data_9 = json.loads(str_response_9)
query_9 = data_9['query']
pages_9 = pd.DataFrame(query_9['pages']).T

pages_10 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=11154407%7C1290733%7C28096875%7C19925726%7C49258658%7C30518856%7C30821028%7C41533727%7C34027125%7C50644682%7C19140610%7C49478757%7C33761479%7C2399787%7C4175276%7C49521169%7C38109156%7C7364194%7C34793528%7C5845752%7C206586%7C2756846%7C30890479%7C42865956%7C19547325%7C24363618%7C30233970%7C40402832%7C28847674%7C26480028%7C33439781%7C3765816%7C32845520%7C6163915%7C25279503%7C31800111%7C22350998%7C12941228%7C25296555%7C2269864%7C43024112%7C49788699%7C30797416%7C35252701%7C37789882%7C22038673%7C28005288%7C39853660%7C47113160%7C19674177&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_10 = pages_10.read().decode('utf-8')
data_10 = json.loads(str_response_10)
query_10 = data_10['query']
pages_10 = pd.DataFrame(query_10['pages']).T

pages_11 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=29576811%7C1706137%7C30220188%7C21137368%7C34292819%7C38739665%7C27490351%7C30303453%7C3813661%7C23785395%7C21283046%7C3964070%7C6702491%7C8733307%7C2504415%7C45686490%7C24925014%7C15459030%7C41517953%7C41517954%7C37041135%7C14541812%7C17799618%7C20727104%7C34061482%7C33185688%7C33906593%7C3234557%7C39214950%7C1693503%7C14568189%7C9466979%7C51957134%7C50515193%7C2840555%7C25745688%7C14546175%7C51567252%7C40453325%7C36292664%7C15095926%7C1745691%7C53984403%7C40727883%7C24272165%7C42460055%7C11648918%7C35152919%7C35127328%7C15309361&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_11 = pages_11.read().decode('utf-8')
data_11 = json.loads(str_response_11)
query_11 = data_11['query']
pages_11 = pd.DataFrame(query_11['pages']).T

pages_12 = urllib.request.urlopen("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Ccategories%7Cinfo%7Cimages%7Clinks%7Crevisions%7Ccategoryinfo&list=&pageids=9622164%7C23739219%7C9186941%7C53417207%7C26651713%7C38798197%7C9761144%7C2543335&exlimit=20&exintro=1&explaintext=1&exsectionformat=wiki&inprop=url%7Cdisplaytitle")
str_response_12 = pages_12.read().decode('utf-8')
data_12 = json.loads(str_response_4)
query_12 = data_12['query']
pages_12 = pd.DataFrame(query_12['pages']).T

pages_df = pd.concat([pages_1,pages_2,pages_3,pages_4,pages_5,pages_6,pages_7,pages_8,pages_9,pages_10,pages_11,pages_12], axis=0, join='outer')
pages_df.sample(10)

In [None]:
query_1

In [None]:
d1df = pd.DataFrame.from_dict(data_1['query']['pages'], orient='index')

In [None]:
d1df.categories.values