## ORES comparison for clustered vital 1k + links dataset

**Author:** Jim Maddock

**Last Updated:** 6-14-20

**Description:** The purpose of this notebook is compare the categories (clusters) from the EdCast 1000 taxonomy to the categories predicted by ORES.  This notebook will:

* Categorize all articles in each cluster, then look at distribution of categories in each cluster - **DONE**
* Categorize all articles in each cluster and assign a label based on the most frequent category. Look at whether the edcast label makes sense compared to the ORES label.
* Categorize all articles.  Look at number of clusters that appear in each ORES predicted category

We can run these evaluations using 2 different ORES datasets:

* **ORES predicted topics:** Overlap with this dataset shows us how the edcast model compares to ORES *predictions*.  The advantage is that we can use a much larger dataset (the entirety of wikipedia).  The disadvantage is that these are predictions, so we can't necessarily assume that ORES represents the "correct" or ground truth answer.  - **DONE**
* **ORES training data:** Overlap with this dataset shows how the edcast model compares to human a human crafted taxonomy.  This dataset is much smaller, but we can assume that all articles are correctly categorized.

In [1]:
import json
import pandas as pd
from oresapi import Session
import requests

import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')
pd.set_option("display.min_rows", 100)
pd.set_option("display.max_rows", 100)

### Notebook Setup
Import edcast data from json file and create pandas dataframe.  Import ORES predictions from https://figshare.com/articles/Topics_for_each_Wikipedia_Article_across_Languages/12127434 merge with edcast dataframe.

In [4]:
FILEPATH = '/Users/klogg/research_data/wmf_knowledge_graph/wiki_5-28-20/wiki_1000_clusters_6-8-20.json'

with open(FILEPATH) as json_file:
    cluster_65 = json.load(json_file)

In [5]:
df = pd.DataFrame()

for i, cluster in enumerate(cluster_65):
    chunk = []
    for article in cluster_65[cluster]['items']:
        row = {
            'label':cluster_65[cluster]['label'],
            'cluster':cluster_65[cluster]['cluster'],
            'w':article['w'],
            'title':article['title']
        }
        chunk.append(row)
    df = df.append(pd.DataFrame(chunk))
    print('finished cluster: {0}'.format(i))
    
df['title'] = df['title'].apply(lambda x: x.replace(' ','_'))

finished cluster: 0
finished cluster: 1
finished cluster: 2
finished cluster: 3
finished cluster: 4
finished cluster: 5
finished cluster: 6
finished cluster: 7
finished cluster: 8
finished cluster: 9
finished cluster: 10
finished cluster: 11
finished cluster: 12
finished cluster: 13
finished cluster: 14
finished cluster: 15
finished cluster: 16
finished cluster: 17
finished cluster: 18
finished cluster: 19
finished cluster: 20
finished cluster: 21
finished cluster: 22
finished cluster: 23
finished cluster: 24
finished cluster: 25
finished cluster: 26
finished cluster: 27
finished cluster: 28
finished cluster: 29
finished cluster: 30
finished cluster: 31
finished cluster: 32
finished cluster: 33
finished cluster: 34
finished cluster: 35
finished cluster: 36
finished cluster: 37
finished cluster: 38
finished cluster: 39
finished cluster: 40
finished cluster: 41
finished cluster: 42
finished cluster: 43
finished cluster: 44
finished cluster: 45
finished cluster: 46
finished cluster: 47
fi

finished cluster: 379
finished cluster: 380
finished cluster: 381
finished cluster: 382
finished cluster: 383
finished cluster: 384
finished cluster: 385
finished cluster: 386
finished cluster: 387
finished cluster: 388
finished cluster: 389
finished cluster: 390
finished cluster: 391
finished cluster: 392
finished cluster: 393
finished cluster: 394
finished cluster: 395
finished cluster: 396
finished cluster: 397
finished cluster: 398
finished cluster: 399
finished cluster: 400
finished cluster: 401
finished cluster: 402
finished cluster: 403
finished cluster: 404
finished cluster: 405
finished cluster: 406
finished cluster: 407
finished cluster: 408
finished cluster: 409
finished cluster: 410
finished cluster: 411
finished cluster: 412
finished cluster: 413
finished cluster: 414
finished cluster: 415
finished cluster: 416
finished cluster: 417
finished cluster: 418
finished cluster: 419
finished cluster: 420
finished cluster: 421
finished cluster: 422
finished cluster: 423
finished c

finished cluster: 763
finished cluster: 764
finished cluster: 765
finished cluster: 766
finished cluster: 767
finished cluster: 768
finished cluster: 769
finished cluster: 770
finished cluster: 771
finished cluster: 772
finished cluster: 773
finished cluster: 774
finished cluster: 775
finished cluster: 776
finished cluster: 777
finished cluster: 778
finished cluster: 779
finished cluster: 780
finished cluster: 781
finished cluster: 782
finished cluster: 783
finished cluster: 784
finished cluster: 785
finished cluster: 786
finished cluster: 787
finished cluster: 788
finished cluster: 789
finished cluster: 790
finished cluster: 791
finished cluster: 792
finished cluster: 793
finished cluster: 794
finished cluster: 795
finished cluster: 796
finished cluster: 797
finished cluster: 798
finished cluster: 799
finished cluster: 800
finished cluster: 801
finished cluster: 802
finished cluster: 803
finished cluster: 804
finished cluster: 805
finished cluster: 806
finished cluster: 807
finished c

In [6]:
FILEPATH = '/Users/klogg/research_data/wmf_knowledge_graph/topics/topicsForAllWikipediaPages.csv'

topics = pd.read_csv(FILEPATH,escapechar='\\')

In [7]:
topics = topics.loc[topics['wiki_db'] == 'enwiki']
topics = topics.rename(columns={'page_title':'title'})
topics = topics.sort_values(['title','probability'],ascending=False)
topics = topics.reset_index(drop=True).drop_duplicates(subset = ['title','Qid'],keep='first')

In [8]:
df = df.merge(topics, on='title', how='left')

### EdCast + ORES topic prediction comparison
**Number of ORES topics per EdCast cluster:** All EdCast clusters are categorized as containing a large number of the availible topics, between 1.5 and 49 percent.  This indcates that the edcast 1000 model offers some improvement over the edcast 65 model, but is not always providing the same categorization as ORES.

In [9]:
topic_dist = df.groupby('label')['topic'].nunique().describe().to_frame('count')
topic_dist['percent'] = topic_dist['count'].divide(df['topic'].nunique())*100
topic_dist

Unnamed: 0,count,percent
count,1000.0,1587.301587
mean,10.746,17.057143
std,5.299393,8.411735
min,1.0,1.587302
25%,7.0,11.111111
50%,10.0,15.873016
75%,14.0,22.222222
max,31.0,49.206349


**Distribution of number of topics per edcast cluster:** When we look at the distribution of the number of articles in each ORES predicted topic:

**positive:**
* most of these clusters have a long-tail distribution, which means that even though all clusters contain between 46 and 87 percent of topics, most of these topics less than 1 percent of the total articles in the cluster.
* a few edcast clusters belong predominantly to a single topic
    * e.g. "numb_function_theory_space" is 74% "STEM.STEM*"
    * "plant_forest_species_tree" is 72% "STEM.STEM*"
    
**negative:**
* the median "max percentage" is ~34%, which means that most edcast clusters are at least split into thirds
* the min is 10%, which means "desert_oblast_comedy_earthquake" is split among more than 10 large categories

In [10]:
percent_df = df.groupby(['label','topic']).size().to_frame('count').reset_index()
percent_df['sum'] = percent_df.groupby('label')['count'].transform(sum)
percent_df['percent'] = (percent_df['count'].divide(percent_df['sum']))*100
percent_df.sort_values(['label','percent'])
percent_df.groupby('label')['percent'].describe().sort_values('max')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tourism_travel_world_organization,21.0,4.761905,2.456495,3.030303,3.030303,3.030303,6.060606,9.090909
stimulant_lambda_equal_temperament,10.0,10.000000,0.000000,10.000000,10.000000,10.000000,10.000000,10.000000
world_new_global_globalization,25.0,4.000000,3.401507,1.818182,1.818182,1.818182,5.454545,10.909091
british_war_kingdom_unite,31.0,3.225806,3.275925,1.041667,1.041667,1.041667,4.166667,12.500000
list_egypt_comedian_launch,29.0,3.448276,3.019824,1.098901,1.098901,2.197802,4.395604,13.186813
academy_suicide_military_science,20.0,5.000000,3.849760,2.702703,2.702703,2.702703,5.405405,13.513514
east_middle_timor_meridian,16.0,6.250000,3.139951,3.448276,3.448276,6.896552,6.896552,13.793103
post_economy_note_huffpost,17.0,5.882353,3.609322,3.448276,3.448276,3.448276,10.344828,13.793103
area_metropolitan_great_region,24.0,4.166667,3.385022,1.754386,1.754386,3.508772,5.263158,14.035088
american_list_association_union,21.0,4.761905,3.541542,1.785714,1.785714,3.571429,7.142857,14.285714


**Percentage of articles per Edcast cluster that belong to the largest predicted ORES category:** I.E the median value is 34% which indicates that in 50% of EdCast clusters, the largest predicted ORES category contains less 34% of the total articles in that cluster. 

In [11]:
percent_df.groupby('label')['percent'].describe()['max'].describe()

count    1000.000000
mean       46.290672
std        18.831501
min         9.090909
25%        31.578947
50%        44.000000
75%        58.974359
max       100.000000
Name: max, dtype: float64

**Percentage of articles per Edcast cluster that belong to the median predicted ORES category:** I.E the median value is .59% which indicates that in 50% of EdCast clusters, the middle sized predicted ORES category contains less .59% of the total articles in that cluster. 

In [12]:
percent_df.groupby('label')['percent'].describe()['50%'].describe()

count    1000.000000
mean        6.106182
std         7.903794
min         0.529101
25%         2.325581
50%         3.846154
75%         6.666667
max       100.000000
Name: 50%, dtype: float64

### EdCast + ORES topic prediction samples

**Number of predicted ORES categories for each EdCast cluster**:

In [13]:
df.groupby('label')['topic'].nunique().sort_values()

label
mongoose_tail_stripe_slender                             1
consonant_fricative_voice_voiceless                      2
alexander_sarcophagus_louis_poinsot                      2
artery_cerebral_coronary_syndrome                        2
neptune_ring_moon_trojan                                 2
artery_aortic_aorta_coronary                             2
syndrome_atrophy_muscular_progressive                    2
vaccine_polio_vaccination_influenza                      2
list_zarzuela_composer_e                                 2
phylum_clade_pucciniomycetes_polyphyly                   2
supernova_sn_list_remnant                                2
pulsar_nebula_taylor_binary                              2
mood_tense_future_present                                2
rov_yakov_joseph_billings                                2
matthew_luke_mark_source                                 3
lie_yong_lam_ed                                          3
vein_vena_cavum_sinus                             

**Articles and predicted topics in the "cycle_bicycle_wheel_drive" EdCast cluster:**

In [32]:
df.loc[df['label'] == 'use_history_build_geophysics'][['title','topic']]

Unnamed: 0,title,topic
34769,Fair_use,History_and_Society.Politics_and_government
34770,Mixed-use_development,
34771,Placeholder_name,Culture.Linguistics
34772,Muti,STEM.Medicine_&_Health
34773,Substance_use_disorder,STEM.Medicine_&_Health
34774,Decimal_separator,STEM.STEM*
34775,Tool_use_by_animals,STEM.STEM*
34776,Textile,STEM.STEM*
34777,Second_Industrial_Revolution,History_and_Society.History
34778,Arrowhead,History_and_Society.History


**Number of Articles in each ORES predicted topic in the "cycle_bicycle_wheel_drive" EdCast cluster:**
* This is super long-tailed.  All but 3 of the 29 categories have less than 5 articles
* Most of the articles belong to STEM or Sports, which is because articles are either about bicycles or biological/chemical cycles (see the cell below this one for a more detailed view)

In [17]:
df.loc[df['label'] == 'cycle_bicycle_wheel_drive'].groupby('topic').size().sort_values()

Series([], dtype: int64)

In [139]:
df.loc[(df['label'] == 'cycle_bicycle_wheel_drive') & (df['topic'] == 'STEM.STEM*')]

Unnamed: 0,label,cluster,w,title,Qid,topic,probability,page_id,wiki_db
25781,cycle_bicycle_wheel_drive,28,0.606544,Penny-farthing,Q1554036,STEM.STEM*,0.79,183351.0,enwiki
25786,cycle_bicycle_wheel_drive,28,0.584621,Atkinson_cycle,Q384952,STEM.STEM*,0.99,519209.0,enwiki
25791,cycle_bicycle_wheel_drive,28,0.541644,Ericsson_cycle,Q384539,STEM.STEM*,0.98,826277.0,enwiki
25792,cycle_bicycle_wheel_drive,28,0.540536,Miller_cycle,Q657886,STEM.STEM*,1.0,74761.0,enwiki
25801,cycle_bicycle_wheel_drive,28,0.495984,Steering_wheel,Q679300,STEM.STEM*,0.92,772013.0,enwiki
25806,cycle_bicycle_wheel_drive,28,0.488096,Wheel,Q446,STEM.STEM*,0.99,33555.0,enwiki
25815,cycle_bicycle_wheel_drive,28,0.466507,Kleemenko_cycle,Q6420078,STEM.STEM*,1.0,19871306.0,enwiki
25816,cycle_bicycle_wheel_drive,28,0.466175,Stirling_cycle,Q910550,STEM.STEM*,1.0,247323.0,enwiki
25818,cycle_bicycle_wheel_drive,28,0.463178,Spoke,Q845671,STEM.STEM*,0.71,170836.0,enwiki
25820,cycle_bicycle_wheel_drive,28,0.454256,Glyoxylate_cycle,Q575119,STEM.STEM*,1.0,3322454.0,enwiki


**Number of Articles in each ORES predicted topic in the "state_unite_union_liberalism" EdCast cluster:**  It's possible to see why most of these articles would be related, but we're probably not producing better categories.  Might be worth validating this by hand.

In [206]:
df.loc[df['label'] == 'state_unite_union_liberalism'].groupby('topic').size().sort_values()

topic
STEM.Space                                       1
STEM.Mathematics                                 1
STEM.Physics                                     1
STEM.Libraries_&_Information                     1
Geography.Regions.Africa.Western_Africa          1
Culture.Media.Entertainment                      1
Geography.Regions.Africa.Southern_Africa         1
STEM.Computing                                   1
Culture.Internet_culture                         1
Geography.Regions.Africa.Eastern_Africa          1
Culture.Performing_arts                          2
Culture.Sports                                   2
Culture.Visual_arts.Architecture                 2
History_and_Society.Education                    2
Culture.Visual_arts.Visual_arts*                 2
Culture.Food_and_drink                           2
STEM.Engineering                                 2
History_and_Society.Transportation               3
STEM.Technology                                  3
Culture.Media.Music      

In [147]:
df.loc[(df['label'] == 'state_unite_union_liberalism') & (df['topic'] == 'History_and_Society.Society')]

Unnamed: 0,label,cluster,w,title,Qid,topic,probability,page_id,wiki_db
46041,state_unite_union_liberalism,49,0.645006,Self-determination,Q156595,History_and_Society.Society,1.00,29269.0,enwiki
46042,state_unite_union_liberalism,49,0.641665,Voting_rights_in_the_United_States,Q405566,History_and_Society.Society,1.00,667785.0,enwiki
46070,state_unite_union_liberalism,49,0.600630,Freedom_of_religion_in_the_United_States,Q2142732,History_and_Society.Society,1.00,7493150.0,enwiki
46091,state_unite_union_liberalism,49,0.581370,Mass_surveillance,Q1425056,History_and_Society.Society,1.00,331195.0,enwiki
46097,state_unite_union_liberalism,49,0.578095,Arab_Spring,Q33761,History_and_Society.Society,0.52,30655949.0,enwiki
46122,state_unite_union_liberalism,49,0.565052,Freedom_of_movement,Q1344824,History_and_Society.Society,1.00,1270497.0,enwiki
46124,state_unite_union_liberalism,49,0.563690,McCarthyism,Q207066,History_and_Society.Society,0.99,43805.0,enwiki
46155,state_unite_union_liberalism,49,0.548138,Commonwealth_of_Nations,Q7785,History_and_Society.Society,1.00,21175158.0,enwiki
46162,state_unite_union_liberalism,49,0.546457,United_Nations_Human_Rights_Council,Q205650,History_and_Society.Society,0.99,635790.0,enwiki
46169,state_unite_union_liberalism,49,0.542284,Forced_disappearance,Q1288449,History_and_Society.Society,1.00,686148.0,enwiki


**Percentages (rather than absolute values) for each ORES category in a few EdCast clusters:** This reflects the earlier distribution analysis in this notebook.

In [187]:
percent_df.loc[percent_df['label'] == 'al_ibn_islam_islamic'].sort_values('percent')

Unnamed: 0,label,topic,count,sum,percent
58,al_ibn_islam_islamic,Geography.Regions.Europe.Northern_Europe,1,614,0.162866
36,al_ibn_islam_islamic,Culture.Food_and_drink,1,614,0.162866
37,al_ibn_islam_islamic,Culture.Internet_culture,1,614,0.162866
69,al_ibn_islam_islamic,STEM.Physics,1,614,0.162866
50,al_ibn_islam_islamic,Geography.Regions.Africa.Eastern_Africa,1,614,0.162866
67,al_ibn_islam_islamic,STEM.Libraries_&_Information,1,614,0.162866
42,al_ibn_islam_islamic,Culture.Media.Music,1,614,0.162866
43,al_ibn_islam_islamic,Culture.Media.Television,1,614,0.162866
47,al_ibn_islam_islamic,Culture.Visual_arts.Visual_arts*,1,614,0.162866
45,al_ibn_islam_islamic,Culture.Sports,1,614,0.162866


In [188]:
percent_df.loc[percent_df['label'] == 'book_comic_fiction_publish'].sort_values('percent')

Unnamed: 0,label,topic,count,sum,percent
160,book_comic_fiction_publish,STEM.Technology,1,840,0.119048
143,book_comic_fiction_publish,Geography.Regions.Europe.Eastern_Europe,1,840,0.119048
138,book_comic_fiction_publish,Geography.Regions.Asia.Central_Asia,1,840,0.119048
144,book_comic_fiction_publish,Geography.Regions.Europe.Europe*,1,840,0.119048
134,book_comic_fiction_publish,Geography.Regions.Americas.Central_America,1,840,0.119048
133,book_comic_fiction_publish,Geography.Regions.Africa.Eastern_Africa,1,840,0.119048
148,book_comic_fiction_publish,Geography.Regions.Oceania,1,840,0.119048
141,book_comic_fiction_publish,Geography.Regions.Asia.Southeast_Asia,1,840,0.119048
151,book_comic_fiction_publish,History_and_Society.Politics_and_government,1,840,0.119048
154,book_comic_fiction_publish,STEM.Computing,1,840,0.119048


In [202]:
percent_df.loc[percent_df['label'] == 'plant_forest_species_tree'].sort_values('percent')

Unnamed: 0,label,topic,count,sum,percent
1779,plant_forest_species_tree,STEM.Space,1,1419,0.070472
1777,plant_forest_species_tree,STEM.Physics,1,1419,0.070472
1743,plant_forest_species_tree,Culture.Internet_culture,1,1419,0.070472
1775,plant_forest_species_tree,STEM.Libraries_&_Information,1,1419,0.070472
1745,plant_forest_species_tree,Culture.Literature,1,1419,0.070472
1773,plant_forest_species_tree,STEM.Chemistry,1,1419,0.070472
1747,plant_forest_species_tree,Culture.Media.Media*,1,1419,0.070472
1748,plant_forest_species_tree,Culture.Media.Music,1,1419,0.070472
1771,plant_forest_species_tree,History_and_Society.Transportation,1,1419,0.070472
1766,plant_forest_species_tree,History_and_Society.Business_and_economics,1,1419,0.070472


In [5]:
FILEPATH = '/Users/klogg/research_data/wmf_knowledge_graph/topics/labeled_enwiki_with_topics_metadata.json'

with open(FILEPATH) as json_file:
    ground_truth_topics = [json.loads(line) for line in json_file]

In [18]:
dt_list = []

for article in ground_truth_topics:
    if len(article['topics']) > 0:
        article_dict = {
            'topic':article['topics'][0],
            'title':article['title']
        }
        dt_list.append(article_dict)

dt_df = pd.DataFrame(dt_list)

In [23]:
dt_df = dt_df.merge(df, on='title', how='inner')

In [31]:
topic_dist = dt_df.groupby('label')['topic'].nunique().describe().to_frame('count')
topic_dist['percent'] = topic_dist['count'].divide(dt_df['topic'].nunique())*100
topic_dist

Unnamed: 0,count,percent
count,65.0,162.5
mean,33.323077,83.307692
std,3.553641,8.884103
min,23.0,57.5
25%,31.0,77.5
50%,33.0,82.5
75%,37.0,92.5
max,40.0,100.0


In [35]:
percent_df = dt_df.groupby(['label','topic']).size().to_frame('count').reset_index()
percent_df['sum'] = percent_df.groupby('label')['count'].transform(sum)
percent_df['percent'] = (percent_df['count'].divide(percent_df['sum']))*100
percent_df.sort_values(['label','percent'])
percent_df.groupby('label')['percent'].describe().sort_values('max')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
computer_image_photography_weapon,40.0,2.5,2.105449,0.046992,0.787124,2.020677,3.724154,8.599624
water_lake_environmental_soil,37.0,2.702703,3.469289,0.100301,0.401204,1.203611,3.209629,12.23671
energy_power_system_nuclear,37.0,2.702703,3.094059,0.099502,0.597015,1.492537,3.383085,13.233831
list_country_poet_egypt,38.0,2.631579,2.871447,0.125628,0.785176,1.507538,3.831658,13.819095
state_unite_union_liberalism,38.0,2.631579,4.261322,0.046404,0.197216,0.556845,2.053364,14.013921
company_new_medium_oil,40.0,2.5,3.737443,0.098232,0.368369,0.884086,2.111984,15.324165
stone_temple_architecture_circle,33.0,3.030303,4.559115,0.113122,0.339367,0.791855,2.941176,16.742081
war_military_force_world,36.0,2.777778,4.730019,0.099701,0.299103,0.747757,2.467597,19.641077
dive_air_engine_gas,36.0,2.777778,4.066206,0.107527,0.403226,1.397849,3.252688,20.860215
paper_glass_color_print,37.0,2.702703,4.128285,0.134409,0.537634,1.478495,2.956989,21.236559


In [36]:
percent_df.groupby('label')['percent'].describe()['max'].describe()

count    65.000000
mean     39.785857
std      17.003745
min       8.599624
25%      27.654321
50%      39.667897
75%      49.113592
max      81.135707
Name: max, dtype: float64

In [37]:
percent_df.groupby('label')['percent'].describe()['50%'].describe()

count    65.000000
mean      0.802699
std       0.403644
min       0.229885
25%       0.534759
50%       0.737101
75%       1.060071
max       2.020677
Name: 50%, dtype: float64

In [39]:
dt_df.groupby('label')['topic'].nunique().sort_values()

label
church_jesus_pope_john                        23
star_variable_stellar_supernova               27
script_braille_alphabet_write                 27
dialect_german_language_min                   27
al_ibn_islam_islamic                          28
cell_protein_system_receptor                  28
cuisine_food_list_milk                        29
plant_forest_species_tree                     29
french_france_battle_war                      30
russia_russian_sea_soviet                     30
religion_god_mythology_jewish                 30
cycle_bicycle_wheel_drive                     30
genetic_evolution_tourism_population          31
language_english_linguistic_consonant         31
numb_function_theory_space                    31
island_sea_gulf_list                          31
physic_quantum_theory_particle                31
film_animation_cinema_list                    31
poetry_literature_animal_poet                 31
book_comic_fiction_publish                    31
desert_oblast_

In [40]:
len(dt_list)

5511929

In [31]:
topics.loc[topics['page_id'] == 20952184]

Unnamed: 0,Qid,topic,probability,page_id,title,wiki_db


In [30]:
topics.columns

Index(['Qid', 'topic', 'probability', 'page_id', 'title', 'wiki_db'], dtype='object')