## ORES comparison for clustered vital 1k + links (`Edcast 1000`) dataset

**Author:** Jim Maddock

**Last Updated:** 6-16-20

**Description:** The purpose of this notebook is compare the categories (clusters) from the EdCast 1000 taxonomy to the categories predicted by ORES.  This notebook will:

* Categorize all articles in each cluster, then look at distribution of categories in each cluster - **DONE**
* Categorize all articles in each cluster and assign a label based on the most frequent category. Look at whether the edcast label makes sense compared to the ORES label.
* Categorize all articles.  Look at number of clusters that appear in each ORES predicted category

We can run these evaluations using 2 different ORES datasets:

* **ORES predicted topics:** Overlap with this dataset shows us how the edcast model compares to ORES *predictions*.  The advantage is that we can use a much larger dataset (the entirety of wikipedia).  The disadvantage is that these are predictions, so we can't necessarily assume that ORES represents the "correct" or ground truth answer.  - **DONE**
* **ORES training data:** Overlap with this dataset shows how the edcast model compares to human a human crafted taxonomy.  This dataset is much smaller, but we can assume that all articles are correctly categorized.

In [1]:
import json
import pandas as pd
from oresapi import Session
import requests

import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')
pd.set_option("display.min_rows", 100)
pd.set_option("display.max_rows", 100)

### Notebook Setup
Import edcast data from json file and create pandas dataframe.  Import ORES predictions from https://figshare.com/articles/Topics_for_each_Wikipedia_Article_across_Languages/12127434 merge with edcast dataframe.

In [4]:
FILEPATH = '/Users/klogg/research_data/wmf_knowledge_graph/wiki_5-28-20/wiki_1000_clusters_6-8-20.json'

with open(FILEPATH) as json_file:
    cluster_65 = json.load(json_file)

In [5]:
df = pd.DataFrame()

for i, cluster in enumerate(cluster_65):
    chunk = []
    for article in cluster_65[cluster]['items']:
        row = {
            'label':cluster_65[cluster]['label'],
            'cluster':cluster_65[cluster]['cluster'],
            'w':article['w'],
            'title':article['title']
        }
        chunk.append(row)
    df = df.append(pd.DataFrame(chunk))
    print('finished cluster: {0}'.format(i))
    
df['title'] = df['title'].apply(lambda x: x.replace(' ','_'))

finished cluster: 0
finished cluster: 1
finished cluster: 2
finished cluster: 3
finished cluster: 4
finished cluster: 5
finished cluster: 6
finished cluster: 7
finished cluster: 8
finished cluster: 9
finished cluster: 10
finished cluster: 11
finished cluster: 12
finished cluster: 13
finished cluster: 14
finished cluster: 15
finished cluster: 16
finished cluster: 17
finished cluster: 18
finished cluster: 19
finished cluster: 20
finished cluster: 21
finished cluster: 22
finished cluster: 23
finished cluster: 24
finished cluster: 25
finished cluster: 26
finished cluster: 27
finished cluster: 28
finished cluster: 29
finished cluster: 30
finished cluster: 31
finished cluster: 32
finished cluster: 33
finished cluster: 34
finished cluster: 35
finished cluster: 36
finished cluster: 37
finished cluster: 38
finished cluster: 39
finished cluster: 40
finished cluster: 41
finished cluster: 42
finished cluster: 43
finished cluster: 44
finished cluster: 45
finished cluster: 46
finished cluster: 47
fi

finished cluster: 379
finished cluster: 380
finished cluster: 381
finished cluster: 382
finished cluster: 383
finished cluster: 384
finished cluster: 385
finished cluster: 386
finished cluster: 387
finished cluster: 388
finished cluster: 389
finished cluster: 390
finished cluster: 391
finished cluster: 392
finished cluster: 393
finished cluster: 394
finished cluster: 395
finished cluster: 396
finished cluster: 397
finished cluster: 398
finished cluster: 399
finished cluster: 400
finished cluster: 401
finished cluster: 402
finished cluster: 403
finished cluster: 404
finished cluster: 405
finished cluster: 406
finished cluster: 407
finished cluster: 408
finished cluster: 409
finished cluster: 410
finished cluster: 411
finished cluster: 412
finished cluster: 413
finished cluster: 414
finished cluster: 415
finished cluster: 416
finished cluster: 417
finished cluster: 418
finished cluster: 419
finished cluster: 420
finished cluster: 421
finished cluster: 422
finished cluster: 423
finished c

finished cluster: 763
finished cluster: 764
finished cluster: 765
finished cluster: 766
finished cluster: 767
finished cluster: 768
finished cluster: 769
finished cluster: 770
finished cluster: 771
finished cluster: 772
finished cluster: 773
finished cluster: 774
finished cluster: 775
finished cluster: 776
finished cluster: 777
finished cluster: 778
finished cluster: 779
finished cluster: 780
finished cluster: 781
finished cluster: 782
finished cluster: 783
finished cluster: 784
finished cluster: 785
finished cluster: 786
finished cluster: 787
finished cluster: 788
finished cluster: 789
finished cluster: 790
finished cluster: 791
finished cluster: 792
finished cluster: 793
finished cluster: 794
finished cluster: 795
finished cluster: 796
finished cluster: 797
finished cluster: 798
finished cluster: 799
finished cluster: 800
finished cluster: 801
finished cluster: 802
finished cluster: 803
finished cluster: 804
finished cluster: 805
finished cluster: 806
finished cluster: 807
finished c

In [6]:
FILEPATH = '/Users/klogg/research_data/wmf_knowledge_graph/topics/topicsForAllWikipediaPages.csv'

topics = pd.read_csv(FILEPATH,escapechar='\\')

In [7]:
topics = topics.loc[topics['wiki_db'] == 'enwiki']
topics = topics.rename(columns={'page_title':'title'})
topics = topics.sort_values(['title','probability'],ascending=False)
topics = topics.reset_index(drop=True).drop_duplicates(subset = ['title','Qid'],keep='first')

In [8]:
df = df.merge(topics, on='title', how='left')

### EdCast + ORES topic prediction comparison
**Number of ORES topics per EdCast cluster:** All EdCast clusters are categorized as containing a large number of the availible topics, between 1.5 and 49 percent.  This indcates that the `edcast 1000 model` offers some improvement over the `edcast 65 model` (46 and 87 percent), but is not always providing the same categorization as ORES.

In [9]:
topic_dist = df.groupby('label')['topic'].nunique().describe().to_frame('count')
topic_dist['percent'] = topic_dist['count'].divide(df['topic'].nunique())*100
topic_dist

Unnamed: 0,count,percent
count,1000.0,1587.301587
mean,10.746,17.057143
std,5.299393,8.411735
min,1.0,1.587302
25%,7.0,11.111111
50%,10.0,15.873016
75%,14.0,22.222222
max,31.0,49.206349


**Distribution of number of topics per edcast cluster:** When we look at the distribution of the number of articles in each ORES predicted topic:

**positive:**
* most of these clusters have a long-tail distribution, which means that even though all clusters contain between 1.5 and 49 percent of topics, most of these topics less than 1 percent of the total articles in the cluster.
* a few edcast clusters belong predominantly to a single topic
    * e.g. "mongoose_tail_stripe_slender" is 100% "STEM.STEM*" (these all seem to be about mongoose)
    * "consonant_fricative_voice_voiceless" is 98% "Culture.Linguistics"
    * this is a big improvement compared to `edcast 56` where the top clusters contain at least 29 (46% of topics).
    
**negative:**
* the median "max percentage" is ~44%, which means that most edcast clusters are at least split in half (compared to 34% in the `edcast 65` dataset).
* the min is 9%, which means "tourism_travel_world_organization" is split among more than 10 large categories (compared to 10% in the `edcast 65` dataset).  No improvememnt here.

In [10]:
percent_df = df.groupby(['label','topic']).size().to_frame('count').reset_index()
percent_df['sum'] = percent_df.groupby('label')['count'].transform(sum)
percent_df['percent'] = (percent_df['count'].divide(percent_df['sum']))*100
percent_df.sort_values(['label','percent'])
percent_df.groupby('label')['percent'].describe().sort_values('max')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tourism_travel_world_organization,21.0,4.761905,2.456495,3.030303,3.030303,3.030303,6.060606,9.090909
stimulant_lambda_equal_temperament,10.0,10.000000,0.000000,10.000000,10.000000,10.000000,10.000000,10.000000
world_new_global_globalization,25.0,4.000000,3.401507,1.818182,1.818182,1.818182,5.454545,10.909091
british_war_kingdom_unite,31.0,3.225806,3.275925,1.041667,1.041667,1.041667,4.166667,12.500000
list_egypt_comedian_launch,29.0,3.448276,3.019824,1.098901,1.098901,2.197802,4.395604,13.186813
academy_suicide_military_science,20.0,5.000000,3.849760,2.702703,2.702703,2.702703,5.405405,13.513514
east_middle_timor_meridian,16.0,6.250000,3.139951,3.448276,3.448276,6.896552,6.896552,13.793103
post_economy_note_huffpost,17.0,5.882353,3.609322,3.448276,3.448276,3.448276,10.344828,13.793103
area_metropolitan_great_region,24.0,4.166667,3.385022,1.754386,1.754386,3.508772,5.263158,14.035088
american_list_association_union,21.0,4.761905,3.541542,1.785714,1.785714,3.571429,7.142857,14.285714


**Percentage of articles per Edcast cluster that belong to the largest predicted ORES category:** I.E the median value is 44% which indicates that in 50% of EdCast clusters, the largest predicted ORES category contains less 44% of the total articles in that cluster. (compared to 34% in the `edcast 65` dataset)

In [11]:
percent_df.groupby('label')['percent'].describe()['max'].describe()

count    1000.000000
mean       46.290672
std        18.831501
min         9.090909
25%        31.578947
50%        44.000000
75%        58.974359
max       100.000000
Name: max, dtype: float64

**Percentage of articles per Edcast cluster that belong to the median predicted ORES category:** I.E the median value is 3.8% which indicates that in 50% of EdCast clusters, the middle sized predicted ORES category contains less 3.8% of the total articles in that cluster.  This is a slight decrease in performance compared to `edcast 65` (.59%) 

In [12]:
percent_df.groupby('label')['percent'].describe()['50%'].describe()

count    1000.000000
mean        6.106182
std         7.903794
min         0.529101
25%         2.325581
50%         3.846154
75%         6.666667
max       100.000000
Name: 50%, dtype: float64

### EdCast + ORES topic prediction samples

**Number of predicted ORES categories for each EdCast cluster**:

In [13]:
df.groupby('label')['topic'].nunique().sort_values()

label
mongoose_tail_stripe_slender                             1
consonant_fricative_voice_voiceless                      2
alexander_sarcophagus_louis_poinsot                      2
artery_cerebral_coronary_syndrome                        2
neptune_ring_moon_trojan                                 2
artery_aortic_aorta_coronary                             2
syndrome_atrophy_muscular_progressive                    2
vaccine_polio_vaccination_influenza                      2
list_zarzuela_composer_e                                 2
phylum_clade_pucciniomycetes_polyphyly                   2
supernova_sn_list_remnant                                2
pulsar_nebula_taylor_binary                              2
mood_tense_future_present                                2
rov_yakov_joseph_billings                                2
matthew_luke_mark_source                                 3
lie_yong_lam_ed                                          3
vein_vena_cavum_sinus                             

**Number of Articles in each ORES predicted topic in the "mongoose_tail_stripe_slender" EdCast cluster:**
* All the of the articles fit into a single ORES category
* All of the articles seem to be about mongoose
    * for these *tight* clusters (e.g. only one edcast category), unigram seems like a better cluster name than 4-gram

In [36]:
df.loc[df['label'] == 'mongoose_tail_stripe_slender'][['title','topic']]

Unnamed: 0,title,topic
37935,Mongoose,STEM.STEM*
37936,Slender_mongoose,STEM.STEM*
37937,Angolan_slender_mongoose,STEM.STEM*
37938,White-tailed_mongoose,STEM.STEM*
37939,Indian_brown_mongoose,STEM.STEM*
37940,Banded_mongoose,STEM.STEM*
37941,Liberian_mongoose,STEM.STEM*
37942,Black_mongoose,STEM.STEM*
37943,Collared_mongoose,STEM.STEM*
37944,Ruddy_mongoose,STEM.STEM*


In [37]:
df.loc[df['label'] == 'mongoose_tail_stripe_slender'].groupby('topic').size().sort_values()

topic
STEM.STEM*    32
dtype: int64

**Percentages (rather than absolute values) for each ORES category in a few EdCast clusters:** This reflects the earlier distribution analysis in this notebook.

For the "state_unite_list_territory" cluster it looks like the algorithm is just clustering types of states, which doesn't seem particularily helpful.  It's not clear that "Wyoming" and "Vassal State" should be in the same cluster.

In [40]:
percent_df.loc[percent_df['label'] == 'state_unite_list_territory'].sort_values('percent')

Unnamed: 0,label,topic,count,sum,percent
9185,state_unite_list_territory,STEM.Biology,1,206,0.485437
9163,state_unite_list_territory,Culture.Media.Books,1,206,0.485437
9180,state_unite_list_territory,History_and_Society.Education,1,206,0.485437
9166,state_unite_list_territory,Geography.Regions.Africa.Western_Africa,1,206,0.485437
9167,state_unite_list_territory,Geography.Regions.Americas.Central_America,1,206,0.485437
9189,state_unite_list_territory,STEM.Mathematics,1,206,0.485437
9188,state_unite_list_territory,STEM.Earth_and_environment,1,206,0.485437
9178,state_unite_list_territory,Geography.Regions.Oceania,1,206,0.485437
9191,state_unite_list_territory,STEM.Physics,1,206,0.485437
9187,state_unite_list_territory,STEM.Computing,1,206,0.485437


In [41]:
df.loc[df['label'] == 'state_unite_list_territory'][['title','topic']]

Unnamed: 0,title,topic
22341,Failed_state,History_and_Society.Politics_and_government
22342,Constituent_state,History_and_Society.Politics_and_government
22343,Sovereign_state,History_and_Society.Politics_and_government
22344,State_formation,STEM.STEM*
22345,Eastern_United_States,Geography.Geographical
22346,United_States,Geography.Geographical
22347,States_of_Germany,Geography.Regions.Europe.Europe*
22348,United_States_territorial_acquisitions,
22349,States_and_union_territories_of_India,Geography.Regions.Asia.South_Asia
22350,State_capitalism,History_and_Society.Politics_and_government


For the "law_enforcement_corporation_theory" cluster it looks like the algorithm is just clustering types of law, which is more interesting.  For instance, we get some examples of historical figures (ORES classifies under Biography) that have to do with law.

**Follow up:** Check articles with multiple ORES categories

In [43]:
percent_df.loc[percent_df['label'] == 'law_enforcement_corporation_theory'].sort_values('percent')

Unnamed: 0,label,topic,count,sum,percent
5363,law_enforcement_corporation_theory,Geography.Regions.Europe.Europe*,1,126,0.793651
5354,law_enforcement_corporation_theory,Culture.Food_and_drink,1,126,0.793651
5355,law_enforcement_corporation_theory,Culture.Literature,1,126,0.793651
5370,law_enforcement_corporation_theory,History_and_Society.Military_and_warfare,1,126,0.793651
5357,law_enforcement_corporation_theory,Culture.Visual_arts.Fashion,1,126,0.793651
5358,law_enforcement_corporation_theory,Culture.Visual_arts.Visual_arts*,1,126,0.793651
5359,law_enforcement_corporation_theory,Geography.Regions.Americas.North_America,1,126,0.793651
5360,law_enforcement_corporation_theory,Geography.Regions.Asia.Asia*,1,126,0.793651
5361,law_enforcement_corporation_theory,Geography.Regions.Asia.South_Asia,1,126,0.793651
5373,law_enforcement_corporation_theory,STEM.Engineering,1,126,0.793651


In [44]:
df.loc[df['label'] == 'law_enforcement_corporation_theory'][['title','topic']]

Unnamed: 0,title,topic
58851,Common_law,History_and_Society.Politics_and_government
58852,Rule_of_law,History_and_Society.Society
58853,Private_law,History_and_Society.Politics_and_government
58854,Constitutional_law,History_and_Society.Society
58855,Index_of_law_articles,
58856,Law_of_India,Geography.Regions.Asia.South_Asia
58857,Sources_of_law,History_and_Society.Politics_and_government
58858,Scientific_law,STEM.STEM*
58859,Natural_law,History_and_Society.Society
58860,Comparative_law,History_and_Society.Politics_and_government
