In [1]:
from getters import get_taxonomy, get_group_names
import polars as pl
import altair as alt

# Load and preview a taxonomy 

The taxonomy is saved as a polars dataframe containing the entire taxonomy. The taxonomy is nested, and the nesting structure is represented using "_". The top level ("Level_1") is represented as a single integer (cast as a string) (ex: '0'). All of the groups at the next level within group 0 would be represented as '0_1', '0_2', '0_3', etc.

In many cases a grouping did not break down into all 5 levels (as described in the article). In this case, the previous level is cascaded throughout the taxonomy.

The columns in the dataframe are:
    <br> - Level_1: "disciplines" - represented by a single integer (cast as string) (ex: '0').
    <br> - Level_2: "domains" - represented by up to two integers separated by _ (cast as string) (ex: '0_1')
    <br> - Level_3: "areas" - represented by up to three integers separated by _ (cast as string) (ex: '0_1_1')
    <br> - Level_4: "topics" - represented by up to four integers separated by _ (cast as string) (ex: '0_1_1_1')
    <br> - Level_5: "subtopics" - represented by up to five integers separated by _ (cast as string) (ex: '0_1_1_1_1')
    <br> - Entity: the Wikipedia entity being categorised

In [20]:
# preview the co-occurrence taxonomy
taxonomy = get_taxonomy("cooccurrence")
taxonomy.head(10)

Level_1,Level_2,Level_3,Level_4,Level_5,Entity
str,str,str,str,str,str
"""0""","""0_36""","""0_36_16""","""0_36_16""","""0_36_16""","""Superlens"""
"""0""","""0_53""","""0_53_0""","""0_53_0""","""0_53_0""","""Quinoline"""
"""0""","""0_60""","""0_60_27""","""0_60_27_2""","""0_60_27_2""","""Diimine"""
"""0""","""0_26""","""0_26_12""","""0_26_12_3""","""0_26_12_3""","""Hydrotrope"""
"""0""","""0_60""","""0_60_7""","""0_60_7_4""","""0_60_7_4""","""Trifluorometha…"
"""0""","""0_57""","""0_57_28""","""0_57_28_0""","""0_57_28_0""","""Zirconium dibo…"
"""0""","""0_54""","""0_54_8""","""0_54_8""","""0_54_8""","""Triple point"""
"""0""","""0_5""","""0_5_13""","""0_5_13_7""","""0_5_13_7""","""Francium"""
"""0""","""0_19""","""0_19_3""","""0_19_3""","""0_19_3""","""Melt blowing"""
"""0""","""0_48""","""0_48_18""","""0_48_18""","""0_48_18""","""Hydrodeoxygena…"


# Explore the Taxonomy
View all entities within a certain taxonomy group. As the taxonomy is a polars dataframe you need to use polars syntax to manipulate it. The polars documentation for manipulating dataframes can be found [here](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/modify_select.html)

In [21]:
# view all entities within a specific level 3 (area) category
list(taxonomy.filter(pl.col("Level_3")=="0_26_12")["Entity"])

['Hydrotrope',
 'Cocamidopropyl betaine',
 'Sodium gluconate',
 'Salicylanilide',
 'Amphoterism',
 'Humectant',
 'Methylene bridge',
 'Ultrahigh',
 'Partial current',
 'Kendrick mass',
 'Homologous series',
 'Organosulfate']

### Plot the distribution of the number of entities per group at different levels

This notebook uses the postprocessed cooccurrence taxonomy described in the article and is therefore relatively evenly distributed. Run the same code with the imbalanced or centroids taxonomies to compare. 

In [22]:
source = taxonomy["Level_1"].value_counts().to_pandas()
alt.Chart(source).mark_bar().encode(
    x = alt.X("Level_1", title = "Discipline"),
    y = alt.Y("counts", title = "number of entities in the Discipline"))

In [23]:
source = taxonomy.filter(pl.col("Level_1")=="13")["Level_2"].value_counts().to_pandas()
alt.Chart(source).mark_bar().encode(
    x = alt.X("Level_2", title = "Domain"),
    y = alt.Y("counts", title = "number of entities in the Domain"))

# Explore taxonomy group names

We provide the names of a taxonomy at a given level using either the top 5 entities in documents in our training corpus or chatgpt (as described in the article.) Names are only provided for the co-occurrence taxonomy at the top 3 levels (disciplines, areas, domains). 

<br> 
If using **chatgpt naming**, we provide: 
<br> - "name": name of group (ex: 'Analytical chemistry')
<br> - "confidence": confidence score for name provided by chatgpt (out of 100)
<br> - "discard": any entities that made the naming ambiguous for chatgpt and were discarded (ex: ['John Wiley & Sons'])
<br>
If using **entity naming**, we provide the top 5 most frequent entities in the group (where counts come from count of documents containing that entity in our training corpus) 

In [37]:
#load both name types (disciplines)
chatgpt_discipline_names = get_group_names(level = 1, name_type = "chatgpt")
entity_discipline_names = get_group_names(level = 1, name_type = "entities")

### Look at the names for the discipline plotted above (discipline 13)

In [44]:
print("Naming provided by chatgpt: {}".format(chatgpt_discipline_names["13"]))
print()
print("Most frequent entities in the corpus in this category: {}".format(entity_discipline_names["13"]))

Naming provided by chatgpt: {'name': 'Wildlife and Diseases', 'confidence': 100, 'discard': None}

Most frequent entities in the corpus in this category: Parasitism, Wildlife, Zoonosis, Primate, Nematode


### Look at the names for several of the domains within discipline 13

In [46]:
chatgpt_domain_names = get_group_names(level = 2, name_type = "chatgpt")
entity_domain_names = get_group_names(level = 2, name_type = "entities")

In [47]:
#taxonomy group 13_12
print("Naming provided by chatgpt: {}".format(chatgpt_domain_names['13_12']))
print()
print("Most frequent entities in the corpus in this category: {}".format(entity_domain_names["13_12"]))

Naming provided by chatgpt: {'name': 'Influenza', 'confidence': 100, 'discard': None}

Most frequent entities in the corpus in this category: Influenza A virus subtype H1N1, Hemagglutinin, Influenza A virus subtype H5N1, Influenza A virus subtype H3N2, Avian influenza


In [48]:
#taxonomy group 13_38
print("Naming provided by chatgpt: {}".format(chatgpt_domain_names['13_38']))
print()
print("Most frequent entities in the corpus in this category: {}".format(entity_domain_names["13_38"]))

Naming provided by chatgpt: {'name': 'Parasitic diseases and treatments', 'confidence': 100, 'discard': None}

Most frequent entities in the corpus in this category: Parasitism, Helminths, Schistosomiasis, Neglected tropical diseases, Anthelmintic
