## Get Valid metaclasses
* pre-requis file: metaclasses.csv: query by wikidata service; instances of metaclass

``` shell
    SELECT ?item ?itemLabel WHERE {
      ?item wdt:P31 wd:Q19478619 (metaclass)
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
```

In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/infres/ypeng-21/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/infres/ypeng-21/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
df_meta = pd.read_csv('../raw_data/metaclasses.csv')

* Some classes are "instance_of" metaclass(Q19478619) or second-order class(Q24017414), which are reasonable. So we also keep reasonable classes which are instance of useful metaclasses.
* Keep Important metaclasses:
    * labels have keywords: type, class, style, genre, form, category, "occupation/profession", "field of work", "classification".
* Post-processing:
    * class labels have preposition. e.g. state-owned enterprise of Russia (Q30590861) -> P31 -> type of business entity in Russia (Q7860948); not what we want, as Russia should go as an object by the property 'country'(P17). Preposition exclude: "of"
    * exclude classes with labels having "property/properties". e.g. type of Wikidata property (Q107649491) -> so the type is more like a property than a class itself.
    * exclude "BFO classes".

In [3]:
keywords = ["type", "class", "style", "genre", "form", "category", "classification"]
exact_match = ['occupation', 'profession', 'field of work']
def valid_metaclass(input_string):
    # Check if the input string is a valid metaclass
    tokens = word_tokenize(input_string)
    for keyword in keywords:
        # keyword should be the first or last word in the input string
        if keyword == tokens[0] or keyword == tokens[-1]:
            return True
        else:
            for match in exact_match:
                if match == input_string:
                    return True
    return False

In [4]:
def contain_preps(input_string):
    # Check if the input string is an excluded metaclass
    tagged = pos_tag(word_tokenize(input_string))
    return any(tag == 'IN' and word.lower() != 'of' for word, tag in tagged)

In [5]:
def contain_property(input_string):
    if 'property' in input_string or 'properties' in input_string:
        return True
    return False

def contain_BFO(input_string):
    if 'BFO' in input_string:
        return True
    return False

In [6]:
df_meta['valid_metaclass'] = df_meta['itemLabel'].apply(valid_metaclass)
df_meta['contain_preps'] = df_meta['itemLabel'].apply(contain_preps)
df_meta['contain_property'] = df_meta['itemLabel'].apply(contain_property)
df_meta['contain_BFO'] = df_meta['itemLabel'].apply(contain_BFO)

In [7]:
df_meta_filter = df_meta[df_meta['valid_metaclass'] & ~df_meta['contain_preps'] 
        & ~df_meta['contain_property'] & ~df_meta['contain_BFO']].loc[:, ['item', 'itemLabel']]

In [8]:
df_meta_filter['qid'] = df_meta_filter['item'].apply(lambda x: x.replace('http://www.wikidata.org/entity/', 'wd:'))

In [9]:
df_meta_filter.loc[:, ['qid', 'itemLabel']]

Unnamed: 0,qid,itemLabel
2,wd:Q28640,profession
3,wd:Q32880,architectural style
5,wd:Q188451,music genre
6,wd:Q190087,data type
10,wd:Q223393,literary genre
...,...,...
1050,wd:Q114570820,rocket class
1059,wd:Q115483827,sex-specific tissue type
1063,wd:Q116123132,lens type
1067,wd:Q116766962,Hindustani adjective class


In [10]:
df_meta_filter.loc[:, ['qid', 'itemLabel']].to_csv('metaclasses_filter.csv', index=False)

In [1]:
import utils

In [None]:
utils.cls_mentions # cumulative count