# Filtering candidate labels
In this notebook, we will analyze the labels created in the candidate_label_creation notebook and then use a simple rule-based filter to remove labels that are trivially either unrepresentative or indistinctive.

We'll look at each type of label separately.

### Pre-requisite steps to run this notebook
1. You need to run the candidate_label_creation notebook before this notebook.

In [72]:
import pandas as pd
import os
import subprocess
import matplotlib.pyplot as plt
import numpy as np
from utility import rename_cols_and_overwrite_id
from utility import run_command
from tqdm import tqdm

### Parameters
**Required**  
*work_dir*: path to work_dir that was specified in candidate_label_creation notebook. This should contain a folder called label_creation with files created by the label creation notebook which we will filter in this notebook.  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.  
*lower_bound_avl*: lower filter bound to use for AVL type labels (filtering on % entities within type that label is applicable to)  
*lower_bound_ail*: lower filter bound to use for AIL type labels  
*lower_bound_rel*: lower filter bound to use for REL type labels  
*lower_bound_ral*: lower filter bound to use for RAL and RAIL type labels  

In [2]:
# **REQUIRED**
work_dir = "./output/wikidata-20210215-dwd"
store_dir = "./output/wikidata-20210215-dwd/temp-filtering"
lower_bound_avl = .1
lower_bound_ail = .1
lower_bound_rel = .1
lower_bound_ral = .1

### Process params / set up variables

In [3]:
# Ensure paths are absolute
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)

label_creation_dir = "{}/label_creation".format(work_dir)
    
# Create directories
output_dir = "{}/candidate_filter".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if not os.path.exists(store_dir):
    os.makedirs(store_dir)
    
# adding some environment variables we'll be using frequently
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['IN'] = label_creation_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

helper for doing filtering and visualizing the results

In [105]:
def filter_by_support_and_plot(support_file, filtered_out_file, filtered_file, lower_bound=.1, upper_bound=.9):
    df = pd.read_csv(support_file, delimiter = '\t')
#     df = df.loc[df.loc[:,"node1"] == "Q11424"]
    
    supports = df.loc[:,"support"]
    
    mask_filtered_out = ((supports < lower_bound) | (supports > upper_bound))
    filtered_out = supports.loc[mask_filtered_out]
    not_filtered_out = supports.loc[~mask_filtered_out]
    
    print("Number of labels filtered out: {} / {} ({:.2f}%)".format(len(filtered_out), len(supports), 100*(len(filtered_out) / len(supports))))
    
    bins = np.arange(0,1.01,.02)
    plt.hist(filtered_out, label = "filtered out", bins = bins)
    plt.hist(not_filtered_out, label = "not filtered out", bins = bins)
    
    plt.title("Distribution of Label Supports")
    plt.legend()
    plt.show()
    
    df_filtered_out = df.loc[mask_filtered_out]
    df_not_filtered_out = df.loc[~mask_filtered_out]
    
    df_filtered_out.to_csv(filtered_out_file, sep='\t', index = False)
    df_not_filtered_out.to_csv(filtered_file, sep='\t', index = False)
    
# For each type, try several lower bound values and choose the one that gives the largest number of labels
# without going over the max_labels_per_type. If none of the lower bound values stay within that limit, the
# highest lower bound value given will be chosen.
def dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1], ub=.9, max_labels_per_type=500):
    df = pd.read_csv(support_file, delimiter = '\t')
    supports = df.loc[:,"support"]
    types = df.loc[:,"node1"].unique()
    type_labels = df.groupby("node1")["type_label"].first().to_dict()
    num_labels_for_type_before_filtering = df.loc[:,"node1"].value_counts().to_dict()
    
    # If only one lb value to try, then don't need to do per-type filtering
    if len(lbs) == 1:
        lb = lbs[0]
        mask_filtered_out = ((supports < lb) | (supports > ub))
        df_filtered_out = df.loc[mask_filtered_out]
        df_not_filtered_out = df.loc[~mask_filtered_out]
        #record filter settings and num filtered labels for each type
        num_labels_for_type_after_filtering = df_not_filtered_out.loc[:,"node1"].value_counts().to_dict()
        filter_settings = []
        for t in types:
            filter_settings_for_type = [t, type_labels[t], lb]
            if t in num_labels_for_type_after_filtering:
                filter_settings_for_type.append(num_labels_for_type_after_filtering[t])
            else:
                filter_settings_for_type.append(0)
            filter_settings_for_type.append(num_labels_for_type_before_filtering[t])
            filter_settings.append(filter_settings_for_type)
        filter_settings_df = pd.DataFrame(filter_settings, columns=["type", "type_label", "lb", "# labels in bounds", "# labels before filtering"])
        
    else: 
        #way 1
#         lbs = sorted(lbs) # starting with most relaxed lb and narrowing if needed
        
#         mask_filtered_in_overall = [False for i in range(len(df))] # we are going to gradually add to this
#         filter_settings = []
        
#         for t in tqdm(types):
#             print("getting rows for type")
#             supports_t = supports.loc[df.loc[:,"node1"] == t]
#             num_labels_before_filtering = len(supports_t)
#             print("finding lb")
#             for lb in lbs:
#                 supports_t = supports_t.loc[((supports_t >= lb) & (supports_t <= ub))]
#                 if len(supports_t) <= max_labels_per_type:
#                     break
#             print("adding to mask")
#             # add filtered-in labels to overall mask
#             for i in supports_t.index:
#                 mask_filtered_in_overall[i] = True
#             print("adding to filter settings")
#             # save filter settings
#             filter_settings.append([t, type_labels[t], lb, len(supports_t), num_labels_before_filtering])
        
#         filter_settings_df = pd.DataFrame(filter_settings, columns=["type", "type_label", "lb", "# labels in bounds", "# labels before filtering"])
#         df_filtered_out = df.loc[~mask_filtered_in_overall]
#         df_not_filtered_out = df.loc[mask_filtered_in_overall]

        #way2
        lbs = sorted(lbs) # starting with most relaxed lb and narrowing if needed
        
        mask_filtered_in_overall = np.array([False for i in range(len(df))]) # we are going to gradually add to this
        df_still_considering = df
        filter_settings = []
        
        for lb in tqdm(lbs):
            supports = df_still_considering.loc[:,"support"]
            df_still_considering = df_still_considering.loc[(supports >= lb) & (supports <= ub)]
            
            label_counts_by_type = df_still_considering.loc[:,"node1"].value_counts().to_dict()
#             for i, row in df_still_considering.iterrows():
#                 t = row["node1"]
#                 if label_counts_by_type[t] <= max_labels_per_type:
            finished_types = {t for t, count in label_counts_by_type.items() if count <= max_labels_per_type}
            finished_mask = df_still_considering.loc[:,"node1"].isin(finished_types)
            # add filtered-in labels to overall mask
            for i in df_still_considering.loc[finished_mask].index:
                mask_filtered_in_overall[i] = True
            # reduce size of dataframe still being considered
            df_still_considering = df_still_considering.loc[~finished_mask]
            # save filter settings
            for t in finished_types:
                filter_settings.append([t, type_labels[t], lb, label_counts_by_type[t], num_labels_for_type_before_filtering[t]])
            if df_still_considering.empty:
                break
        
        filter_settings_df = pd.DataFrame(filter_settings, columns=["type", "type_label", "lb", "# labels in bounds", "# labels before filtering"])
        df_filtered_out = df.loc[~mask_filtered_in_overall]
        df_not_filtered_out = df.loc[mask_filtered_in_overall]
        
    
    df_filtered_out.to_csv(filtered_out_file, sep='\t', index = False)
    df_not_filtered_out.to_csv(filtered_in_file, sep='\t', index = False)
    filter_settings_df.to_csv(filter_summary_file, sep='\t', index = False)

In [88]:
df = pd.read_csv("{}/candidate_AVLs_time.year_supports.tsv".format(output_dir), delimiter = '\t')
df.loc[:,"node1"].unique()
df.loc[:,"node1"].value_counts().to_dict()
(df.loc[:,"support"].loc[df.loc[:,"node1"]=="Q5"] >=.0001).index[0]
df.groupby("node1")["type_label"].first().to_dict()
for i, row in df.iterrows():
    print(i)
    print(row["node1"])
    break


0
Q100039327


## AVL - time.year labels

Adding support column

In [6]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_AVLs_time.year.tsv \
-o $OUT/candidate_AVLs_time.year_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type)-[lab_id {type_label:t_lab, label:prop, positives:pos, property_label:p_lab}]->(val), counts_per_type: (type)-[]->(count)' \
--return 'type as node1, t_lab as type_label, prop as label, p_lab as property_label, val as node2, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support, lab_id as id' \
--order-by 'type'

In [7]:
display(pd.read_csv("{}/candidate_AVLs_time.year_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,positives,support,id
0,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,1993,3,0.428571,Q100039327_P571_1993
1,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,1922,1,0.142857,Q100039327_P571_1922
2,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,1929,1,0.142857,Q100039327_P571_1929
3,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,1931,1,0.142857,Q100039327_P571_1931
4,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,1979,1,0.142857,Q100039327_P571_1979


Filter labels based on support value, send to files.

In [47]:
support_file = "{}/candidate_AVLs_time.year_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_AVLs_time.year_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_AVLs_time.year_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_AVL_time.year.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1], ub=.9, max_labels_per_type=500)

In [40]:
display(pd.read_csv("{}/filter_summary_AVL_time.year.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q100039327,'autonomous constitutional agency'@en,0.1,5,5
1,Q1000415,'Henry 180'@en,0.1,2,2
2,Q10007123,'Griqua state'@en,0.1,4,4
3,Q1000809,'Buddharupa'@en,0.1,0,4
4,Q1000923,'persecution of Buddhists'@en,0.1,0,1


## AVL - quantity labels

Adding support column

In [46]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_AVLs_quantity.tsv \
-o $OUT/candidate_AVLs_quantity_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type)-[lab_id {type_label:t_lab, label:prop, property_label:p_lab, positives:pos, si_units:si, wd_units:wd}]->(val), counts_per_type: (type)-[]->(count)' \
--return 'type as node1, t_lab as type_label, prop as label, p_lab as property_label, val as node2, si as si_units, wd as wd_units, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support,  lab_id as id' \
--order-by 'type'

In [48]:
display(pd.read_csv("{}/candidate_AVLs_quantity_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,si_units,wd_units,positives,support,id
0,Q1000300,'Land Rover Series'@en,P2043,'length'@en,3345,,Q174789,1,0.5,Q1000300_P2043_+3345__Q174789
1,Q1000300,'Land Rover Series'@en,P2048,'height'@en,1867,,Q174789,1,0.5,Q1000300_P2048_+1867__Q174789
2,Q1000300,'Land Rover Series'@en,P2049,'width'@en,1549,,Q174789,1,0.5,Q1000300_P2049_+1549__Q174789
3,Q1000300,'Land Rover Series'@en,P2067,'mass'@en,1177,,Q11570,1,0.5,Q1000300_P2067_+1177__Q11570
4,Q1000415,'Henry 180'@en,P3157,'event distance'@en,45,,Q26484625,2,0.666667,Q1000415_P3157_+45__Q26484625


Filter labels based on support value, send to files.

In [54]:
support_file = "{}/candidate_AVLs_quantity_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_AVLs_quantity_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_AVLs_quantity_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_AVL_quantity.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1], ub=.9, max_labels_per_type=500)

  if (await self.run_code(code, result,  async_=asy)):


In [55]:
display(pd.read_csv("{}/filter_summary_AVL_quantity.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q1000300,'Land Rover Series'@en,0.1,4,4
1,Q1000415,'Henry 180'@en,0.1,1,1
2,Q1000809,'Buddharupa'@en,0.1,0,5
3,Q1000858,'regions of the Faroe Islands'@en,0.1,4,4
4,Q1001059,'writ'@en,0.1,2,2


## RELs

Adding support column

In [172]:
%%time
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_RELs.tsv \
-o $OUT/candidate_RELs_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type)-[lab_id {type_label:t_lab, label:prop, positives:pos, property_label:p_lab}]->(val), counts_per_type: (type)-[]->(count)' \
--return 'type as node1, t_lab as type_label, prop as label, p_lab as property_label, val as node2, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support, lab_id as id' \
--order-by 'type'


CPU times: user 7.61 s, sys: 1.44 s, total: 9.05 s
Wall time: 5min 32s


In [173]:
display(pd.read_csv("{}/candidate_RELs_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,node2_label,positives,support,id
0,Q1000017,'Brentidae'@en,P31,'instance of'@en,Q1000017,'Brentidae'@en,1,1.0,Q1000017_P31_Q1000017
1,Q1000017,'Brentidae'@en,P31,'instance of'@en,Q1040689,'synonym'@en,1,1.0,Q1000017_P31_Q1040689
2,Q1000017,'Brentidae'@en,P31,'instance of'@en,Q17276484,'later homonym'@en,1,1.0,Q1000017_P31_Q17276484
3,Q1000017,'Brentidae'@en,P910,'topic\\\\\\\\'s main category'@en,Q14967583,'Category:Euphenges'@en,1,1.0,Q1000017_P910_Q14967583
4,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q47463912,"'RFC 3092: Etymology of \\\\\\\\""Foo\\\\\\\\""'@en",1,1.0,Q100023_P1343_Q47463912


Filter labels based on support value, send to files.

In [174]:
support_file = "{}/candidate_RELs_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_RELs_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_RELs_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_RELs.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1,.03,.01,.003,.001,.0003], ub=.9, max_labels_per_type=500)

100%|██████████| 6/6 [00:05<00:00,  1.08it/s]


In [175]:
df = pd.read_csv("{}/filter_summary_RELs.tsv".format(output_dir), delimiter = '\t').fillna("")
display(df.head())
display(df.loc[df.loc[:,"type"]=="Q5"])
display(df.loc[df.loc[:,"type"]=="Q7889"])
display(df.loc[df.loc[:,"type"]=="Q6256"])
display(df.loc[df.loc[:,"type"]=="Q41298"])

Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q1325650,'Malmö International Badminton Championships'@en,0.0003,12,18
1,Q20162172,'human language'@en,0.0003,4,5
2,Q47007890,'bicomplete category'@en,0.0003,15,17
3,Q18118092,'Aloe saponaria'@en,0.0003,4,6
4,Q4441564,'stack-oriented programming language'@en,0.0003,16,18


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
37589,Q5,'human'@en,0.001,460,3919121


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
38212,Q7889,'video game'@en,0.003,242,26262


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
39935,Q6256,'country'@en,0.03,438,22205


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
37436,Q41298,'magazine'@en,0.001,255,9895


## RAVLs

Adding support column

In [176]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_RAVLs.tsv \
-o $OUT/candidate_RAVLs_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type1)-[lab_id {type_label:t_lab, label:prop1, property_label:p1_lab, positives:pos}]->(val), counts_per_type: (type1)-[]->(count)' \
--return 'type1 as node1, t_lab as type_label, prop1 as label, p1_lab as property_label, val as node2, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support, lab_id as id' \
--order-by 'type1'


In [177]:
display(pd.read_csv("{}/candidate_RAVLs_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,type2,type2_label,prop2,property2_label,node2,si_units,wd_units,positives,support,id
0,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q1322187,'April Fools\\\\\\\\\\\\\\\\' Day Request for ...,P1104,'number of pages'@en,14.0,,,1,1.0,Q100023_P1343_Q1322187_P1104_+14__
1,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q1322187,'April Fools\\\\\\\\\\\\\\\\' Day Request for ...,P577,'publication date'@en,2001.0,,,1,1.0,Q100023_P1343_Q1322187_P577_2001__
2,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q212971,'Request for Comments'@en,P1104,'number of pages'@en,14.0,,,1,1.0,Q100023_P1343_Q212971_P1104_+14__
3,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q212971,'Request for Comments'@en,P577,'publication date'@en,2001.0,,,1,1.0,Q100023_P1343_Q212971_P577_2001__
4,Q100026,'F-16'@en,P17,'country'@en,Q3624078,'sovereign state'@en,P1081,'Human Development Index'@en,0.903,,,1,1.0,Q100026_P17_Q3624078_P1081_+0.903__


Filter labels based on support value, send to files.

In [178]:
support_file = "{}/candidate_RAVLs_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_RAVLs_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_RAVLs_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_RAVLs.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1], ub=.9, max_labels_per_type=500)

In [179]:
display(pd.read_csv("{}/filter_summary_RAVLs.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q100023,'metasyntactic variable'@en,0.1,0,4
1,Q100026,'F-16'@en,0.1,0,59
2,Q1000300,'Land Rover Series'@en,0.1,0,2
3,Q1000371,'personalization'@en,0.1,9,9
4,Q100039327,'autonomous constitutional agency'@en,0.1,17,56


## RAILs

Adding support column

In [116]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_RAILs.tsv \
-o $OUT/candidate_RAILs_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type1)-[lab_id {type_label:t_lab, label:prop1, property_label:p1_lab, positives:pos}]->(val), counts_per_type: (type1)-[]->(count)' \
--return 'type1 as node1, t_lab as type_label, prop1 as label, p1_lab as property_label, val as node2, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support, lab_id as id' \
--order-by 'type1'


In [117]:
display(pd.read_csv("{}/candidate_RAILs_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,type2,type2_label,prop2,property2_label,node2,lower_bound,upper_bound,si_units,wd_units,positives,support,id
0,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q1322187,'April Fools\\\\\\\\\\\\\\\\' Day Request for ...,P1104,'number of pages'@en,,11.0,37.0,,,1,1.0,Q100023_P1343_Q1322187_P1104_11.0-37.0__
1,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q1322187,'April Fools\\\\\\\\\\\\\\\\' Day Request for ...,P577,'publication date'@en,,2000.0,2010.0,,,1,1.0,Q100023_P1343_Q1322187_P577_2000-2010__
2,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q212971,'Request for Comments'@en,P1104,'number of pages'@en,,13.0,20.0,,,1,1.0,Q100023_P1343_Q212971_P1104_13.0-20.0__
3,Q100023,'metasyntactic variable'@en,P1343,'described by source'@en,Q212971,'Request for Comments'@en,P577,'publication date'@en,,2000.0,2010.0,,,1,1.0,Q100023_P1343_Q212971_P577_2000-2010__
4,Q100026,'F-16'@en,P17,'country'@en,Q3624078,'sovereign state'@en,P1081,'Human Development Index'@en,,0.853,0.957,,,1,1.0,Q100026_P17_Q3624078_P1081_0.853-0.957__


Filter labels based on support value, send to files.

In [118]:
support_file = "{}/candidate_RAILs_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_RAILs_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_RAILs_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_RAILs.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1], ub=.9, max_labels_per_type=500)

In [119]:
display(pd.read_csv("{}/filter_summary_RAILs.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q100023,'metasyntactic variable'@en,0.1,0,4
1,Q100026,'F-16'@en,0.1,0,58
2,Q1000300,'Land Rover Series'@en,0.1,0,2
3,Q1000371,'personalization'@en,0.1,5,5
4,Q100039327,'autonomous constitutional agency'@en,0.1,17,56


## AIL - time.year

Adding support column

In [120]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_AILs_time.year.tsv \
-o $OUT/candidate_AILs_time.year_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type)-[lab_id {type_label:t_lab, label:prop, positives:pos, lower_bound:lb, upper_bound:ub, property_label:p_lab}]->(), counts_per_type: (type)-[]->(count)' \
--return 'type as node1, t_lab as type_label, prop as label, p_lab as property_label, "" as node2,lb as lower_bound, ub as upper_bound, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support, lab_id as id' \
--order-by 'type'

In [121]:
display(pd.read_csv("{}/candidate_AILs_time.year_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,lower_bound,upper_bound,positives,support,id
0,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,,1990,2000,3,0.428571,Q100039327_P571_1990-2000
1,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,,1920,1930,2,0.285714,Q100039327_P571_1920-1930
2,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,,1930,1940,1,0.142857,Q100039327_P571_1930-1940
3,Q100039327,'autonomous constitutional agency'@en,P571,'inception'@en,,1970,1980,1,0.142857,Q100039327_P571_1970-1980
4,Q1000415,'Henry 180'@en,P585,'point in time'@en,,2010,2020,2,0.666667,Q1000415_P585_2010-2020


Filter labels based on support value, send to files.

In [122]:
support_file = "{}/candidate_AILs_time.year_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_AILs_time.year_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_AILs_time.year_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_AILs_time.year.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1,.03,.01,.003,.001,.0003], ub=.9, max_labels_per_type=100)

 67%|██████▋   | 4/6 [00:00<00:00, 19.32it/s]


In [123]:
df = pd.read_csv("{}/filter_summary_AILs_time.year.tsv".format(output_dir), delimiter = '\t').fillna("")
display(df.head())
display(df.loc[df.loc[:,"type"]=="Q5"])
display(df.loc[df.loc[:,"type"]=="Q7889"])
display(df.loc[df.loc[:,"type"]=="Q6256"])
display(df.loc[df.loc[:,"type"]=="Q41298"])

Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q4441564,'stack-oriented programming language'@en,0.0003,2,2
1,Q1929383,'sneakers'@en,0.0003,2,2
2,Q26706,'International Youth Congress of Esperanto'@en,0.0003,24,24
3,Q1048525,'golf course'@en,0.0003,32,32
4,Q7307,'kiss'@en,0.0003,1,1


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
25879,Q5,'human'@en,0.001,86,1276


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
21212,Q7889,'video game'@en,0.0003,18,51


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
11453,Q6256,'country'@en,0.0003,92,92


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
25880,Q41298,'magazine'@en,0.001,82,187


## AIL - Quantity

Adding support column

In [125]:
!kgtk query -i $IN/entity_counts_per_type.tsv -i $IN/candidate_AILs_quantity.tsv \
-o $OUT/candidate_AILs_quantity_supports.tsv \
--graph-cache $STORE \
--match 'candidate: (type)-[lab_id {type_label:t_lab, label:prop, property_label:p_lab, positives:pos, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(), counts_per_type: (type)-[]->(count)' \
--return 'type as node1, t_lab as type_label, prop as label, p_lab as property_label, "" as node2, lb as lower_bound, ub as upper_bound, si as si_units, wd as wd_units, pos as positives, kgtk_quantity_number_float(pos)/kgtk_quantity_number(count) as support,  lab_id as id' \
--order-by 'type'

In [126]:
display(pd.read_csv("{}/candidate_AILs_quantity_supports.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,lower_bound,upper_bound,si_units,wd_units,positives,support,id
0,Q1000300,'Land Rover Series'@en,P2043,'length'@en,,3345.0,3345.0,,Q174789,1,0.5,Q1000300_P2043_3345.0-3345.0__Q174789
1,Q1000300,'Land Rover Series'@en,P2048,'height'@en,,1867.0,1867.0,,Q174789,1,0.5,Q1000300_P2048_1867.0-1867.0__Q174789
2,Q1000300,'Land Rover Series'@en,P2049,'width'@en,,1549.0,1549.0,,Q174789,1,0.5,Q1000300_P2049_1549.0-1549.0__Q174789
3,Q1000300,'Land Rover Series'@en,P2067,'mass'@en,,1177.0,1177.0,,Q11570,1,0.5,Q1000300_P2067_1177.0-1177.0__Q11570
4,Q1000415,'Henry 180'@en,P3157,'event distance'@en,,45.0,45.0,,Q26484625,2,0.666667,Q1000415_P3157_45.0-45.0__Q26484625


Filter labels based on support value, send to files.

In [127]:
support_file = "{}/candidate_AILs_quantity_supports.tsv".format(output_dir)
filtered_out_file = "{}/candidate_AILs_quantity_filtered_out.tsv".format(output_dir)
filtered_in_file = "{}/candidate_AILs_quantity_filtered_in.tsv".format(output_dir)
filter_summary_file = "{}/filter_summary_AILs_quantity.tsv".format(output_dir)
dynamic_filter_by_support(support_file, filtered_out_file, filtered_in_file, filter_summary_file, lbs=[.1,.03,.01,.003,.001,.0003], ub=.9, max_labels_per_type=100)

 50%|█████     | 3/6 [00:00<00:00, 27.95it/s]


In [128]:
df = pd.read_csv("{}/filter_summary_AILs_quantity.tsv".format(output_dir), delimiter = '\t').fillna("")
display(df.head())
display(df.loc[df.loc[:,"type"]=="Q5"])
display(df.loc[df.loc[:,"type"]=="Q7889"])
display(df.loc[df.loc[:,"type"]=="Q6256"])
display(df.loc[df.loc[:,"type"]=="Q41298"])

Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
0,Q35798,'executive branch'@en,0.0003,18,18
1,Q34442,'road'@en,0.0003,21,43
2,Q174713,'planet trail'@en,0.0003,5,5
3,Q30894535,'European Road Championships – Men\\\\\\\\'s e...,0.0003,2,4
4,Q945344,'rift zone'@en,0.0003,2,2


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
11387,Q5,'human'@en,0.0003,28,336


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
4238,Q7889,'video game'@en,0.0003,9,129


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
15809,Q6256,'country'@en,0.01,99,139


Unnamed: 0,type,type_label,lb,# labels in bounds,# labels before filtering
15078,Q41298,'magazine'@en,0.0003,21,87


# Create table of entities and the labels that have passed filtering
## First filter down tables of entities and labels for each kind of label separately
We will drop label info cols here since they are saved separately

## AVL - time.year labels

In [131]:
%%time
!kgtk ifexists --input-file $IN/entity_AVLs_time.year.tsv --filter-file $OUT/candidate_AVLs_time.year_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_AVLs_time.year_filtered.tsv

CPU times: user 789 ms, sys: 272 ms, total: 1.06 s
Wall time: 50.2 s


In [135]:
!wc -l $IN/entity_AVLs_time.year.tsv

12746968 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_AVLs_time.year.tsv


In [137]:
!wc -l $OUT/entity_AVLs_time.year_filtered.tsv

360456 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AVLs_time.year_filtered.tsv


In [180]:
%%time
!kgtk remove-columns -i $OUT/entity_AVLs_time.year_filtered.tsv \
--columns type type_label prop property_label value -o $OUT/entity_AVLs_time.year_filtered_temp.tsv \
&& mv $OUT/entity_AVLs_time.year_filtered_temp.tsv $OUT/entity_AVLs_time.year_filtered.tsv

CPU times: user 146 ms, sys: 216 ms, total: 362 ms
Wall time: 6.86 s


In [181]:
display(pd.read_csv("{}/entity_AVLs_time.year_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,Q100004820,_,Q1477856_P577_2012,E144
1,Q100004823,_,Q1477856_P577_2012,E146
2,Q100004824,_,Q1477856_P577_2012,E148
3,Q100004829,_,Q1477856_P577_2012,E150
4,Q100004830,_,Q1477856_P577_2012,E152


## AVL - quantity labels

In [139]:
%%time
!kgtk ifexists --input-file $IN/entity_AVLs_quantity.tsv --filter-file $OUT/candidate_AVLs_quantity_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_AVLs_quantity_filtered.tsv

CPU times: user 617 ms, sys: 207 ms, total: 824 ms
Wall time: 39.3 s


In [140]:
!wc -l $IN/entity_AVLs_quantity.tsv

9704175 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_AVLs_quantity.tsv


In [141]:
!wc -l $OUT/entity_AVLs_quantity_filtered.tsv

887929 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AVLs_quantity_filtered.tsv


In [182]:
%%time
!kgtk remove-columns -i $OUT/entity_AVLs_quantity_filtered.tsv \
--columns type type_label prop property_label value si_units wd_units -o $OUT/entity_AVLs_quantity_filtered_temp.tsv \
&& mv $OUT/entity_AVLs_quantity_filtered_temp.tsv $OUT/entity_AVLs_quantity_filtered.tsv

CPU times: user 161 ms, sys: 166 ms, total: 327 ms
Wall time: 7.61 s


In [183]:
display(pd.read_csv("{}/entity_AVLs_quantity_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P1187,_,Q64289010_P4876_+42496__,E21
1,P1188,_,Q64289010_P4876_+58916__,E23
2,P1803,_,Q27870307_P4876_+440000__,E69
3,P1937,_,Q29642812_P4876_+103034__,E83
4,P1948,_,Q29546563_P4876_+8865__,E85


## RELs

In [143]:
%%time
!kgtk ifexists --input-file $IN/entity_RELs.tsv --filter-file $OUT/candidate_RELs_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_RELs_filtered.tsv

CPU times: user 10 s, sys: 2.21 s, total: 12.3 s
Wall time: 10min 50s


In [144]:
!wc -l $IN/entity_RELs.tsv

155762523 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_RELs.tsv


In [145]:
!wc -l $OUT/entity_RELs_filtered.tsv

64546468 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RELs_filtered.tsv


In [184]:
%%time
!kgtk remove-columns -i $OUT/entity_RELs_filtered.tsv \
--columns type type_label prop property_label value -o $OUT/entity_RELs_filtered_temp.tsv \
&& mv $OUT/entity_RELs_filtered_temp.tsv $OUT/entity_RELs_filtered.tsv

CPU times: user 7.77 s, sys: 1.73 s, total: 9.5 s
Wall time: 5min 59s


In [185]:
display(pd.read_csv("{}/entity_RELs_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P10,_,Q18610173_P1629_Q34508,E1
1,P10,_,Q18610173_P1855_Q15075950,E2
2,P10,_,Q18610173_P1855_Q4504,E3
3,P10,_,Q18610173_P1855_Q69063653,E4
4,P10,_,Q18610173_P1855_Q7378,E5


## RAVLs

In [147]:
%%time
!kgtk ifexists --input-file $IN/entity_RAVLs.tsv --filter-file $OUT/candidate_RAVLs_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_RAVLs_filtered.tsv

CPU times: user 1min 33s, sys: 20.5 s, total: 1min 53s
Wall time: 1h 40min 43s


In [148]:
!wc -l $IN/entity_RAVLs.tsv

1382700851 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_RAVLs.tsv


In [149]:
!wc -l $OUT/entity_RAVLs_filtered.tsv

418593915 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RAVLs_filtered.tsv


In [186]:
%%time
!kgtk remove-columns -i $OUT/entity_RAVLs_filtered.tsv \
--columns type type_label prop property_label type2 type2_label prop2 property2_label value si_units wd_units -o $OUT/entity_RAVLs_filtered_temp.tsv \
&& mv $OUT/entity_RAVLs_filtered_temp.tsv $OUT/entity_RAVLs_filtered.tsv

CPU times: user 59.2 s, sys: 11.4 s, total: 1min 10s
Wall time: 44min 51s


In [187]:
display(pd.read_csv("{}/entity_RAVLs_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P1003,_,Q19595382_P17_Q3624078_P2884_+230__Q25250,E103
1,P1003,_,Q19595382_P17_Q6256_P2884_+230__Q25250,E104
2,P1003,_,Q19595382_P17_Q3624078_P2997_+18__Q24564698,E105
3,P1003,_,Q19595382_P17_Q6256_P2997_+18__Q24564698,E106
4,P1003,_,Q19595382_P17_Q3624078_P3000_+18__Q24564698,E107


## RAILs

In [151]:
%%time
!kgtk ifexists --input-file $IN/entity_RAILs.tsv --filter-file $OUT/candidate_RAILs_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_RAILs_filtered.tsv

CPU times: user 1min 32s, sys: 19.8 s, total: 1min 52s
Wall time: 1h 40min 31s


In [153]:
!wc -l $IN/entity_RAILs.tsv

1387585809 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_RAILs.tsv


In [154]:
!wc -l $OUT/entity_RAILs_filtered.tsv

524088521 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RAILs_filtered.tsv


In [188]:
%%time
!kgtk remove-columns -i $OUT/entity_RAILs_filtered.tsv \
--columns type type_label prop property_label type2 type2_label prop2 property2_label lower_bound upper_bound si_units wd_units \
-o $OUT/entity_RAILs_filtered_temp.tsv \
&& mv $OUT/entity_RAILs_filtered_temp.tsv $OUT/entity_RAILs_filtered.tsv

CPU times: user 1min 15s, sys: 14.4 s, total: 1min 29s
Wall time: 57min 36s


In [189]:
display(pd.read_csv("{}/entity_RAILs_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P1000,_,Q18608871_P1855_Q5_P569_1980-1990__,E12
1,P1003,_,Q96776953_P17_Q3624078_P1081_0.768-0.847__,E95
2,P1003,_,Q96776953_P17_Q6256_P1081_0.774-0.855__,E99
3,P1003,_,Q19595382_P17_Q3624078_P1198_6.0-8.0__Q11229,E100
4,P1003,_,Q19833377_P17_Q3624078_P1198_6.0-8.0__Q11229,E101


## AIL - time.year

In [156]:
%%time
!kgtk ifexists --input-file $IN/entity_AILs_time.year.tsv --filter-file $OUT/candidate_AILs_time.year_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_AILs_time.year_filtered.tsv

CPU times: user 1.03 s, sys: 348 ms, total: 1.38 s
Wall time: 1min 9s


In [157]:
!wc -l $IN/entity_AILs_time.year.tsv

12701962 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_AILs_time.year.tsv


In [158]:
!wc -l $OUT/entity_AILs_time.year_filtered.tsv

12184784 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AILs_time.year_filtered.tsv


In [190]:
%%time
!kgtk remove-columns -i $OUT/entity_AILs_time.year_filtered.tsv \
--columns type type_label prop property_label lower_bound upper_bound \
-o $OUT/entity_AILs_time.year_filtered_temp.tsv \
&& mv $OUT/entity_AILs_time.year_filtered_temp.tsv $OUT/entity_AILs_time.year_filtered.tsv

CPU times: user 1.59 s, sys: 436 ms, total: 2.02 s
Wall time: 1min 8s


In [191]:
display(pd.read_csv("{}/entity_AILs_time.year_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P1841,_,Q55977691_P580_2010-2020,E1
1,P2847,_,Q18608871_P2669_2010-2020,E2
2,P2847,_,Q24239898_P2669_2010-2020,E3
3,P2847,_,Q30041186_P2669_2010-2020,E4
4,P2847,_,Q60457486_P2669_2010-2020,E5


## AIL - Quantity

In [160]:
%%time
!kgtk ifexists --input-file $IN/entity_AILs_quantity.tsv --filter-file $OUT/candidate_AILs_quantity_filtered_in.tsv \
--input-keys node2 --filter-keys id -o $OUT/entity_AILs_quantity_filtered.tsv

CPU times: user 794 ms, sys: 241 ms, total: 1.04 s
Wall time: 50.3 s


In [161]:
!wc -l $IN/entity_AILs_quantity.tsv

9704175 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_AILs_quantity.tsv


In [162]:
!wc -l $OUT/entity_AILs_quantity_filtered.tsv

9548943 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AILs_quantity_filtered.tsv


In [192]:
%%time
!kgtk remove-columns -i $OUT/entity_AILs_quantity_filtered.tsv \
--columns type type_label prop property_label lower_bound upper_bound si_units wd_units \
-o $OUT/entity_AILs_quantity_filtered_temp.tsv \
&& mv $OUT/entity_AILs_quantity_filtered_temp.tsv $OUT/entity_AILs_quantity_filtered.tsv

CPU times: user 1.23 s, sys: 456 ms, total: 1.68 s
Wall time: 58.1 s


In [193]:
display(pd.read_csv("{}/entity_AILs_quantity_filtered.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,id
0,P1004,_,Q19829908_P4876_7883.0-37427.0__,E1
1,P1004,_,Q24075706_P4876_18000.0-95823.0__,E2
2,P1004,_,Q27525351_P4876_37427.0-45000.0__,E3
3,P1014,_,Q27918607_P4876_38128.0-160203.0__,E4
4,P1014,_,Q43831109_P4876_53249.0-53249.0__,E5


## Finally, combine tables of entities and their filtered labels

In [194]:
!kgtk cat -i $OUT/entity_RELs_filtered.tsv \
-i $OUT/entity_AVLs_quantity_filtered.tsv \
-i $OUT/entity_AVLs_time.year_filtered.tsv \
-i $OUT/entity_AILs_quantity_filtered.tsv \
-i $OUT/entity_AILs_time.year_filtered.tsv \
-i $OUT/entity_RAVLs_filtered.tsv \
-i $OUT/entity_RAILs_filtered.tsv \
-o $OUT/all_entity_labels_filtered.tsv \
&& kgtk add-id -i $OUT/all_entity_labels_filtered.tsv --overwrite-id -o $OUT/all_entity_labels_filtered_temp.tsv \
&& mv $OUT/all_entity_labels_filtered_temp.tsv $OUT/all_entity_labels_filtered.tsv

^C

Keyboard interrupt in cat -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RELs_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AVLs_quantity_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AVLs_time.year_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AILs_quantity_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_AILs_time.year_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RAVLs_filtered.tsv -i /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/entity_RAILs_filtered.tsv -o /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/all_entity_labels_filtered.tsv.


In [167]:
!wc -l $OUT/all_entity_labels_filtered.tsv

1030211010 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/all_entity_labels_filtered.tsv


### Also combine label-info files

In [169]:
!kgtk cat -i $OUT/candidate_RELs_filtered_in.tsv \
-i $OUT/candidate_AVLs_quantity_filtered_in.tsv \
-i $OUT/candidate_AVLs_time.year_filtered_in.tsv \
-i $OUT/candidate_AILs_quantity_filtered_in.tsv \
-i $OUT/candidate_AILs_time.year_filtered_in.tsv \
-i $OUT/candidate_RAVLs_filtered_in.tsv \
-i $OUT/candidate_RAILs_filtered_in.tsv \
-o $OUT/all_candidate_labels_filtered_in.tsv 

In [170]:
!wc -l $OUT/all_candidate_labels_filtered_in.tsv

9932712 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/candidate_filter/all_candidate_labels_filtered_in.tsv
