Diversity matters: Robustness of bias measurements in Wikidata

The work accepted at ACM WebSci 2023 is contributed by Paramita Das, Sai Keerthana Karnam, Anirban Panda, Bhanu Prakash Reddy Guda, Soumya Sarkar and Animesh Mukherjee.

Find our paper at

arxiv

Dataset Preperation

We used kgtk to prepare the dataset. All the required files are added in DATASET folder.

wiki_dump = dataset_path
kgtk import-wikidata -i $wiki_dumo --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv

# Node file containes the labels of all entities and relations in the wikidata
# Edge file contains all the triples in the wikidata along with their descriptions. We remove all the columns and store only the head,relation and tail entity. This file is saved as Final_triples_QP.tsv

Collection of Demography datasets

First we extract all the humans from the entire dataset

# P31 - instance of and Q5 - human

kgtk filter -i Final_triples_QP.tsv -p ";P31;Q5" > humans.tsv

Following are the commands to extract each demographic dataset. Let us say Qnode_demo is the Qnode entity of a particular demography. The entities and relations whose labels are not present in nodefile.tsv are found using qwikidata python package.

kgtk filter -i Final_triples_QP.tsv -p ";P21;Qnode_demo" > humans.tsv
kgtk join --right-file citizens.tsv --left-file ../humans.tsv > humans.tsv
kgtk join --right-file humans.tsv --left-file Final_triples_QP.tsv > All_triples.tsv
kgtk filter -i All_triples.tsv --regex --match-type fullmatch -p ';;Q[0-9].*' > rhs.tsv

# The All_triples.tsv file contain all the triples in format 

python find_labels.py 

# This command helps to label the entities and relations present in the demography dataset.

python write_labels.py

# This command prepares the final demographic dataset.

Formation of giant knowledge graph

To form a giant knowledge graph which represents all these demographics we followed the same method as mentioned in paper. The final dataset can be downloaded here.

Generation of the knowledge graph embedding

To generate the embeddings we used Ampligraph library. The files for this are in GENERATE_EMBEDS folder. Following are the commands to train the model and evaluate it's performance.

cd Generate_embeddings
from train_model import train_model,eval_model
train_model(model_name,path_to_dataset,num_epochs,saved_model_path)
eval_model(saved_model_path, path_to_dataset,n)

Bias measurement in KG embedding

The entire code for the bias measurement is present in the folder BIAS_METRIC. Following are the commands to run the bias measurement for each of the demography.

import json
import operator
from preprocess import preprocess
from grad_descent_for_bias import train
from find_bias import find_bias
from get_biases_for_all_professions import get_biases

# Preprocessing to get the humans,occupations and gender entities embeddings 

all_human_dataset, human2id, human_embeddings, gmf2id, gmf_embeddings = preprocess(demographic_dataset_path, model_path , entire_dataset_path)
f=open(human2id_path,"w")
json.dump(human2id,f)

# Updating the embeddings using gradient descent 

model = train(seed, human2id, gmf2id, human_embeddings, gmf_embeddings, train_df, batch_size, use_gpu, learning_rate, checkpoint_path, dimension, epochs, best_loss_input, ifSave, relative_gender)

# Finding the bias scores and sorting in decreasing order to rank the professions. 

find_bias(human2id_path, ckpt_path, professions_path, embeddings_path, dimension, path_to_save_bias_scores)

Citation

@misc{das2023diversity,
      title={Diversity matters: Robustness of bias measurements in Wikidata}, 
      author={Paramita Das and Sai Keerthana Karnam and Anirban Panda and Bhanu Prakash Reddy Guda and Soumya Sarkar and Animesh Mukherjee},
      year={2023},
      eprint={2302.14027},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
BIAS_METRIC		BIAS_METRIC
DATASET		DATASET
GENERATE_EMBEDS		GENERATE_EMBEDS
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIAS_METRIC

BIAS_METRIC

DATASET

DATASET

GENERATE_EMBEDS

GENERATE_EMBEDS

README.md

README.md

Repository files navigation

Diversity matters: Robustness of bias measurements in Wikidata

Find our paper at

arxiv

Dataset Preperation

Collection of Demography datasets

Formation of giant knowledge graph

Generation of the knowledge graph embedding

Bias measurement in KG embedding

Citation

About

Releases

Packages

Contributors 2

Languages

paramita08/Wikidata_Bias_WebSci_2023

Folders and files

Latest commit

History

Repository files navigation

Diversity matters: Robustness of bias measurements in Wikidata

Find our paper at

Dataset Preperation

Collection of Demography datasets

Formation of giant knowledge graph

Generation of the knowledge graph embedding

Bias measurement in KG embedding

Citation

About

Resources

Stars

Watchers

Forks

Languages