Skip to content

Code for the Master Thesis "Enhancing the Microsoft Academic Knowledge Graph"

Notifications You must be signed in to change notification settings

lin-ao/enhancing_the_makg

Repository files navigation

Enhancing the Microsoft Academic Knowledge Graph
Code for the Master Thesis "Enhancing the Microsoft Academic Knowledge Graph"

Entity Resolution

Required packages: pyjarowinkler

Code for data preperation + disambiguation + recreating file (or use execute.sh)

python 00.prepare_paper_references.py
python 01.extract_paper_id_with_doi.py
python 02.extract_author_with_paper_id.py

LANG=en_US.UTF-8 LC_ALL=C sort -n -t$'\t' -k1 02.author_id_with_paper_id.txt > 02.author_id_with_paper_id_sorted.txt

python 03.extract_paper_with_author_id.py

LANG=en_US.UTF-8 LC_ALL=C sort -n -t$'\t' -k1 03.paper_id_with_author_id.txt > 03.paper_id_with_author_id_sorted.txt 

python 04.author_id_merge_paper_id.py
python 05.paper_id_merge_author_ids.py
python 06.add_to_authors_paper_id.py
python 07.add_to_authors_doi.py
python 08.add_to_authors_coauthors.py
python 09.add_to_authors_titles.py
python 10.add_to_authors_year.py
python 11.add_to_authors_journal_and_conference.py
python 12.add_to_authors_references.py

mkdir sort
split -l 5000000 -d 12.authors_with_references.txt sort/sort_file
cd sort
for file in sort_file*; do
    echo $file
    LANG=en_US.UTF-8 LC_ALL=C sort -t$'\t' -k3 -o $file $file
    done
LANG=en_US.UTF-8 LC_ALL=C sort -t$'\t' -k3 sort_file* > ../12.authors_with_references_sorted.txt
cd ..
rm -r sort

python 13.disambiguation_data.py
python 14.recreate_files

Edit the following data paths for MAG files:

  • 00.prepare_paper_references.py: path to PaperReferences.txt
  • 01.extract_paper_id_with_doi.py: path to Papers.txt
  • 02.extract_author_with_paper_id.py: path to PaperAuthorAffiliations.txt
  • 03.extract_paper_with_author_id.py: path to PaperAuthorAffiliations.txt
  • 06.add_to_authors_paper_id.py: path to Authors.txt
  • 09.add_to_authors_titles.py: path to Papers.txt
  • 10.add_to_authors_year.py: path to Papers.txt
  • 11.add_to_authors_journal_and_conference.py: path to Papers.txt
  • 14.recreate_files.py: path to PaperAuthorAffiliations.txt

Files 15-19 are used for evaluation in the Thesis

Field of Study Classification

Required packages: NLTK, Pandas, scikit-learn, simpletransformers, spaCy, pytextrank

  • files 00 and 01 are used to convert MAG paper abstracts from Inverted Indexes to Full Texts
  • file 02 is used to extract field of study labels from the MAG (all 19 low level FoS), edit the path to FieldsOfStudy.txt accordingly
  • files 03-06 require a sorted version of the PaperFieldsOfStudy.txt file, which can be done with the following code:
mkdir sort
split -l 5000000 -d PaperFieldsOfStudy.txt sort/sort_file
cd sort
for file in sort_file*; do
    echo $file
    LANG=en_US.UTF-8 LC_ALL=C sort -t$'\t' -k3 -o $file $file
    done
LANG=en_US.UTF-8 LC_ALL=C sort -t$'\t' -k3 sort_file* > ../SortedPaperFieldsOfStudy.txt
cd ..
rm -r sort
  • execute file 03 in sequence to generate the data set using direct labels, edit path to the sorted PaperFieldsOfStudy.txt accordingly
  • execute both 04 files in sequence to generate the data set using indirect labels, edit paths to FieldOfStudyChildren.txt and the sorted PaperFieldsOfStudy.txt accordingly
  • execute all three 05 files in order to generate the data set using journal labels
  • use file 06 in order to evaluate MAG labels from data sets generated by 03 and 04, edit the file path accordingly
  • use file 07 to generate training and evalution sets
  • use file 08 to train and evaluate the classifier, edit the model as well as hyperparameters accordingly
  • use file 09 to classify MAG papers
  • use file 10 to match extracted labels with MAG papers
  • use file 11 to extract keywords


Knowledge Graph Embeddings

Required packages: DGL-KE

  • use files 00 and 01 to generate input files for training author embeddings, add file path to input graph Authors.nt
  • use files 02 and 03 to generate input files for training paper embeddings, add file path to input graph Papers.nt, Journals.nt, and ConferenceSeries.nt
  • execute 04 or the following console command for training embeddings, edit file paths, data sets, and hyperparameters accordingly
DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --data_path 02.knowledge_graph_embeddings --dataset mag_author --data_files 01.author_entities.dict 01.author_relations.dict 01.author_train.tsv 01.author_valid.tsv 01.author_test.tsv --format udd_hrt --batch_size 1000 --neg_sample_size 1000 --hidden_dim 100 --gamma 19.9 --lr 0.25 --max_step 1000000 --log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 -adv --regularization_coef 1.00E-09 --gpu 0 1 2 3 4 5 6 7 --valid --test --mix_cpu_gpu



Statistical Analysis
Required packages: Pandas, NumPy, seaborn, matplotlib, chord

Includes files used to generate graphs and data for statistical analysis

  • file 00 is used to count entity properties, edit file paths for Authors.txt and Papers.txt
  • file 01 is used to calculate the number of papers published per year, edit file path for Papers.txt
  • file 02 is used to generate data for table 25, uses files created during entity resolution, edit file paths accordingly
  • file 04 and 05 to calculate data for figures 08, 09, 10, 11, 12 and 13, create a folder named 04.field_of_study_over_time beforehand, use file 04 to split data by individual field of study, use file 05 to generate time data for each field of study
  • file 06 is used to generate data for table 27, edit file path for Papers.txt
  • file 07 is used to generate data for figures 04 and 05, edit file path for Papers.txt
  • file 08 is used to generate data for figure 03, uses file generate during entity resolution, edit file paths accordingly
  • file 09 and 12 are used to generate data for and figure 15, file 09 uses data generate during entity resolution and field of study classification. File 09 generates a matrix which is loaded by 12 to generate the chord graph
  • file 10 is used to generate data for figure 06, uses file generated during entity resolution, edit file path for Papers.txt
  • file 11 is used to generate figure 14
  • file 13 is used to generate data for table 26

Knowledge Graph Creation
Includes files used to generate the MAKG and as well as the ontology file

  • file 00-20 creates RDF representations of existing MAG files, edit file paths accordingly
  • file 21 uses our extract keywords for each paper, edit file path accordingly
  • file 22 uses our field of study labels for papers, edit file path accordingly
  • file 23 and 24 use our disambiguated author and paperauthoraffiliation files, edit file paths accordingly
  • file 25 links MAG authors (undisambiguated, though disambituated authors input can be created using our provided files) to their ORCIDs, requires ORCID file generated during entity resolution, edit file path accordingly

About

Code for the Master Thesis "Enhancing the Microsoft Academic Knowledge Graph"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages