data amusement on the microsoft academic graph
Jupyter Notebook Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore revision v1 to the plotting notebook Mar 6, 2016
batch2.sh several scripts Aug 25, 2016
batch_chi.sh several scripts Aug 25, 2016
batch_construct.sh various updates Feb 16, 2016
batch_cryptoinfo.sh several scripts Aug 25, 2016
batch_ml.sh several scripts Aug 25, 2016
batch_plot_citations.ipynb started code for OOPSLA paper filtering Feb 7, 2017
citation_join.py Merge branch 'master' of https://github.com/lexingxie/academic-graph Apr 6, 2016
citation_joins.ipynb working version of citation join Jan 27, 2016
construct_citation_table.ipynb make gather_conf take commandline input Apr 8, 2016
construct_citation_table.py fix pandans indexing bug Apr 8, 2016
csrank_03.sh batch #3 of cs ranking conferences Aug 25, 2016
csrank_04.sh leftover two conference Aug 27, 2016
export_citations.ipynb intermediate save Jan 27, 2016
export_citations.py various scripting update Apr 6, 2016
filter_citations.ipynb started code for OOPSLA paper filtering Feb 7, 2017
gather_conf_data.py make gather_conf take commandline input Apr 8, 2016
load_data.ipynb data munging scripts Jan 27, 2016
load_data.py use os.system for shell cmd Apr 6, 2016
paperdb.sql various scripting update Apr 6, 2016
plot_citations.ipynb started code for OOPSLA paper filtering Feb 7, 2017
plot_citations.py tweak post format and text for conference pages. Aug 17, 2016
plot_conf_citations.ipynb tweak post format and text for conference pages. Aug 17, 2016
plot_conf_citations.py multiple bug and style fixes to citation plotting Jul 14, 2016
prune_papers.ipynb use os.system for shell cmd Apr 6, 2016
prune_papers.py use os.system for shell cmd Apr 6, 2016
readme.md code for plotting the half-circle graph Jun 1, 2016

readme.md

This repo contains the scripts to process Microsoft Academic Graph, in order to profile the citation influence and reference heritage of a publication venue (e.g. conferences).

developer workflow to analyze a new conference /venue

  1. prep Paper.db (once for each new version of MAG data) first run prune_papers.ipynb

then import the result to sqlite

sqlite> create table paper_pruned(id TEXT, year INTEGER, venueid TEXT);                

sqlite> .separator ","                                                                   

sqlite> .import ./data_txt/Papers_pruned.txt paper_pruned  

or

sqlite3 Papers.db < paperdb.sql

note: 75M+ papers with unknown venues among 120M in all (jan 2016) 73M+ papers with unknown venues among 126M in all (apr 2016)

  1. get its citings and cited record (~30 mins) python export_citations.py WSDM

prep-step: [This is arleady been done in export_citations.py below] get subset for its published papers:

xlx@braun:/data2/xlx/MicrosoftAcademicGraph$ grep WSDM data_txt/ConferenceSeries.txt
42C7B402 WSDM Web Search and Data Mining
xlx@braun:/data2/xlx/MicrosoftAcademicGraph$ grep 42C7B4025 data_txt/Papers.txt > papers.WSDM.txt

  1. do the necessary joins (can take a few hrs) python construct_citation_table.py MM