Calculating Gene Citation Counts using NCBI's databases

This is no longer maintained, try at your own risk. See this comment for a link to similar data at the NCBI

To get an idea of how citation counts have changed over the years, we need a list of which citations were published when. This script queries Entrez gene for citation lists for each day.

python scripts/pmids_by_date.py --startdate 1990/01/01 --enddate 2017/11/05;

Consolidate the publications from each day into one complete list

For easier processing, we'll collapse the per-day list of files into one file

rm ~/data/genbank-data/hg19/recent_pmid_year.ssv; \
for file in $(find data/pmid_by_date -name "*.ssv"); \
    do cat $file | awk '{split($1,a,"-"); print a[1], $2}' >> ~/data/genbank-data/hg19/recent_pmid_year.ssv; \
done;

Get NCBI's genes to publications table

The NCBI maintains a list of citations that mention each gene in its database. We'll download it.

wget -N -P ~/data/genbank-data/ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz; \
gunzip ~/data/genbank-data/gene2pubmed.gz; \
mkdir ~/data/genbank-data/hg19/;

And then filter out the human genes:

cat ~/data/genbank-data/gene2pubmed | awk '{if ($1 == 9606) print;}' > ~/data/genbank-data/hg19/gene2pubmed
    cat ~/data/genbank-data/gene2pubmed  | python scripts/genes_by_species.py  - | sort -nk 2 > results/genes_by_species.tsv; \
    cp results/genes_by_species.tsv ~/Dropbox/tmp/graphic-idea-3.tsv

We'll get a list of chromosome sizes from the UCSC genome browser for plotting purposes:

wget -N -P ~/data/ucsc-data/hg19/ http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/chromInfo.txt.gz

We'll also grab a list of genomic positions from UCSC to plot the locations of the genes.

wget -N -P ~/data/ucsc-data/hg19/ http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/refGene.txt.gz

We need a mapping of gene IDs, which are just numbers, to more meaningful names and descriptions.

wget -N -P ~/data/genbank-data/ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz; \
gunzip ~/data/genbank-data/gene_info.gz; \
mkdir ~/data/genbank-data/hg19/;

Filter for human data for faster downstream processing.

cat ~/data/genbank-data/gene_info | awk '{if ($1 == 9606) print;}' > ~/data/genbank-data/hg19/gene_info

Get gene to refseq information so that we can use it to get transcript level information. NCBI's gene database annotates only genes. Actual transcript information is stored in RefSeq.

wget -N -P ~/data/genbank-data/ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz; \
mkdir ~/data/genbank-data/hg19/; \

Again, filter for human data.

zcat gene2refseq.gz | awk '{if ($1 == 9606) print;}' > ~/data/genbank-data/hg19/gene2refseq

All of the aggregations which count the total citations per gene as well as the citations per per gene per year are calculated using the summarize_gene_citations.py script. It could be done using command line scripts but I started with python so I'll stick with it.

# aggregate by citation counts and by year
python scripts/summarize_gene_citations.py; \
python scripts/summarize_gene_citations_all_species.py; \
cp gene_info_by_year.tsv ~/Dropbox/tmp/graphic-idea-2.tsv; \

To get a ranking of human transcripts by citation count, I used the pipeline we created for higlass's gene anntations: https://hms-dbmi.github.io/higlass-docs/gene_annotations.html#creating-the-track This is a good file to cross-reference with the the ones above to check that the values are correct.

    cp ~/data/genbank-data/hg19/geneAnnotationsExonUnions.bed ~/Dropbox/tmp/graphic-idea-1.tsv \

        awk '{if ($4 == "TP53") print}' gene_info_by_year.tsv | sort -nk 7

To calculate how many of the papers referencing human genes reference the top 100 genes we first need to obtain the top 100 genes (transcripts, in this case).

sort -nk 5  ~/data/genbank-data/hg19/geneAnnotationsExonUnions.bed | tail -n 100 | awk '{print $8}' > ~/data/genbank-data/hg19/top100_genes

Calculate the total number of papers which cite this set of genes.

python scripts/pubs_for_genes.py ~/data/genbank-data/gene2pubmed ~/data/genbank-data/hg19/top100_genes

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README		README
README.md		README.md
all_gene_counts.tsv		all_gene_counts.tsv
gene-popularity.ipynb		gene-popularity.ipynb
streamgraph.py		streamgraph.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calculating Gene Citation Counts using NCBI's databases

Get NCBI's genes to publications table

About

Releases 3

Packages

Languages

License

pkerpedjiev/gene-citation-counts

Folders and files

Latest commit

History

Repository files navigation

Calculating Gene Citation Counts using NCBI's databases

Get NCBI's genes to publications table

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages