This repo contains my scripts to:
-
replicate the analysis in the article The most popular genes in the human genome
-
repeat the analysis for prokaryotes (focussing on bacteria for starters) and see what I get.
Why ?
I am a python newbie, but not new to programming per se. I am good with R, Perl and have an undiagnosed OCD of trying to write efficient shell one-liners ;-)
R was a self-taught effort and I came to realize pretty early on that when it comes to programming,
reading basics only takes you so far ... pick up a project and learn along the way !
And prokaryotes because, lets face it ... its a microbial world !
Fork the original repo for scripts and instructions
Checkout my gist for Conda installation instructions and commands
python scripts/pmids_by_date.py --startdate 1990/01/01 --enddate 2017/11/05
FileNotFoundError: [Errno 2] No such file or directory: './data/pmid_by_date/1990_01_02.ssv'
At first glance, one obvious mistake appeared to be that I was not specifying the --output-dir
parameter for the script. However that didn't solve the issue. The error just changed to now include the outdir path I provided . So the issue was not that I didn't give that option (since from the code it is obvious that it should default to ./data/pmid_by_date
).
⚡ Turns out that python can't just create paths willy nilly unless you specifically ask it to do so ! Simple Googling and ol faithful Stackoverflow helped solve the issue.
Code edits (changed commited and pushed on github),
import os
os.makedirs(os.path.dirname(filename), exist_ok=True)
Lets try again
python gene-citation-counts/scripts/pmids_by_date.py --startdate 1990/01/01 --enddate 2017/11/05
🎉 it worked !!
The script takes range of dates as input (defaults to 2014/01/01 if none provided), and then uses NCBI's eutils to download a list of PubmedID's for those dates (max records downloaded 100000). It will create a ssv
format file (with date as filename) that contains date and Pubmed article ID separated by space.
❓ A quick advanced pubmed search (webpage) revealed that the number of publications on it for the given date range is 18851774. Can we check if we got all records ?
🔖 Add help menu to the script !
Consolidate the publications from each day into one complete list
Editing the provided command to fit my needs
mkdir -p /data/consolidated
for FILE in $(find data/pmid_by_date -name "*.ssv"); do cat $FILE | awk '{split($1,a,"-"); print a[1], $2}' >> data/consolidated/recent_pmid_year.ssv; done;
The above command will generate a consolidated file that contains the Year of publication and the PubmedID
mkdir -p ./data/genbank-data
wget -N -P ./data/genbank-data/ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
gunzip ./data/genbank-data/gene2pubmed.gz
Most (if not all) FTP sites will have a README file explaining the directory/file structure and content. The gene2pubmed readme tell me that it is updated daily and it contains 3 tab separated columns
- tax_id
- Gene_ID
- Pubmed_ID
This is an interesting file to explore. Lets see some quick facts about it !
There are total 10695821 lines in this file (not counting the column header). But does this means that there are 10695821 publications ? A quick head
or tail
of the file will let you know thats not the case. A single publication worked on multiple genes from each species, the same species (and/or genes) were studies by multiple publications... so on and so forth.
How many unique species are listed in this document ?
awk '{print $1}' gene2pubmed| sort | uniq | wc -l
13951 (subtract 1 for column header)
Doing the same for genes and pubmedID (only changing the column in awk
) we also see that there are 6002839
unique genes and 1124097
pubmed articles in total.
Questions that come to mind:
- What the the genes that were most studied ? i.e. most publications associated with it. We know from the original publication that
TP53
is the most studied gene. But what about in bacteria ? Similarly ... - Most studied species ? Can we narrow this down to different taxa levels ?
- If we move to the consolidated file (by year), do the results above hold true between the years we restricted to ? Any trend associated with year?
- On the flip side what is the least studied species? Can we get proof of concept using NCBI's taxidToGeneNames.pl script ?
- Within species, what genes are most studied per taxa ?
- Is there a correlation between the most studied gene and the most studied taxa ?
I guess the questions will never end. Lets begin ...
According to the original instructions, all genes that are assigned to Human (tax_id 9606) were selected for further analysis. This is where my protocol will branch out from the original one. I am not really that interested in Human genes. Its the bacteria that fascinate me ! However, Bacteria is a superkingdom (tax_id 2); what this list contains is tax_id species. So now how do we deal with this ?
NCBI has the following taxids for the 5 different superkingdoms.
Tax_id | Parent tax_id | Name |
---|---|---|
2 | 131567 | Bacteria |
2157 | 131567 | Archaea |
2759 | 131567 | Eukaryote |
10239 | 1 | Viruses |
12884 | 1 | Viriods |
We need to determine the superkindgdom (and classification) of individual taxid's in the gene2pubmed file to identify and separate bacteria (possibly move on to other prokaryotes later). NCBI taxonomy is a fine resource when you have few taxa to search for. But for our purposes, this is not feasible !
I had faced this issue before and had created a handy R script to create a database using NCBI's taxonomy files ! I could try to re-write the script in python, but chose not to waste time on it **. The scripts and relevant instructions to create the database are here
** There are multiple perfectly good modules/packages that can give you NCBI taxonomy information from taxid (for e.g. ETE toolkit is an excellent package). But I wrote the script to create the database at a time when (a) there were not many packages available; (b) I was learning R and thought learning both R and sqlite won't be a bad practice; (c) having your own scripts gives you the flexibility to customize as per necessity.
R has markdown (Rmd) and python has Jupyter(lab) notebooks ! I will use Jupyter lab for my analysis.
With the database created, all we have to do is query the database and begin !
Jupyter lab notebooks
-
Reproducing results from original paper - AllTaxa.ipynb
-
Repeating analysis for bacteria - Bacteria.ipynb