Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

header info in input files #1

Closed
jcbarret opened this issue Jun 18, 2018 · 3 comments
Closed

header info in input files #1

jcbarret opened this issue Jun 18, 2018 · 3 comments

Comments

@jcbarret
Copy link

I'm looking at files at http://het.io/disease-genes/downloads/ and am wondering if there's a key to the headers of the different input files? For example, https://raw.githubusercontent.com/dhimmel/het.io-dag-data/d8028c8820322ae4ad7642998bccc3ee7318ff16/downloads/diseases.txt has columns HC-P, HC-S, LC-P, LC-S but I'm not sure what they are. Sorry if this is obvious somewhere, but I couldn't find it after some searching.

@dhimmel
Copy link
Member

dhimmel commented Jun 18, 2018

The S6 Data caption from the associated PLOS Computational Biology paper is slightly more helpful:

An extended version of Table 3 including all diseases with at least one GWAS-Catalog-extracted association. The manual pathophysiology classification is included.

The caption for Table 3 is:

Diseases. Associations were predicted for 29 diseases with at least 10 positives. For these diseases, the number of high-confidence primary (HC-P), high-confidence secondary (HC-S), low-confidence primary (LC-P), and low-confidence secondary associations (LC-S) that were extracted from the GWAS Catalog is indicated.

So hopefully that answers your questions regarding diseases.txt. See the Associations Method section for more about how disease-gene associations were extracted from the GWAS catalog and what HC-P, HC-S, LC-P, and LC-S mean.

Note that the files available at http://het.io/disease-genes/downloads/ are from our 2015 study to predict disease-associated genes. In general, most users will be interested in Hetionet v1.0, which is available at https://neo4j.het.io (is down right now, will fix) and at https://github.com/dhimmel/hetionet. This hetnet is descibed in our 2017 eLife study called Project Rephetio. This project has much more detailed supplementary methods, since we discussed all code and data on Thinklab while performing the project. For example, see this discussion for how we processed the GWAS Catalog to get gene-disease associations in Project Rephetio. We used a very similar method to what we did in the predecessor study that created diseases.txt mentioned above.

@dhimmel
Copy link
Member

dhimmel commented Jun 18, 2018

More generally, @jcbarret correctly points out an issue that the table columns are not very well documented for the files at http://het.io/disease-genes/downloads/. At this point, I don't have any immediate plans to fix this issue, but encourage users to post GitHub issues with any questions. At some point in the future, I'd like to revamp the het.io website and may address some of these issues then.

@dhimmel
Copy link
Member

dhimmel commented Aug 5, 2019

We're moving the downloads page for the disease-genes study to GitHub from https://het.io/disease-genes/downloads/.

The READMDE (pinned version) now shows the first two row of each table for more convenience. While columns are still not fully documented, I will close this for now. Happy to elaborate on column meanings as requested. As I note above, most users will probably be interested in the newer Hetionet data instead.

@dhimmel dhimmel closed this as completed Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants