diff --git a/outline.txt b/outline.txt index 16bb615..3834fc8 100644 --- a/outline.txt +++ b/outline.txt @@ -68,7 +68,7 @@ First focus on biological issues: Polymorphisms to phenotypes, correlation only. done Bias in the samples collected by openSNP? done Sample size to small? done -Quality check for phenotypes? +Quality check for phenotypes? done Mapping to haplotype blocks? Combination with the hapmap Manage increasing amount of data? @@ -77,7 +77,7 @@ Manage increasing amount of data? from great power follows great responsibility! Imagine a world where all data is free... -What privacy issues do arise? +What privacy issues do arise? done Is there an effective mechanism to avoid abuse or -genetic discrimination? -How should governments treat the additional data? \ No newline at end of file +genetic discrimination? done +How should governments treat the additional data? done \ No newline at end of file diff --git a/paper_draft.tex b/paper_draft.tex index 7e0d7fd..c50d3b5 100644 --- a/paper_draft.tex +++ b/paper_draft.tex @@ -98,13 +98,13 @@ \section*{Author Summary} \section*{Introduction} -Genome Wide Association Studies (GWAS) are a comparatively easy and cheap way to find Single Nucleotide Polymorphisms (SNPs) which can be interesting because of their medical relevance. SNPs found through GWAS can be used to find candidate genes for a closer inspection or to predict disease risks. Genome Wide Association Studies make use of statistics to compare the alleles of patients to the alleles of healthy controls. By this the method does not allow to find causal differences but mere correlations. The first GWAS was published in 2005 and compared age-related macular degeneration in contrast to a healthy control group \cite{Klein2005}. Since the beginning the number of participants in those studies is rising and over 1200 GWAS have been performed \cite{Johnson2009} and over 5000 SNPs have been linked to different diseases and traits in those studies \cite{Hindorff2009}. %(http://www.genome.gov/page.cfm?pageid=26525384&clearquery=1#result_table) +Genome Wide Association Studies (GWAS) are a comparatively easy and cheap way to find Single Nucleotide Polymorphisms (SNPs) which are of interest because of their medical relevance. Such SNPs of interest can be used to find candidate genes for a closer inspection or to predict disease risks or other traits. Genome Wide Association Studies make use of statistics to compare the alleles of patients to the alleles of healthy controls. By this the method does not allow to find causal differences but mere correlations. The first GWAS was published in 2005 and compared age-related macular degeneration in contrast to a healthy control group \cite{Klein2005}. Since the beginning the number of participants in those studies is rising and over 1200 GWAS have been performed \cite{Johnson2009} and over 5000 SNPs have been linked to different diseases and traits in those studies \cite{Hindorff2009}. %(http://www.genome.gov/page.cfm?pageid=26525384&clearquery=1#result_table) -Since 2006 companies like 23andMe, deCODEme or FamilyTreeDNA offer Direct-To-Consumer (DTC) genetic testing. Those companies use DNA micro arrays to screen for around 1 million SNPs spread over the human genome. In return customers get an analysis of the results, as well as a raw file that includes the SNP-IDs and their respective allele for the customer. In 2011 23andMe alone had over 100.000 customers\footnote{http://spittoon.23andme.com/2011/06/15/23andme-2011-state-of-the-database-address/} - the company recognizes the potential to perform GWAS with that amount of data by using surveys to ask their customers about traits and diseases. With the consent of the customer those data is used for association studies. 23andMe published several articles in which they replicate known findings but also find new associations for Parkinson's Disease \cite{Eriksson2010, Do2011}. Over 30,000 23andme-customers participated in those association studies. +Since 2006 companies like 23andMe, deCODEme or FamilyTreeDNA offer Direct-To-Consumer (DTC) genetic testing. Those companies use DNA micro arrays to screen for around 1 million SNPs spread over the human genome. In return customers get an analysis of the results, as well as a raw file that includes the individual genotypes of the customer. In 2011 23andMe alone had over 100.000 customers\footnote{http://spittoon.23andme.com/2011/06/15/23andme-2011-state-of-the-database-address/} - the company recognizes the potential to perform GWAS with that amount of data by using surveys to ask their customers about traits and diseases. With the consent of the customer those data is used for association studies. 23andMe published several articles in which they replicate known findings but also find new associations for Parkinson's Disease \cite{Eriksson2010, Do2011}. Over 30,000 23andme-customers participated in those association studies. -Although companies like 23andMe are willing to contribute to science it is not easy for individual scientists to access the data. This is mainly due to privacy concerns of the customers. Nevertheless there are individual customers who are willingly sharing their data. Most do so by uploading their data to their personal website or to open software repositories like \textit{GitHub}. While this is makes it possible for scientists to access the data, it requires a lot of work to keep track of all new genotyping data that is available to the public. While projects like the SNPedia try to keep track of all the files \cite{Cariaso2011}, this still does not allow to perform GWAS, as the phenotypic information is not attached to the genetic information. Projects that attach the phenotype to the genetic information, like the Personal Genome Project, still don't allow for an easy re-use of the data, as they lack an advanced programming interface (API) or other methods by which researchers could download the data. +Although companies like 23andMe are willing to contribute to science it is not easy for individual scientists to access the data. This is mainly due to concerns about privacy, liability and consent. Nevertheless there are individual customers who are willingly sharing their data. Most do so by uploading their data to their personal website or to open software repositories like \textit{GitHub}. While this is makes it possible for scientists to access the data, it requires a lot of work to keep track of all new genotyping data that is available to the public. While projects like the SNPedia try to keep track of all the files \cite{Cariaso2011}, this still does not allow to perform GWAS, as the phenotypic information is not attached to the genetic information. Projects that attach the phenotype to the genetic information, like the \textit{Personal Genome Project}, still don't allow for an easy re-use of the data, as they lack an advanced programming interface (API) or other methods by which researchers could download the data. -A possible solution to this can be a community-driven platform that aggregates genetical and phenotypical information of people who are willing to share their data with the general public and have given their informed consent. We designed a survey to assess interest in such a crowd sourcing platform, in which we asked how many people would be willing to share their genetic and phenotypic information with the public. Additionally we built a platform which allows customers of DTC genetic testing to publicate of genetic and phenotypic information and gives researchers multiple ways to reuse the data. +A possible solution to this can be a community-driven platform that aggregates genetic and phenotypic information of people who are willing to share their data with the general public and have given their informed consent. We designed a survey to assess interest in such a crowd sourcing platform, in which we asked how many people would be willing to share their genetic and phenotypic information with the public. Additionally we built a platform which allows customers of DTC genetic testing to publicate of genetic and phenotypic information and gives researchers multiple ways to reuse the data. % Results and Discussion can be combined. \section*{Results} @@ -127,7 +127,7 @@ \subsection*{openSNP} The possible answers in terms of variations for a single phenotype are not limited and every user can add completely new phenotypes if the corresponding questions about this are lacking. To reduce the amount of manual data curation openSNP tries to avoid the entry of the same phenotype or variation, but with a slightly different spelling by helping users at entering data by an autocompletion-feature which lists similar entries which are already in the openSNP-database. -On the side of getting access to the data users can download single genotyping files for specific users, get archives of multiple genotyping files grouped by phenotypic variation or can access a single download that includes all genotyping-files and all phenotypes in a comma separated table. Additionally users can access the genetic data through the Distributed Annotation System, which allows to get all data for specific chromosomes and specific positions on single chromosomes. +On the side of getting access to the data users can download single genotyping files for specific users, get archives of multiple genotyping files grouped by phenotypic variation or can access a single download that includes all genotyping-files and all phenotypes in a comma separated table. Additionally users can access the genetic data through the Distributed Annotation System, which allows to get all data for specific chromosomes and specific positions on single chromosomes. Users can discuss SNPs and phenotypes on the platform using a simple commenting-system or private messages. Between the start of openSNP on 09/27/2011 and 12/18/2011 214 people have signed up with openSNP, 79 of those have uploaded their genotyping files. Through this the openSNP database lists 69486471 SNPs which are distributed over 1938604 unique Rs-IDs. In the same timeframe all users combined have entered 675 variations which are distributed over 47 different phenotypes. See figure n for a distribution of data acquisition over time. @@ -135,23 +135,27 @@ \subsection*{openSNP} A total number of 15229 documents relevant to the SNP-IDs which are listed in openSNP could be found in the databases of Mendeley, the Public Library of Science and SNPedia. Of the primary literature 25 \% are released in Open Access-journals and can be freely free of charge by every user (Figure n+2). For usability reasons SNPs are ranked by the amount of information gathered by the external services. -The external services themselves are ranked by how easy users can access information out of these sources. The SNPedia entries are given the highest impact, as those are already manually curated, followed by open access publications out of the Public Library of Science. Lowest values are given to the Mendeley-results, as those aren't necessary freely available for every user. SNPedia is valued 2.5 times as high as a PLoS publication and 5 times as high as a Mendeley-entry. +The external services themselves are ranked by how easy users can access information out of these sources. The SNPedia entries are given the highest impact, as those are already manually curated, followed by open access publications out of the Public Library of Science. Lowest values are given to the Mendeley-results, as those aren't necessary freely available for every user. A entry on SNPedia is valued 2.5 times as high as a PLoS publication and 5 times as high as a Mendeley-entry. \section*{Discussion} -Although prices for exome or even full genome sequencing are dropping rapidly, in comparison to this GWAS tend to stay cheaper and offer a possibility to get insights on genetic variations on a population level and allow an analysis of SNPs which are linked to different traits. Despite this benefit in terms of costs it must be pointed out that GWAS can only detect correlations of SNPs with those traits and don't allow the detection of the causes for this and needs a large enough sample which has to be statistically analysed by using sound methods. +Although prices for exome or even full genome sequencing are dropping rapidly, in comparison to this GWAS tend to stay cheaper and offer a possibility to get insights on genetic variations on a population level and allow an analysis of SNPs which are linked to different traits. Despite this benefit in terms of costs it must be pointed out that GWAS can only detect correlations of SNPs with those traits and don't allow the detection of the causes for this and needs a large enough sample which has to be statistically analysed by using sound methods. Nevertheless, GWAS are still frequently used and new associations still can be found (we should list some recent results here -> oxytocin maybe? we could look at the twitter-timeline of @opensnporg, i publish papers i find there). -One way to bring costs for GWAS further down is to make use of already available genotyping results and datasets. Data produced by DTC genetic testing companies is a promising source for such data, as such companies already have high numbers of customers which are willing to pay for those genotypings by themselves. As we have found in the survey on the sharing such results many of these customers are also willing to share their results with scientists and the public to help scientific progress, although people who have taken DTC genetic testing are aware of the privacy implications that come with openly sharing those results. +One way to bring costs for GWAS even further down is to make use of already available genotyping results and datasets. Data produced by DTC genetic testing companies is a promising source for such data, as such companies already have high numbers of customers which are willing to pay for those genotypings by themselves. As we have found in the survey on the sharing such results many of these customers are also willing to share their results with scientists and the public to help scientific progress, although people who have taken DTC genetic testing are aware of the privacy implications that come with openly sharing those results. With openSNP we've build a platform that can be used by customers of DTC genetic testing to share their data to easily share their genetic and phenotypic data with a wide audience as well as by scientists and interested citizens who are looking for datasets to be used in GWAS. Customers of DTC genetic testing also benefit of an easy access to primary literature on SNPs and genetic variation they carry. While we don't have collected enough data to perform full scale GWAS yet this might be possible in the future, as user numbers are rising. By crowdsourcing the acquisition of genetic and phenotypic data openSNP faces the same problems as any other open platform on the Internet, namely we have to trust users about the data they upload and enter on openSNP. Additionally the quality of the data varies, especially in terms of accuracy on the phenotypic variation. While we try to suggest similar entries to users there are some cases where users wont follow those suggestions, so duplicates or similar phenotypes or varations in traits may arise. -There are two possible solutions to this problem: We could only allow some (trusted) users to enter new phenotypes or we could make users enter all possible variations of a phenotype while creating a new phenotype, so that later users can't add variations that have not been available from the start on. Both methods have their own disadvantage: In either case it makes it harder for users to enter their data and by this highers the bar for participation, which ultimately could lead to less data entered. Facing this trade-off we decided to keep entering data easy, at the cost that users who want to perform GWAS with the data need to perform more quality control. +There are two possible solutions to this problem: One could only allow some (trusted) users to enter new phenotypes or one could make users enter all possible variations of a phenotype while creating a new phenotype, so that later users can't add variations that have not been available from the start on. Both methods have a disadvantage: In either case it makes it harder for users to enter their data and by this highers the bar for participation, which ultimately could lead to less data entered. Facing this trade-off we decided to keep entering data easy, at the cost that users who want to perform GWAS with the data need to perform more quality control. -Another thing that should kept in mind is a possible bias, as we can't rule out the possibility that only a biased subset of people buy DTC genetic testing and an even smaller subset is willing to publish the results, along with information of medical relevance. +Another thing that should kept in mind is a possible bias in which data is available on openSNP, as we can't rule out the possibility that only a biased subset of people buy DTC genetic testing and an even smaller subset is willing to publish the results, along with information of medical relevance. +The advent of DTC genetic testing has lead to new ethical and social issues. Much of the critique on DTC genetic testing focusses on the practice of delivering medical information without consulting a physician or genetic counselor to help patients/customers make sense of the information and help them to put the new knowledge to good use. Less attention has been given to the privacy implications that come with this cheap way of obtaining genetical information. +By making the own genetic and medical information public it becomes clear that the issue of privacy needs to get a closer view. Our survey has shown that people are concerned about their privacy and are concerned that stakeholders like employers, insurance companies, governments or advertisers misuse this information. Policy makers start to react to those changes by introducing laws like the \textit{Genetic Information Non-Discrimination Act} in the United States or the \emph{Gendiagnostikgesetz} in Germany to minimize the impact widely available genetic information. DTC genetic testing companies on their side also try to educate their customers about the risks of releasing genetic data. +We transparently address the problem of privacy implications that come with releasing the data twice, during registering for openSNP and during the upload of the DTC results. Users have to check that they have read and understood the disclaimer about possible side-effects that can arise by making their data public. To further improve this process and get informed consent we are working to implement the procedures which are currently developed by \textit{Consent for Research} (http://www.weconsent.us). + % You may title this section "Methods" or "Models". % "Models" is not a valid title for PLoS ONE authors. However, PLoS ONE % authors may use "Analysis" diff --git a/privacy_publications/0.pdf b/privacy_publications/0.pdf new file mode 100644 index 0000000..9c0705e Binary files /dev/null and b/privacy_publications/0.pdf differ diff --git a/privacy_publications/15265160902893965.pdf b/privacy_publications/15265160902893965.pdf new file mode 100644 index 0000000..3e76a2d Binary files /dev/null and b/privacy_publications/15265160902893965.pdf differ diff --git a/privacy_publications/annurev%2Egenom%2E9%2E081307%2E164319.pdf b/privacy_publications/annurev%2Egenom%2E9%2E081307%2E164319.pdf new file mode 100644 index 0000000..595e5c9 Binary files /dev/null and b/privacy_publications/annurev%2Egenom%2E9%2E081307%2E164319.pdf differ diff --git a/privacy_publications/annurev-med-062110-123753.pdf b/privacy_publications/annurev-med-062110-123753.pdf new file mode 100644 index 0000000..ded1fa6 Binary files /dev/null and b/privacy_publications/annurev-med-062110-123753.pdf differ diff --git a/privacy_publications/gm71.pdf b/privacy_publications/gm71.pdf new file mode 100644 index 0000000..5563bc9 Binary files /dev/null and b/privacy_publications/gm71.pdf differ diff --git a/privacy_publications/nrg3113.pdf b/privacy_publications/nrg3113.pdf new file mode 100644 index 0000000..0127edc Binary files /dev/null and b/privacy_publications/nrg3113.pdf differ