# A Deeper Dive into 23andMe Data 
### Alijah O'Connor
<small>Based on the R tutorial posted on <a href="https://dabblingwithdata.wordpress.com/2018/07/16/analysing-your-23andme-genetic-data-in-r-part-1-importing-your-genome-into-r/">here</a> on dabblingwithdata.wordpress.com.</small>

23andMe is a company that specializes in personalized genetic reports for the purposes of providing information about ancestry, health, and personal traits. Human DNA is over 99% identical from person to person; however, there exist small variations in each person's genome called small nucleotide polymorphisms (SNPs). Based on these SNPs, researchers can conduct observational studies that find correlations between certain SNPs and Diseases. 23andMe utilizes this idea of SNPs-Disease correlation by genotyping thousands of SNPs in each client's genome and determines whether or not particular SNPs relevant to a disease exist in that client's genome.


After a client's analysis is completed, they are given access to a dashboard containing the relevant information about what their genes are saying (expressed as likelihoods). Most individuals that use 23andMe as a service, however, do not realize that the 23andMe analysis is not necessarily complete. There are many published reports on SNPs that are not (yet) considered by the 23andMe team in their analyses due to lack of general consensus. Luckily, with the power of this Jupyter Notebook, one can look into their own SNP data and make their own conclusions. 
<br><br>
*Disclaimer:  make your own conlusions at your own risk -- biology is complicated, so don't be too rash in finalizing your beliefs*

<b>Requirements:</b>
- Python3+
    Pandas
- 23andMe Raw Data (text file)
- gwas_data.csv in current working directory

<h2>Start Analysis Here:</h2><br>
<b>Download your raw 23andMe data from the 23andMe website:</b> In the settings menu, there is an option to "Browse Our Data."  On that page, find the link to download your data. You'll have to request the data, and they will send you an email when it is ready for download. (Downloads as a text file) 

<i>Global Settings and Library Imports</i>

In [1]:
import pandas as pd
import csv

pd.set_option('display.max_columns', 45) # gwascat data orginally has 45 columns
pd.set_option('display.max_colwidth', 257) # 257 is the length of longest string in the dataframe 

## 1. Update path variable to your 23andMe text file
Converts your 23andMe text file into a csv file called 23andMeData.csv

In [2]:
path = 'sample_genome.txt'
csv_file = "23andMeData.csv"

with open(path, "r") as txt_23andMe:
    # Skip the first 19 lines of the text file (just header information)
    for x in range(19):
        txt_23andMe.readline()
        
    in_txt = csv.reader(txt_23andMe, delimiter = '\t')

    with open(csv_file, "w") as csv_23andMe:
        out_csv = csv.writer(csv_23andMe)
        out_csv.writerows(in_txt)

## 2. Data Processing (No user action required)
Read comments for more information about the operations

In [3]:
# Generate the 23andMe and Gwas DataFrames
data_23andme = pd.read_csv(csv_file, low_memory=False)
gwas_data = pd.read_csv('gwascat_data.csv')

In [4]:
# Drop Duplicate gwascat entries based on the Study, SNP, and Disease Trait fields
#   Don't allow the study to report more than one entry on a particular SNP with regard to a disease 
#   (in other words: if multiple experiments were done for the same SNP in regard to the same disease trait 
#                    in the same study, omit the duplicate findings)
before_drop = len(gwas_data)
gwas_data.drop('Unnamed: 0', axis=1)
gwas_data.drop_duplicates(subset=['Study', 'SNPS', 'DISEASE.TRAIT'], inplace=True);
after_drop = len(gwas_data)
print(before_drop - after_drop, "entries dropped.")

15228 entries dropped.


In [5]:
# Split 23andme genotypes into the individual chromosomes
data_23andme['allele1']  = data_23andme['genotype'].apply(lambda x: x[0])
data_23andme['allele2']  = data_23andme['genotype'].apply(lambda x: x[1] if (len(x) == 2) else '-')

In [6]:
# Extract the Risk Allele for Each Gwas SNP
gwas_data['risk_allele'] = gwas_data['STRONGEST.SNP.RISK.ALLELE'].apply(lambda x: x[-1])

In [7]:
# Abridge the gwas dataset down to revelant columns
gwas_data_abbr = gwas_data[['SNPS', 'INITIAL.SAMPLE.SIZE', 'LINK', 'STUDY', 'DISEASE.TRAIT', 'risk_allele',
                            'RISK.ALLELE.FREQUENCY', 'MAPPED_TRAIT', 'REPORTED.GENE.S.',
                            'MAPPED_TRAIT_URI']]

In [8]:
# Join 23andme and gwas data on rsid
joined_gwas_23 = pd.merge(data_23andme, gwas_data_abbr, left_on='# rsid', right_on='SNPS', how='inner')

In [9]:
# Add column for the number of risk alleles you contain
def count_risk_alleles(allele1, allele2, risk_allele):
    count = 0
    if allele1 == risk_allele:
        count += 1
        
    if allele2 == risk_allele:
        count += 1

    return count

joined_gwas_23['number_of_risk_alleles'] = joined_gwas_23.apply(lambda row: count_risk_alleles(row['allele1'],
                                                                                               row['allele2'],
                                                                                               row['risk_allele']),
                                                                axis = 1) # axis = 1 so entries are accessed correctly

## 3. Analysis
<i>Note: Each of the analyses below have an option to save the results as a csv file.  In order to perform this operation, remove the '#' in front of the expression below the "Uncomment to save to csv" sections</i>

### 3.1. Which rsids have more than 1 risk allele?

In [10]:
at_risk_SNPs = joined_gwas_23[joined_gwas_23.number_of_risk_alleles >= 1]

#-------------------------------------------------------------------------------------#
#-------------------- Uncomment to save to csv  --------------------------#
#-------------------------------------------------------------------------------------#
# at_risk_SNPs.to_csv("at_risk_SNPs.csv")

display(at_risk_SNPs)

Unnamed: 0,# rsid,chromosome,position,genotype,allele1,allele2,SNPS,INITIAL.SAMPLE.SIZE,LINK,STUDY,DISEASE.TRAIT,risk_allele,RISK.ALLELE.FREQUENCY,MAPPED_TRAIT,REPORTED.GENE.S.,MAPPED_TRAIT_URI,number_of_risk_alleles
0,rs11260603,1,1079198,CG,C,G,rs11260603,"2,247 European ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/23382691,Loci associated with N-glycosylation of human immunoglobulin G show pleiotropy with autoimmune diseases and haematological cancers.,IgG glycosylation,C,0.217058,serum IgG glycosylation measurement,NR,http://www.ebi.ac.uk/efo/EFO_0005193,1
5,rs12045693,1,2205581,CC,C,C,rs12045693,"2,247 European ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/23382691,Loci associated with N-glycosylation of human immunoglobulin G show pleiotropy with autoimmune diseases and haematological cancers.,IgG glycosylation,C,0.573033,serum IgG glycosylation measurement,NR,http://www.ebi.ac.uk/efo/EFO_0005193,2
12,rs3748816,1,2526746,AG,A,G,rs3748816,"2,871 European ancestry cases, 12,019 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/27992413,Genome-wide association study of primary sclerosing cholangitis identifies new risk loci and quantifies the genetic relationship with inflammatory bowel disease.,Primary sclerosing cholangitis,A,0.660000,sclerosing cholangitis,MMEL1,http://www.ebi.ac.uk/efo/EFO_0004268,1
14,rs4648356,1,2709164,AC,A,C,rs4648356,"9,772 European ancestry cases, 16,849 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/21833088,Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis.,Multiple sclerosis,C,,multiple sclerosis,MMEL1,http://www.ebi.ac.uk/efo/EFO_0003885,1
15,rs2651899,1,3083712,CC,C,C,rs2651899,"5,122 European ancestry cases, 18,108 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/21666692,Genome-wide association study reveals three susceptibility loci for common migraine in the general population.,Migraine,C,0.430000,migraine disorder,PRDM16,http://www.ebi.ac.uk/efo/EFO_0003821,2
17,rs111756326,1,3203275,GA,G,A,rs111756326,"5,439 European ancestry current and former smoker cases, 821 African American current and former smoker cases",www.ncbi.nlm.nih.gov/pubmed/26634245,A genome-wide association study identifies risk loci for spirometric measures among smokers of European and African ancestry.,Post bronchodilator FEV1/FVC ratio in COPD,A,0.960000,"response to bronchodilator, chronic obstructive pulmonary disease, FEV/FEC ratio",PRDM16,"http://purl.obolibrary.org/obo/GO_0097366, http://www.ebi.ac.uk/efo/EFO_0000341, http://www.ebi.ac.uk/efo/EFO_0004713",1
18,rs2483280,1,3255539,AG,A,G,rs2483280,"6,085 Korean ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/25035420,Identification of three novel genetic variations associated with electrocardiographic traits (QRS duration and PR interval) in East Asians.,QRS duration,A,0.260000,QRS duration,PRDM16,http://www.ebi.ac.uk/efo/EFO_0005055,1
38,rs846111,1,6279370,CC,C,C,rs846111,"13,685 European ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/19305408,Common variants at ten loci influence QT interval duration in the QTGEN Study.,QT interval,C,0.280000,QT interval,"NPHP4, CHDS, ACOT7, PLEKHG5, KLH21, RNF207",http://www.ebi.ac.uk/efo/EFO_0004682,2
39,rs9970334,1,6296238,GG,G,G,rs9970334,"19,920 British ancestry individuals from 6863 families.",www.ncbi.nlm.nih.gov/pubmed/28270201,"Exploration of haplotype research consortium imputation for genome-wide association studies in 20,032 Generation Scotland participants.",Resting heart rate,G,0.447393,resting heart rate,ICMT,http://www.ebi.ac.uk/efo/EFO_0004351,2
44,rs227163,1,7961206,CG,C,G,rs227163,"up to 14,361 European ancestry cases, up to 42,923 European ancestry controls, up to 4,873 East Asian ancestry cases, up to 17,642 East Asian ancestry controls",www.ncbi.nlm.nih.gov/pubmed/24390342,Genetics of rheumatoid arthritis contributes to biology and drug discovery.,Rheumatoid arthritis,C,0.420000,rheumatoid arthritis,TNFRSF9,http://www.ebi.ac.uk/efo/EFO_0000685,1


### 3.2. Which rsids have more than 1 risk allele AND are relatively rare  (<= 0.2 frequency)?
Sorted by rearest to more common alleles by reported frequency
 - Doesn't consider SNPs without reported frequencies, frequencies=0, frequencies=NaN, or frequencies=1
 
Feel free to change the "rarity" of the frequency by changing the threshold_frequency variable
 - The lower the threshold_frequency, the more rare the allele is

In [11]:
threshold_frequency = 0.2

rare_at_risk_SNPs = (joined_gwas_23[(joined_gwas_23.number_of_risk_alleles >= 1) &
                           (joined_gwas_23['RISK.ALLELE.FREQUENCY'] <= threshold_frequency) & 
                           (joined_gwas_23['RISK.ALLELE.FREQUENCY'] > 0)]
                        .sort_values('RISK.ALLELE.FREQUENCY'))

#-------------------------------------------------------------------------------------#
#-------------------- Uncomment to save to csv  --------------------------#
#-------------------------------------------------------------------------------------#
# rare_at_risk_SNPs.to_csv("rare_at_risk_SNPs.csv")

display(rare_at_risk_SNPs)

Unnamed: 0,# rsid,chromosome,position,genotype,allele1,allele2,SNPS,INITIAL.SAMPLE.SIZE,LINK,STUDY,DISEASE.TRAIT,risk_allele,RISK.ALLELE.FREQUENCY,MAPPED_TRAIT,REPORTED.GENE.S.,MAPPED_TRAIT_URI,number_of_risk_alleles
16718,rs487836,15,58033859,GA,G,A,rs487836,"74 European ancestry cases, 824 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/25918132,Genome-Wide Association Study Identifies Novel Loci Associated With Diisocyanate-Induced Occupational Asthma.,Diisocyanate-induced asthma,G,0.001000,"response to diisocyanate, asthma","GRINL1A, ALDH1A2","http://www.ebi.ac.uk/efo/EFO_0006995, http://www.ebi.ac.uk/efo/EFO_0000270",1
15539,rs76582905,12,130466631,GA,G,A,rs76582905,"74 European ancestry cases, 824 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/25918132,Genome-Wide Association Study Identifies Novel Loci Associated With Diisocyanate-Induced Occupational Asthma.,Diisocyanate-induced asthma,G,0.001000,"response to diisocyanate, asthma",intergenic,"http://www.ebi.ac.uk/efo/EFO_0006995, http://www.ebi.ac.uk/efo/EFO_0000270",1
9388,rs144787122,7,2296552,GA,G,A,rs144787122,"215,551 European ancestry individuals, 57,332 African American individuals, 24,743 Hispanic individuals",www.ncbi.nlm.nih.gov/pubmed/30275531,"Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program.",LDL cholesterol,A,0.003500,low density lipoprotein cholesterol measurement,SNX8,http://www.ebi.ac.uk/efo/EFO_0004611,1
1078,rs13373934,1,155448415,GA,G,A,rs13373934,"10,935 Hispanic individuals",www.ncbi.nlm.nih.gov/pubmed/27601451,Chronic Periodontitis Genome-wide Association Study in the Hispanic Community Health Study / Study of Latinos.,Chronic periodontitis (mean interproximal clinical attachment level),G,0.005000,periodontal measurement,ASH1L,http://www.ebi.ac.uk/efo/EFO_0007780,1
21007,rs17879961,22,29121087,GA,G,A,rs17879961,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,G,0.005000,breast carcinoma,CHEK2,http://www.ebi.ac.uk/efo/EFO_0000305,1
1617,rs116493700,1,211738467,GA,G,A,rs116493700,"74 European ancestry cases, 824 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/25918132,Genome-Wide Association Study Identifies Novel Loci Associated With Diisocyanate-Induced Occupational Asthma.,Diisocyanate-induced asthma,G,0.005000,"response to diisocyanate, asthma",intergenic,"http://www.ebi.ac.uk/efo/EFO_0006995, http://www.ebi.ac.uk/efo/EFO_0000270",1
20305,rs116940963,20,10271367,GA,G,A,rs116940963,"74 European ancestry cases, 824 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/25918132,Genome-Wide Association Study Identifies Novel Loci Associated With Diisocyanate-Induced Occupational Asthma.,Diisocyanate-induced asthma,G,0.007000,"response to diisocyanate, asthma",intergenic,"http://www.ebi.ac.uk/efo/EFO_0006995, http://www.ebi.ac.uk/efo/EFO_0000270",1
17927,rs117426660,16,74025074,GA,G,A,rs117426660,"2,187 European ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/28441456,Genome-wide association study of facial morphology reveals novel associations with FREM1 and PARK2.,Facial morphology (factor 18),G,0.007336,facial morphology measurement,NR,http://www.ebi.ac.uk/efo/EFO_0007841,1
12590,rs10466033,10,76703006,GA,G,A,rs10466033,"16,175 European ancestry female individuals",www.ncbi.nlm.nih.gov/pubmed/22747683,Genetic variants associated with breast size also influence breast cancer risk.,Breast size,G,0.008000,breast size,"COMTD1, SPA17P1",http://www.ebi.ac.uk/efo/EFO_0004884,1
19408,rs117064827,19,10334725,AG,A,G,rs117064827,"170,763 European ancestry individuals",www.ncbi.nlm.nih.gov/pubmed/27863252,The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease.,High light scatter reticulocyte percentage of red cells,G,0.008500,reticulocyte count,S1PR2,http://www.ebi.ac.uk/efo/EFO_0007986,1


### 3.3. Searching for Specific Disease Traits
The default search term is 'cancer.' To change this term update the search_term variable to whatever string you are interested in searching.

In [12]:
search_term = 'cancer'
custom_SNP_search = at_risk_SNPs[at_risk_SNPs['DISEASE.TRAIT'].str.contains(search_term, case=False)]

#-------------------------------------------------------------------------------------#
#-------------------- Uncomment to save to csv  --------------------------#
#-------------------------------------------------------------------------------------#
# custom_SNP_search.to_csv("custom_SNP_search.csv")

display(custom_SNP_search)

Unnamed: 0,# rsid,chromosome,position,genotype,allele1,allele2,SNPS,INITIAL.SAMPLE.SIZE,LINK,STUDY,DISEASE.TRAIT,risk_allele,RISK.ALLELE.FREQUENCY,MAPPED_TRAIT,REPORTED.GENE.S.,MAPPED_TRAIT_URI,number_of_risk_alleles
78,rs636291,1,10556097,GA,G,A,rs636291,"34,379 European ancestry cases, 33,164 European ancestry controls, 5,327 African ancestry cases, 5,136 African ancestry controls, 2,563 Japanese ancestry cases, 4,391 Japanese ancestry controls, 1,034 Latino cases, 1,046 Latino controls",www.ncbi.nlm.nih.gov/pubmed/25217961,"A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer.",Prostate cancer,A,0.1600,prostate carcinoma,PEX14,http://www.ebi.ac.uk/efo/EFO_0001663,1
79,rs616488,1,10566215,AG,A,G,rs616488,"14,135 European ancestry cases, 58,126 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29058716,Identification of ten variants associated with risk of estrogen-receptor-negative breast cancer.,Breast cancer (estrogen-receptor negative),A,0.6700,estrogen-receptor negative breast cancer,NR,http://www.ebi.ac.uk/efo/EFO_1000650,1
80,rs616488,1,10566215,AG,A,G,rs616488,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,A,0.6700,breast carcinoma,PEX14,http://www.ebi.ac.uk/efo/EFO_0000305,1
154,rs2992756,1,18807339,CG,C,G,rs2992756,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,C,0.4900,breast carcinoma,KLHDC7A,http://www.ebi.ac.uk/efo/EFO_0000305,1
365,rs1707302,1,46600917,AG,A,G,rs1707302,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,A,0.6600,breast carcinoma,"PIK3R3, LOC101929626",http://www.ebi.ac.uk/efo/EFO_0000305,1
689,rs17426269,1,88156923,GG,G,G,rs17426269,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,G,0.1500,breast carcinoma,intergenic,http://www.ebi.ac.uk/efo/EFO_0000305,2
878,rs1230666,1,114173410,GG,G,G,rs1230666,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,G,0.8721,breast carcinoma,NR,http://www.ebi.ac.uk/efo/EFO_0000305,2
960,rs11249433,1,121280613,GA,G,A,rs11249433,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,G,0.4100,breast carcinoma,EMBP1,http://www.ebi.ac.uk/efo/EFO_0000305,1
972,rs594206,1,147020456,GA,G,A,rs594206,"64 Japanese ancestry cases, 27 Japanese ancestry controls",www.ncbi.nlm.nih.gov/pubmed/24025145,A genome-wide association study of chemotherapy-induced alopecia in breast cancer patients.,Adverse response to chemotherapy in breast cancer (alopecia) (cyclophosphamide+doxorubicin+/-5FU),A,0.7800,"response to cyclophosphamide, chemotherapy-induced alopecia, response to 5' fluorouracil, response to doxorubicin",BCL9,"http://purl.obolibrary.org/obo/GO_1902518, http://www.ebi.ac.uk/efo/EFO_0005400, http://purl.obolibrary.org/obo/GO_0036275, http://purl.obolibrary.org/obo/GO_1902520",1
975,rs11205303,1,149906413,CG,C,G,rs11205303,"76,192 European ancestry cases, 63,082 European ancestry controls",www.ncbi.nlm.nih.gov/pubmed/29059683,Association analysis identifies 65 new breast cancer risk loci.,Breast cancer,C,0.3998,breast carcinoma,NR,http://www.ebi.ac.uk/efo/EFO_0000305,1
