##**Do identical twins actually have perfectly identical DNA?**

---

Our names are Kiana and Diana, and we've always been told that we are identical twins. But how identical are identical twins? Even though we look pretty similar, we definitely have some key differences - such as the texture of our hair, our height, and the age we started puberty. We've been raised  the same way our whole lives, so we would expect that in the scheme of nature vs nurture, nurture is less likely to create a difference. In theory, we should have identical DNA, but we decided to test if this is actually the case using our raw data from 23andMe.

In [None]:
import pandas as pd

df_kiana = pd.read_csv('/Users/kiana/Downloads/genKiana.txt', delimiter='\t', on_bad_lines='warn', comment="#",
                       names=["rsid", "chr", "pos", "genotype"])
df_diana = pd.read_csv('/Users/kiana/Downloads/genDiana.txt', delimiter='\t', on_bad_lines='warn', comment="#",
names=["rsid", "chr", "pos", "genotype"])
#df.to_csv('/Users/kianaDownloads/genKiana.csv', index=False)

#dfd

##**What data are we able to get from 23andMe?**

------
23andMe collected our saliva samples and provided us with raw genotype data at 638,547 SNPs (single nucleotide polymorphisms). These SNPs are chosen from a set of locations at which people are known to have differences in their DNA. A sample of the data is shown below. Each row corresponds to an SNP and the genotype columns tells us the nucleotide we have at that position.

In [None]:
!head -n 100 /Users/kiana/Downloads/genKiana_sample.txt

# This data file generated by 23andMe at: Wed Apr 29 12:13:05 2020
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been 
# individually validated for accuracy. As such, this data is suitable only for research, 
# educational, and informational use and not for medical or other use.
# 
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier 
# (an rsid or an internal id), its location on the reference human genome, and the 
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing 
# improvements in our ability to call genotypes. More information ab

In [None]:
df_twins = df_kiana.merge(df_diana, how='inner', on="rsid", suffixes=("_k", "_d"))

In [None]:
df_twins_complete = df_twins.query("genotype_k != '--' and genotype_d != '--'")

##**Results**


We merged our two datasets indexing on the SNP column and calculated the number of rows in which we had the same genotype. The results indicate our DNA is 99.99% identical.

In [None]:
df_twins_complete.loc[:, "match"] = df_twins_complete["genotype_k"] == df_twins_complete["genotype_d"]

In [None]:
sum(df_twins_complete["match"]) / df_twins_complete.shape[0]

0.9999190118145564

##**What about the 0.001% that's different?**


The table below shows the small subset of SNPs at which we have different genotypes. These differences are most likely due to sequencing errors; however, we investigated any clinical significance they may hold. We used the Single Nucleotide Polymorphism Database (dbSNP https://www.ncbi.nlm.nih.gov/snp/) hosted by the National Library of Medicine to uncover any meaning behind these SNPs.

In [None]:
df_twins_complete.loc[df_twins_complete["match"] == False, :]

Unnamed: 0,rsid,chr_k,pos_k,genotype_k,chr_d,pos_d,genotype_d,match
39733,rs12043779,1,209028440,CC,1,209028440,CT,False
53972,rs4589755,2,17191807,CT,2,17191807,CC,False
97203,rs10180116,2,229307294,CC,2,229307294,CT,False
107597,rs1463581,3,22106472,CC,3,22106472,CT,False
113624,rs74783035,3,48658844,CT,3,48658844,CC,False
117962,rs115957924,3,67613466,CC,3,67613466,AC,False
123644,rs2713692,3,104058316,CC,3,104058316,CT,False
132391,rs35888352,3,147528677,GG,3,147528677,GT,False
132949,rs7617031,3,149004527,GG,3,149004527,AG,False
138598,rs116779868,3,176260376,AG,3,176260376,GG,False


# **Conclusion**

We performed this data analysis to assess whether we are actually twins. The results showed that out of nearly 640,000 SNPs in our genomes, we differ in 50 of them. We cannot conclude if these differences are due to errors in genotyping or if these are actual differences. We did not find significant clinical correlations linked to these SNPs in the SNP database.

Because 23andMe uses genotyping instead of sequencing, there may be other portions of our genome that differ which we were not able to compare with this data. Regardless, our genomes are much more similar than siblings, parent-child relationships, or paternal twins - which have about 50% in common. There may also be differences in other parts of our genome that were not included in these SNPs. Whole genome sequencing would be necessary to draw conclusions about the entirety of our DNA.

Other factors that may affect our genetic profiles is the variable expression of DNA. For example, DNA methylation represses transcription into RNA. 23andMe does not provide this epigenetic data, which may influence differences between our appearances.