# Human de novo mutations

### Load modules

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm

### Load files

In [8]:
# tell pandas to read the dataframes straight from the link:

df = pd.read_csv("aau1043_dnm.tsv")
df_parent_ages = pd.read_csv("aau1043_parental_age.tsv")


In [75]:
df

Unnamed: 0,Chr,Pos,Ref,Alt,Proband_id,Phase_combined,Crossover,Sanger
0,chr1,241097646,C,T,99379,father,paternal_crossover,confirmed
1,chr10,29202943,A,G,8147,father,maternal_crossover,PCR failed
2,chr11,129441657,C,T,5410,mother,maternal_crossover,confirmed
3,chr13,96867147,A,G,46025,father,paternal_crossover,confirmed
4,chr17,50609998,C,T,144769,mother,maternal_crossover,confirmed
...,...,...,...,...,...,...,...,...
26426,chr9,137374330,C,T,54383,father,,
26427,chr9,137396508,C,T,39729,father,,
26428,chr9,137633973,C,A,17904,mother,,
26429,chr9,137889777,G,A,80108,father,,


Count the number of de novo mutations per proband: 
- The Phase_combined column records the inferred parent of origin of the de novo mutation. 
- Break the counts of de novo mutations down into maternally inherited, paternally inherited, and total de novo mutations (including of unknown parental origin). 
- Store these counts in a new pandas dataframe with columns: Proband_id, pat_dnm, mat_dnm, tot_dnm.

In [64]:
# first, count the total number of proband_id (which will include all dnms, regardless of origin)
total_proband_counts = df["Proband_id"].value_counts()

91410     122
114094    121
111288    115
8147      114
88246     113
         ... 
121087     37
62630      34
76504      34
37789      34
13990      33
Name: Proband_id, Length: 396, dtype: int64


In [73]:
# create df subsets for each dnm group
roi = df["Phase_combined"] == "father"
df_father_dnm = df.loc[roi, :]

roi = df["Phase_combined"] == "mother"
df_mother_dnm = df.loc[roi, :]


# count unique proband_id for each subset
father_dnm_per_proband = df_father_dnm["Proband_id"].value_counts()
mother_dnm_per_proband = df_mother_dnm["Proband_id"].value_counts()

In [74]:

# create new datafram combining each count series

new_df = pd.DataFrame({"total_dnm":total_proband_counts, "pat_dnm":father_dnm_per_proband, "mat_dnm":mother_dnm_per_proband})

new_df

Unnamed: 0,total_dnm,pat_dnm,mat_dnm
675,70,51,19
1097,39,26,12
1230,57,42,12
1481,68,53,14
1806,78,61,11
...,...,...,...
153657,49,41,8
154565,75,61,14
154621,50,39,11
154810,69,55,14


Use the pandas `merge` function to combine the above data frame with the data frame with maternal and paternal ages