# Vignette of plotting the likelihood of pop size estimate
In this vignette you will learn how to estimate pop sizes from ROH in specific length bins. This model assumes as panmictic population, so think of this estiamtes as the size of the ancestry pool at the time depth of the origin of the ROH you are fitting. The model used is described in the Supplement of https://doi.org/10.1101/2020.06.01.126730 and using the same framework introduced in https://www.genetics.org/content/205/3/1335

In [1]:
import os as os
import pandas as pd

## Set the path
You can set the path here to the path you want to work in (relative data loads will be done from there)

In [2]:
### Fill in your own path here!
path = "/project2/jnovembre/hringbauer/HAPSBURG/"  # The Path to Package Midway Cluster
#path = "/n/groups/reich/hringbauer/"
os.chdir(path)  # Set the right Path (in line with Atom default)
print(f"Set path to: {os.getcwd()}") # Show the current working directory. Should be HAPSBURG/Notebooks/ParallelRuns

Set path to: /project2/jnovembre/hringbauer/HAPSBURG


# Load the data for the MLE analysis
We can analyse any set of individuals (including sets of a single individual). One has to load the ROH of these (as a list of lists), and passes them on to the MLE object.

In [3]:
from hapsburg.PackagesSupport.fit_ne import MLE_ROH_Ne, load_roh_vec

In [10]:
df1 = pd.read_csv("./Empirical/Eigenstrat/Example/combined_roh05.csv", sep='\t')
df2 = df1[df1["sum_roh>20"]<50] # Remove inbred individuals
iids = df2["iid"].values # Load list of all iids
print(f"Loaded {len(iids)} IIDs")

Loaded 8 IIDs


### Load ROH to analyze
This step requires that the individuals to analyse have been run with hapROH. Check the `callROH_vignette`
It loads all ROH lengths - in the step below you can define which lengths to actually fit.

In [11]:
roh_vec = load_roh_vec(iids=iids, base_path = "./Empirical/Eigenstrat/Example/", suffix="_roh_full.csv")

# Use Maximum Likelihood to fit Ne
Having the roh_vec loaded, now we can fit Ne. We use a class implemented in hapROH for that purpose. This will retunr a pandas dataframe, with coef giving the most likely estimate, and two columns for the lower and upper bound of the 95% CI intervall for the estimate.

Note: Depending on demography (how many generations are alive, skew of reproductive success), the ratio of effective to census size is often 0.1-0.3x. Check the popgen literature for further details.

Important: By default the reported estimates are for 2Ne. You have to divide by 2 to get the estimates in Ne.

In [19]:
%%time
output = True
min_len = 4 # Min ROH length in cM to fit
max_len = 20 # Max ROH length in cM to fit

mle = MLE_ROH_Ne(start_params=1000, endog=roh_vec,
                 min_len=min_len, max_len=max_len,
                 chr_lgts=[],      # lengths of Chromosomes to fit (in cM). If len 0, use default for 1240K
                 error_model=False, output=False)
fit = mle.fit_ll_profile()
#summary = fit.summary()
mle.summary/2  # to get estimates in terms of Ne

CPU times: user 1.89 s, sys: 0 ns, total: 1.89 s
Wall time: 1.89 s


Unnamed: 0,coef,std err,0.025,0.975,n
0,6068.118992,,3146.073055,14009.332782,4.0


Ne=6070 (3150-14000 95% CI), that's a value typical for relatively large populations. It is a value where little ROH, even in the 4-8 cM category, is expected.

# Alternative Mode for CI Intervalls
In the above method 95% Confidence Intervalls are fitted via the likelihood profile (calculating the log likelihood for a large number of 2Ne, and looking for the interval 1.92 LL units down from the Maximum Likelihood.)

A simpler and quicker way is to use the curvature of the likelihood (the so called Fisher Information matrix). This works well for a lot of data and small pop sizes - as the likelihood is approxiamted well by this Gaussian fit. However, when the likelihood is "flat", it is better to use the above method.

In [14]:
%%time
output = True
min_len = 4 # Min ROH length in cM to fit
max_len = 20 # Max ROH length in cM to fit

mle = MLE_ROH_Ne(start_params=1000, endog=roh_vec,
                 min_len=4, max_len=20,
                 chr_lgts=[],      # lengths of Chromosomes to fit (in cM). If len 0, use default for 1240K
                 error_model=False, output=False)
fit = mle.fit()
#summary = fit.summary()
mle.summary/2  # to get estimates in terms of Ne

Optimization terminated successfully.
         Current function value: 4.388272
         Iterations: 34
         Function evaluations: 70
CPU times: user 354 ms, sys: 42.6 ms, total: 397 ms
Wall time: 1.81 s


Unnamed: 0,coef,std err,0.025,0.975
const,6055.0,2293.178,1559.8625,10550.0


The estimates agree, but the CI are different. The "symmetric" and local approximation capture the order of magnitude of uncertainty, but is problematic in this case.

# Further Application: Fitting IBD_X
You can also use this machinery to fit pop sizes from IBD on the X. You have to set the chromosome length to chr_lgts=[180.85 * 2/3,], and first convert the length of the IBD_X to the sex-averaged rate (2/3 * (length of IBD in female map units)