# Number of causal sites

This notebook uses the French Canadian dataset and the inferred tree sequence from the 1000 Genomes Project to examine how the number of causal sites is influencing the simulation speed.

### Dataset:

The datasets are obtained from the following links:

- French Canadian dataset (`simulated_genomes_chr9.tsz`) is installed from https://zenodo.org/record/6839683
- 1000 Genomes Project dataset (`1kg_chr9.trees.tsz`) is installed from https://zenodo.org/record/3051855

Please put the datasets in `data` folder before running the codes.

### Note:

The code to measure the simulation time of the French Canadian dataset will be very long. We suggest that you run the code overnight.

The saved computational time dataframes will be used in `figure.ipynb` to generate the computational time plots.

In [None]:
import tstrait
import tskit
import time
import numpy as np
import pandas as pd
import tszip
import msprime

In [None]:
# Run this code before measuring the computational time, as tstrait uses numba
ts = msprime.sim_ancestry(samples=1000, recombination_rate=1e-8, sequence_length=10_000, population_size=10_000,)
ts = msprime.sim_mutations(ts, rate=1e-8)
trait_model = tstrait.trait_model(distribution="normal", mean=0, var=1)
sim_result = tstrait.sim_phenotype(ts, 1, trait_model, 0.3)

In [None]:
def compute_time_tstrait(ts, num_causal):
    times = []
    trait_model = tstrait.trait_model(distribution="normal", mean=0, var=1)
    for _ in range(10):
        before = time.perf_counter()
        sim_result = tstrait.sim_phenotype(ts, num_causal, trait_model, 0.3)
        duration = time.perf_counter() - before
        times.append(duration)
    return np.array(times)

In [None]:
# French Canadian dataset
ts = tszip.decompress("data/simulated_genomes_chr9.tsz")

In [None]:
time_result = {}

In [None]:
# Note: This code is computationally intensive. Please only run it if you can do it overnight
num_causal_array = [50, 100, 500, 1000, 2500, 5000, 7500, 10_000]

for num_causal in num_causal_array:
    time_result["French_Canadian_{}".format(num_causal)] = compute_time_tstrait(ts, num_causal)

In [None]:
# 1000 Genomes project
ts = tszip.decompress("data/1kg_chr9.trees.tsz")

In [None]:
num_causal_array = [50, 100, 500, 1000, 2500, 5000, 7500, 10_000]

for num_causal in num_causal_array:
    time_result["1000_Genomes_{}".format(num_causal)] = compute_time_tstrait(ts, num_causal)

In [None]:
time_df = pd.DataFrame(time_result)

In [None]:
time_df.to_csv("output/tstrait_num_causal.csv")