## Further investigations: an example with real data

Publicly available inferred tree sequences for a global human dataset of 7524 whole genomes, include a few ancient individuals such as Neanderthals and Denisovans, is available on <a href="https://zenodo.org/record/5512994">Zenodo</a>. The genomes are primarily taken from the Thousand Genomes Project, the Simons Genome Diversity Project, and the Human Genome Diversity Project.

It can be interesting to play with this data. The code below will download a compressed tree sequence of chromosome 2 (the "q" arm) from the public repository, and make it available as `ts_2q`.

In [None]:
# Download the compressed tree sequence for human chromosome 2, q arm (~180Mb - may take a number of seconds)
import urllib.request
url = "https://zenodo.org/record/5512994/files/hgdp_tgp_sgdp_high_cov_ancients_chr2_q.dated.trees.tsz"
with workbook.download(url) as t:
    temporary_filename, _ = urllib.request.urlretrieve(url, reporthook=t.update_to)
    print("Converting file...")
    ts_2q = workbook.convert_metadata_to_new_format( # only needed as the zenodo files currently have old-style metadata
        tszip.decompress(temporary_filename))
    urllib.request.urlcleanup() # remove temporary_filename

The remaining code in the notebook is for readers who are interested in how to go about investigating specific genomic regions in real datasets. It illustrates plotting diversity in a region that has been previously identified as of interest in east and souteast asians (the <a href="https://en.wikipedia.org/wiki/Ectodysplasin_A_receptor">EDAR gene</a>), in whom a particular SNP, rs3827760, is at much higher frequency than in the rest of the world.

In [None]:
# Have a look at the populations defined in this tree sequence
ts_2q.tables.populations

In [None]:
# For simplicity & speed of future analyis, truncate the tree sequence to flank EDAR (108_894_471..108_989_220)
keep_region = [108_000_000, 110_000_000]
edar_ts = ts_2q.keep_intervals([keep_region]).trim()
print(edar_ts.num_trees, "trees and", edar_ts.num_sites, "sites in the genomic region containing the EDAR1 gene")
edar_gene_bounds = np.array([108_894_471, 108_989_220]) - keep_region[0]

In [None]:
# Find the node IDs of the east asian samples
east_asian_population_ids = np.array([
    p.id
    for p in edar_ts.populations()
    if "region" in p.metadata and p.metadata["region"] in ('EastAsia', "EAST_ASIA")
])
is_east_asian_sample = np.isin(edar_ts.tables.nodes.population, east_asian_population_ids)
east_asian_samples = np.where(is_east_asian_sample)[0]

In [None]:
# Find the site with the interesting SNP, for plotting
interesting_snp = "rs3827760"
for s in edar_ts.sites():
    if "ID" in s.metadata and s.metadata["ID"] == interesting_snp:
        print(f"SNP id {interesting_snp} is at site {s.id}")
        interesting_site_id = s.id
        break

# Plot the diversity in everyone vs east asians
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 3))
plt.subplots_adjust(wspace=0.3)

window_locations, step = np.linspace(0, edar_ts.sequence_length, num=101, retstep=True)
diversity_site = edar_ts.diversity(windows=window_locations)
eas_diversity_site = edar_ts.diversity(sample_sets=east_asian_samples, windows=window_locations)
diversity_branch = edar_ts.diversity(windows=window_locations, mode="branch")
eas_diversity_branch = edar_ts.diversity(sample_sets=east_asian_samples, windows=window_locations, mode="branch")

ax1.axvspan(edar_gene_bounds[0], edar_gene_bounds[1], color="lightgray")
ax1.stairs(diversity_site, window_locations, baseline=None, label="all samples")
ax1.stairs(eas_diversity_site, window_locations, baseline=None, label="east asian\nsamples")
ax1.set_xlabel("Genome position (bp)")
ax1.set_ylabel(f"Average proportion of sites\nthat vary between pairs")
ax1.axvline(edar_ts.site(interesting_site_id).position, ls=":")

ax1.legend()

# Genealogical equivalent (mode="branch")
ax2.axvspan(edar_gene_bounds[0], edar_gene_bounds[1], color="lightgray")
ax2.stairs(diversity_branch, window_locations, baseline=None, label="all samples")
ax2.stairs(eas_diversity_branch, window_locations, baseline=None, label="east asian\nsamples")
ax2.set_xlabel("Genome position (bp)")
ax2.set_ylabel(f"Genealogical equivalent\n(av. branch length between pairs)")
ax2.axvline(edar_ts.site(interesting_site_id).position, ls=":")

plt.suptitle(
    r"Genetic diversity ($\pi$) in the EDAR gene (grey), "
    f"plotted in {step/1000:.0f} Kb windows "
    f"(dotted line gives position of {interesting_snp})"
)
plt.show()

Although not conclusive, it seems like the branch length diversity shows a drop in east asians in the region in front of the gene, possibly a sign of local selection on an east-asian specific variant such as rs3827760. It would be interesting to examine the [genealogical nearest neighbours](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.genealogical_nearest_neighbours) of the Denisovan individual in this region. As they say, "this exercise is left for the reader".