# Ag1000G phase 3 SNP data release - data download guide

**5 January 2021**

This notebook provides information about how to download data from the [MalariaGEN Anopheles gambiae 1000 Genomes project (Ag1000G) phase 3 SNP data release](https://www.malariagen.net/data/ag1000g-phase3-snp). This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

If you have any questions about this guide or how to use the data, please [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-public-data repo on GitHub. If you find any bugs, please [raise an issue](https://github.com/malariagen/vector-public-data/issues/new/choose).

## About this guide

This guide is written as a Jupyter notebook. Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this guide assume you are downloading data to a local folder within your home directory at the path "~/data/ag3/". Change this if you want to download to a different folder.

## Data hosting

Data in this release are hosted by several different services. 

Raw sequence reads in FASTQ format, sequence read alignments in BAM format, and SNP calls in VCF format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the `wget` command line tool, but please note that there are several other options for downloading data, see the [ENA documentation on how to download data files](https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html) for more information.  

Sample metadata in CSV format and SNP calls in Zarr format are hosted on Google Cloud Storage (GCS) in the `vo_agam_release` bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible and do not require any authentication to access. This guide provides examples of downloading data from GCS to a local computer using the `gsutil` command line tool. For more information about `gsutil`, see the [gsutil tool documentation](https://cloud.google.com/storage/docs/gsutil).

## Sample sets

Data in this release are organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release. Here is a direct download link for the sample set manifest:

* https://storage.googleapis.com/vo_agam_release/v3/manifest.tsv

The sample set manifest can also be downloaded via gsutil to a directory on the local file system, e.g.:

In [1]:
!mkdir -pv ~/data/ag3/
!gsutil cp gs://vo_agam_release/v3/manifest.tsv ~/data/ag3/

Copying gs://vo_agam_release/v3/manifest.tsv...
/ [1 files][  453.0 B/  453.0 B]                                                
Operation completed over 1 objects/453.0 B.                                      


Here are the file contents:

In [2]:
!cat ~/data/ag3/manifest.tsv

sample_set	sample_count
AG1000G-AO	81
AG1000G-BF-A	181
AG1000G-BF-B	102
AG1000G-BF-C	13
AG1000G-CD	76
AG1000G-CF	73
AG1000G-CI	80
AG1000G-CM-A	303
AG1000G-CM-B	97
AG1000G-CM-C	44
AG1000G-FR	23
AG1000G-GA-A	69
AG1000G-GH	100
AG1000G-GM-A	74
AG1000G-GM-B	31
AG1000G-GM-C	174
AG1000G-GN-A	45
AG1000G-GN-B	185
AG1000G-GQ	10
AG1000G-GW	101
AG1000G-KE	86
AG1000G-ML-A	60
AG1000G-ML-B	71
AG1000G-MW	41
AG1000G-MZ	74
AG1000G-TZ	300
AG1000G-UG	290
AG1000G-X	297


The sample set identifiers all start with "AG1000G-" followed by the two-letter code of the country from which samples were collected (e.g., "AO" is Angola). Where there are multiple sample sets from the same country, these have been given alphabetical suffixes, e.g., "AG1000G-BF-A", "AG1000G-BF-B" and "AG1000G-BF-C" are three sample sets from Burkina Faso.

These country code suffixes are just a convenience to help remember which sample sets contain which data, please see the sample metadata for more precise location information. Please note also that sample set AG1000G-GN-B contains samples from both Guinea and Mali.

## Sample metadata

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.

### Collection metadata

Specimen collection metadata can be downloaded from GCS. E.g., here is the download link for the sample metadata for sample set AG1000G-BF-A:

* https://storage.googleapis.com/vo_agam_release/v3/metadata/general/AG1000G-BF-A/samples.meta.csv

Sample metadata for all sample sets can also be downloaded using gsutil:

In [3]:
!mkdir -pv ~/data/ag3/metadata/
!gsutil -m rsync -r gs://vo_agam_release/v3/metadata/ ~/data/ag3/metadata/

Building synchronization state...
Starting synchronization...


Here are the first few rows of the sample metadata for sample set AG1000G-BF-A:

In [4]:
!head ~/data/ag3/metadata/general/AG1000G-BF-A/samples.meta.csv

sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F
AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F
AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0090-C,BF3-10,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0091-C,BF3-12,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0092-C,BF3-13,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0094-Cx,BF3-17,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F


The `sample_id` columns gives the sample identifier used throughout all Ag1000G analyses.

The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.

The `year` and `month` columns give the approximate date when the specimen was collected.

The `sex_call` column gives the gender as determined from the sequence data.

### Species calls

We have made a call for each specimen as to which species it belongs to (*Anopheles gambiae*, *Anopheles coluzzii*, *Anopheles arabiensis*) based on the genotypes of the samples. These calls were made from the sequence data, and there are cases where the species is not easy to determine. We report species calls using two methods, principal components analysis (PCA) and ancestry informative markers (AIMs). 

Species calls can be downloaded from GCS, e.g., for sample set AG1000G-BF-A:

* PCA species calls - https://storage.googleapis.com/vo_agam_release/v3/metadata/species_calls_20200422/AG1000G-BF-A/samples.species_pca.csv
* AIM species calls - https://storage.googleapis.com/vo_agam_release/v3/metadata/species_calls_20200422/AG1000G-BF-A/samples.species_aim.csv

Alternatively if you ran the `gsutil rsync` command above to download sample metadata then this file will already be present on your local file system.

Here are the first few rows of the AIM species calls for sample set AG1000G-BF-A:

In [5]:
!head ~/data/ag3/metadata/species_calls_20200422/AG1000G-BF-A/samples.species_aim.csv

sample_id,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii
AB0085-Cx,0.024,0.002,gamb_colu,gambiae
AB0086-Cx,0.038,0.002,gamb_colu,gambiae
AB0087-C,0.982,0.002,gamb_colu,coluzzii
AB0088-C,0.990,0.002,gamb_colu,coluzzii
AB0089-Cx,0.975,0.002,gamb_colu,coluzzii
AB0090-C,0.977,0.002,gamb_colu,coluzzii
AB0091-C,0.974,0.002,gamb_colu,coluzzii
AB0092-C,0.978,0.002,gamb_colu,coluzzii
AB0094-Cx,0.986,0.002,gamb_colu,coluzzii


The `species_gambcolu_arabiensis` column provides a call as to whether the specimen is arabiensis or not (gamb_colu).

The `species_gambiae_coluzzii` column applies to samples that are not arabiensis, and differentiates gambiae versus coluzzii.

## Raw sequence reads (FASTQ format)

The raw sequence reads used in this data release can be downloaded from ENA. Note that for most samples there were multiple sequencing runs, and hence there are usually multiple ENA run accessions per sample. For most samples there were 3 sequencing runs, but some samples have 4 and some have a single sequencing run.

To find the ENA run accessions for a given sample, first download the catalog of run accessions:

* https://storage.googleapis.com/vo_agam_release/v3/metadata/ena_runs.csv

Alternatively if you ran the `gsutil rsync` command above to download sample metadata then this file will already be present on your local file system. Inspect the file:

In [6]:
!head ~/data/ag3/metadata/ena_runs.csv

sample_id,ena_run
AR0001-C,ERR347035
AR0001-C,ERR347047
AR0001-C,ERR352136
AR0002-C,ERR328585
AR0002-C,ERR323844
AR0002-C,ERR328597
AR0004-C,ERR343648
AR0004-C,ERR343636
AR0004-C,ERR343468


For example, the sequence reads for sample AR0001-C are available from three ENA accessions: ERR347035, ERR347047 and ERR352136. To download the sequence reads, visit the ENA website and search for these accessions. E.g., links to download sequence reads for run ERR352136 are available from this web page: https://www.ebi.ac.uk/ena/browser/view/ERR352136. To download the FASTQ files for this run via `wget`:

In [None]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR352/ERR352136/ERR352136_1.fastq.gz
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR352/ERR352136/ERR352136_2.fastq.gz

Note that FASTQ files are relatively large, several GB per sample, so they may take a long time to download, and may require a substantial amount of disk space on your local system.

## Sequence read alignments (BAM format)

Analysis-ready sequence read alignments are available in BAM format for all samples in the release and can be downloaded from ENA. A catalog file mapping sample identifiers to ENA accessions is available at this link:

* https://storage.googleapis.com/vo_agam_release/v3/metadata/ena_alignments.csv

Alternatively if you ran the `gsutil rsync` command above to download sample metadata then this file will already be present on your local file system. Here are the first few rows:

In [8]:
!head ~/data/ag3/metadata/ena_alignments.csv

sample_id,ena_analysis
AR0001-C,ERZ1695275
AR0002-C,ERZ1695276
AR0004-C,ERZ1695277
AR0006-C,ERZ1695278
AR0007-C,ERZ1695279
AR0008-C,ERZ1695280
AR0009-C,ERZ1695281
AR0010-Cx,ERZ1695282
AR0011-C,ERZ1695283


Each row in this file provides a mapping from Ag1000G sample identifiers to ENA analysis accessions. To find links for downloading the data, visit the ENA website and search for the corresponding analysis accession. E.g., the analysis-ready BAM file for sample AR0001-C can be downloaded from this web page: https://www.ebi.ac.uk/ena/browser/view/ERZ1695275. To download the BAM file via `wget`:

In [None]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/ERZ169/ERZ1695275/AR0001-C.bam

Note that BAM files are relatively large, approximately 10G per sample, so they may take a long time to download, and may require a substantial amount of disk space on your local system.

## SNP calls (VCF format)

### SNP genotypes

SNP calls in VCF format are available from EVA. There is one VCF file for each individual sample. A catalog file mapping sample identifiers to EVA accessions is available at this link:

* https://storage.googleapis.com/vo_agam_release/v3/metadata/eva_snp_genotypes.csv (@@TODO)

Alternatively if you ran the `gsutil rsync` command above to download sample metadata then this file will already be present on your local file system. Inspect the file:

In [None]:
!head ~/data/ag3/metadata/ena_snp_genotypes.csv

Each row in this file provides a mapping from Ag1000G sample identifiers to EVA analysis accessions. To find links for downloading the data, visit the EVA website and search for the corresponding analysis accession. E.g., the VCF file for sample @@TODO can be downloaded from this web page: @@TODO

Note that each sample has been genotyped at all genome positions (except for those where the reference sequence is 'N') and considering all possible SNP alleles. It is possible to combine VCF files for multiple samples if you need to analyse a multi-sample VCF. E.g., here are commands to download VCFs for three samples then merge them into a single multi-sample VCF:

In [None]:
!@@TODO download and merge VCFs

### Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created some sites-only VCF files with site filter information in the `FILTER` column. These VCF files are hosted on GCS. 

Because different species may have different genome accessibility issues, we have created three separate site filters:

* The "gamb_colu" site filter is designed for working only with samples that are not *An. arabiensis*.
* The "arab" filter is designed for when only working with samples that are *An. arabiensis*.
* The "gamb_colu_arab" filter is suitable for when analysing samples of any species together.

Each filter is available as a set of VCF files, one per chromosome arm. E.g., here is the direct download link for the gamb_colu_arab filters on chromosome arm 3R:

* https://storage.googleapis.com/vo_agam_release/v3/site_filters/dt_20200416/vcf/gamb_colu_arab/3R_sitefilters.vcf.gz

Alternatively, all site filters VCFs can be downloaded using `gsutil`, e.g.:

In [None]:
!mkdir -pv ~/data/ag3/site_filters/dt_20200416/vcf/
!gsutil -m rsync -r gs://vo_agam_release/v3/site_filters/dt_20200416/vcf/ ~/data/ag3/site_filters/dt_20200416/vcf/

@@TODO describe how to use site filters VCFs with the genotypes VCF.

## SNP calls (Zarr format)

SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the [cloud user guide @@TODO link](@@TODO) for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.

The data are organised into several Zarr hierarchies. 

### SNP sites and alleles

Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:

In [None]:
!mkdir -pv ~/data/ag3/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/all/sites/ \
    ~/data/ag3/snp_genotypes/all/sites/

### Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format, excluding some parts of the data that you probably won't need:

In [None]:
!mkdir -pv ~/data/ag3/site_filters/
!gsutil -m rsync -r \
    -x '.*vcf.*|.*crosses_stats.*|.*[MG]Q10.*|.*[MG]Q30.*|.*[MG]Q_mean.*|.*[MG]Q_std.*|.*/lo_.*|.*/hi_.*|.*no_cov.*|.*allele_consistency.*|.*heterozygosity.*' \
    gs://vo_agam_release/v3/site_filters/ \
    ~/data/ag3/site_filters/

### SNP genotypes

SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set AG1000G-BF-A, excluding some data you probably won't need:

In [None]:
!mkdir -pv ~/data/ag3/snp_genotypes/all/AG1000G-BF-A/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/ \
        ~/data/ag3/snp_genotypes/all/AG1000G-BF-A/

## Accessing downloaded data from Python

There are a wide variety of tools available for analysing the data from Ag1000G phase 3 once downloaded to your local system. If you would like advice on possible approaches for a given analysis, please feel free to [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-public-data repo on GitHub. Please note that these data can also be analysed directly in the cloud without downloading to the local system, see the [cloud user guide @@TODO link](@@TODO) for more information.

Within the MalariaGEN vector genomics team we primarily use the Python programming language for analysing the SNP data, making use of software packages in the Scientific Python / PyData ecosystem. This section gives some simple examples illustrating how to use these tools to read downloaded data.

These examples use data from two sample sets, AG1000G-BF-A and AG1000G-BF-B. 

Firstly, here are all the commands to download the data needed to run these examples (requires about 10G of local storage):

In [None]:
# download sample set manifest
!mkdir -pv ~/data/ag3/
!gsutil cp gs://vo_agam_release/v3/manifest.tsv ~/data/ag3/

# download sample metadata
!mkdir -pv ~/data/ag3/metadata/
!gsutil -m rsync -r gs://vo_agam_release/v3/metadata/ ~/data/ag3/metadata/

# download sites data
!mkdir -pv ~/data/ag3/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/all/sites/ \
    ~/data/ag3/snp_genotypes/all/sites/

# download site filters data
!mkdir -pv ~/data/ag3/site_filters/
!gsutil -m rsync -r \
    -x '.*vcf.*|.*crosses_stats.*|.*[MG]Q10.*|.*[MG]Q30.*|.*[MG]Q_mean.*|.*[MG]Q_std.*|.*/lo_.*|.*/hi_.*|.*no_cov.*|.*allele_consistency.*|.*heterozygosity.*' \
    gs://vo_agam_release/v3/site_filters/ \
    ~/data/ag3/site_filters/

# download SNP genotype data for sample sets AG1000G-BF-A and AG1000G-BF-B
!mkdir -pv ~/data/ag3/snp_genotypes/all/AG1000G-BF-A/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/ \
        ~/data/ag3/snp_genotypes/all/AG1000G-BF-A/
!mkdir -pv ~/data/ag3/snp_genotypes/all/AG1000G-BF-B/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-B/ \
        ~/data/ag3/snp_genotypes/all/AG1000G-BF-B/


The following Python packages need to be installed on the local system: numpy, pandas, dask, matplotlib, zarr and scikit-allel.

In [13]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import dask.array as da
from dask.diagnostics import ProgressBar as progress
import zarr
import allel

Set up local path where data were downloaded:

In [14]:
ag3_path = Path("~/data/ag3").expanduser()

Examples of reading/opening various files:

In [15]:
# read sample set manifest
df_sample_sets = pd.read_csv(ag3_path / "manifest.tsv", sep="\t")
df_sample_sets

Unnamed: 0,sample_set,sample_count
0,AG1000G-AO,81
1,AG1000G-BF-A,181
2,AG1000G-BF-B,102
3,AG1000G-BF-C,13
4,AG1000G-CD,76
5,AG1000G-CF,73
6,AG1000G-CI,80
7,AG1000G-CM-A,303
8,AG1000G-CM-B,97
9,AG1000G-CM-C,44


In [16]:
# read sample metadata
sample_set = "AG1000G-BF-A"
df_samples = pd.read_csv(ag3_path / f"metadata/general/{sample_set}/samples.meta.csv")
df_samples.head()

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
0,AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.15,-4.235,F
1,AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.15,-4.235,F
2,AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
3,AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
4,AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F


In [17]:
# inspect number of samples by collection location and year
df_samples.groupby(["country", "location", "year"]).size()

country       location        year
Burkina Faso  Bana            2012    65
              Pala            2012    59
              Souroukoudinga  2012    57
dtype: int64

In [18]:
# read AIM species calls
df_species_aim = pd.read_csv(ag3_path / f"metadata/species_calls_20200422/{sample_set}/samples.species_aim.csv")
df_species_aim.head()

Unnamed: 0,sample_id,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii
0,AB0085-Cx,0.024,0.002,gamb_colu,gambiae
1,AB0086-Cx,0.038,0.002,gamb_colu,gambiae
2,AB0087-C,0.982,0.002,gamb_colu,coluzzii
3,AB0088-C,0.99,0.002,gamb_colu,coluzzii
4,AB0089-Cx,0.975,0.002,gamb_colu,coluzzii


In [19]:
# inspect number of samples by species
df_species_aim.fillna("").groupby(["species_gambcolu_arabiensis", "species_gambiae_coluzzii"]).size()

species_gambcolu_arabiensis  species_gambiae_coluzzii
gamb_colu                    coluzzii                    82
                             gambiae                     98
                             intermediate                 1
dtype: int64

In [20]:
# read PCA species calls
df_species_pca = pd.read_csv(ag3_path / f"metadata/species_calls_20200422/{sample_set}/samples.species_pca.csv")
df_species_pca.head()

Unnamed: 0,sample_id,PC1,PC2,species_gambcolu_arabiensis,species_gambiae_coluzzii
0,AB0085-Cx,-28.289,-22.87,gamb_colu,gambiae
1,AB0086-Cx,-31.577,-22.986,gamb_colu,gambiae
2,AB0087-C,-32.06,42.962,gamb_colu,coluzzii
3,AB0088-C,-33.315,43.974,gamb_colu,coluzzii
4,AB0089-Cx,-31.606,42.225,gamb_colu,coluzzii


In [21]:
# inspect number of samples by species
df_species_pca.fillna("").groupby(["species_gambcolu_arabiensis", "species_gambiae_coluzzii"]).size()

species_gambcolu_arabiensis  species_gambiae_coluzzii
gamb_colu                    coluzzii                    82
                             gambiae                     99
dtype: int64

In [22]:
# read sites
callset_sites = zarr.open(str(ag3_path / "snp_genotypes/all/sites"), mode='r')
callset_sites

<zarr.hierarchy.Group '/' read-only>

In [23]:
# arrays are organised hierarchically
print(callset_sites.tree())

/
 ├── 2L
 │   ├── calldata
 │   └── variants
 │       ├── ALT (48525747, 3) |S1
 │       ├── POS (48525747,) int32
 │       └── REF (48525747,) |S1
 ├── 2R
 │   ├── calldata
 │   └── variants
 │       ├── ALT (60132453, 3) |S1
 │       ├── POS (60132453,) int32
 │       └── REF (60132453,) |S1
 ├── 3L
 │   ├── calldata
 │   └── variants
 │       ├── ALT (40758473, 3) |S1
 │       ├── POS (40758473,) int32
 │       └── REF (40758473,) |S1
 ├── 3R
 │   ├── calldata
 │   └── variants
 │       ├── ALT (52226568, 3) |S1
 │       ├── POS (52226568,) int32
 │       └── REF (52226568,) |S1
 ├── Mt
 │   ├── calldata
 │   └── variants
 │       ├── ALT (15363, 3) |S1
 │       ├── POS (15363,) int32
 │       └── REF (15363,) |S1
 ├── UNKN
 │   ├── calldata
 │   └── variants
 │       ├── ALT (27274988, 3) |S1
 │       ├── POS (27274988,) int32
 │       └── REF (27274988,) |S1
 ├── X
 │   ├── calldata
 │   └── variants
 │       ├── ALT (23385349, 3) |S1
 │       ├── POS (23385349,) int32
 │       └

In [24]:
# e.g., read first 10 positions from chromosome arm 3R
pos = callset_sites["3R/variants/POS"][:10]
pos

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

In [25]:
# e.g., read fist 10 reference alleles from chromosome arm 3R
ref = callset_sites["3R/variants/REF"][:10]
ref

array([b'C', b'C', b'T', b'C', b'T', b'A', b'C', b'G', b'T', b'T'],
      dtype='|S1')

In [26]:
# e.g., read first 10 alternate alleles from chromosome arm 3R
alt = callset_sites["3R/variants/ALT"][:10]
alt

array([[b'A', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'C', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'G'],
       [b'A', b'C', b'G']], dtype='|S1')

In [27]:
# read gamb_colu site filters
callset_filters_gamb_colu = zarr.open(str(ag3_path / "site_filters/dt_20200416/gamb_colu"), mode='r')
callset_filters_gamb_colu

<zarr.hierarchy.Group '/' read-only>

In [28]:
# arrays are organised hierarchically
print(callset_filters_gamb_colu.tree())

/
 ├── 2L
 │   └── variants
 │       ├── filter_pass (48525747,) bool
 │       ├── training_negative (48525747,) bool
 │       └── training_positive (48525747,) bool
 ├── 2R
 │   └── variants
 │       ├── filter_pass (60132453,) bool
 │       ├── training_negative (60132453,) bool
 │       └── training_positive (60132453,) bool
 ├── 3L
 │   └── variants
 │       ├── filter_pass (40758473,) bool
 │       ├── training_negative (40758473,) bool
 │       └── training_positive (40758473,) bool
 ├── 3R
 │   └── variants
 │       ├── filter_pass (52226568,) bool
 │       ├── training_negative (52226568,) bool
 │       └── training_positive (52226568,) bool
 └── X
     └── variants
         └── filter_pass (23385349,) bool


Each set of site filters provides a "filter_pass" Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

In [29]:
# e.g., load the mask for the first 10 SNPs on chromosome arm 3R
filter_pass = callset_filters_gamb_colu["3R/variants/filter_pass"][:10]
filter_pass

array([False, False, False, False, False, False, False, False, False,
       False])

In [30]:
# open SNP genotypes
callset_genotypes = zarr.open(str(ag3_path / f"snp_genotypes/all/{sample_set}"))
callset_genotypes

<zarr.hierarchy.Group '/'>

In [31]:
print(callset_genotypes.tree())

/
 ├── 2L
 │   └── calldata
 │       └── GT (48525747, 181, 2) int8
 ├── 2R
 │   └── calldata
 │       └── GT (60132453, 181, 2) int8
 ├── 3L
 │   └── calldata
 │       └── GT (40758473, 181, 2) int8
 ├── 3R
 │   └── calldata
 │       └── GT (52226568, 181, 2) int8
 ├── X
 │   └── calldata
 │       └── GT (23385349, 181, 2) int8
 └── samples (181,) |S24


For each sample set, data are grouped by chromosome arm. The "calldata/GT" array provides the actual genotypes. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

In [32]:
# e.g., load genotypes for the first 5 SNPs on chromosome arm 3R and the first 3 samples
gt = callset_genotypes["3R/calldata/GT"][:5, :3, :]
gt

array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

### Concatenating data from multiple sample sets

Often you may wish to work with multiple sample sets simultaneously, in which case data needs to concatenated.

Concatenating can be done directly. E.g., concatenate sample metadata for two sample sets:

In [33]:
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]
df_samples = (
    pd.concat(
        [pd.read_csv(ag3_path / f"metadata/general/{sample_set}/samples.meta.csv") 
         for sample_set in sample_sets], 
        axis=0)
    .reset_index(drop=True)
)
df_samples.head()

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
0,AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.15,-4.235,F
1,AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.15,-4.235,F
2,AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
3,AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
4,AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F


In [34]:
len(df_samples)

283

In [35]:
df_samples.groupby(["country", "location", "year"]).size()

country       location        year
Burkina Faso  Bana            2012    65
                              2014    63
              Pala            2012    59
                              2014    18
              Souroukoudinga  2012    57
                              2014    21
dtype: int64

SNP genotypes for multiple sample sets can be concatenated using [dask](https://docs.dask.org/en/latest/array.html). E.g.:

In [36]:
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]
callsets = [zarr.open(str(ag3_path / f"snp_genotypes/all/{sample_set}"), mode='r') 
            for sample_set in sample_sets]
gt_arrays = [da.from_array(callset["3R/calldata/GT"]) for callset in callsets]
# note that arrays are concatenated across axis 1 - the samples dimension
gt = da.concatenate(gt_arrays, axis=1)

# wrap with scikit-allel class for convenience 
gt = allel.GenotypeDaskArray(gt)
gt

Unnamed: 0,0,1,2,3,4,...,278,279,280,281,282,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
52226565,./.,./.,./.,./.,./.,...,0/0,./.,./.,./.,./.,
52226566,./.,./.,./.,./.,./.,...,./.,./.,./.,./.,./.,
52226567,./.,./.,./.,./.,./.,...,./.,./.,./.,./.,./.,


To load some of these data into memory, call `.compute()`, e.g.:

In [37]:
gt[:5, :3, :].compute()

Unnamed: 0,0,1,2
0,0/0,0/0,0/0
1,0/0,0/0,0/0
2,0/0,0/0,0/0
3,0/0,0/0,0/0
4,0/0,0/0,0/0


Here's a computation to count the number of segregating sites on chromosome arm 3R that also pass filters:

In [38]:
# locate pass sites
loc_pass = callset_filters_gamb_colu["3R/variants/filter_pass"][:]

# perform an allele count over genotypes
with progress():
    ac = gt.count_alleles(max_allele=3).compute()
    
# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
np.count_nonzero(loc_pass & loc_seg)

[########################################] | 100% Completed | 18.6s


12990674

### Convenience functions for accessing data

When performing more complex analyses using data from multiple sample sets, it may be useful to have some convenience functions for loading data. The [ag3_local.py](ag3_local.py) module provides some convenience functions for loading locally downloaded data. If you have this module in your Python path or local directory, you can do, e.g.:

In [39]:
import ag3_local
ag3 = ag3_local.Data("~/data/ag3")

In [40]:
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]
seq_id = "3R"
mask = "gamb_colu"

In [41]:
df_samples = ag3.load_sample_metadata(sample_set=sample_sets)
df_samples

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,sample_set,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii
0,AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F,AG1000G-BF-A,0.024,0.002,gamb_colu,gambiae
1,AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F,AG1000G-BF-A,0.038,0.002,gamb_colu,gambiae
2,AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,0.982,0.002,gamb_colu,coluzzii
3,AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,0.990,0.002,gamb_colu,coluzzii
4,AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,0.975,0.002,gamb_colu,coluzzii
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278,AB0533-C,BF13-18,Austin Burt,Burkina Faso,Souroukoudinga,2014,7,11.235,-4.535,F,AG1000G-BF-B,0.021,0.002,gamb_colu,gambiae
279,AB0536-C,BF13-31,Austin Burt,Burkina Faso,Souroukoudinga,2014,7,11.235,-4.535,F,AG1000G-BF-B,0.025,0.002,gamb_colu,gambiae
280,AB0537-C,BF13-32,Austin Burt,Burkina Faso,Souroukoudinga,2014,7,11.235,-4.535,F,AG1000G-BF-B,0.029,0.002,gamb_colu,gambiae
281,AB0538-C,BF13-33,Austin Burt,Burkina Faso,Souroukoudinga,2014,7,11.235,-4.535,F,AG1000G-BF-B,0.018,0.002,gamb_colu,gambiae


In [42]:
filter_pass = ag3.load_mask(seq_id=seq_id, mask=mask)
filter_pass

array([False, False, False, ..., False, False, False])

In [43]:
pos = ag3.load_variants_array(seq_id=seq_id, field="POS", mask=mask)
pos

array([     180,      185,      236, ..., 53196502, 53196504, 53196522],
      dtype=int32)

In [44]:
gt = ag3.load_calldata_array(seq_id=seq_id, field="GT", sample_set=sample_sets, mask=mask)
gt

Unnamed: 0,0,1,2,3,4,...,278,279,280,281,282,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
37199399,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
37199400,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
37199401,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
