# Ag3.1 data access

This notebook provides information about how to download data from the [*Ag3.1* release](intro). This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this notebook assume you are downloading data to a local folder within your home directory at the path `~/vo_agam_release/`. Change this if you want to download to a different folder on the local file system.

## Data hosting

`Ag3.1` data are hosted by several different services.

Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the `wget` command line tool, but please note that there are several other options for downloading data, see the [ENA documentation on how to download data files](https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html) for more information.  

SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading thes data using `wget`.

Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the `vo_agam_release` bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible and do not require any authentication to access. 

This guide provides examples of:
- **Cloud access** - accessing data using the `malariagen_data` python package 
- **Downloads** - downloading data from GCS to a local computer using the `wget` and `gsutil` command line tools. For more information about `gsutil`, see the [gsutil tool documentation](https://cloud.google.com/storage/docs/gsutil).

## Data - cloud access

To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [Ag3 API docs](api) for documentation of all functions available from this package. 

Install the `malariagen_data` python package:

In [1]:
!pip install -q malariagen_data

Import the `malariagen_data` package:

In [2]:
import malariagen_data

`Ag3.1`  data access from Google Cloud is set up with the following code:

In [3]:
ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client,MalariaGEN Ag3 API client
"Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs.","Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs..1"
Storage URL,gs://vo_agam_release/
Data releases available,3.0
Results cache,
Cohorts analysis,20230223
Species analysis,aim_20220528
Site filters analysis,dt_20200416
Software version,malariagen_data 7.3.1
Client location,"England, GB"


### Sample sets

In [4]:
release = "3.1"

In [5]:
df_sample_sets = ag3.sample_sets(release=release)
df_sample_sets

Unnamed: 0,sample_set,sample_count,release
0,1177-VO-ML-LEHMANN-VMF00004,647,3.1


For more examples see the [*Ag3.0* cloud data access](download) guide, replacing the release version for `3.1`. 

## Data - downloads

### Sample sets

Data in this release are organised into 1 sample set. This sample set corresponds to a set of mosquito specimens contributed by a collaborating study. For convenience there is a tab-delimited manifest file listing all sample sets in the release. Here is a direct download link for the sample set manifest:

* [https://storage.googleapis.com/vo_agam_release/v3.1/manifest.tsv](https://storage.googleapis.com/vo_agam_release/v3.1/manifest.tsv)

The sample set manifest can also be downloaded via `gsutil` to a directory on the local file system, e.g.:

In [6]:
!mkdir -pv ~/vo_agam_release/v3.1/
!gsutil cp gs://vo_agam_release/v3.1/manifest.tsv ~/vo_agam_release/v3.1/

Copying gs://vo_agam_release/v3.1/manifest.tsv...
/ [1 files][   56.0 B/   56.0 B]                                                
Operation completed over 1 objects/56.0 B.                                       


Here are the file contents:

In [7]:
!cat ~/vo_agam_release/v3.1/manifest.tsv

sample_set	sample_count
1177-VO-ML-LEHMANN-VMF00004	647


For more information about these sample sets, see the section on sample sets in the [introduction to Ag1000G phase 3](intro).

## Sample metadata

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.

### Specimen collection metadata

Specimen collection metadata can be downloaded from GCS. Here is the download link for the sample metadata for sample set `1177-VO-ML-LEHMANN-VMF00004`:

* [https://storage.googleapis.com/vo_agam_release/v3.1/metadata/general/1177-VO-ML-LEHMANN-VMF00004/samples.meta.csv](https://storage.googleapis.com/vo_agam_release/v3.1/metadata/general/1177-VO-ML-LEHMANN-VMF00004/samples.meta.csv)

Sample metadata for all sample sets can also be downloaded using `gsutil`:

In [8]:
!mkdir -pv ~/vo_agam_release/v3.1/metadata/
!gsutil -m rsync -r gs://vo_agam_release/v3.1/metadata/ ~/vo_agam_release/v3.1/metadata/

Here are the first few rows of the sample metadata for sample set `1177-VO-ML-LEHMANN-VMF00004`:

In [9]:
!head ~/vo_agam_release/v3.1/metadata/general/1177-VO-ML-LEHMANN-VMF00004/samples.meta.csv

sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
VBS00256-4651STDY7017184,GP97,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00257-4651STDY7017185,GP98,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00259-4651STDY7017186,GP100,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00262-4651STDY7017187,GP103,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00277-4651STDY7017189,GP118,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00288-4651STDY7017191,GP129,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00289-4651STDY7017192,GP130,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00293-4651STDY7017193,GP134,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F
VBS00309-4651STDY7017194,GP150,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F


The `sample_id` columns gives the sample identifier used throughout all Ag1000G analyses.

The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.

The `year` and `month` columns give the approximate date when the specimen was collected.

The `sex_call` column gives the gender as determined from the sequence data.

### Species calls

We have made a preliminary call for each specimen as to which species it belongs to (*Anopheles gambiae*, *Anopheles coluzzii*, *Anopheles arabiensis*) based on the genotypes of the samples. These calls were made from the sequence data, and there are cases where the species is not easy to determine. We report species calls using ancestry informative markers (AIMs). 

AIM species calls can be downloaded from GCS. For sample set `1177-VO-ML-LEHMANN-VMF00004`:

* [https://storage.googleapis.com/vo_agam_release/v3.1/metadata/species_calls_aim_20220528/1177-VO-ML-LEHMANN-VMF00004/samples.species_aim.csv](https://storage.googleapis.com/vo_agam_release/v3.1/metadata/species_calls_aim_20220528/1177-VO-ML-LEHMANN-VMF00004/samples.species_aim.csv)

Alternatively if you ran the `gsutil rsync` command above to download sample metadata then this file will already be present on your local file system.

Here are the first few rows of the AIM species calls for sample set `1177-VO-ML-LEHMANN-VMF00004`:

In [10]:
!head ~/vo_agam_release/v3.1/metadata/species_calls_aim_20220528/1177-VO-ML-LEHMANN-VMF00004/samples.species_aim.csv

sample_id,aim_species_fraction_arab,aim_species_fraction_colu,aim_species_fraction_colu_no2L,aim_species_gambcolu_arabiensis,aim_species_gambiae_coluzzii,aim_species
VBS00256-4651STDY7017184,0.002,0.858,0.973,gambcolu,coluzzii,coluzzii
VBS00257-4651STDY7017185,0.002,0.977,0.982,gambcolu,coluzzii,coluzzii
VBS00259-4651STDY7017186,0.001,0.917,0.974,gambcolu,coluzzii,coluzzii
VBS00262-4651STDY7017187,0.002,0.860,0.974,gambcolu,coluzzii,coluzzii
VBS00277-4651STDY7017189,0.002,0.924,0.984,gambcolu,coluzzii,coluzzii
VBS00288-4651STDY7017191,0.001,0.858,0.976,gambcolu,coluzzii,coluzzii
VBS00289-4651STDY7017192,0.002,0.923,0.983,gambcolu,coluzzii,coluzzii
VBS00293-4651STDY7017193,0.001,0.867,0.984,gambcolu,coluzzii,coluzzii
VBS00309-4651STDY7017194,0.001,0.908,0.974,gambcolu,coluzzii,coluzzii


The `species_gambcolu_arabiensis` column provides a call as to whether the specimen is arabiensis or not (gamb_colu).

The `species_gambiae_coluzzii` column applies to samples that are not arabiensis, and differentiates gambiae versus coluzzii.

## Sequence read alignments (BAM format)

Analysis-ready sequence read alignments are available in BAM format for all samples in the release and can be downloaded from S3-compatible storage hosted at WSI. A catalog file mapping sample identifiers to download URLs is included as part of the metadata. To obtain the catalog files, download the sample metadata as shown above.

For example, here are the first few rows of the catalog file for a chosen sample set, showing the `sample_id` and `alignments_bam` columns:

In [11]:
!head ~/vo_agam_release/v3.1/metadata/general/1177-VO-ML-LEHMANN-VMF00004/wgs_snp_data.csv | cut -d, -f1,2

sample_id,alignments_bam
VBS00256-4651STDY7017184,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00256-4651STDY7017184.fixmate.bam
VBS00257-4651STDY7017185,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00257-4651STDY7017185.fixmate.bam
VBS00259-4651STDY7017186,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00259-4651STDY7017186.fixmate.bam
VBS00262-4651STDY7017187,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00262-4651STDY7017187.fixmate.bam
VBS00277-4651STDY7017189,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00277-4651STDY7017189.fixmate.bam
VBS00288-4651STDY7017191,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00288-4651STDY7017191.fixmate.bam
VBS00289-4651STDY7017192,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00289-4651STDY7017192.fixmate.bam
VBS00293-4651STDY7017193,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00293-4651STDY7017193.fixmate.bam
VBS00309-4651STDY7017194,https://1177-vo-ml-le

Each row provides information about a sample, and the value of the alignments_bam field gives the download URL for the BAM file. To download a file locally, you can use `wget`, e.g.:


In [None]:
!wget --no-clobber https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00256-4651STDY7017184.fixmate.bam   

Note that BAM files are relatively large, approximately 10G per sample, so they may take a long time to download, and may require a substantial amount of disk space on your local system.

## SNP calls (VCF format)

### SNP genotypes

SNP genotypes for individual mosquitoes in VCF format are available for download from WSI S3-compatible object storage. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. Then inspect the data catalog in the metadata. 

The download links for the VCF files is given by the `snp_genotypes_vcf` field in the catalog file.

For example, here are the first few rows of the catalog file for the sample set `1177-VO-ML-LEHMANN-VMF00004`, this time showing the `sample_id` and `snp_genotypes_vcf` columns:

In [12]:
!head ~/vo_agam_release/v3.1/metadata/general/1177-VO-ML-LEHMANN-VMF00004/wgs_snp_data.csv | cut -f1,4 -d,

sample_id,snp_genotypes_vcf
VBS00256-4651STDY7017184,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00256-4651STDY7017184.vcf.gz
VBS00257-4651STDY7017185,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00257-4651STDY7017185.vcf.gz
VBS00259-4651STDY7017186,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00259-4651STDY7017186.vcf.gz
VBS00262-4651STDY7017187,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00262-4651STDY7017187.vcf.gz
VBS00277-4651STDY7017189,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00277-4651STDY7017189.vcf.gz
VBS00288-4651STDY7017191,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00288-4651STDY7017191.vcf.gz
VBS00289-4651STDY7017192,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00289-4651STDY7017192.vcf.gz
VBS00293-4651STDY7017193,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00293-4651STDY7017193.vcf.gz
VBS00309-4651STDY7017194,https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00

Each row provides information about a sample, and the value of the snp_genotypes_vcf field gives the download URL for the VCF file for this sample. To download a file locally, use `wget`, e.g.:

In [None]:
!wget --no-clobber https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00256-4651STDY7017184.vcf.gz
!wget --no-clobber https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00256-4651STDY7017184.vcf.gz.tbi

!wget --no-clobber https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00257-4651STDY7017185.vcf.gz
!wget --no-clobber https://1177-vo-ml-lehmann-vmf00004.cog.sanger.ac.uk/VBS00257-4651STDY7017185.vcf.gz.tbi

Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.

Each of these VCF files is an "all sites" VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not "N", regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below merges VCFs for three samples for chromosome arm 3R up to 1 Mbp: 

In [13]:
!bcftools merge --output-type z --regions 3R:1-1000000 --output merged.vcf.gz VBS00256-4651STDY7017184.vcf.gz VBS00257-4651STDY7017185.vcf.gz 

If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using [bcftools view](http://samtools.github.io/bcftools/bcftools.html#view):

In [14]:
!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz

### Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created some sites-only VCF files with site filter information in the `FILTER` column. These VCF files are hosted on GCS. 

Because different species may have different genome accessibility issues, we have created three separate site filters:

* The `gamb_colu` site filter is designed for working only with samples that are not *An. arabiensis*.
* The `arab` filter is designed for when only working with samples that are *An. arabiensis*.
* The `gamb_colu_arab` filter is suitable for when analysing samples of any species together.

Each filter is available as a set of VCF files, one per chromosome arm. E.g., here is the direct download link for the gamb_colu_arab filters on chromosome arm 3R:

* [https://storage.googleapis.com/vo_agam_release/v3/site_filters/dt_20200416/vcf/gamb_colu_arab/3R_sitefilters.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3/site_filters/dt_20200416/vcf/gamb_colu_arab/3R_sitefilters.vcf.gz)

Alternatively, all site filters VCFs can be downloaded using `gsutil`, e.g.:

<!--

@@TODO describe how to use site filters VCFs with the genotypes VCF.

-->

In [None]:
!mkdir -pv ~/vo_agam_release/v3/site_filters/dt_20200416/vcf/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/site_filters/dt_20200416/vcf/ \
    ~/vo_agam_release/v3/site_filters/dt_20200416/vcf/

**Note the data on the genomic positions (sites) are the same as they were for Ag1000G phase 3.**

## SNP calls (Zarr format)

SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the [Ag3 cloud data access guide](cloud) for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.

The data are organised into several Zarr hierarchies. 

### SNP sites and alleles

Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:

In [None]:
!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/all/sites/ \
    ~/vo_agam_release/v3/snp_genotypes/all/sites/

### Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format, excluding some parts of the data that you probably won't need:

In [None]:
!mkdir -pv ~/vo_agam_release/v3/site_filters/
!gsutil -m rsync -r \
    -x '.*vcf.*|.*crosses_stats.*|.*[MG]Q10.*|.*[MG]Q30.*|.*[MG]Q_mean.*|.*[MG]Q_std.*|.*/lo_.*|.*/hi_.*|.*no_cov.*|.*allele_consistency.*|.*heterozygosity.*' \
    gs://vo_agam_release/v3/site_filters/ \
    ~/vo_agam_release/v3/site_filters/

### SNP genotypes

SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set `1177-VO-ML-LEHMANN-VMF00004`, excluding some data you probably won't need:

In [None]:
!mkdir -pv ~/vo_agam_release/v3.1/snp_genotypes/all/1177-VO-ML-LEHMANN-VMF00004/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3.1/snp_genotypes/all/1177-VO-ML-LEHMANN-VMF00004/ \
        ~/vo_agam_release/v3.1/snp_genotypes/all/1177-VO-ML-LEHMANN-VMF00004/

## Copy number variation (CNV) data

Data on copy number variation within the `Ag3.1` cohort are available as three separate data types:

* **HMM** -- Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.
* **Coverage calls** -- Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.
* **Discordant read calls** -- Copy number variant calls at selected insecticide resistance loci, based on analysis of read alignments at CNV breakpoints.

For more information on the methods used to generate these data, see the [variant-calling methods](methods) page.

For each of these data types, data can be downloaded from Google Cloud Storage, and are available in either VCF or Zarr format.

### CNV HMM

The HMM inferences of copy number state are available in VCF, Zarr and text formats, and are organised by sample set. 

For example, the VCF file for sample set `1177-VO-ML-LEHMANN-VMF00004` can be downloaded from the following URL:

* [https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/hmm/vcf/1177-VO-ML-LEHMANN-VMF00004_cnv_hmm.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/hmm/vcf/1177-VO-ML-LEHMANN-VMF00004_cnv_hmm.vcf.gz)

VCF files for all samples sets can be downloaded via gsutil as follows:

In [None]:
# create a local directory to hold downloaded CNV data
!mkdir -pv ~/vo_agam_release/v3.1/cnv/

In [None]:
# download the HMM data in VCF format for all sample sets
!gsutil -m rsync -r \
    -x '.*/coverage_calls/.*|.*/discordant_read_calls/.*|.*/hmm/zarr/.*|.*/hmm/per_sample/.*' \
    gs://vo_agam_release/v3.1/cnv/ ~/vo_agam_release/v3.1/cnv/

Zarr files for all sample sets can be downloaded as follows:

In [None]:
# download HMM data in Zarr format for all sample sets
!gsutil -m rsync -r \
    -x '.*/coverage_calls/.*|.*/discordant_read_calls/.*|.*/hmm/vcf/.*|.*/hmm/per_sample/.*' \
    gs://vo_agam_release/v3.1/cnv/ ~/vo_agam_release/v3.1/cnv/

### CNV coverage calls

Coverage-based CNV calls are available in VCF and Zarr formats, and are organised by sample set. Additionally the coverage calls were performed separately for *An. arabiensis* and (*An. gambiae* + *An. coluzzii*), and so are subdivided into "arab" and "gamb_colu" datasets, depending on which species are present in a given sample set.

Note that some samples were excluded from coverage calling because of high coverage variance.

For example, the VCF file for sample set `1177-VO-ML-LEHMANN-VMF00004` and the `gamb_colu` callset can be downloaded from:

* [https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/coverage_calls/gamb_colu/vcf/1177-VO-ML-LEHMANN-VMF00004_gamb_colu_cnv_coverage_calls.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/coverage_calls/gamb_colu/vcf/1177-VO-ML-LEHMANN-VMF00004_gamb_colu_cnv_coverage_calls.vcf.gz)

VCF files for all sample sets can be downloaded with:

In [None]:
# download coverage calls in VCF format for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/discordant_read_calls/.*|.*/coverage_calls/.*/zarr/.*' \
    gs://vo_agam_release/v3.1/cnv/ ~/vo_agam_release/v3.1/cnv/

Zarr files for all sample sets can be downloaded with:

In [None]:
# download coverage calls in Zarr format for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/discordant_read_calls/.*|.*/coverage_calls/.*/vcf/.*' \
    gs://vo_agam_release/v3.1/cnv/ ~/vo_agam_release/v3.1/cnv/

### CNV discordant read calls

CNV calls for selected insecticide resistance loci are available in VCF and Zarr formats, and are organised by sample set. 

For example, the VCF file for sample set `1177-VO-ML-LEHMANN-VMF00004` can be downloaded from:

* [https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/discordant_read_calls/vcf/1177-VO-ML-LEHMANN-VMF00004_cnv_discordant_read_calls.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3.1/cnv/1177-VO-ML-LEHMANN-VMF00004/discordant_read_calls/vcf/1177-VO-ML-LEHMANN-VMF00004_cnv_discordant_read_calls.vcf.gz)

VCF and Zarr files for all sample sets can be downloaded with:

In [None]:
# download discordant read calls for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/coverage_calls/.*' \
    gs://vo_agam_release/v3.1/cnv/ ~/vo_agam_release/v3.1/cnv/

## Haplotypes

The `Ag3.1` data resource also includes haplotype reference panels, which were obtained by [phasing](https://en.wikipedia.org/wiki/Haplotype_estimation) the SNP calls. To allow for a range of different analyses, three different haplotype reference panels were constructed, each using a different subset of samples and applying different site filters:

* `gamb_colu_arab` - This haplotype reference panel includes **all wild-caught samples**, and phases biallelic SNPs passing the "gamb_colu_arab" site filters.
* `gamb_colu` - This haplotype reference panel includes **wild-caught samples assigned as either *An. gambiae*, *An. coluzzii* or intermediate *An. gambiae/An. coluzzii*** via the AIM species calling method, and phases biallelic SNPs passing the "gamb_colu" site filters.
* `arab` - This haplotype reference panel includes **wild-caught samples assigned as *An. arabiensis*** via the AIM species calling method, and phases biallelic SNPs passing the "arab" site filters. 

Haplotype data can be downloaded in either VCF or Zarr format. See the subsections below for further details

### Haplotype reference panels (VCF format)

These are the VCFs created by the phasing pipeline, containing all samples included each of the phasing runs. There is one VCF per phasing analysis per chromosome arm. The URL for each file has the following structure:

* `https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/panel/{analysis}/ag3.1_{analysis}_{contig}_phased.vcf.gz`

...where `{analysis}` is one of "gamb_colu_arab", "gamb_colu" or "arab"; and `{contig}` is one of "2R", "2L", "3R", "3L", "X". 

E.g., the panel VCF for the `gamb_colu_arab` analysis for chromosome arm 3L can be downloaded here:

* [https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/panel/gamb_colu_arab/ag3.1_gamb_colu_arab_3L_phased.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/panel/gamb_colu_arab/ag3.1_gamb_colu_arab_3L_phased.vcf.gz)

Note that these files can be large, up to ~50 GB.

If you'd like to download all of the panel files, you could also use `gsutil`, e.g.:

In [None]:
# create a local directory to store the data
!mkdir -pv ~/vo_agam_release/v3.1/snp_haplotypes/panel/

# copy files from cloud to local file system
!gsutil -m rsync -r \
    -x '.*/.*zarr.zip' \
    gs://vo_agam_release/v3.1/snp_haplotypes/panel/ \
    ~/vo_agam_release/v3.1/snp_haplotypes/panel/

### Sample set haplotypes (VCF format)

These VCFs are subsets of the panel VCFs, containing only samples in a given sample set. There is one VCF per sample set, per phasing analysis, per chromosome arm. The URL for each file has the following structure:

* `https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/{analysis}/vcf/A1177-VO-ML-LEHMANN-VMF00004_{analysis}_{contig}_phased.vcf.gz`

...where `{analysis}` is one of "gamb_colu_arab", "gamb_colu" or "arab"; `{contig}` is one of "2R", "2L", "3R", "3L", "X".

E.g., the VCF for sample set `1177-VO-ML-LEHMANN-VMF00004`, for the `gamb_colu` analysis, for chromosome arm 2L, can be downloaded here:

* [https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/gamb_colu/vcf/1177-VO-ML-LEHMANN-VMF00004_gamb_colu_2L_phased.vcf.gz](https://storage.googleapis.com/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/gamb_colu/vcf/1177-VO-ML-LEHMANN-VMF00004_gamb_colu_2L_phased.vcf.gz)

If you'd like to download all of the VCF files for a given sample set, you could also use gsutil, e.g.:

In [None]:
# create a local directory to store the data
!mkdir -pv ~/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/

# copy files from cloud to local file system
!gsutil -m rsync -r \
    -x '.*/zarr/.*' \
    gs://vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/ \
    ~/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/

### Sample set haplotypes (Zarr format)

These contain the haplotype data in Zarr format, with one Zarr hierarchy per sample set. The root zarr path for a given hierarchy has the following structure:

* `gs://vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/{analysis}/zarr`

Data can be downloaded with gsutil. E.g., download the Zarr data for sample set `1177-VO-ML-LEHMANN-VMF00004`. Note that the sites are stored in a separate hierarchy:

In [None]:
# create local directories to store the data
!mkdir -pv ~/vo_agam_release/v3/snp_haplotypes/sites
!mkdir -pv ~/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004

# copy haplotype data from cloud to local file system
!gsutil -m rsync -r \
    -x '.*/vcf/.*' \
    gs://vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/ \
    ~/vo_agam_release/v3.1/snp_haplotypes/1177-VO-ML-LEHMANN-VMF00004/

# copy phased sites data from cloud to local file system
!gsutil -m rsync -rn \
    gs://vo_agam_release/v3/snp_haplotypes/sites/ \
    ~/vo_agam_release/v3/snp_haplotypes/sites/ 

## Feedback and suggestions

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions).