# Amin1 cloud data access

This page provides information about how to access data from the [Amin1 resource](intro) via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via [Google Colab](https://colab.research.google.com/) which is free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon (<i class="fas fa-rocket"></i>) at the top of the page and select one of the cloud computing services available.

## Data hosting

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_amin_release` bucket, which is a multi-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. 

## Setup

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

In [1]:
!pip install -q malariagen_data

To make accessing these data more convenient, we've created the [malariagen_data Python package](https://github.com/malariagen/malariagen-data-python), which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions. 

Now import the packages we'll need to use here.

In [55]:
import malariagen_data
import numpy as np
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel

Data access from Google Cloud is set up with the following code:

In [3]:
amin1 = malariagen_data.Amin1("gs://vo_amin_release/")

## Sample metadata

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

Load sample metadata into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe):

In [4]:
df_samples = amin1.sample_metadata()
df_samples

Unnamed: 0,sample_id,original_sample_id,sanger_sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,season,PCA_cohort,cohort,subsampled_cohort
0,VBS09378-4248STDY7308980,VBS09378,4248STDY7308980,CB-2-00264,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,
1,VBS09382-4248STDY7308981,VBS09382,4248STDY7308981,CB-2-00258,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,
2,VBS09397-4248STDY7308982,VBS09397,4248STDY7308982,CB-2-00384,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,PV
3,VBS09460-4248STDY7308986,VBS09460,4248STDY7308986,CB-2-02960,Brandy St. Laurent,Cambodia,Preah Kleang,2016,6,13.667,104.982,May-Jul (early wet),A,PV,
4,VBS09466-4248STDY7308989,VBS09466,4248STDY7308989,CB-2-04070,Brandy St. Laurent,Cambodia,Preah Kleang,2016,11,13.667,104.982,Nov-Jan (early dry),A,PV,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,VBS16624-4248STDY7918667,VBS16624,4248STDY7918667,KV-32-01591,Brandy St. Laurent,Cambodia,Sayas,2014,6,13.548,107.025,May-Jul (early wet),C,RK2,RK2
298,VBS16625-4248STDY7918668,VBS16625,4248STDY7918668,KV-32-01499,Brandy St. Laurent,Cambodia,Sayas,2014,6,13.548,107.025,May-Jul (early wet),C,RK2,RK2
299,VBS16626-4248STDY7918669,VBS16626,4248STDY7918669,KV-32-01465,Brandy St. Laurent,Cambodia,Sayas,2014,6,13.548,107.025,May-Jul (early wet),B,RK1,RK1
300,VBS16628-4248STDY7918670,VBS16628,4248STDY7918670,KV-32-01454,Brandy St. Laurent,Cambodia,Sayas,2014,6,13.548,107.025,May-Jul (early wet),C,RK2,RK2


The `sample_id` column gives the sample identifier used throughout all analyses.

The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.

The `year` and `month` columns give the approximate date when the specimen was collected.

[Pandas](https://pandas.pydata.org/) can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

In [5]:
df_summary = df_samples.pivot_table(
    index=["longitude", "latitude", "location"], 
    columns=["year"],
    values="sample_id", 
    aggfunc=len,
    fill_value=0
)
df_summary.style.set_caption("Number of mosquito specimens by collection site and year.")

Unnamed: 0_level_0,Unnamed: 1_level_0,year,2010,2011,2014,2015,2016
longitude,latitude,location,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
102.735,12.155,Thmar Da,26,15,0,0,0
104.92,13.77,Chean Mok,0,0,66,9,0
104.982,13.667,Preah Kleang,0,0,47,9,36
106.995,13.595,Chamkar San,0,0,40,11,0
107.025,13.548,Sayas,0,0,39,4,0


## Reference genome

Sequence data in this study were aligned to the MINIMUS1 reference genome. This reference genome contains 678 contigs in total, but many contigs are small and not suitable for population genetic analyses. We therefore have included only SNP calls for the 40 largest contigs. The set of contigs analysed can be accessed as follows:

In [6]:
amin1.contigs

('KB663610',
 'KB663611',
 'KB663622',
 'KB663633',
 'KB663644',
 'KB663655',
 'KB663666',
 'KB663677',
 'KB663688',
 'KB663699',
 'KB663710',
 'KB663721',
 'KB663722',
 'KB663733',
 'KB663744',
 'KB663755',
 'KB663766',
 'KB663777',
 'KB663788',
 'KB663799',
 'KB663810',
 'KB663821',
 'KB663832',
 'KB663833',
 'KB663844',
 'KB663855',
 'KB663866',
 'KB663877',
 'KB663888',
 'KB663899',
 'KB663910',
 'KB663921',
 'KB663932',
 'KB663943',
 'KB663955',
 'KB664054',
 'KB664165',
 'KB664255',
 'KB664266',
 'KB664277')

For convenience, the reference genome sequence for any contig can be loaded as a [NumPy array](https://numpy.org/doc/stable/user/absolute_beginners.html), e.g.:

In [7]:
# load the reference sequence for a single contig as a numpy array
seq = amin1.genome_sequence(contig="KB663610").compute()
seq

array([b'T', b'T', b'C', ..., b'C', b'A', b'C'], dtype='|S1')

In [8]:
len(seq)

31626230

In [11]:
np.sum((seq != b'N') & (seq != b'n'))

30141520

## SNP calls

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not "N". Data on the SNP positions, alleles, site filters and genotype calls for a given contig can be accessed as an [xarray Dataset](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset). E.g., access SNP calls for contig KB663610: 

In [35]:
df_snps = amin1.snp_calls(contig="KB663610", site_mask=False)
df_snps

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.89 MiB Shape (22857027,) (495697,) Count 597 Tasks 58 Chunks Type int32 numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,484.08 kiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,uint8,numpy.ndarray
"Array Chunk Bytes 21.80 MiB 484.08 kiB Shape (22857027,) (495697,) Count 597 Tasks 58 Chunks Type uint8 numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,484.08 kiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,uint8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.08 kiB,7.08 kiB
Shape,"(302,)","(302,)"
Count,1 Tasks,1 Chunks
Type,|S24,numpy.ndarray
"Array Chunk Bytes 7.08 kiB 7.08 kiB Shape (302,) (302,) Count 1 Tasks 1 Chunks Type |S24 numpy.ndarray",302  1,

Unnamed: 0,Array,Chunk
Bytes,7.08 kiB,7.08 kiB
Shape,"(302,)","(302,)"
Count,1 Tasks,1 Chunks
Type,|S24,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.42 MiB Shape (22857027, 4) (495697, 3) Count 980 Tasks 116 Chunks Type |S1 numpy.ndarray",4  22857027,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 21.80 MiB 865.87 kiB Shape (22857027,) (886654,) Count 96 Tasks 32 Chunks Type bool numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 12.86 GiB 27.18 MiB Shape (22857027, 302, 2) (284970, 50, 2) Count 3285 Tasks 707 Chunks Type int8 numpy.ndarray",2  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.43 GiB,13.59 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 6.43 GiB 13.59 MiB Shape (22857027, 302) (284970, 50) Count 3285 Tasks 707 Chunks Type int8 numpy.ndarray",302  22857027,

Unnamed: 0,Array,Chunk
Bytes,6.43 GiB,13.59 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.72 GiB,54.35 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 25.72 GiB 54.35 MiB Shape (22857027, 302) (284970, 50) Count 3285 Tasks 707 Chunks Type float32 numpy.ndarray",302  22857027,

Unnamed: 0,Array,Chunk
Bytes,25.72 GiB,54.35 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.43 GiB,108.71 MiB
Shape,"(22857027, 302, 4)","(284970, 50, 4)"
Count,3285 Tasks,707 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 51.43 GiB 108.71 MiB Shape (22857027, 302, 4) (284970, 50, 4) Count 3285 Tasks 707 Chunks Type int16 numpy.ndarray",4  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,51.43 GiB,108.71 MiB
Shape,"(22857027, 302, 4)","(284970, 50, 4)"
Count,3285 Tasks,707 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3992 Tasks,707 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.86 GiB 27.18 MiB Shape (22857027, 302, 2) (284970, 50, 2) Count 3992 Tasks 707 Chunks Type bool numpy.ndarray",2  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3992 Tasks,707 Chunks
Type,bool,numpy.ndarray


The arrays within this dataset are backed by [Dask arrays](https://docs.dask.org/en/stable/array.html), and can be accessed as shown below.

### SNP positions and alleles

In [36]:
# SNP positions (1-based)
pos = df_snps['variant_position'].data
pos

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.89 MiB Shape (22857027,) (495697,) Count 597 Tasks 58 Chunks Type int32 numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray


In [37]:
# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p

array([209, 210, 212, 214, 215, 217, 219, 220, 221, 222], dtype=int32)

In [38]:
# SNP alleles (first column is the reference allele)
alleles = df_snps['variant_allele'].data
alleles

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.42 MiB Shape (22857027, 4) (495697, 3) Count 980 Tasks 116 Chunks Type |S1 numpy.ndarray",4  22857027,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray


In [39]:
# read first 10 SNP alleles into a numpy array
a = alleles[:10, :].compute()
a

array([[b'A', b'C', b'G', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'A', b'C', b'G', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'A', b'C', b'G', b'T']], dtype='|S1')

### Site filters

SNP calling is not always reliable, and we have created site filters to allow excluding low quality SNPs. For each contig, a "filter_pass" Boolean mask is available, where `True` indicates that the site passed the filter and is accessible to high quality SNP calling. 

In [40]:
# site filters
filter_pass = df_snps['variant_filter_pass'].data
filter_pass

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 21.80 MiB 865.87 kiB Shape (22857027,) (886654,) Count 96 Tasks 32 Chunks Type bool numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray


In [41]:
# load site filters for first 10 SNPs
f = filter_pass[:10].compute()
f

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

In [42]:
# how many sites on this contig pass filters?
n_sites = df_snps.dims['variants']
n_pass = filter_pass.sum().compute()
print(f"{n_pass:,} out of {n_sites:,} ({n_pass/n_sites:.0%}) sites pass site filters")

22,857,027 out of 22,857,027 (100%) sites pass site filters


Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below. 

### SNP genotypes

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown in the example below.

In [43]:
gt = df_snps['call_genotype'].data
gt

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 12.86 GiB 27.18 MiB Shape (22857027, 302, 2) (284970, 50, 2) Count 3285 Tasks 707 Chunks Type int8 numpy.ndarray",2  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray


Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

In [44]:
len(df_samples) == gt.shape[1]

True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

In [45]:
df_samples.cohort.unique()

array(['PV', nan, 'RK2', 'RK1', 'TD'], dtype=object)

In [49]:
# select samples from the Thmar Da cohort
loc_cohort = df_samples.eval("cohort == 'TD'").values
print(f"found {np.count_nonzero(loc_cohort)} samples")
gt_cohort = da.compress(loc_cohort, gt, axis=1)
gt_cohort

found 41 samples


Unnamed: 0,Array,Chunk
Bytes,1.75 GiB,18.48 MiB
Shape,"(22857027, 41, 2)","(284970, 34, 2)"
Count,3487 Tasks,202 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 1.75 GiB 18.48 MiB Shape (22857027, 41, 2) (284970, 34, 2) Count 3487 Tasks 202 Chunks Type int8 numpy.ndarray",2  41  22857027,

Unnamed: 0,Array,Chunk
Bytes,1.75 GiB,18.48 MiB
Shape,"(22857027, 41, 2)","(284970, 34, 2)"
Count,3487 Tasks,202 Chunks
Type,int8,numpy.ndarray


Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

In [50]:
g = gt[:5, :3, :].compute()
g

array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use [scikit-allel](http://scikit-allel.readthedocs.org/).
E.g., the code below sets up a genotype array.

In [53]:
# use the scikit-allel wrapper class for genotype calls
gt = allel.GenotypeDaskArray(df_snps['call_genotype'].data)
gt

Unnamed: 0,0,1,2,3,4,...,297,298,299,300,301,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
22857024,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/2,
22857025,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
22857026,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,


## Example computations

Below are some examples of simple computations that can be run with these data.

### Counting sites passing filters

For each of the contigs for which SNP calling was performed, count the number of sites and the number passing site filters.

In [24]:
for contig in amin1.contigs:
    ds_snps = amin1.snp_calls(contig=contig)
    filter_pass = ds_snps['variant_filter_pass'].data
    n_sites = ds_snps.dims['variants']
    n_pass = filter_pass.sum().compute()
    print(f"{contig}: {n_pass:,} out of {n_sites:,} ({n_pass/n_sites:.0%}) sites pass site filters")

KB663610: 22,857,027 out of 22,857,027 (100%) sites pass site filters
KB663611: 3,737,822 out of 3,737,822 (100%) sites pass site filters
KB663622: 4,388,762 out of 4,388,762 (100%) sites pass site filters
KB663633: 4,215,569 out of 4,215,569 (100%) sites pass site filters
KB663644: 4,221,032 out of 4,221,032 (100%) sites pass site filters
KB663655: 2,916,639 out of 2,916,639 (100%) sites pass site filters
KB663666: 2,501,655 out of 2,501,655 (100%) sites pass site filters
KB663677: 2,586,150 out of 2,586,150 (100%) sites pass site filters
KB663688: 2,742,023 out of 2,742,023 (100%) sites pass site filters
KB663699: 1,534,228 out of 1,534,228 (100%) sites pass site filters


KeyboardInterrupt: 

### Counting segregating sites

Count the number of segregating SNPs on a single contig that also pass site filters.

In [58]:
# choose contig
contig = "KB663610"

# access SNP data
ds_snps = amin1.snp_calls(contig=contig)
ds_snps

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.89 MiB Shape (22857027,) (495697,) Count 597 Tasks 58 Chunks Type int32 numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.89 MiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,484.08 kiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,uint8,numpy.ndarray
"Array Chunk Bytes 21.80 MiB 484.08 kiB Shape (22857027,) (495697,) Count 597 Tasks 58 Chunks Type uint8 numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,484.08 kiB
Shape,"(22857027,)","(495697,)"
Count,597 Tasks,58 Chunks
Type,uint8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.08 kiB,7.08 kiB
Shape,"(302,)","(302,)"
Count,1 Tasks,1 Chunks
Type,|S24,numpy.ndarray
"Array Chunk Bytes 7.08 kiB 7.08 kiB Shape (302,) (302,) Count 1 Tasks 1 Chunks Type |S24 numpy.ndarray",302  1,

Unnamed: 0,Array,Chunk
Bytes,7.08 kiB,7.08 kiB
Shape,"(302,)","(302,)"
Count,1 Tasks,1 Chunks
Type,|S24,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 87.19 MiB 1.42 MiB Shape (22857027, 4) (495697, 3) Count 980 Tasks 116 Chunks Type |S1 numpy.ndarray",4  22857027,

Unnamed: 0,Array,Chunk
Bytes,87.19 MiB,1.42 MiB
Shape,"(22857027, 4)","(495697, 3)"
Count,980 Tasks,116 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 21.80 MiB 865.87 kiB Shape (22857027,) (886654,) Count 96 Tasks 32 Chunks Type bool numpy.ndarray",22857027  1,

Unnamed: 0,Array,Chunk
Bytes,21.80 MiB,865.87 kiB
Shape,"(22857027,)","(886654,)"
Count,96 Tasks,32 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 12.86 GiB 27.18 MiB Shape (22857027, 302, 2) (284970, 50, 2) Count 3285 Tasks 707 Chunks Type int8 numpy.ndarray",2  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.43 GiB,13.59 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 6.43 GiB 13.59 MiB Shape (22857027, 302) (284970, 50) Count 3285 Tasks 707 Chunks Type int8 numpy.ndarray",302  22857027,

Unnamed: 0,Array,Chunk
Bytes,6.43 GiB,13.59 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.72 GiB,54.35 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 25.72 GiB 54.35 MiB Shape (22857027, 302) (284970, 50) Count 3285 Tasks 707 Chunks Type float32 numpy.ndarray",302  22857027,

Unnamed: 0,Array,Chunk
Bytes,25.72 GiB,54.35 MiB
Shape,"(22857027, 302)","(284970, 50)"
Count,3285 Tasks,707 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.43 GiB,108.71 MiB
Shape,"(22857027, 302, 4)","(284970, 50, 4)"
Count,3285 Tasks,707 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 51.43 GiB 108.71 MiB Shape (22857027, 302, 4) (284970, 50, 4) Count 3285 Tasks 707 Chunks Type int16 numpy.ndarray",4  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,51.43 GiB,108.71 MiB
Shape,"(22857027, 302, 4)","(284970, 50, 4)"
Count,3285 Tasks,707 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3992 Tasks,707 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.86 GiB 27.18 MiB Shape (22857027, 302, 2) (284970, 50, 2) Count 3992 Tasks 707 Chunks Type bool numpy.ndarray",2  302  22857027,

Unnamed: 0,Array,Chunk
Bytes,12.86 GiB,27.18 MiB
Shape,"(22857027, 302, 2)","(284970, 50, 2)"
Count,3992 Tasks,707 Chunks
Type,bool,numpy.ndarray


In [59]:
# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
loc_pass

array([ True,  True,  True, ...,  True,  True,  True])

In [63]:
# perform an allele count over genotypes
gt = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gt

Unnamed: 0,0,1,2,3,4,...,297,298,299,300,301,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
22857024,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/2,
22857025,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
22857026,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,


In [64]:
with ProgressBar():
    ac = gt.count_alleles(max_allele=3).compute()
ac

[########################################] | 100% Completed |  1min 20.0s


Unnamed: 0,0,1,2,3,Unnamed: 5
0,604,0,0,0,
1,604,0,0,0,
2,604,0,0,0,
...,...,...,...,...,...
18061986,604,0,0,0,
18061987,604,0,0,0,
18061988,604,0,0,0,


In [66]:
np.bincount(ac.allelism())

array([       0, 12723665,  4583449,   711704,    43171])

In [67]:
ac.shape

(18061989, 4)

In [62]:
# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
n_pass_seg = np.count_nonzero(loc_pass & loc_seg)

n_pass_seg

ValueError: operands could not be broadcast together with shapes (22857027,) (18061989,) 

## Feedback and suggestions

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions).

## API docs

Here are the docstrings for the functions in the `malariagen_data` package that we used above.

In [26]:
help(amin1.sample_metadata)

Help on method sample_metadata in module malariagen_data.amin1:

sample_metadata() method of malariagen_data.amin1.Amin1 instance
    Access sample metadata.
    
    Returns
    -------
    df : pandas.DataFrame



In [28]:
help(amin1.genome_sequence)

Help on method genome_sequence in module malariagen_data.amin1:

genome_sequence(contig, inline_array=True, chunks='native') method of malariagen_data.amin1.Amin1 instance
    Access the reference genome sequence.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    d : dask.array.Array



In [29]:
help(amin1.snp_calls)

Help on method snp_calls in module malariagen_data.amin1:

snp_calls(contig, site_mask=False, inline_array=True, chunks='native') method of malariagen_data.amin1.Amin1 instance
    Access SNP sites, site filters and genotype calls.
    
    Parameters
    ----------
    contig : str or list
        Contig, e.g., "KB663610", or list of contigs.
    site_mask : bool
        Apply site filters.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    ds : xarray.Dataset



---