# Marker investigation 

In this notebook the all of the variants in the [Pv4 data release](https://www.malariagen.net/resource/30) are used to identify regions of the core genome that are microhaplotype candidates (<200 bp length) with the following characteristics: 

- Clonal samples only (FWS > 0.95)
- Unique samples only, > 50% callable (Richard's "in_analysis_set" metadata column) 
- QC pass (Filter pass) 
- Only SNPs 
- Located in core genome 

Files used in this notebook are available through the [Pv4 data release](https://www.malariagen.net/resource/30), but are also attached to the repo


Sasha's notes 
- samples only in GSK and Price studies + anything in Pv1.0 release
- exclude samples that have unverified metadata and don't cluster in the defined subpopulations (these are removed in the "in_analysis_set" step already)
- biallelic SNPs only 

Questions 
- Do we also want to filter studies and in analysis set? - the usable study list in other notebook
- Also filtering snps by...
                     & variants['CDS']
                     & (freqs_subpops['all'][:,0] > 0.1)
                     & (freqs_subpops['all'][:,1] > 0.1)
                     & (freq_missing < 0.1))
- Change to Sasha's FWS file ? Checked and get the same samples when filtering > 0.95 so using one in pv4 release 
- I'm using region file not CDS to filter variants, is that okay 

## Setup 

In [231]:
from malariagen_data.pv4 import Pv4
import pandas as pd
import numpy as np
import allel
import dask.array as da
import collections

In [392]:
# Supress warning 
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)   

## Load Data  

Using the Pv4 data package we can access the files that are stored on the cloud. This is set up with the following code:

In [232]:
pv4 = Pv4("gs://pv4_staging/")

Using this we can load the **sample metadata**

In [233]:
pv4_metadata = pv4.sample_metadata()

pv4_metadata.head()

Unnamed: 0,Sample,Study,Site,First-level administrative division,Country,Lat,Long,Year,ENA,All samples same individual,Population,% callable,QC pass,Exclusion reason,Is returning traveller
0,BBH-1-125,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678989,BBH-1-125,AF,88.52,True,Analysis_set,False
1,BBH_1_132,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678991,BBH_1_132,AF,90.2,True,Analysis_set,False
2,BBH_1_137,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2679003,BBH_1_137,AF,87.09,True,Analysis_set,False
3,BBH_1_153,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678992,BBH_1_153,AF,90.6,True,Analysis_set,False
4,BBH_1_162,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678993,BBH_1_162,AF,91.67,True,Analysis_set,False


We can also use the package to load the **variant data**

In [234]:
variant_dataset = pv4.variant_calls(extended=True)
variant_dataset

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 14.80 kiB 14.80 kiB Shape (1895,) (1895,) Count 1 Tasks 1 Chunks Type object numpy.ndarray",1895  1,

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 244.12 MiB 3.00 MiB Shape (4571056, 7) (65536, 6) Count 350 Tasks 140 Chunks Type object numpy.ndarray",7  4571056,

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895, 2) (65536, 64, 2) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",2  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 112.94 GiB 56.00 MiB Shape (4571056, 1895, 7) (65536, 64, 7) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",7  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 8.07 GiB 4.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 96.81 GiB 48.00 MiB Shape (4571056, 1895, 3) (65536, 64, 3) Count 2100 Tasks 2100 Chunks Type int32 numpy.ndarray",3  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray


## Subset Variants

We only want to include certain variants in this analysis. Below we filter the variant dataset to only include: 
* samples in the analysis set
* samples with FWS > 0.95
* samples with percent callable > 50% 
* variants that are SNPs 
* filter pass variants 
* biallelic snps 

We will need the [FWS values](https://www.malariagen.net/sites/default/files/Pv4_fws.txt) which are stored in a separate file within the repository. The following code loads the FWS data and adds it to the existing metadata:

In [235]:
pv4_fws = pd.read_csv('../supplementary_files/Pv4_fws.txt', sep='\t', comment='t')
pv4_metadata = pd.merge(pv4_metadata, pv4_fws, on='Sample', how='outer')

Filter variants to only include samples with **FWS > 0.95** and **percent callable > 50%**

**DO I INCLUDE ANALYSIS SET HERE, IT GIVES SAME ANSWER**

In [236]:
loc_filtered_samples = ((pv4_metadata['Fws'] > 0.95) & (pv4_metadata['% callable']>50)) #(pv4_metadata['Exclusion reason']=='Analysis_set'))
subset_metadata = pv4_metadata[loc_filtered_samples]
variant_dataset_filtered = variant_dataset.isel(samples=loc_filtered_samples)

Subset variants to only include ones which **pass filters** and are **snps**

In [237]:
filters = (variant_dataset_filtered['variant_filter_pass'].data) & (variant_dataset_filtered['variant_is_snp'].data) #& (variant_dataset_filtered['variant_CDS'].data)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=filters)

Filter variants to only include **biallelic** snps - currently doing this further down instead 

In [238]:
biallelic_filter = (variant_dataset_filtered['variant_numalt']==1).data
variant_dataset_filtered = variant_dataset_filtered.isel(variants=biallelic_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.42 kiB 5.42 kiB Shape (694,) (694,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",694  1,

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,48.70 MiB,1.45 MiB
Shape,"(911901, 7)","(31658, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 48.70 MiB 1.45 MiB Shape (911901, 7) (31658, 6) Count 514 Tasks 82 Chunks Type object numpy.ndarray",7  911901,

Unnamed: 0,Array,Chunk
Bytes,48.70 MiB,1.45 MiB
Shape,"(911901, 7)","(31658, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.18 GiB,2.35 MiB
Shape,"(911901, 694, 2)","(31658, 39, 2)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 1.18 GiB 2.35 MiB Shape (911901, 694, 2) (31658, 39, 2) Count 6660 Tasks 1230 Chunks Type int8 numpy.ndarray",2  694  911901,

Unnamed: 0,Array,Chunk
Bytes,1.18 GiB,2.35 MiB
Shape,"(911901, 694, 2)","(31658, 39, 2)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.25 GiB,16.48 MiB
Shape,"(911901, 694, 7)","(31658, 39, 7)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 8.25 GiB 16.48 MiB Shape (911901, 694, 7) (31658, 39, 7) Count 6660 Tasks 1230 Chunks Type int16 numpy.ndarray",7  694  911901,

Unnamed: 0,Array,Chunk
Bytes,8.25 GiB,16.48 MiB
Shape,"(911901, 694, 7)","(31658, 39, 7)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.18 GiB,2.35 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 1.18 GiB 2.35 MiB Shape (911901, 694) (31658, 39) Count 6660 Tasks 1230 Chunks Type int16 numpy.ndarray",694  911901,

Unnamed: 0,Array,Chunk
Bytes,1.18 GiB,2.35 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,603.54 MiB,1.18 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 603.54 MiB 1.18 MiB Shape (911901, 694) (31658, 39) Count 6660 Tasks 1230 Chunks Type int8 numpy.ndarray",694  911901,

Unnamed: 0,Array,Chunk
Bytes,603.54 MiB,1.18 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,9.42 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 4.72 GiB 9.42 MiB Shape (911901, 694) (31658, 39) Count 6660 Tasks 1230 Chunks Type object numpy.ndarray",694  911901,

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,9.42 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,9.42 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 4.72 GiB 9.42 MiB Shape (911901, 694) (31658, 39) Count 6660 Tasks 1230 Chunks Type object numpy.ndarray",694  911901,

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,9.42 MiB
Shape,"(911901, 694)","(31658, 39)"
Count,6660 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.07 GiB,14.13 MiB
Shape,"(911901, 694, 3)","(31658, 39, 3)"
Count,6660 Tasks,1230 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 7.07 GiB 14.13 MiB Shape (911901, 694, 3) (31658, 39, 3) Count 6660 Tasks 1230 Chunks Type int32 numpy.ndarray",3  694  911901,

Unnamed: 0,Array,Chunk
Bytes,7.07 GiB,14.13 MiB
Shape,"(911901, 694, 3)","(31658, 39, 3)"
Count,6660 Tasks,1230 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 20.87 MiB 741.98 kiB Shape (911901, 6) (31658, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  911901,

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 20.87 MiB 741.98 kiB Shape (911901, 6) (31658, 6) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",6  911901,

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 890.53 kiB 30.92 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,890.53 kiB,30.92 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 3.48 MiB 123.66 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,3.48 MiB,123.66 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 6.96 MiB 247.33 kiB Shape (911901,) (31658,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",911901  1,

Unnamed: 0,Array,Chunk
Bytes,6.96 MiB,247.33 kiB
Shape,"(911901,)","(31658,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 20.87 MiB 741.98 kiB Shape (911901, 6) (31658, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  911901,

Unnamed: 0,Array,Chunk
Bytes,20.87 MiB,741.98 kiB
Shape,"(911901, 6)","(31658, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray


In [198]:
# variant_dataset = variant_dataset.set_index(variants="variant_position", samples="sample_id")

Only include variants that have a frequency over 0.1 - **DO I INCLUDE THIS?**

Perform an allele count on the genotypes and convert to frequency

In [239]:
%%time
# allele count for all samples
gt = allel.GenotypeDaskArray(variant_dataset_filtered["call_genotype"].data)
ac_pop = gt.count_alleles()
ac_pop_freq = ac_pop.to_frequencies().compute()
ac_pop

CPU times: user 6min 21s, sys: 1min, total: 7min 21s
Wall time: 17min 43s


Unnamed: 0,0,1,Unnamed: 3
0,1376,2,
1,1380,2,
2,1378,0,
...,...,...,...
911898,1380,2,
911899,1378,8,
911900,1386,0,


Calculate the missingness frequency for each SNP

In [228]:
# freq_missing = gt.count_missing(axis=0).compute() / gt.shape[1]

In [229]:
# len(freq_missing)

Filter the variants to only include frequency over 0.1 and missingness less than 0.1 

In [240]:
pop_freq_filter = (ac_pop_freq[:,0] > 0.1) & (ac_pop_freq[:,1] > 0.1) #& (freq_missing < 0.1)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=pop_freq_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.42 kiB 5.42 kiB Shape (694,) (694,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",694  1,

Unnamed: 0,Array,Chunk
Bytes,5.42 kiB,5.42 kiB
Shape,"(694,)","(694,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.46 MiB,51.52 kiB
Shape,"(27300, 7)","(1099, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 1.46 MiB 51.52 kiB Shape (27300, 7) (1099, 6) Count 596 Tasks 82 Chunks Type object numpy.ndarray",7  27300,

Unnamed: 0,Array,Chunk
Bytes,1.46 MiB,51.52 kiB
Shape,"(27300, 7)","(1099, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.14 MiB,83.71 kiB
Shape,"(27300, 694, 2)","(1099, 39, 2)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 36.14 MiB 83.71 kiB Shape (27300, 694, 2) (1099, 39, 2) Count 7890 Tasks 1230 Chunks Type int8 numpy.ndarray",2  694  27300,

Unnamed: 0,Array,Chunk
Bytes,36.14 MiB,83.71 kiB
Shape,"(27300, 694, 2)","(1099, 39, 2)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,252.96 MiB,585.99 kiB
Shape,"(27300, 694, 7)","(1099, 39, 7)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 252.96 MiB 585.99 kiB Shape (27300, 694, 7) (1099, 39, 7) Count 7890 Tasks 1230 Chunks Type int16 numpy.ndarray",7  694  27300,

Unnamed: 0,Array,Chunk
Bytes,252.96 MiB,585.99 kiB
Shape,"(27300, 694, 7)","(1099, 39, 7)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.14 MiB,83.71 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 36.14 MiB 83.71 kiB Shape (27300, 694) (1099, 39) Count 7890 Tasks 1230 Chunks Type int16 numpy.ndarray",694  27300,

Unnamed: 0,Array,Chunk
Bytes,36.14 MiB,83.71 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,18.07 MiB,41.86 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 18.07 MiB 41.86 kiB Shape (27300, 694) (1099, 39) Count 7890 Tasks 1230 Chunks Type int8 numpy.ndarray",694  27300,

Unnamed: 0,Array,Chunk
Bytes,18.07 MiB,41.86 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,144.55 MiB,334.85 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 144.55 MiB 334.85 kiB Shape (27300, 694) (1099, 39) Count 7890 Tasks 1230 Chunks Type object numpy.ndarray",694  27300,

Unnamed: 0,Array,Chunk
Bytes,144.55 MiB,334.85 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,144.55 MiB,334.85 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 144.55 MiB 334.85 kiB Shape (27300, 694) (1099, 39) Count 7890 Tasks 1230 Chunks Type object numpy.ndarray",694  27300,

Unnamed: 0,Array,Chunk
Bytes,144.55 MiB,334.85 kiB
Shape,"(27300, 694)","(1099, 39)"
Count,7890 Tasks,1230 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,216.82 MiB,502.28 kiB
Shape,"(27300, 694, 3)","(1099, 39, 3)"
Count,7890 Tasks,1230 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 216.82 MiB 502.28 kiB Shape (27300, 694, 3) (1099, 39, 3) Count 7890 Tasks 1230 Chunks Type int32 numpy.ndarray",3  694  27300,

Unnamed: 0,Array,Chunk
Bytes,216.82 MiB,502.28 kiB
Shape,"(27300, 694, 3)","(1099, 39, 3)"
Count,7890 Tasks,1230 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 639.84 kiB 25.76 kiB Shape (27300, 6) (1099, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  27300,

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 639.84 kiB 25.76 kiB Shape (27300, 6) (1099, 6) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",6  27300,

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 26.66 kiB 1.07 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,26.66 kiB,1.07 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 106.64 kiB 4.29 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,106.64 kiB,4.29 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 213.28 kiB 8.59 kiB Shape (27300,) (1099,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",27300  1,

Unnamed: 0,Array,Chunk
Bytes,213.28 kiB,8.59 kiB
Shape,"(27300,)","(1099,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 639.84 kiB 25.76 kiB Shape (27300, 6) (1099, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  27300,

Unnamed: 0,Array,Chunk
Bytes,639.84 kiB,25.76 kiB
Shape,"(27300, 6)","(1099, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray


## Load core region data 

Load [Pv4 regions](https://www.malariagen.net/sites/default/files/Pv4_regions.bed.gz) into pandas dataframe. This file details the chromosome, the start and end, and the type of the region.

In [241]:
pv4_regions = pd.read_csv('../supplementary_files/Pv4_regions.bed', sep='\t', comment='t', header=None)
header = ['chrom', 'chromStart', 'chromEnd', 'name']
pv4_regions.columns = header[:len(pv4_regions.columns)]

In [242]:
# VCF might be base 1, checking with Sasha to see if bed file needs to shift by 1
pv4_regions.loc[pv4_regions.name=='Core'] 

Unnamed: 0,chrom,chromStart,chromEnd,name
1,PvP01_01_v1,116541,677962,Core
3,PvP01_01_v1,679789,903591,Core
6,PvP01_02_v1,100155,162348,Core
8,PvP01_02_v1,164087,745643,Core
11,PvP01_03_v1,108061,630663,Core
13,PvP01_03_v1,632481,894722,Core
16,PvP01_04_v1,185114,564965,Core
18,PvP01_04_v1,566927,685685,Core
20,PvP01_04_v1,748923,967650,Core
23,PvP01_05_v1,143101,844198,Core


# Sliding window through regions 

Sasha is counting the missing as unique

Add in entropy and heterozygosity, where gt_freqs is the frequency for each of the unique alleles
ent = -np.sum(gt_freqs*np.log(gt_freqs))
het = 1.-np.sum(gt_freqs**2)

In [394]:
def filter_variants(variant_dataset, field, value): 
    filter_values = (variant_dataset[field]==value).data
    variant_dataset_filtered = variant_dataset.isel(variants=filter_values)
    return variant_dataset_filtered

def count_windowed_biallelic_alleles(pos, variant_dataset, window_length, step): 
    values = (variant_dataset['variant_numalt']==1).data.compute()
    biallelic, windows, counts = allel.windowed_statistic(pos, values, statistic=np.count_nonzero, 
                                                              size=window_length, step=step)
    return biallelic, windows, counts

def calc_unique_allele_freq_in_window(gt): # Does this need to be just for biallelic
    unique, index, counts = np.unique(gt, axis=1, return_index=True, return_counts=True)
#     gt_freqs = counts/sum(counts)
#     return gt_freqs
    return counts 

def windowed_unique_allele_freq(pos, variant_dataset, window_length, step): 
    values = allel.GenotypeDaskArray(variant_dataset["call_genotype"].data)
    unique_freq, windows, counts = allel.windowed_statistic(pos, values, statistic=calc_unique_allele_freq_in_window, 
                                                            size=window_length, step=step)
    return unique_freq, windows, counts

def calculate_stats(variant_dataset, window_length, step):
    pos = variant_dataset["variants"].data
    # Count biallelic snps 
    biallelic, windows, counts = count_windowed_biallelic_alleles(pos, variant_dataset, window_length, step)
    # Count unique alleles 
    gt_freqs, windows2, counts2 = windowed_unique_allele_freq(pos, variant_dataset, window_length, step)
    return biallelic, gt_freqs, windows

In [397]:
def evaluate_marker_options(variant_dataset, chrom, region_df, window_length=200, step=50000): 
    
    # Filter variants to chromosome and set index
    variant_dataset = filter_variants(variant_dataset, 'variant_chrom', chrom)
    variant_dataset = variant_dataset.set_index(variants="variant_position", samples="sample_id")
    
    # Find core region boundaries for chromosome 
    core_region_df = region_df.loc[(region_df.chrom == chrom) & (region_df.name == 'Core')]
    
    biallelic_counts = []
    unique_allele_frequencies = []
    window_start = []
    window_end = []
    
    # For each region 
    for index, row in core_region_df.iterrows():
        print(f'starting sliding window for region: {row.chromStart}-{row.chromEnd}')
        
        # Restrict variants to region 
        variant_dataset_region = variant_dataset.sel(variants=slice(row.chromStart, row.chromEnd))
        
        # STATS 
        biallelic, gt_freqs, windows = calculate_stats(variant_dataset_region, window_length, step)
        
        # Concatenate results 
        biallelic_counts = biallelic_counts + list(biallelic) #Is there a better way to do this?
        window_start = window_start + list(windows[:,0])
        window_end = window_end + list(windows[:,1])
        unique_allele_frequencies = unique_allele_frequencies + list(gt_freqs)
        
    return biallelic_counts, unique_allele_frequencies, window_start, window_end

**Evaluate Markers for one Chrom**

In [398]:
%%time 
biallelic_counts, unique_allele_frequencies, window_start, window_end = evaluate_marker_options(variant_dataset_filtered, 
                                                                                       'PvP01_02_v1', pv4_regions)

starting sliding window for region: 100155-162348
starting sliding window for region: 164087-745643
CPU times: user 25.3 s, sys: 3.93 s, total: 29.2 s
Wall time: 1min 13s


In [376]:
df = pd.DataFrame(data={'window_start':window_start,'window_end':window_end, 
                        'biallelic_counts':biallelic_counts, 'unique_allele_frequencies':unique_allele_frequencies})
df

Unnamed: 0,window_start,window_end,biallelic_counts,unique_allele_frequencies
0,100512,100711,2.0,"[0.18587896253602307, 0.08069164265129683, 0.0..."
1,150512,150711,,
2,170692,170891,2.0,"[0.001440922190201729, 0.8515850144092219, 0.0..."
3,220692,220891,,
4,270692,270891,1.0,"[0.001440922190201729, 0.25504322766570603, 0...."
5,320692,320891,,
6,370692,370891,,
7,420692,420891,1.0,"[0.8948126801152738, 0.10518731988472622]"
8,470692,470891,,
9,520692,520891,,


In [377]:
df = df.dropna()

In [389]:
unique_allele_count = []
entropy = []
het = []
for index, row in df.iterrows():
    gt_freqs = row.unique_allele_frequencies
    unique_allele_count.append(len(gt_freqs))
    entropy.append(-np.sum(gt_freqs*np.log(gt_freqs)))
    het.append(1.-np.sum(gt_freqs**2))
    
df['unique_allele_count'] = unique_allele_count
df['entropy'] = entropy
df['het'] = het

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == "":


In [390]:
df

Unnamed: 0,window_start,window_end,biallelic_counts,unique_allele_frequencies,unique_allele_count,entropy,het
0,100512,100711,2.0,"[0.18587896253602307, 0.08069164265129683, 0.0...",4,0.762102,0.42523
2,170692,170891,2.0,"[0.001440922190201729, 0.8515850144092219, 0.0...",6,0.517952,0.260495
4,270692,270891,1.0,"[0.001440922190201729, 0.25504322766570603, 0....",4,0.58869,0.384274
7,420692,420891,1.0,"[0.8948126801152738, 0.10518731988472622]",2,0.336333,0.188246
12,670692,670891,1.0,"[0.001440922190201729, 0.7060518731988472, 0.0...",4,0.643089,0.419234


**Evaluate Markers for all Chrom**

In [None]:
%%time 

chromosomes = np.unique(variant_dataset_filtered["variant_chrom"].data.compute())
biallelic_counts = []
window_start = []
window_end = []
chrom_list = []
for chrom in chromosomes: 
    print(f'Chromosome: {chrom}')
    biallelic, start, end = evaluate_marker_options(variant_dataset_filtered, chrom, pv4_regions)
    biallelic_counts = biallelic_counts + biallelic
    window_start = window_start + start
    window_end = window_end + end
    chrom_list = chrom_list + ([chrom]*len(start))

In [None]:
results_df = pd.DataFrame(data={'chrom':chrom_list,'window_start':window_start,'window_end':window_end,
                        'biallelic_counts':biallelic_counts})
results_df

# Playground 

In [307]:
def count_unique_alleles_in_window(gt): # Does this need to be just for biallelic
    unique, index, counts = np.unique(gt, axis=1, return_index=True, return_counts=True)
    n_unique = len(counts)
    gt_freqs = counts/n_unique
    return n_unique, gt_freqs

data = [[[0,0],[1,1],[1,1],[1,1],[0,0]],
        [[0,0],[1,1],[-1,-1],[1,1],[0,0]],
        [[0,0],[0,1],[1,0],[1,1],[0,0]],
        [[0,0],[0,1],[1,1],[1,-1],[0,0]],]

gt = allel.GenotypeDaskArray(data)

n_unique, gt_freqs = count_unique_alleles_in_window(gt)

t_nalt_in_win = np_genotypes[:,snps_to_consider]

df_win_gt = pd.Series([str(x) for x in t_nalt_in_win.tolist()])

gt_freqs = df_win_gt.value_counts(normalize=True)

**entropy**
ent = -np.sum(gt_freqs*np.log(gt_freqs))

**het**
het = 1.-np.sum(gt_freqs**2)

**allele count**
n_all = len(gt_freqs)

In [308]:
n_unique, gt_freqs

(4, array([0.5 , 0.25, 0.25, 0.25]))

In [309]:
ent = -np.sum(gt_freqs*np.log(gt_freqs))
ent

1.3862943611198906

In [310]:
het = 1.-np.sum(gt_freqs**2)
het

0.5625

# Legacy Functions 

In [None]:
def filter_variants(variant_dataset, field, value): 
    filter_values = (variant_dataset[field]==value).data
    variant_dataset_filtered = variant_dataset.isel(variants=filter_values)
    return variant_dataset_filtered

def count_windowed_biallelic_alleles(pos, variant_dataset, window_length, step): 
    values = (variant_dataset['variant_numalt']==1).data.compute()
    # USE THIS TO START FROM BOUNDARY EDGE 
#     pos = np.insert(pos,0, row.chromStart) 
#     values = np.insert(values,0, 0)
    biallelic, windows, counts = allel.windowed_statistic(pos, values, statistic=np.count_nonzero, 
                                                              size=window_length, step=step)
    return biallelic, windows, counts

def count_unique_alleles_in_window(gt): # Does this need to be just for biallelic
    unique,index = np.unique(gt, axis=1, return_index=True)
    # If just N thats different don't count as unique 
#     unique_alleles = allel.GenotypeDaskArray(unique)
#     n_missing_replicates = alleles_with_only_missing_differences(unique_alleles)
#     n_unique_alleles = len(index) #- n_missing_replicates
    return len(index)

# def alleles_with_only_missing_differences(alleles_gt): 
#     # Remove alleles that are only different because they contain missing 
#     missing_replicates = []
#     # check alleles with ./.
#     alleles_with_missing = np.unique(np.where(alleles_gt==-1)[1].compute())
    
#     for al1_index in alleles_with_missing: 
#         al1 = (alleles_gt[:,al1_index].compute())

#         for al2_index in range(alleles_gt.shape[1]): 
#             if al2_index == al1_index or al2_index in missing_replicates: 
#                 continue 
#             al2 = (alleles_gt[:,al2_index].compute())
#             true_differences = 0 
#             # Is only difference missing variants
#             for var in range(len(al2)):
#                 if list(al1[var]) != [-1,-1] and list(al2[var]) != [-1,-1]:
#                     if list(al1[var]) != list(al2[var]): 
#                         true_differences +=1    
#                         break

#             # If al1 and al2 are replicates 
#             if true_differences == 0 :
#                 missing_replicates.append(al1_index)
#                 break

#     return len(missing_replicates)

def count_windowed_unique_alleles(pos, variant_dataset, window_length, step): 
    values = allel.GenotypeDaskArray(variant_dataset["call_genotype"].data)
    n_alleles, windows, counts = allel.windowed_statistic(pos, values, statistic=count_unique_alleles_in_window, 
                                                              size=window_length, step=step)
    return n_alleles, windows, counts


def calculate_stats(variant_dataset, window_length, step):
    pos = variant_dataset["variants"].data
    
    # Count biallelic snps 
    biallelic, windows, counts = count_windowed_biallelic_alleles(pos, variant_dataset, window_length, step)
    # Count unique alleles 
    n_alleles, windows2, counts2 = count_windowed_unique_alleles(pos, variant_dataset, window_length, step)

    return biallelic, n_alleles, windows