# Marker investigation 

In this notebook the variants in the [Pv4 data release](https://www.malariagen.net/resource/30) are used to identify regions of the core genome that are microhaplotype candidates (<200 bp length). The notebook consists of three parts :
- Loading the data
- Subset the variants and samples 
- Perform a sliding window across the core regions of the genome calculating certain statistics for each window

Files used in this notebook are available through the [Pv4 data release](https://www.malariagen.net/resource/30), but are also attached to the repo.

This notebook could be used as a template to further analyse microhaplotype candidates yourself, changing the filters to subset the data and the statistics calculated for each window.

## Setup 

In [1]:
from malariagen_data.pv4 import Pv4
import pandas as pd
import numpy as np
import allel
import dask.array as da
import collections
import math

In [2]:
# Supress warning 
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)   

## Load Data  

In this notebook we use the [malariagen_data Python package](https://github.com/malariagen/malariagen-data-python) to access the Pv4 data stored on the cloud. Further information on using this package can be found [here](https://malariagen.github.io/parasite-data/landing-page.html). 

We initialise access to the data with the below line of code:  

In [3]:
pv4 = Pv4()

Using this we can load the **sample metadata**

In [4]:
pv4_metadata = pv4.sample_metadata()

pv4_metadata.head()

Unnamed: 0,Sample,Study,Site,First-level administrative division,Country,Lat,Long,Year,ENA,All samples same individual,Population,% callable,QC pass,Exclusion reason,Is returning traveller
0,BBH-1-125,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678989,BBH-1-125,AF,88.52,True,Analysis_set,False
1,BBH_1_132,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678991,BBH_1_132,AF,90.2,True,Analysis_set,False
2,BBH_1_137,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2679003,BBH_1_137,AF,87.09,True,Analysis_set,False
3,BBH_1_153,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678992,BBH_1_153,AF,90.6,True,Analysis_set,False
4,BBH_1_162,X0009-PV-ET-LO,Jimma,Ethiopia: Oromia,Ethiopia,7.683331,36.851318,2016,ERR2678993,BBH_1_162,AF,91.67,True,Analysis_set,False


We can also use the package to load the **variant data**

In [5]:
variant_dataset = pv4.variant_calls(extended=True)
variant_dataset

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 14.80 kiB 14.80 kiB Shape (1895,) (1895,) Count 1 Tasks 1 Chunks Type object numpy.ndarray",1895  1,

Unnamed: 0,Array,Chunk
Bytes,14.80 kiB,14.80 kiB
Shape,"(1895,)","(1895,)"
Count,1 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 244.12 MiB 3.00 MiB Shape (4571056, 7) (65536, 6) Count 350 Tasks 140 Chunks Type object numpy.ndarray",7  4571056,

Unnamed: 0,Array,Chunk
Bytes,244.12 MiB,3.00 MiB
Shape,"(4571056, 7)","(65536, 6)"
Count,350 Tasks,140 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895, 2) (65536, 64, 2) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",2  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895, 2)","(65536, 64, 2)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 112.94 GiB 56.00 MiB Shape (4571056, 1895, 7) (65536, 64, 7) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",7  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,112.94 GiB,56.00 MiB
Shape,"(4571056, 1895, 7)","(65536, 64, 7)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 16.13 GiB 8.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int16 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,16.13 GiB,8.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 8.07 GiB 4.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type int8 numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,8.07 GiB,4.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 64.54 GiB 32.00 MiB Shape (4571056, 1895) (65536, 64) Count 2100 Tasks 2100 Chunks Type object numpy.ndarray",1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,64.54 GiB,32.00 MiB
Shape,"(4571056, 1895)","(65536, 64)"
Count,2100 Tasks,2100 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 96.81 GiB 48.00 MiB Shape (4571056, 1895, 3) (65536, 64, 3) Count 2100 Tasks 2100 Chunks Type int32 numpy.ndarray",3  1895  4571056,

Unnamed: 0,Array,Chunk
Bytes,96.81 GiB,48.00 MiB
Shape,"(4571056, 1895, 3)","(65536, 64, 3)"
Count,2100 Tasks,2100 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.36 MiB 64.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type bool numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,4.36 MiB,64.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 17.44 MiB 256.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type float32 numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,17.44 MiB,256.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 34.87 MiB 512.00 kiB Shape (4571056,) (65536,) Count 70 Tasks 70 Chunks Type object numpy.ndarray",4571056  1,

Unnamed: 0,Array,Chunk
Bytes,34.87 MiB,512.00 kiB
Shape,"(4571056,)","(65536,)"
Count,70 Tasks,70 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 104.62 MiB 1.50 MiB Shape (4571056, 6) (65536, 6) Count 70 Tasks 70 Chunks Type int32 numpy.ndarray",6  4571056,

Unnamed: 0,Array,Chunk
Bytes,104.62 MiB,1.50 MiB
Shape,"(4571056, 6)","(65536, 6)"
Count,70 Tasks,70 Chunks
Type,int32,numpy.ndarray


## Subset Samples

Now we have access to the metadata and variants we can apply fiters to subset the data. 

In our analysis we have only included samples which: 
* Are **clonal** (FWS>0.95)
* Have **percent callable > 50%**
* Are **unique** (we identify these using the "Exclusion reason" in_analysis_set in the metadata)
* Due to data restrictions at the time of analysis we only include samples that were in Pv1 or certain studies listed below 

A copy of the [FWS values](https://www.malariagen.net/sites/default/files/Pv4_fws.txt) which were calculated for Pv4 are stored within this repository. There is also a file listing if a sample was included in Pv1 (an older public release of vivax data). 
Below we load the FWS and Pv1 information and add it to the existing metadata:

In [6]:
pv4_fws = pd.read_csv('../supplementary_files/Pv4_fws.txt', sep='\t', comment='t')
pv4_metadata = pd.merge(pv4_metadata, pv4_fws, on='Sample', how='outer')

In [7]:
pv1_samples = pd.read_csv('../supplementary_files/Samples_in_Pv1.tsv', sep='\t')
pv4_metadata = pv4_metadata.merge(pv1_samples, on='Sample', how='left')

Next we identify samples that meet our criteria and filter the variants to only include these samples.

In [8]:
useable_studies = [
    '1128-PV-MULTI-GSK', '1154-PV-TH-PRICE', '1157-PV-MULTI-PRICE', 'X0001-PV-MULTI-HUPALO2016', 
    'X0002-PV-KH-PAROBEK2016'
]

loc_filtered_samples = (
    (pv4_metadata["Study"].isin(useable_studies) | pv4_metadata.in_pv_10)
    & (pv4_metadata["Fws"] > 0.95)
    & (pv4_metadata["% callable"] > 50)
    & (pv4_metadata["Exclusion reason"] == "Analysis_set")
)
subset_metadata = pv4_metadata[loc_filtered_samples]
variant_dataset_filtered = variant_dataset.isel(samples=loc_filtered_samples)

## Subset variants 

We also apply filters to the variants to only include ones which **pass filters** in the MalariaGEN dataset and are **coding SNPs**  

In [9]:
filters = (
    (variant_dataset_filtered["variant_filter_pass"].data)
    & (variant_dataset_filtered["variant_is_snp"].data)
    & (variant_dataset_filtered["variant_CDS"].data)
)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=filters)

We further filter these to only include **biallelic** snps

In [10]:
biallelic_filter = (variant_dataset_filtered["variant_numalt"] == 1).data
variant_dataset_filtered = variant_dataset_filtered.isel(variants=biallelic_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.80 kiB,4.80 kiB
Shape,"(615,)","(615,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 4.80 kiB 4.80 kiB Shape (615,) (615,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",615  1,

Unnamed: 0,Array,Chunk
Bytes,4.80 kiB,4.80 kiB
Shape,"(615,)","(615,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,23.51 MiB,744.80 kiB
Shape,"(440222, 7)","(15889, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 23.51 MiB 744.80 kiB Shape (440222, 7) (15889, 6) Count 514 Tasks 82 Chunks Type object numpy.ndarray",7  440222,

Unnamed: 0,Array,Chunk
Bytes,23.51 MiB,744.80 kiB
Shape,"(440222, 7)","(15889, 6)"
Count,514 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,516.39 MiB,1.15 MiB
Shape,"(440222, 615, 2)","(15889, 38, 2)"
Count,6508 Tasks,1189 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 516.39 MiB 1.15 MiB Shape (440222, 615, 2) (15889, 38, 2) Count 6508 Tasks 1189 Chunks Type int8 numpy.ndarray",2  615  440222,

Unnamed: 0,Array,Chunk
Bytes,516.39 MiB,1.15 MiB
Shape,"(440222, 615, 2)","(15889, 38, 2)"
Count,6508 Tasks,1189 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.53 GiB,8.06 MiB
Shape,"(440222, 615, 7)","(15889, 38, 7)"
Count,6508 Tasks,1189 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 3.53 GiB 8.06 MiB Shape (440222, 615, 7) (15889, 38, 7) Count 6508 Tasks 1189 Chunks Type int16 numpy.ndarray",7  615  440222,

Unnamed: 0,Array,Chunk
Bytes,3.53 GiB,8.06 MiB
Shape,"(440222, 615, 7)","(15889, 38, 7)"
Count,6508 Tasks,1189 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,516.39 MiB,1.15 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 516.39 MiB 1.15 MiB Shape (440222, 615) (15889, 38) Count 6508 Tasks 1189 Chunks Type int16 numpy.ndarray",615  440222,

Unnamed: 0,Array,Chunk
Bytes,516.39 MiB,1.15 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,258.19 MiB,589.63 kiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 258.19 MiB 589.63 kiB Shape (440222, 615) (15889, 38) Count 6508 Tasks 1189 Chunks Type int8 numpy.ndarray",615  440222,

Unnamed: 0,Array,Chunk
Bytes,258.19 MiB,589.63 kiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.02 GiB,4.61 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 2.02 GiB 4.61 MiB Shape (440222, 615) (15889, 38) Count 6508 Tasks 1189 Chunks Type object numpy.ndarray",615  440222,

Unnamed: 0,Array,Chunk
Bytes,2.02 GiB,4.61 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.02 GiB,4.61 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 2.02 GiB 4.61 MiB Shape (440222, 615) (15889, 38) Count 6508 Tasks 1189 Chunks Type object numpy.ndarray",615  440222,

Unnamed: 0,Array,Chunk
Bytes,2.02 GiB,4.61 MiB
Shape,"(440222, 615)","(15889, 38)"
Count,6508 Tasks,1189 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.03 GiB,6.91 MiB
Shape,"(440222, 615, 3)","(15889, 38, 3)"
Count,6508 Tasks,1189 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 3.03 GiB 6.91 MiB Shape (440222, 615, 3) (15889, 38, 3) Count 6508 Tasks 1189 Chunks Type int32 numpy.ndarray",3  615  440222,

Unnamed: 0,Array,Chunk
Bytes,3.03 GiB,6.91 MiB
Shape,"(440222, 615, 3)","(15889, 38, 3)"
Count,6508 Tasks,1189 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 429.90 kiB 15.52 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type bool numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,429.90 kiB,15.52 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.68 MiB 62.07 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type float32 numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,1.68 MiB,62.07 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.36 MiB 124.13 kiB Shape (440222,) (15889,) Count 152 Tasks 41 Chunks Type object numpy.ndarray",440222  1,

Unnamed: 0,Array,Chunk
Bytes,3.36 MiB,124.13 kiB
Shape,"(440222,)","(15889,)"
Count,152 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 10.08 MiB 372.40 kiB Shape (440222, 6) (15889, 6) Count 152 Tasks 41 Chunks Type int32 numpy.ndarray",6  440222,

Unnamed: 0,Array,Chunk
Bytes,10.08 MiB,372.40 kiB
Shape,"(440222, 6)","(15889, 6)"
Count,152 Tasks,41 Chunks
Type,int32,numpy.ndarray


In [11]:
np.unique(variant_dataset_filtered["variant_chrom"].data.compute())

array(['PvP01_01_v1', 'PvP01_02_v1', 'PvP01_03_v1', 'PvP01_04_v1',
       'PvP01_05_v1', 'PvP01_06_v1', 'PvP01_07_v1', 'PvP01_08_v1',
       'PvP01_09_v1', 'PvP01_10_v1', 'PvP01_11_v1', 'PvP01_12_v1',
       'PvP01_13_v1', 'PvP01_14_v1'], dtype=object)

Now we filter these variants to only include ones which have **high global minor allele frequencies (> 0.1) and low missingness (< 0.1)**. 

To do this we must first perform an allele count on the genotypes and convert to frequency

In [12]:
%%time
# allele frequency for all samples
gt = allel.GenotypeDaskArray(variant_dataset_filtered["call_genotype"].data)
ac_pop = gt.count_alleles()
ac_pop_freq = ac_pop.to_frequencies().compute()

CPU times: user 3min 28s, sys: 37.1 s, total: 4min 5s
Wall time: 4min 46s


Below we calculate the missingness frequency for each SNP

In [13]:
%%time 
freq_missing = gt.count_missing(axis=1).compute() / gt.shape[1]

CPU times: user 1min 56s, sys: 20.2 s, total: 2min 16s
Wall time: 2min 21s


We can now use the allele and missingness frequencies to filter the variants to only include minor allele frequency over 0.1 and missingness less than 0.1 

In [17]:
pop_freq_filter = (ac_pop_freq[:, :2].min(axis=1) > 0.1) & (freq_missing < 0.1)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=pop_freq_filter)
variant_dataset_filtered

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.80 kiB,4.80 kiB
Shape,"(615,)","(615,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 4.80 kiB 4.80 kiB Shape (615,) (615,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",615  1,

Unnamed: 0,Array,Chunk
Bytes,4.80 kiB,4.80 kiB
Shape,"(615,)","(615,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,715.53 kiB,29.44 kiB
Shape,"(13084, 7)","(628, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 715.53 kiB 29.44 kiB Shape (13084, 7) (628, 6) Count 596 Tasks 82 Chunks Type object numpy.ndarray",7  13084,

Unnamed: 0,Array,Chunk
Bytes,715.53 kiB,29.44 kiB
Shape,"(13084, 7)","(628, 6)"
Count,596 Tasks,82 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.35 MiB,46.61 kiB
Shape,"(13084, 615, 2)","(628, 38, 2)"
Count,7697 Tasks,1189 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 15.35 MiB 46.61 kiB Shape (13084, 615, 2) (628, 38, 2) Count 7697 Tasks 1189 Chunks Type int8 numpy.ndarray",2  615  13084,

Unnamed: 0,Array,Chunk
Bytes,15.35 MiB,46.61 kiB
Shape,"(13084, 615, 2)","(628, 38, 2)"
Count,7697 Tasks,1189 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,107.43 MiB,326.27 kiB
Shape,"(13084, 615, 7)","(628, 38, 7)"
Count,7697 Tasks,1189 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 107.43 MiB 326.27 kiB Shape (13084, 615, 7) (628, 38, 7) Count 7697 Tasks 1189 Chunks Type int16 numpy.ndarray",7  615  13084,

Unnamed: 0,Array,Chunk
Bytes,107.43 MiB,326.27 kiB
Shape,"(13084, 615, 7)","(628, 38, 7)"
Count,7697 Tasks,1189 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.35 MiB,46.61 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 15.35 MiB 46.61 kiB Shape (13084, 615) (628, 38) Count 7697 Tasks 1189 Chunks Type int16 numpy.ndarray",615  13084,

Unnamed: 0,Array,Chunk
Bytes,15.35 MiB,46.61 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,int16,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.67 MiB,23.30 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 7.67 MiB 23.30 kiB Shape (13084, 615) (628, 38) Count 7697 Tasks 1189 Chunks Type int8 numpy.ndarray",615  13084,

Unnamed: 0,Array,Chunk
Bytes,7.67 MiB,23.30 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,61.39 MiB,186.44 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 61.39 MiB 186.44 kiB Shape (13084, 615) (628, 38) Count 7697 Tasks 1189 Chunks Type object numpy.ndarray",615  13084,

Unnamed: 0,Array,Chunk
Bytes,61.39 MiB,186.44 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,61.39 MiB,186.44 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 61.39 MiB 186.44 kiB Shape (13084, 615) (628, 38) Count 7697 Tasks 1189 Chunks Type object numpy.ndarray",615  13084,

Unnamed: 0,Array,Chunk
Bytes,61.39 MiB,186.44 kiB
Shape,"(13084, 615)","(628, 38)"
Count,7697 Tasks,1189 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,92.09 MiB,279.66 kiB
Shape,"(13084, 615, 3)","(628, 38, 3)"
Count,7697 Tasks,1189 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 92.09 MiB 279.66 kiB Shape (13084, 615, 3) (628, 38, 3) Count 7697 Tasks 1189 Chunks Type int32 numpy.ndarray",3  615  13084,

Unnamed: 0,Array,Chunk
Bytes,92.09 MiB,279.66 kiB
Shape,"(13084, 615, 3)","(628, 38, 3)"
Count,7697 Tasks,1189 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 306.66 kiB 14.72 kiB Shape (13084, 6) (628, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  13084,

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 306.66 kiB 14.72 kiB Shape (13084, 6) (628, 6) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",6  13084,

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 12.78 kiB 628 B Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type bool numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,12.78 kiB,628 B
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 51.11 kiB 2.45 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type float32 numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,51.11 kiB,2.45 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 102.22 kiB 4.91 kiB Shape (13084,) (628,) Count 193 Tasks 41 Chunks Type object numpy.ndarray",13084  1,

Unnamed: 0,Array,Chunk
Bytes,102.22 kiB,4.91 kiB
Shape,"(13084,)","(628,)"
Count,193 Tasks,41 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 306.66 kiB 14.72 kiB Shape (13084, 6) (628, 6) Count 193 Tasks 41 Chunks Type int32 numpy.ndarray",6  13084,

Unnamed: 0,Array,Chunk
Bytes,306.66 kiB,14.72 kiB
Shape,"(13084, 6)","(628, 6)"
Count,193 Tasks,41 Chunks
Type,int32,numpy.ndarray


## Load core region data 

When we start calculating statistics on microhaplotype candidates we want to make sure to restrict the regions we are looking at to the core genome. This information is stored in [Pv4 regions](https://www.malariagen.net/sites/default/files/Pv4_regions.bed.gz), which we load into a pandas dataframe. This file details the chromosome, the start and end, and the type of the region.

In [18]:
pv4_regions = pd.read_csv(
    "../supplementary_files/Pv4_regions.bed", sep="\t", comment="t", names=["chrom", "chromStart", "chromEnd", "name"]
)
pv4_regions.head()

Unnamed: 0,chrom,chromStart,chromEnd,name
0,PvP01_01_v1,0,116541,Sub
1,PvP01_01_v1,116541,677962,Core
2,PvP01_01_v1,677962,679789,Cen
3,PvP01_01_v1,679789,903591,Core
4,PvP01_01_v1,903591,1021664,Sub


This file is 0-based, so we convert it to be 1-based below so it is consistent with our variant data 

In [19]:
pv4_regions[["chromStart", "chromEnd"]] += 1

Below we count the number of variants in our filtered list that are within the core regions to get a total number of variants that will be included in our analysis 

In [20]:
total_variants = 0

for index, row in (pv4_regions.loc[pv4_regions.name == "Core"]).iterrows():

    filter_values = (variant_dataset_filtered["variant_chrom"] == row.chrom).data
    variant_dataset_chrom = variant_dataset_filtered.isel(variants=filter_values)

    test_variants = variant_dataset_chrom.set_index(
        variants="variant_position", samples="sample_id"
    )

    variant_count = test_variants.sel(
        variants=slice(row.chromStart, row.chromEnd)
    ).dims["variants"]
    total_variants += variant_count
print("Total variants to be included in analysis : ", total_variants)

Total variants to be included in analysis :  13084


# Sliding window through core regions 

The co genome was then scanned in coding regions for all 200 bp windows in which > 1 of the identified variants were found and filtered for high diversity (global heterozygosity ≥0.5). 

The function `evaluate_marker_options` below utilises the other functions definded to perform a sliding window through the core regions of the definded chromosome. For each window the following is calculated:
- the number of biallelic snps in the window 
- for each unique allele how many samples have that allele

In [23]:
def filter_variants(variant_dataset, field, value):
    filter_values = (variant_dataset[field] == value).data
    variant_dataset_filtered = variant_dataset.isel(variants=filter_values)
    return variant_dataset_filtered


def variant_positions(positions):
    return list(positions)


def unique_allele_counts_in_window(gt):
    unique, index, counts = np.unique(gt, axis=1, return_counts=True, return_index=True)
    # Find index with the missing or het
    alleles_with_missing = []
    alleles_with_het = []
    for i in range(len(index)):
        if -1 in (gt[:, index[i]].compute()):
            alleles_with_missing.append(i)
        if True in gt[:, int(index[i])].is_het().compute():
            alleles_with_het.append(i)

    return counts, alleles_with_missing, alleles_with_het


def calculate_stats(variant_dataset, window_length, step):
    pos = variant_dataset["variants"].data

    # Find windows with variants
    n_variants, windows = allel.windowed_count(
        pos, size=window_length, step=step
    )
    index_with_variants = [i for i, var in enumerate(n_variants) if var != 0]
    window_with_variants = [list(windows[i]) for i in index_with_variants]

    # Find windows with unique variants
    positions, windows, counts = allel.windowed_statistic(
        pos, pos, statistic=variant_positions, windows=window_with_variants
    )
    unique_var, unique_var_index = np.unique(positions, return_index=True)
    unique_windows = [list(windows[i]) for i in unique_var_index]
    
    # Count occurances of each unique allele
    values = allel.GenotypeDaskArray(variant_dataset["call_genotype"].data)
    allele_counts, windows, counts = allel.windowed_statistic(
        pos,
        values,
        statistic=unique_allele_counts_in_window,
        windows=unique_windows,
        fill=[0, None, None],
    )
    n_variants, windows = allel.windowed_count(
        pos, windows=unique_windows
    )
    return n_variants, allele_counts, windows

In [24]:
def evaluate_marker_options(
    variant_dataset, chrom, region_df, window_length=200, step=50
):

    # Filter variants to chromosome and set index
    variant_dataset = filter_variants(variant_dataset, "variant_chrom", chrom)
    variant_dataset = variant_dataset.set_index(
        variants="variant_position", samples="sample_id"
    )

    # Find core region boundaries for chromosome
    core_region_df = region_df.loc[
        (region_df.chrom == chrom) & (region_df.name == "Core")
    ]

    biallelic_counts = []
    unique_allele_counts = []
    unique_alleles_with_missing = []
    unique_alleles_with_het = []
    window_start = []
    window_end = []
    variant_counts = []

    # For each region
    for index, row in core_region_df.iterrows():
        print(f"starting sliding window for region: {row.chromStart}-{row.chromEnd}")

        # Restrict variants to region
        variant_dataset_region = variant_dataset.sel(
            variants=slice(row.chromStart, row.chromEnd)
        )

        # STATS
        n_variants, allele_counts, windows = calculate_stats(
            variant_dataset_region, window_length, step
        )

        # Concatenate results
        window_start = window_start + list(windows[:, 0])
        window_end = window_end + list(windows[:, 1])
        variant_counts = variant_counts + list(n_variants)
        unique_allele_counts = unique_allele_counts + list(
            allele_counts[:, 0]
        )
        unique_alleles_with_missing = unique_alleles_with_missing + list(
            allele_counts[:, 1]
        )
        unique_alleles_with_het = unique_alleles_with_het + list(allele_counts[:, 2])

    return (
        variant_counts,
        unique_allele_counts,
        unique_alleles_with_missing,
        unique_alleles_with_het,
        window_start,
        window_end,
    )

# Perform sliding window and entropy and heterozygosity for all chromosomes

The code below implements the functions defined above to perform the sliding window for each chromosome. It takes the results and uses them to calculate the entropy and heterozygosity using the following formulas
       
entropy = - $\sum$ (gt_freqs * $\log(gt\_freqs)$)

heterozygosity = 1 - $\sum$ gt_freqs<sup>2
    
where `gt_freqs` are the counts for each unique allele divided by the total count

In [25]:
%%time 
chromosomes = np.unique(variant_dataset_filtered["variant_chrom"].data.compute())
for chrom in chromosomes:
    print('Chromosome:',chrom)
    # Calculate window stats
    (
        variant_counts,
        unique_allele_counts,
        unique_alleles_with_missing,
        unique_alleles_with_het,
        window_start,
        window_end,
    ) = evaluate_marker_options(variant_dataset_filtered, chrom, pv4_regions)
    
    # Format data
    df = pd.DataFrame(
        data={
            "window_start": window_start,
            "window_end": window_end,
            "variant_counts": variant_counts,
            "unique_allele_counts": unique_allele_counts,
            "unique_alleles_with_missing_index": unique_alleles_with_missing,
            "unique_alleles_with_het_index": unique_alleles_with_het,
        }
    )
    
    # Calculate entropy and hetrozygosity
    unique_allele_count = []
    unique_allele_freqs = []
    entropy = []
    het = []
    df_with_stats = df.copy()
    for index, row in df.iterrows():
        gt_counts = row.unique_allele_counts
        n_alleles = len(gt_counts)
        gt_freqs = gt_counts/sum(gt_counts)

        unique_allele_freqs.append(list(gt_freqs))
        unique_allele_count.append(n_alleles)
        entropy.append(-np.sum(gt_freqs * np.log(gt_freqs)))
        het.append(1.0 - np.sum(gt_freqs ** 2))
    df_with_stats["unique_allele_frequencies"] = unique_allele_freqs
    df_with_stats["unique_allele_count"] = unique_allele_count
    df_with_stats["entropy"] = entropy
    df_with_stats["het"] = het
    
    # Output to csv
    df_with_stats.to_csv(f"sliding_window_results/{chrom}_windowed_heterozygosity.csv")

Chromosome: PvP01_01_v1
starting sliding window for region: 116542-677963
starting sliding window for region: 679790-903592
Chromosome: PvP01_02_v1
starting sliding window for region: 100156-162349
starting sliding window for region: 164088-745644


Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7cf519c358>
transport: <_SelectorSocketTransport fd=92 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/Users/km22/anaconda3/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 60] Operation timed out
Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7d2b3dfa90>
transport: <_SelectorSocketTransport fd=78 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/Users/km22/anaconda3/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 60] Operation timed out
Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7cf6cc1f98>
transport: <_SelectorSocketTransport fd=91 read=polling wri

Chromosome: PvP01_03_v1
starting sliding window for region: 108062-630664


Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7cf66d9400>
transport: <_SelectorSocketTransport fd=98 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/Users/km22/anaconda3/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 60] Operation timed out
Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7cf66d9fd0>
transport: <_SelectorSocketTransport fd=97 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/Users/km22/anaconda3/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 60] Operation timed out
Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f7cf66d9c50>
transport: <_SelectorSocketTransport fd=91 read=polling wri

KeyboardInterrupt: 

In [None]:
df_with_stats

In [None]:
# Concatanate Data from each chromosome into one dataframe 

results_directory = 'sliding_window_results'
i=0
for filename in os.listdir(results_directory):
    f = os.path.join(results_directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        df = pd.read_csv(f, index_col=0)
        # Add chromosome name columns 
        df.insert(loc=0, column='chrom', value=filename.replace('_windowed_heterozygosity.csv',''))
        # Join dataframes together 
        if i==0: 
            results_df = df
        else: 
            results_df = results_df.append(df)
        i+=1
results_df = results_df.sort_values('chrom')