# Introduction
https://pypi.org/project/pandas-plink/

Must install a package for reading PLINK binary file format. 

## Install

    pip install pandas-plink
    # aternatively
    conda install -c conda-forge pandas-plink

## Usage

Need access to 3 files:
* BED: containing genotype
* BIM: containing variant information
* FAM: containing sample information

    from pandas_plink import read_plink
    
From `help(pandas_plink)`, this function returns a samples-by-variants matrix. Rows and columns have multiple coordinates each, which have the metainformation contained in the BIM and FAM files. 


In [1]:
from pandas_plink import read_plink1_bin
G = read_plink1_bin("tutorial/hapmap1/hapmap1.bed", 
                    "tutorial/hapmap1/hapmap1.bim",
                    "tutorial/hapmap1/hapmap1.fam")

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)
Mapping files: 100%|██████████| 3/3 [00:00<00:00, 11.32it/s]


In [2]:
print(G)
print("Shape:", G.shape)

<xarray.DataArray 'genotype' (sample: 89, variant: 83534)>
dask.array<shape=(89, 83534), dtype=float64, chunksize=(89, 1024)>
Coordinates:
  * sample   (sample) object '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
  * variant  (variant) object '1_rs6681049' '1_rs4074137' ... '22_rs756638'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    fid      (sample) <U6 'HCB181' 'HCB182' 'HCB183' ... 'JPT268' 'JPT269'
    gender   (sample) <U1 '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
    iid      (sample) <U1 '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) float64 1.0 1.0 2.0 1.0 1.0 1.0 ... 1.0 2.0 2.0 2.0 2.0
    a0       (variant) <U1 '1' '1' '0' '1' '1' '1' ... '1' '1' '0' '0' '1' '1'
    a1       (variant) <U1 '2' '2' '2' '2' '2' '2' ... '2' '2' '2' '2' '2' '2'
    chrom    (variant) <U2 '1' '1' '1' '1' '1' '1' ... '22' '22' '22' '22' '22'
 

### More info

Based on the .log files generated by plink, we had 89 individuals with 83'534 SNPs in total. Reassuringly, this is also the shape of that special matrix. 

# xarray.DataArray

https://xarray.pydata.org/en/stable/generated/xarray.DataArray.html

N-dimensional array with labeled coordinates and dimensions. The API is similar to that for the pandas Series or DataFrame, but DataArray objects can have any number of dimensions, and their contents have fixed data types. 

In [10]:
# Attributes:
print("dims tuple:", G.dims)
print("coords dict-like:", G.coords)
print("name:",  G.name)
print("attrs ordered dict:", G.attrs)

dims tuple: ('sample', 'variant')
coords dict-like: Coordinates:
  * sample   (sample) object '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
  * variant  (variant) object '1_rs6681049' '1_rs4074137' ... '22_rs756638'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    fid      (sample) <U6 'HCB181' 'HCB182' 'HCB183' ... 'JPT268' 'JPT269'
    gender   (sample) <U1 '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
    iid      (sample) <U1 '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1' '1'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) float64 1.0 1.0 2.0 1.0 1.0 1.0 ... 1.0 2.0 2.0 2.0 2.0
    a0       (variant) <U1 '1' '1' '0' '1' '1' '1' ... '1' '1' '0' '0' '1' '1'
    a1       (variant) <U1 '2' '2' '2' '2' '2' '2' ... '2' '2' '2' '2' '2' '2'
    chrom    (variant) <U2 '1' '1' '1' '1' '1' '1' ... '22' '22' '22' '22' '22'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 

https://xarray.pydata.org/en/stable/indexing.html

Access data using `[]` synthax, such as `array[i, j]`. See the table that summarizes the 4 ways to lookup in dimensions. 

# Documentation

https://pandas-plink.readthedocs.io/en/latest/

The matrix G is a special matrix: `xarray.DataArray`. It provides labes for its dimensions (`samples` for rows and `variant` for columns) and additional metadata for those dimensions. Let's print out the genotype value for a given sample and a given variant: 

In [13]:
print(G.sel( sample='1', variant='1_rs6681049' ).values)

[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 1. 2. 2. 1. 1. 1. 1. 0. 1. 0. 2. 1.
 1. 2. 1. 2. 1. 2. 2. 1. 1. 2. 2. 0. 2. 2. 1. 2. 2. 2. 1. 2. 2. 1. 2. 2.
 2. 2. 1. 2. 2. 0. 1. 2. 2. 2. 1. 2. 0. 2. 0. 2. 2. 2. 2. 2. 0. 2. 2. 2.
 2. 2. 2. 1. 2. 1. 2. 2. 2. 2. 2. 2. 0. 2. 1. 2. 2.]


Let's print a summary of the genotype values:

In [5]:
print(G.values)

[[2. 2. 2. ... 2. 1. 2.]
 [2. 1. 2. ... 2. 1. 2.]
 [2. 1. 2. ... 2. 2. 1.]
 ...
 [1. 2. 2. ... 2. 1. 1.]
 [2. 2. 2. ... 2. 1. 2.]
 [2. 2. 2. ... 2. 1. 1.]]


The genotypes values can be either 0, 1, 2 or Nan:
* 0: homozygous having the first allele (given by the coordinate a0)
* 1: heterozygous
* 2: homozygous having the second allele (given the coordinate a1) 
* NaN: missing genotype.