Usage

Genotype

>>> import os >>> from os.path import join >>> from pandas_plink import get_data_folder >>> import shutil >>> import tempfile >>> >>> old_path = os.getcwd() >>> tmp_path = tempfile.mkdtemp() >>> os.chdir(tmp_path) >>> >>> filenames = ["chr11.bed", "chr11.bim", "chr11.fam", "chr12.bed", "chr12.bim", ... "chr12.fam"] >>> for f in filenames: ... _ = shutil.copy(join(get_data_folder(), f), ".") >>> rel_filenames = ["plink2.rel.bin", "plink2.rel.id"] >>> for f in rel_filenames: ... _ = shutil.copy(join(get_data_folder(), "rel-bin", f), ".")

It is as simple as:

>>> from pandas_plink import read_plink1_bin >>> G = read_plink1_bin("chr11.bed", "chr11.bim", "chr11.fam", verbose=False) >>> print(G) <xarray.DataArray 'genotype' (sample: 14, variant: 779)> dask.array<transpose, shape=(14, 779), dtype=float32, chunksize=(14, 779), chunktype=numpy.ndarray> Coordinates: (12/14) * sample (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' * variant (variant) <U10 'variant0' 'variant1' ... 'variant777' 'variant778' fid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' iid (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014' father (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0' mother (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0' ... ... chrom (variant) object '11' '11' '11' '11' '11' ... '11' '11' '11' '11' snp (variant) object '316849996' '316874359' ... '345698259' cm (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 pos (variant) int32 157439 181802 248969 ... 28937375 28961091 29005702 a0 (variant) object 'C' 'G' 'G' 'C' 'C' 'T' ... 'A' 'C' 'A' 'A' 'T' a1 (variant) object 'T' 'C' 'C' 'T' 'T' 'A' ... 'G' 'T' 'G' 'C' 'C'

The matrix G is a special matrix: xarray.DataArray. It provides labels for its dimensions ("sample" for rows and "variant" for columns) and additional metadata for those dimensions. Lets print the genotype value of sample "B003" and variant "variant5":

>>> variant = "variant5" >>> print(G.sel(sample="B003", variant=variant).values) 0.0 >>> print(G.a0.sel(variant=variant).values) T

It shows that sample "B003" has two T alleles (a0) at the variant "variant5".

Likewise, sample "B003" has two C alleles (a1) at the variant "variant135":

>>> variant = "variant135" >>> print(G.sel(sample="B003", variant=variant).values) 2.0 >>> print(G.a1.sel(variant=variant).values) C

To see how the genotype value maps to the alleles, see below.

Now lets print a summary of the genotype values:

>>> print(G.values) [[0.00 0.00 2.00 ... 0.00 0.00 0.00] [0.00 1.00 2.00 ... 0.00 0.00 nan] [0.00 0.00 2.00 ... 0.00 0.00 0.00] ... [2.00 2.00 0.00 ... 2.00 2.00 2.00] [2.00 1.00 0.00 ... 2.00 2.00 1.00] [0.00 0.00 2.00 ... 0.00 0.00 nan]]

The genotype values can be either 0, 1, 2, or math.nan:

0 Homozygous having the first allele (given by coordinate a0)
1 Heterozygous
2 Homozygous having the second allele (given by coordinate a1)
math.nan Missing genotype

Kinship matrix

Pandas-plink supports relationship/covariance matrix encoded in PLINK and GCTA file formats since version 2.0.0.

>>> from pandas_plink import read_rel >>> K = read_rel("plink2.rel.bin") >>> print(K) <xarray.DataArray (sample_0: 10, sample_1: 10)> array([[ 0.89, 0.23, -0.19, -0.01, -0.14, 0.29, 0.27, -0.23, -0.10, -0.21], [ 0.23, 1.08, -0.45, 0.19, -0.19, 0.17, 0.41, -0.01, -0.13, -0.13], [-0.19, -0.45, 1.18, -0.04, -0.15, -0.20, -0.31, -0.04, 0.30, -0.01], [-0.01, 0.19, -0.04, 0.90, -0.07, 0.01, 0.06, -0.19, -0.09, 0.17], [-0.14, -0.19, -0.15, -0.07, 1.18, 0.09, -0.03, 0.10, 0.22, 0.17], [ 0.29, 0.17, -0.20, 0.01, 0.09, 0.96, 0.07, -0.04, -0.09, -0.23], [ 0.27, 0.41, -0.31, 0.06, -0.03, 0.07, 0.71, -0.10, -0.09, -0.06], [-0.23, -0.01, -0.04, -0.19, 0.10, -0.04, -0.10, 1.42, -0.30, -0.07], [-0.10, -0.13, 0.30, -0.09, 0.22, -0.09, -0.09, -0.30, 0.91, -0.02], [-0.21, -0.13, -0.01, 0.17, 0.17, -0.23, -0.06, -0.07, -0.02, 0.91]]) Coordinates: * sample_0 (sample_0) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753' * sample_1 (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753' fid (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753' iid (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753' >>> print(K.values) [[ 0.89 0.23 -0.19 -0.01 -0.14 0.29 0.27 -0.23 -0.10 -0.21] [ 0.23 1.08 -0.45 0.19 -0.19 0.17 0.41 -0.01 -0.13 -0.13] [-0.19 -0.45 1.18 -0.04 -0.15 -0.20 -0.31 -0.04 0.30 -0.01] [-0.01 0.19 -0.04 0.90 -0.07 0.01 0.06 -0.19 -0.09 0.17] [-0.14 -0.19 -0.15 -0.07 1.18 0.09 -0.03 0.10 0.22 0.17] [ 0.29 0.17 -0.20 0.01 0.09 0.96 0.07 -0.04 -0.09 -0.23] [ 0.27 0.41 -0.31 0.06 -0.03 0.07 0.71 -0.10 -0.09 -0.06] [-0.23 -0.01 -0.04 -0.19 0.10 -0.04 -0.10 1.42 -0.30 -0.07] [-0.10 -0.13 0.30 -0.09 0.22 -0.09 -0.09 -0.30 0.91 -0.02] [-0.21 -0.13 -0.01 0.17 0.17 -0.23 -0.06 -0.07 -0.02 0.91]]

>>> os.chdir(old_path) >>> shutil.rmtree(tmp_path, ignore_errors=True)

Please, refer to the functions pandas_plink.read_rel and pandas_plink.read_grm for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.rst

usage.rst

Usage

Genotype

Kinship matrix

Files

usage.rst

Latest commit

History

usage.rst

File metadata and controls

Usage

Genotype

Kinship matrix