# Human genetics (Gencove) 

The Human Phenotype Project collects genomic variation data on all its participants. The genomic data together with the Human Phenotype Project  deep-phenotypes allows to investigate the progression of disease, and to explore personalized treatments. We genotype millions of positions by low-pass sequencing combined with imputation using gencove platform technologies. Genotype imputation is a process of statistically inferring unobserved genotypes using known haplotypes in a population. The performance of Gencove genotype imputation is very high ( accuracy > 98% ) (Wasik et al. 2021). 

### Data availability:
The information is stored in a number of statistics parquet files:<br>
- `human_genetics.parquet`: sample metadata, including QC statistics, paths to PLINK variant files (raw and post-QC), and principal components (PCs).<br>
- `variants_qc.parquet`: variant QC statistics.<br>
- `relationship/relationship_ibs.txt`: IBS calculated by PLINK for pairs of participants.<br>
- `relationship/relationship_king.tsv`: King kinship coefficients for pairs of participants.<br>
- `pca/pca.parquet`: a PLINK file containing  principal components.<br>
- `pca/pca_loadings.tsv`: a PLINK file containing principal component loadings calculated.<br>

In [1]:
#| echo: false
import pandas as pd
pd.set_option("display.max_rows", 500)

In [2]:
from pheno_utils import PhenoLoader

In [3]:
dl = PhenoLoader('human_genetics')
dl

DataLoader for human_genetics with
190 fields
3 tables: ['human_genetics', 'pca', 'age_sex']

# Data dictionary

In [4]:
dl.dict

Unnamed: 0_level_0,field_string,description_string,relative_location,value_type,units,sampling_rate,item_type,array,cohorts,data_type,debut,pandas_dtype,parent_dataframe,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
collection_date,Collection date,The date of downloading Gencove results from t...,human_genetics/human_genetics.parquet,Time,Time,,Data,Single,10k,Tabular,,datetime64[ns],,
version,Gencove version,Gencove API version 1 or 2,human_genetics/human_genetics.parquet,Categorical (single),,,Data,Single,10k,Tabular,,category,,
gencove_fastq_r1,R1_FASTQ,"Per sample FASTQ file, a text file that contai...",human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Text file,,string,,
gencove_fastq_r2,R2_FASTQ,"Per sample FASTQ file, a text file that contai...",human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Text file,,string,,
gencove_bam,BAM,Per sample Binary Alignment Map (BAM) file for...,human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Compressed binary file,,string,,
gencove_bai,BAM indices,Per sample indices for BAM file,human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Compressed file,,string,,
gencove_vcf,Imputation Variant Call Format (VCF),"Per sample VCF, a text file storing imputed ge...",human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Compressed text file,,string,,
genecov_qc_bases,Bases sequenced,number of total bases sequenced. from genecov ...,human_genetics/human_genetics.parquet,Integer,Count,,Data,Single,10k,Tabular,,int,,
genecov_qc_bases_dedup,Deduplicated bases,number of deduplicated bases. from genecov qc ...,human_genetics/human_genetics.parquet,Integer,Count,,Data,Single,10k,Tabular,,int,,
gencove_qc_bases_dedup_mapped,Deduplicated bases aligned,number of deduplicated bases that have aligned...,human_genetics/human_genetics.parquet,Integer,Count,,Data,Single,10k,Tabular,,int,,
