# Human genetics (Gencove) 

The Human Phenotype Project collects genomic variation data on all its participants. The genomic data together with the Human Phenotype Project  deep-phenotypes allows to investigate the progression of disease, and to explore personalized treatments. We genotype millions of positions by low-pass sequencing combined with imputation using gencove platform technologies. Genotype imputation is a process of statistically inferring unobserved genotypes using known haplotypes in a population. The performance of Gencove genotype imputation is very high ( accuracy > 98% ) (Wasik et al. 2021). 

### Data availability:
The information is stored in a number of statistics parquet files:
- `main.parquet`: sample metadata, including QC statistics, paths to PLINK variant files (raw and post-QC), and principal components (PCs).
- `variant_qc.parquet`: variant QC statistics.
- `relatives/plink_ibs.parquet`: IBS calculated by PLINK for pairs of participants.
- `relatives/king_kinship.parquet`: King kinship coefficients for pairs of participants.

And a PLINK text file `pca/eigenvec.var` containing the principal component loadings.

In [None]:
#| echo: false
import pandas as pd
pd.set_option("display.max_rows", 500)

In [None]:
from pheno_utils import PhenoLoader

In [2]:
# We load the data but:
# - Skip the king matrix (large and slow to load)
# - Skip the eigenvectors of the PCA (PLINK output, not a parquet file)

dl = PhenoLoader('human_genetics')
dl

DataLoader for human_genetics with
190 fields
3 tables: ['human_genetics', 'pca', 'age_sex']

# Data dictionary
For some datasets, we can also take a look at the data dictionary

In [3]:
dl.dict.head()

Unnamed: 0_level_0,field_string,description_string,parent_dataframe,relative_location,value_type,units,sampling_rate,item_type,array,cohorts,data_type,debut,pandas_dtype
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
collection_date,Collection date,The date of downloading Gencove results from t...,,/human_genetics/human_genetics.parquet,Time,Time,,Data,Single,10k,Tabular,,datetime64[ns]
version,Gencove version,Gencove API version 1 or 2,,/human_genetics/human_genetics.parquet,Categorical (single),,,Data,Single,10k,Tabular,,category
gencove_fastq_r1,R1_FASTQ,"Per sample FASTQ file, a text file that contai...",,/human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Text file,,string
gencove_fastq_r2,R2_FASTQ,"Per sample FASTQ file, a text file that contai...",,/human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Text file,,string
gencove_bam,BAM,Per sample Binary Alignment Map (BAM) file for...,,/human_genetics/human_genetics.parquet,String,Text,,Bulk,Single,10k,Compressed binary file,,string
