Skip to content

ncbi/graf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GRAF Software Documentation

GRAF (Genetic Relationship And Fingerprinting) is a package to do some useful analyses and visualization of genotype data from genome-wide association studies. The latest version GRAF 2.3 includes two main features: (1) subject relationship inference (GRAF-rel); (2) subject ancestry (or population structure) inference (GRAF-pop). Both relationship and ancestry inferences are based on the genotypes of 10,000 pre-selected fingerprint SNPs extracted from the input dataset. The GRAF package includes a main C++ program graf to calculate the relationships and predict the subject ancestry, and two auxiliary Perl programs PlotGraf.pl and PlotPopulations.pl to visualize the results. Note that PlotGraf.pl and PlotPopulations.pl require that GD Graphics Library (http://search.cpan.org/~lds/GD-1.38/GD.pm) be installed.

GRAF-rel: inferring subject relationships using genotypes

GRAF-rel analyzes the genotypes of all the 10,000 fingerprinting SNPs (distributed as the file FP_SNPs.txt) and calculates the all genotype mismatch rate (AGMR) and the homozygous genotype mismatch rate (HGMR) for each pair of sample (Jin et al, 2017). AGMR is the percentage of SNPs on which the two genotypes are not identical, while HGMR is the genotype mismatch rate when only the SNPs with homozygous calls for both samples are considered.

graf compares the genotypes of all pairs of subjects and finds and reports the closely related pairs, while PlotGraf.pl takes the file generated by graf and plots graphs to show the distributions of HGMR and AGMR values.

Input files

In most usages, graf expects as input one or more genotype datasets in PLINK format, i.e., .bed, .bim and .fam files that share a prefix in their names. However, since multiple samples can be collected from one subject and the subject-sample mapping information is not stored in datasets in PLINK, graf reads subject-sample mapping and pedigree information from the dbGaP SSM file and pedigree file. The IDs (second column, no column header) in the PLINK .fam file are read as sample IDs by graf. If subject IDs are the same as sample IDs, then no SSM file is necessary, and graf will read the pedigree information from the PLINK .fam file.

However, if any of the sample IDs are different from their corresponding subject IDs, then an SSM file should be passed to graf. The SSM file should be a tab-delimited plain text file with a sample column and a subject column (with column headers, see dbGaP submission guide). When an SSM file is provided, a pedigree file (see dbGaP submission guide) should also be provided to pass the pedigree information to graf. The pedigree file should be a tab-delimited plain text file with at least the following 5 columns (with a column header row):

  1. FamilyID
  2. SubjectID
  3. FatherID
  4. MotherID
  5. Sex (1 = male; 2 = female; 0 or NULL = unknown)

SubjectID, FatherID and MotherID are IDs of subjects, not samples.

The SSM format is a two-column tab delimited text file that establishes a mapping from Sample IDs to Subject IDs. The columns should have the headers Subject_ID and Sample_ID, respectively. An example SSM format file is included in the GRAF distribution with the name affy_hapmap_ssm.txt.

If there are identical twins in the datasets, the twin information should be entered to the optional 6th column TwinID, where the same twin ID (can be an integer or a string) is used to indicate that subjects are identical twins. For example, if three subjects A, B, C are identical triplets, a unique subject ID, e.g., the integer 18, can be created for them and entered into the TwinID column for subjects A, B, C.

The sample genotypes can also be stored in datasets with GRAF format. graf uses a single .fpg file to store the sample genotypes. A .fpg file is a plain text file with three columns: the first column is the dataset ID (integer) column; the second one is the sample ID column; and the third column stores sample genotypes in strings of hexadecimal numbers. Each hexadecimal number represents genotypes of two fingerprinting SNPs. The first hexadecimal number stores genotypes of the first two fingerprinting SNPs; the second number keeps genotypes of fingerprinting SNPs #3 and #4, and so on. If the hexadecimal number is converted to a binary number, then the first two bits keep the genotype of the first SNP and the last two bits are for the second SNP, with the following code meanings:

        00: 0 reference alleles
        01: 1 reference allele
        10: 2 reference alleles
        11: missing genotype

The .fpg file can be generated using the -geno option of the graf program and reused as input to the program in a subsequent run.

Included in the distribution are two sample datasets for which the file names have prefixes affy_hapmap and perlegen_hapmap. Both sets of sample files come in byte-encode PLINK format meaning that there are three files with suffixes {fam,bim,bed}.

Running graf to find closely related subjects

graf is a command line executable that can be run under GNU/LINUX 64 bit systems. Brief instructions are given when the program is executed without parameters:

$ graf

Usage: graf [options]
    -plink  PLINK set root:  File root of PLINK .bed, .bim and .fam files
    -geno   fpg file:        Specify GRAF .fpg file
    -exfp   PLINK set list:  Extract fingerprinting genotypes from a list of PLINK sets (file roots) separated by commas
    -pop    output file:     Check subject populations and save results to the output file
    -out    output file:     Output file to save the results
    -appd   DS No.:          Append extracted fingerprinting genotypes to the output file.  The integer is dataset No.
                             of the first PLINK set
    -ssrs   SS-RS mapping:   Specify SS# to RS# mapping file (Two  columns: SS# and RS# without column  headers)
    -ped    pedigree file:   Specify pedigree file of subject IDs (with column headers)
    -ssm    SSM file:        Specify dbGaP subject-sample mapping file
    -maxhm  max HGMR value:  Specify maximum HGMR values for a pair of subjects to be reported by GRAF
    -xpmr   type:            Specify how expected HGMR and AGMR values are calculated for each type of relationship (default 1)
                             1: Use input dataset to calculate the expected HGMR and AGMR values
                             2: Use average HGMR and AGMR values in dbGaP database for the expected values
    -type   relation_type:   Specify relation type.  Acceptable values are 1, 2, 3, or 4 (default 3)
                             1: Find all duplicates and PO pairs
                             2: Find all duplicates, PO and FS pairs
                             3: Find all duplicates, PO, FS and second  degree relatives
                             4: Compare all the 10,000 SNPs to find all the related subjects

NOTE:
    1. Exactly one of the following two options should be selected:  -plink or –geno.
    2. When option -exfp is selected, -out must also be selected and output file should have .fpg extension.
    3. When multiple PLINK sets are used, each dataset will be assigned an integer dataset ID starting with 1.
    4. The above PLINK set starting index can be specified using option -appd.
       When -appd is selected, the out file should be an existing GRAF .fpg file.
    5. Multiple datasets can be combined into a single geno file using the –exfp and –appd options.
    6. When multiple datasets are used, the program does pairwise comparisons to find related samples both within and across datasets

Below are more detailed descriptions (with examples) of these options.

-plink

Allows the user to specify the name of the genotype dataset in PLINK .bed, .bim, .fam format. The parameter should be the file root of the plink set. In this example, graf will try to find the following three files: affy_hapmap.bed, aff_hapmap.bim and affy_hapmap.fam. Example:

$ graf -plink affy_hapmap

-exfp

Extracts fingerprinting genotypes from multiple PLINK sets and saves the results to the file name specified by -out option. The datasets will be given integer dataset IDs starting from 1. The output file name should be new. Example:

$ graf -exfp affy_hapmap,perlegen_hapmap -out comb_hapmap.fpg

-exfp -appd

Extracts fingerprinting genotypes from a PLINK set and appends the results to an existing output file, with dataset ID specified by –appd option. Example (two steps):

$ graf -exfp affy_hapmap -out comb_hapmap2.fpg
$ graf -exfp perlegen_hapmap -out comb_hapmap2.fpg –appd 2

-geno

Allows the user to specify the name of the genotype dataset in GRAF format. Example:

$ graf -geno comb_hapmap.fpg

-ssm

Allows the user to specify the name of the subject-sample mapping file in dbGaP format. When sample IDs are different from subject IDs, a subject-sample mapping file is required. The subject-sample mapping file should list all the sample IDs in the PLINK .fam file and their corresponding subject IDs. Example:

$ graf -plink affy_hapmap -ssm affy_hapmap_ssm.txt

-ped

Allows the user to specify the pedigree file in dbGaP format. When pedigree file is specified with -ped option, graf will ignore the pedigree information in the PLINK .fam file and read the information from the pedigree file. The IDs in the pedigree file should be subject IDs. This option can take only one dataset at a time. Example:

$ graf -plink affy_hapmap -ssm affy_hapmap_ssm.txt -ped affy_hapmap_fake_pedigree.txt

-out

Allows the user to specify the name of the output file for saving the related pairs of samples detected by graf. If the output file is not specified, the output will be saved to a default file graf_rel_yyyymmdd_hhmm.txt, where yyyymmdd_hhmm is the current local time in this format. Example:

$ graf -plink affy_hapmap –out aff_hapmap_rels.txt

-maxhm

Sets the maximum HGMR value for related pairs outputted by graf. Subject pairs with HGMR greater than this value will be treated by graf as unrelated and will not be saved to the output file. The default maximum HGMR is 20. Example:

$ graf -plink affy_hapmap –out aff_hapmap_rels_m_15.txt –maxhm 15

-xpmr

Allows the user to specify how the expected HGMR and AGMR values are calculated. For each pair of subjects, GRAF estimates the allele frequency distribution of the fingerprinting SNPs of the population where the subjects are sampled, and then uses these allele frequencies to calculate the expected HGMR and AGMR values. Assuming all of the subjects in the input file(s) are sampled from the same population, GRAF uses the allele frequencies of all subjects in the input datasets to estimate the allele frequencies in the population. In cases when the sample size is small (fewer than 100 subjects) in the input datasets, GRAF uses the allele frequencies of all the subjects in dbGaP Fingerprint Collection to estimate the population allele frequencies. The user can use -xpmr option (1 or 2) to let GRAF choose one of the above two options to estimate the population allele frequencies. When the selection -xpmr 1 is combined with choices of -geno or -exfp that combine multiple datasets, then the allele frequencies are combined as a weighted average of all the participating datasets and the same weighted average is used for all pairwise comparisons. Example:

$ graf -plink affy_hapmap –xpmr 2

-type

Usage of graf involves a tradeoff between running time and prediction accuracy. To obtain high sensitivity, the program needs to check more SNPs, at the expense of a longer running time. The -type option allows the user to specify the relative type for which graf should try to find all the pairs. The type should be an integer from 1 to 4, with the code meanings shown in the above short description. The greater the type value is, the more SNPs graf will check, and hence the more related samples it will find and the more time it will spend. The default type value is 3. Example:

$ graf -plink affy_hapmap –type 2

-ssrs

When the marker IDs in the PLINK .bim file are SS IDs, the user can use -ssrs option to specify an SS to RS mapping file so that graf can convert the SS IDs to RS IDs. Example (assuming PLINK set DsWithSs.* exists):

$ graf -plink DsWithSs –SsToRs.txt

Output files

graf requires that an input genotype file, either in PLINK format (with -plink option) or in GRAF format (with -geno option) should be specified. When -exfp option is selected, the -out option should also be selected to specify the name of the output file. The output file is the genotype dataset in GRAF format (.fpg file), as described above.

The output file should have the extracted genotypes of the fingerprinting SNPs and can be passed back to graf as an input file in a later run.exfp

When -exfp option is not selected, graf will use the genotype information in the input genotype dataset, find the related subjects or determine population structures, and will save the results to the output file.

If any related subjects are found by GRAF-rel, the results will be saved to the output file, which is a plain text file with the following columns:

            Sample1:       ID of the first sample in each pair
            Sample2:       ID of the second sample in each pair
            Subject1:      subject ID of the first sample in each pair
            Subject2:      subject ID of the second sample in each pair
            Sex11:         gender of the first subject in each pair, 1=male; 2=female
            Sex12:         gender of the second subject in each pair, 1=male; 2=female
            HG match:      number of SNPs with matched genotypes when only homozygous SNPs are counted
            HG miss:       number of SNPs with mismatched genotypes when only homozygous SNPs are counted
            HGMR:          Homozygous Genotype Mismatch Rate (%)
            AG match:      number of SNPs with matched genotypes when all SNPs are counted
            AG miss:       number of SNPs with mismatched genotypes when all SNPs are counted
            AGMR:          All Genotype Mismatch Rate (%)
            Geno relation: relationship determined by sample genotypes. See above for code meanings
            Ped relation:  relationship derived from subject-sample mapping file and pedigree file (See Table 1 for code meanings).
            p_value:       probability that the genetic relationship is NOT the predicted type

Table 1. Pedigree relationships and the expected genetic relationships

Table1.

When multiple PLINK sets are checked pairwise, the output file will have two extra columns, DS1 and DS2, showing the dataset IDs for the pair of PLINK sets.

Running PlotGraf.pl to plot closely related subjects

PlotGraf.pl is a perl script that plots graphs to show the distributions of HGMR and AGMR values of the related pairs of subjects. It shows brief instructions when it is executed without parameters:

$ PlotGraf.pl

Usage: PlotGraf.pl <input related subject file> <output png file> <graph type> [Options]

Note:
    Valid graph types are:
        1 = HGMR histogram
        2 = AGMR histogram
        3 = HGMR + AGMR scatter plot

Options:
    -gw     graph width:  Set graph width in pixels
    -gh     graph height: Set graph height in pixels
    -xmax   max x value:  Set maximum HGMR or AGMR on x-axis of the histogram
    -ymax   max y value:  Set maximum number of pairs on y-axis of the histogram
    -dot    size:         Set dot size in pixels on the scatter plot
    -hfd    size:         Set dot size in pixels for HF (half sibling + full cousin) pairs

It takes three required parameters. The first parameter should be the name of the file that is generated by graf and contains related subject pairs. The second one is the output .png file which shows the graph. The third one is an integer representing the graph type. The options should be entered after the required parameters. Below are some examples showing how to run the script.

$ graf -plink affy_hapmap -maxhm 15 -ssm affy_hapmap_ssm.txt -ped affy_hapmap_fake_pedigree.txt -out affy_hapmap_rels_15.txt
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_hgmr.png 1
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_agmr.png 2
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_scatter.png 3

In the first step the C++ program finds related pairs and saves the results to affy_hapmap_rels_15.txt. Then PlotGraf.pl takes the results and plots histograms to show distributions of HGMR values of the related subjects, AGMR values of the duplicates, and a scatter plot to show distribution of both values.

In both histograms, the colored bars represent different type of relationships derived from the SSM and pedigree file (See Table 1 for the meanings of the two-letter abbreviations). The cyan lines show the cutoff values suggested by GRAF to separate different types of relationships determined by comparing the genotypes. In the scatter plot, each contour line shows the area that is predicted to contain 95% of the pairs for each relatedness type, assuming all of the 10,000 fingerprinting SNPs are genotyped for all of the subjects in a large, homogeneous, random mating population. Note that the HapMap samples were collected from human individuals from very different populations, and GRAF is more accurate when predicting relatedness for subjects from a homogeneous population.

$ graf -plink affy_hapmap -maxhm 15 -ssm affy_hapmap_fake_ssm.txt -ped affy_hapmap_fake_pedigree.txt -out affy_hapmap_fake_rels.txt
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_hgmr_f1.png 1 -gw 1000 -gh 500
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_agmr_f1.png 2 -xmax 60 -ymax 20
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_scatter_f1.png 3 -dot 5

The above examples show that graph size, axis limits and the scatter plot dot size can be adjusted by users. In the first step a fake pedigree and a fake SSM file are used to show how GRAF finds and reports errors in the pedigree and SSM files. The HGMR histogram generated in the second step shows that some of the related pairs reported by the pedigree and SSM file don't match the genetic relatedness determined by GRAF. It also shows that the graph size can be adjusted by using options -gw and -gh. The AGMR histogram also shows the mismatches between the relationships types reported in the input files and those determined by GRAF. The axis limits can be adjusted by using -xmax and -ymax options. The scatter plot shows the dot size can be adjusted using -dot option.

Multiple genotype datasets can be combined into one .fpg file and passed to graf for determining genetic relationships, e.g.,

$ graf -exfp affy_hapmap,perlegen_hapmap -out comb_hapmap.fpg
$ graf -geno comb_hapmap.fpg -out comb_hapmap_rels.txt -maxhm 15 -ped affy_hapmap_fake_pedigree.txt -ssm comb_hapmap_ssm.txt
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_hgmr.png 1
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_agmr.png 2
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_scatter.png 3

When multiple datasets are used, if there are no SSM and pedigree files, it is not required that the sample and subject IDs be unique across datasets. GRAF uses both DS# and subject/sample IDs to identify subjects or samples. In the output table, GRAF shows both the DS# and ID for each subject or sample. However, when there are SSM and pedigree files, it is required that IDs be unique across datasets. GRAF doesn't take multiple SSM or pedigree files. The user needs to combine multiple SSM or pedigree files into one, and each ID in the combined SSM or pedigree file should represent only one sample or subject. Neither the SSM file nor the pedigree file has DS# columns.

The -hfd option of PlotGraf.pl lets user set the dot size for the half sibling + full cousin pairs (HF, see Table 1) in the scatter plot. The HF relationship is genetically remoter than full sibling but closer than second degree relatives. In the scatter plot, these pairs are predicted to be between FS and D2 pairs. In the rare cases when there are HF pairs, the user can use -hfd option to highlight the HF pairs by setting different dot sizes for them.

GRAF-pop: inferring subject ancestry using genotypes

GRAF-pop calculates genetic distances from each subject to several reference populations and estimates subject ancestry and ancestral proportions based on these distances. Four genetic distances scores, GD1, GD2, GD3, GD4, are used in ancestry inference in the current version of GRAF. Subjects in the input datasets are clustered using these scores and plotted on scatter plots.

GRAF-pop assumes that each subject is an admixture of three ancestries: European (E), African (F), and Asian (A), and estimates ancestral proportions Pe, Pf, Pa based on GD1 and GD2 scores using barycentric coordinates. It also assigns a population ID (PopID) to each subject using the cutoff values shown in Tables 2 and 3.

Table 2. Grouping subjects based on the ancestry proportions

PopID Population Cutoff standard
1 European Pe ≥ 87%
2 African Pf ≥ 95%
3 East Asian Pa ≥ 95%
4 African American 40% ≤ Pf < 95% and Pa < 13%
5 Hispanic1 Pf < 40% and Pe < 87% and Pa < 13% and Pf ≥ Pa
6,7,8 (Three populations) Pa < 95% and Pe < 87% and Pf < 13% and Pf < Pa
9 Other Pa ≥ 13% and Pf ≥ 13%

Table 3. Separating Asians and Hispanics using GD1 and GD4 scores

PopID Population Cutoff standard
7 Other Asian GD1 > 30 × (GD4)2 + 1.73
8 South Asian GD4 > 5 × (GD1 -1.69)2 + 0.042
6 Hispanic2 GD4 < 0 and PopID is not 7

Input files

Same as GRAF-rel, GRAF-pop takes genotype datasets in either PLINK format (.fam, .bim, .bed) or GRAF format (.fpg). In addition, GRAF-pop can read self-reported ancestries from the input file and compare the ancestries inferred from genotypes with the self-reported ones. The input file should be a plain text file with two columns (without column header), containing subject ID and the self-reported ancestry, respectively.

Running graf to infer subject ancestry

Option -pop is used by graf to infer subject ancestry:

$ graf

...
Usage: graf [options]
    -pop    output file:     Check subject populations and save results to the output file
...

The following command determines population structures and saves results to the output file:

graf -plink G1000FpGeno -pop G1000_sbj_scores.txt

Running PlotPopulations.pl to plot population results

The results generated by graf can be passed to PlotPopulations.pl for further processing. The following instructions are displayed on the screen when the script is run without parameters:

$ PlotPopulations.pl

Usage: PlotPopulations.pl <input file> <output file> [Options]

Note:
    Output file should be either a .png file or a .txt file.

    If the output file is a .png file, the script will plot the results to a graph and save the graph to the file.
    If the output file is a .txt file, the script will save the calculated subject ancestry components to the file.

Options:
    Set window size in pixels
        -gw      graph width

    Set graph axis limits
        -xmin    min x value
        -xmax    max x value
        -ymin    min y value
        -ymax    max y value

    Set a rectangle area to retrieve subjects for graph of GD1 vs. GD2
        -xcmin   min x value
        -xcmax   max x value
        -ycmin   min y value
        -ycmax   max y value
        -isByd   0 or 1
                 0:  retrieve subjects whose values are within the above rectangle (default value)
                 1:  retrieve subjects whose values are beyond the above rectangle

    Set population cutoff lines
        -ecut   proportion: cutoff European proportion dividing Europeans from other population. Default 87%.
        -fcut   proportion: cutoff African proportion dividing Africans from other population. Default 95%.
                            Set it to -1 to combine African and African American populations
        -acut   proportion: cutoff East Asian proportion dividing East Asians from other populations. Default 95%.
                            Set it to -1 to combine East Asian and Other Asian populations
        -ohcut  proportion: cutoff African proportion dividing Hispanics from Other population. Default 13%.
        -fhcut  proportion: cutoff African proportion dividing Hispanics from African Americans. Default 40%.

    Select some self-reported populations (by IDs) to be highlighted on the graph
        -pops   comma separated population IDs, e.g., -pops 1,3,4 -> highlight populations #1, #3 and #4

    Select self-reported populations (by IDs) to show areas including 95% dbGaP subjects with genotypes of at least 4000 fingerprint SNPs
        -areas  comma separated dbGaP self-population IDs, e.g., -areas 1,3
            -> show areas that include 95% dbGaP subjects with self-reported populations #1 and #3
              1: European/White/Caucasian
              2: African (Ghana/Yoruba)
              3: East Asian (Chinese/Japanese)
              4: African American/Black
              5: Puerto Rican/Dominican
              6: Mexican/Latino
              7: Asian/Pacific Islander
              8: Asian Indian/Pakistani

    Select which score to show on the y-axis
        -gd4     1 or 0.  1: show GD4 on y-axis;  0: show GD2

    Set population cutoff lines
        -cutoff  1 or 0.  1: show cutoff lines;  0: hide cutoff lines

    Rotate the plot with respect to the x-axis by a certain angle
        -rotx    angle in degrees

    Set the size (diameter) of each dot that represents each subject
        -dot     pixels

    The input file with self-reported subject race information
        -spf     a file with two columns: subject and self-reported population

The script takes two required parameters, which must be the first two arguments and are not preceded by flags, unlike all the optional arguments, which are preceded by a flag. The first parameter should be the name of the file that is generated by graf -pop option and contain subject genetic distance scores. The second parameter is the output file, expected to be either a .png or .txt file. If the output file is a .png file, the script processes the scores and saves the results to the output file. The default graph is GD1 vs. GD2, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops.png

When option -gd4 is set to 1, the script generates a graph of GD1 vs. GD4:

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gd4.png -gd4 1

If the output file is a .txt file, the script processes the data and saves the results to the output file in a format of a rectangular table.

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list.txt

In the output file, columns P_e, P_f, P_a show each subject's African, European, and East Asian proportions Pe, Pf, Pa, in percentages. The populations determined by GRAF-pop are included in the last two columns as an identifier and as the full name of the population.

When self-reported ancestries are available, the information can be passed to the script with -spf option so that the script can color-code the subjects using the self-reported ancestries, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_sp.png -spf G1000SbjSuperPop.txt

The format of the input ancestry file is described above. In the graph generated by the script, the ancestries are numbered and color coded.

The cutoff lines used to partition the subjects are drawn on the graphs when option -cutoff is set, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_cut.png -spf G1000SbjSuperPop.txt -cutoff 1
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_cut_gd4.png -spf G1000SbjSuperPop.txt -gd4 1 -cutoff 1

If multiple subjects appear at the same locations in the x-y plane, the user can use option -pops to bring some ancestries to the front, while setting some ancestries to the back and fade out them in the graph. For example, the following command generates a graph with the ancestry No. 5 (AMR, standing for Ad Mixed American) in the back and colored yellow: The assignments of colors to populations are currently hard-coded.

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_1234.png -spf G1000SbjSuperPop.txt -pops 1,3,2,4

The ancestry numbers following -pops should be separated by commas without spaces.

One can also use the -rotx option to rotate the graph of GD2 vs. GD1 around x-axis by a certain angle specified in degrees (can be any real number). For example, the following command generates a graph showing the subjects rotated by 90o:

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_90.png -spf G1000SbjSuperPop.txt -rotx 90

Options -gw, -xmin, -xmax, -ymin, -ymax, -dot, similar to those in PlotGraf.pl, can be used to adjust the graph size, specify axis limits, and set the dot size, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gw.png -spf G1000SbjSuperPop.txt -gw 800 -ymin 1.1 -dot 5
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gw_gd4.png -spf G1000SbjSuperPop.txt -gw 800 -gd4 1 -ymin -0.2

One can use the option -areas to select populations to show the expected oval areas that include 95% of dbGaP subjects with at least 4000 fingerprint SNPs with genotypes, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_a.png -spf G1000SbjSuperPop.txt -areas 1,4,7

The integers in the comma-delimited string represent the eight self-reported ancestry groups in dbGaP, with most common ancestry terms in each group shown below:

            1: European/White/Caucasian
            2: African (Ghana/Yoruba)
            3: East Asian (Chinese/Japanese)
            4: African American/Black
            5: Puerto Rican/Dominican
            6: Mexican/Latino
            7: Asian/Pacific Islander
            8: Asian Indian/Pakistani

GRAF-pop uses the ancestry proportions shown in Tables 2 and 3 as default cutoff values. The user can use options -ecut, -fcut, -acut, -ohcut, -ahcut, -fhcut to set the cutoff values to different numbers, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_ucut.png -spf G1000SbjSuperPop.txt -cutoff 1 -fcut 85 -ahcut 80 -ohcut 15.5

When -fcut or -acut are set to negative values, the African or East Asian cutoff line is not plotted on the graph, and the script does not distinguish Africans from African Americans, or East Asians from Other Asians, e.g.,

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_nf.png -spf G1000SbjSuperPop.txt -cutoff 1 -fcut -1

As mentioned above, when the second parameter (the output file) is a .txt file, the script saves subjects and the ancestry proportions into a rectangular table. Options -xcmin, -xcmax, -ycmin, -ycmax, -isByd can be used to specify a rectangular area and let the script to retrieve subjects whose x(GD1), y (GD2) scores are either within or beyond this area. For example, the following command saves all subjects with 1.8 < GD1 < ∞ and -∞ < GD2 < 1.2, which are all the EAS (East Asian) subjects:

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list_cut.txt -spf G1000SbjSuperPop.txt -xcmin 1.8 -ycmax 1.2

When option -isByd is set to 1, the script retrieves subjects whose value are beyond rectangular area specified by options -xcmin, -xcmax, -ycmin, -ycmax. For example, the following command excludes most of the 1000 Genome Projects subjects with super populations AMR (Ad Mixed American) and SAS (South Asian):

$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list_cutb.txt -spf G1000SbjSuperPop.txt -xcmin 1.64 -xcmax 1.8 -ycmin 1.24 -ycmax 1.36 -isByd 1

References

Jin Y, Schäffer AA, Sherry ST, and Feolo M (2017). Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One. 12(6):e0179106.

Jin Y, Schäffer AA, Feolo M, Holmes JB and Kattman BL (2019). GRAF-pop: A Fast Distance-based Method to Infer Subject Ancestry from Multiple Genotype Datasets without Principal Components Analysis G3: Genes | Genomes | Genetics. Aug 8; 9(8):2447-2461.

About

Genetic Relationship And Fingerprinting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages