Skip to content

Beginner Tutorial

mtmcgowan edited this page Sep 2, 2021 · 23 revisions

This page provides an introductory tutorial on how to use GGcall to cluster and call genotypes.

1. Exporting a GS (Genome Studio) project file

First open your GS project file in Genome Studio. This file should have a .bsc extension and will be accompanied by a data folder.

Modify the columns being viewed to only include data that will be used for clustering. If you want to try using different transformations, be sure to include both the raw and the transformed/normalized data.

Click the export button under the 'Full Data Table' tab to export the data as a tabular .txt file

2. Reading in the GS export

Open Rstudio/R and load the GGcall package (If not already installed, see the GGcall main page for installation instructions) library(PolySNPclust)

Call the project.import() function to read in the data:

Project_name <- project.import(exportpath = "your_file_location.txt", raw = T, GSnorm = T)

This function has only a few parameters:

  • exportpath is the location of the GS export file. An example export is included in the PolySNPclust repository in the data folder named 'wheat_demo.txt'. This data is from a population of 241 wheat accessions that were tested with the 90k iSelect SNP chip. To keep computational time down, only a subset of markers are included in the demo data.
  • raw is a T/F indicator whether the raw intensity data was included in the GS export (default = T)
  • GSnorm is another T/F indicator whether GS transformed/normalized theta and r values were included in the GS export (default = T) Your export file must contain at least one of the data types or the import function will fail.

Depending on what columns you filtered out, your GS export may contain a lot of unnecessary variables. The import function reads only relevant columns and stores them in a data structure. The object returned is a project list that contains your GS data, a marker object for every marker, and a genotype object. The marker and genotype objects will be empty at first.

3. Filtering the markers

It is common for some markers to perform very poorly with either lots of missing data or extremely low signal variance. Therefore, it is important to check each marker and determine whether it is likely to cause issues with the clustering algorithm. This function will alter the filtering flag for each marker object in your project.

This is done with a filtering function.

`Project_name <- project.missfilter(project = Project_name, missrate = 0.2, type = 'raw')

  • project This is the name you assigned to your project
  • missrate This is the cutoff for a proportion of missing data (default: missrate = 0.2). Any marker above this threshold will not be clustered.
  • type This indicates which data will be used for clustering.
    • 'raw' Indicates that the raw or scaled raw data will be used.
    • 'GSnorm' Indicates that the GS normalized or arcsine transformed GS data will be used.

4. Clustering the data

The next step is to actually cluster the data. To do this, the R package Rmixmod is automatically used to cluster signal variables in the SNP chip data. Depending on your data, it may be better to cluster either the raw or the theta/r values (some chips greatly benefit from the normalization step to adjust for spotting variance on the chip, but sometimes this can cause more problems than improvements...). Because clustering is considerably time intensive, clustering is sped up by using parallel processing on multiple cores. After the command is entered, multiple threads are started on each core. As each core finishes clustering a marker, the R terminal will update a progress bar. If you are clustering a large number of markers (>10k chip), I recommend you use a machine with a large amount of RAM. There are plans to implement strategies to mitigate this memory requirement, but until then, more is better.

The cluster function is also easy to use:

Project_name <- project.cluster(Project_name)

This function only requires a single parameter, but has several additional options:

  • project is the project name you specified when importing your data

OPTIONAL PARAMETERS

  • marker_indices is the indices of the markers you want to cluster. By default, the function will cluster all markers that aren't flagged as bad. If you only want to cluster specific markers (Like a list MSVs that failed in GenomeStudio), you can specify the index positions on the chip.

  • data.type is a string indicator of what data to use for clustering (default: data.type = 'raw.scaled'). Here are the current options:

    • 'raw.scaled' This is the raw X,Y signal intensities scaled and centered according to the converted R intensity value. To learn more about how this is currently implemented see the scale.raw.data.R function included in the package files.
    • 'raw' This will use the unaltered X,Y intensities.
    • 'GS.norm' This will use the normalized theta and R values determined by GenomeStudio prior to importing your data.
    • 'GS.trans' This will use the normalized values, but will further transform the theta value using an arcsin function to better equalize the cluster variance at extreme values near 0 and 1. To learn how this might help improve your data, read this paper where it is described more in depth: fitTetra paper.
  • 'iteration' How many times to cluster a marker (default: iteration = 5). Increasing this will result in more stable clustering results, but will increase computational time. Unless you are concerned about your cluster results, it is recommended to use the default.

  • 'model.list' This specifies which families of Gaussian mixture model families to use from the Rmixmod package. Best left alone for now, but will allow for better optimized models in the future.

  • 'clust.num' This value is very important. It sets the maximum number of clusters to test for (default: clust.num = 8). Increasing it will allow for more complex cluster models with higher numbers of clusters, but will increase computational time.

5 Calling Your Genotypes

The final step is to use the cluster models to call your genotypes. The default method of calling markers is to output the cluster classifications as a numerical table with a row for each sample and a column for each marker. Because many markers will not be bi-allelic and the expected linear relationship between clusters cannot be easily determined, each marker is expanded into multiple columns with each column representing a single cluster coded as 0 or 1 depending on whether that sample was classified as being in that cluster. This is called indicator expansion and can be learned about from wikipedia. Genotypes will be stored in the ['genotypes'] attribute in your project. They will also be saved under a specified filename in a tab-delimited format compatible with many GWAS or GS tools developed in the ZZlab group.

Here is the function:

Project_name <- project.cluster(Project_name)

PARAMETERS

  • project Again, this is the name of your project.
  • call.type This indicates how to call markers. Currently, only one method is available cat.all, which outputs dummy variables for each cluster. Further methods will be implemented in the future which can isolate just bi-allelic markers or null alleles.
  • min.prop This determines the minimum proportion of samples that need to be in a cluster to be output as a marker. Because outliers are implicitly classified as their own cluster, Gaussian mixture models often identify clusters that only contain a very small proportion of the population. This parameter functions similar to a minor allele frequency (MAF) cutpoint by determining the smallest expected number of individuals required to consider a marker valid.