<h1><span style="color:gray">ipyrad-analysis toolkit:</span> Dimensionality reduction</h1>

The `pca` tool can be used to implement a number of dimensionality reduction methods on SNP data (PCA, t-SNE, UMAP) and to filter and/or impute missing data in genotype matrices to reduce the effects of missing data. 

### Load libraries

In [2]:
# conda install ipyrad -c conda-forge -c bioconda
# conda install ipcoal -c conda-forge
# conda install scikit-learn -c conda-forge

In [3]:
import ipyrad.analysis as ipa
import toyplot

In [4]:
print(ipa.__version__)
print(toyplot.__version__)

0.9.61
0.19.0


### The input data

In [5]:
# the simulated SNP database file
SNPS = "/tmp/oaks.snps.hdf5"

In [7]:
# download example hdf5 dataset (158Mb, takes ~2-3 minutes)
URL = "https://www.dropbox.com/s/x6a4i47xqum27fo/virentes_ref.snps.hdf5?raw=1"
ipa.download(url=URL, path=SNPS);

file already exists


### Make an IMAP dictionary (map popnames to list of samplenames)

In [111]:
IMAP = {
    "virg": ["LALC2", "TXWV2", "FLBA140", "FLSF33", "SCCU3"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXMD3", "TXGR3"],
    "sagr": ["CUCA4", "CUSV6", "CUVN10"],
    "oleo": ["MXSA3017", "BZBB1", "HNDA09", "CRL0030", "CRL0001"],
}
MINMAP = {
    "virg": 3,
    "mini": 3,
    "gemi": 3,
    "bran": 2,
    "fusi": 2,
    "sagr": 2,
    "oleo": 3,
}

### Initiate tool with filtering options

In [112]:
tool = ipa.pca(data=SNPS, minmaf=0.05, imap=IMAP, minmap=MINMAP, impute_method="sample")

Samples: 26
Sites before filtering: 1182005
Filtered (indels): 0
Filtered (bi-allel): 26249
Filtered (mincov): 142749
Filtered (minmap): 876036
Filtered (subsample invariant): 600226
Filtered (minor allele frequency): 494278
Filtered (combined): 1068892
Sites after filtering: 74034
Sites containing missing values: 61935 (83.66%)
Missing values in SNP matrix: 140810 (7.32%)
Imputation: 'sampled'; (0, 1, 2) = 61.2%, 17.0%, 21.8%


### Run PCA
Unlinked SNPs are automatically sampled from each locus. By setting `nreplicates=N` the subsampling procedure is repeated N times to show variation over the subsampled SNPs. The imap dictionary is used in the `.draw()` function to color points, and can be overriden to color points differently from the IMAP used in the tool above.

In [42]:
tool.run(nreplicates=10)
tool.draw(imap=IMAP);

Subsampling SNPs: 25092/100093


In [43]:
# a convenience function for plotting across three axes
tool.draw_panels(0, 1, 2, imap=IMAP);

### Run TSNE
t-SNE is a manifold learning algorithm that can sometimes better project data into a 2-dimensional plane. The distances between points in this space are harder to interpret. 

In [44]:
tool.run_tsne(perplexity=5, seed=333)
tool.draw(imap=IMAP);

Subsampling SNPs: 25092/100093


### Run UMAP
UMAP is similar to t-SNE but the distances between clusters are more representative of the differences betwen groups. This requires another package that if it is not yet installed it will ask you to install. 

In [52]:
tool.run_umap(n_neighbors=13, seed=333)
tool.draw(imap=IMAP);

Subsampling SNPs: 25092/100093


### Missing data with imputation
Missing data has large effects on dimensionality reduction methods, and it is best to (1) minimize the amount of missing data in your input data set by using filtering, and (2) impute missing data values. In the examples above data is imputed using the 'sample' method, which probabilistically samples alleles for based on the allele frequency in the group that a taxon is assigned to in IMAP. It is good to compare this to a case where imputation is performed without IMAP assignments, to assess the impact of the *a priori* assignments. Although this comparison is useful, assigning taxa to groups with IMAP dictionaries for imputation is expected to yield more accurate imputation. 

In [61]:
# allow very little missing data
import itertools
tool = ipa.pca(
    data=SNPS, 
    imap={'samples': list(itertools.chain(*[i for i in IMAP.values()]))},
    minmaf=0.05, 
    mincov=0.9, 
    impute_method="sample", 
    quiet=True,
)
tool.run(nreplicates=10, seed=123)
tool.draw(imap=IMAP);

### Statistics

In [65]:
# variance explained by each PC axes in the first replicate run
tool.variances[0].round(2)

array([0.16, 0.15, 0.06, 0.05, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.03,
       0.03, 0.03, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02,
       0.02, 0.02, 0.02, 0.01, 0.01, 0.  ])

In [74]:
# PC loadings in the first replicate
tool.pcs(0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
BJSB3,-4.61,74.151,-18.954,-26.073,0.587,0.241,-1.056,-2.311,-1.129,0.826,...,1.739,1.861,-0.803,-2.532,0.292,45.451,-3.763,-6.647,0.511,5.15e-15
BJSL25,-4.897,70.815,-17.631,-22.66,-0.514,-0.194,-3.377,-1.688,-0.42,2.316,...,0.284,-1.698,2.079,0.757,0.322,-17.017,4.234,39.465,-0.777,-3.872e-15
BJVL19,-5.211,70.611,-19.109,-23.022,0.183,-0.428,-2.558,-1.428,-0.932,1.365,...,-0.139,-0.828,-0.356,0.9,-0.154,-31.057,0.534,-32.138,0.011,5.343e-15
BZBB1,59.008,-7.891,3.739,1.181,9.771,-21.487,-5.505,-10.85,4.033,2.517,...,-23.674,-7.355,44.886,-0.201,0.726,0.436,-12.788,-0.58,-1.093,1.349e-14
CRL0001,48.873,-14.312,-3.244,-1.591,-4.756,24.637,7.354,-10.712,-12.492,-3.943,...,-2.002,-0.992,-1.08,-1.47,0.523,-0.487,7.434,0.335,35.415,1.801e-14
CRL0030,65.379,-10.158,1.379,0.016,7.581,-12.391,-2.594,-8.762,0.429,1.645,...,-16.87,-1.965,-15.548,-1.647,2.67,3.929,41.397,-2.348,-11.342,1.248e-14
CUCA4,28.689,-17.756,-5.968,-4.747,-25.342,-13.972,-15.904,60.494,-27.65,-2.318,...,3.39,0.035,1.886,1.354,-1.875,0.528,0.13,-0.005,0.234,-5.325e-15
CUSV6,25.507,-18.772,-5.325,-5.905,-13.657,25.755,-3.049,23.135,60.079,13.485,...,-0.083,1.558,0.302,-0.601,-1.812,0.539,-1.068,-0.345,-0.248,7.175e-15
CUVN10,36.071,-18.435,-5.1,-3.779,-13.754,53.098,14.693,-10.917,-24.551,-6.453,...,4.404,0.415,7.293,-0.12,-0.499,-0.757,-6.956,-0.045,-20.352,1.455e-14
FLAB109,-27.347,-23.411,-22.541,6.585,-17.245,-6.095,3.553,-13.351,0.285,-0.432,...,-9.764,2.632,-4.347,-0.406,-20.536,-0.271,1.277,1.406,-0.214,5.87e-15


### Styling plots (see toyplot documentation)
The `.draw()` function returns a canvas and axes object from toyplot which can be further modified and styled.

In [115]:
# get plot objects, several styling options to draw
canvas, axes = tool.draw(imap=IMAP, size=8, width=400);

# various axes styling options shown for x axis
axes.x.ticks.show = True
axes.x.spine.style['stroke-width'] = 1.5
axes.x.ticks.labels.style['font-size'] = '13px'
axes.x.label.style['font-size'] = "15px"
axes.x.label.offset = "22px"