In [1]:
import pyPLS

## Simulation of a few spectra

For the demo, a set of spectra is generated by simulation. **PCA** and **PLS** are then going to performed on these data. The function $simulation.simulateData$ requires a minimum of 4 parameters to generate the spectra:
- **n** is the number of spectra
- **ncp** is the number of latent variables 
- **p** is the number of measured variables
- **sigma** is the peak width
- **signalToNoise** is an optional parameter (default 100) for setting a level of noise in the data. The higher the value the lower is the noise.

In [2]:
X, Z, Y = pyPLS.simulateData(50, 80, 1000, 10., signalToNoise=100.0)

## Principal Component Analysis

A principal component analysis can be performed from the X array generated by the simulation by using the function

out = pyPLS.pca(X, a, scaling=0)

Parameters:
 - X: {N, P} array like, a table of N observations (rows) and P variables (columns) - The explanatory variables,
 - a: The number of component to be fitted
 - scaling: float, optional, default is 0, a decimal typically between 0.0 and 1.0 corresponding to the scaling, typical example are
     - 0.0 corresponds to mean centring
     - 0.5 corresponds to Pareto scaling
     - 1.0 corresponds to unit variance scaling

Returns:
 - out : a pca object with ncp components
        - Attributes:
            - ncp: number of components fitted
            - T :  scores table
            - P :  loadings table
            - E :  residual table
            - R2X: Part of variance of X explained by individual components
            - R2Xcum: Total variance explained by all the components

        - methods:
            - scores(n), loadings(n)
                - n: int
                component id
                return the scores and loadings of the nth component

The first four principal components can be obtained using the following command:

In [5]:
PCA = pyPLS.pca(X, 4, scaling=0)

A summary of the PCA can be obtained using the summary() method

In [7]:
PCA.summary()

Summary of input table
----------------------
Observations: 50
Variables: 1000
Missing values: 0 (0.0%)

Summary of PCA:
---------------
Number of components: 4
Total Variance explained: 55.4%
Variance explained by component:
    - Component 1 : 27.6%
    - Component 2 : 11.0%
    - Component 3 : 8.9%
    - Component 4 : 7.9%


We can then plot the scores with a color corresponding to a group and a tick appearing when hovering a point. The same features are available for the loadings.

In [10]:
show(PCA.plotScores(1,2, groups=Groups, labels=Labels))

## PLS

pyPLS contains an implementation of the modified PLS1 (single y). The classic PLS is also avaliable for comparison. Let's use the first column of the simulated Y to compare PLS amd mPLS.

In [11]:
PLS = pls.pls1(Xc,Yc[:,0],4)
mPLS = pls.mpls1(Xc, Yc[:,0])

Now the model prediction Y for PLS and mPLS are compared. 

In [17]:
show(PLS.scatterYhat(Y=Yc[:,0]))

In [13]:
show(mPLS.scatterYhat(Y=Yc[:,0]))

From the last two plots, it seems that PLS overperforms mPLS. But this may be due to overfitting. Let's use a random Y instead. For this we are going to import the Numpy library directly in the workspace in order to use its random number genarator functions.

In [14]:
import numpy as np
Yrandom = np.random.rand(50)
PLS = pls.pls1(Xc,Yrandom,4)
mPLS = pls.mpls1(Xc, Yrandom)

In [15]:
show(PLS.scatterYhat(Y=Yrandom))

In [16]:
show(mPLS.scatterYhat(Y=Yrandom))

From these results it shows that the output of classic PLS cannot be trusted when the variables of X have some random components. The PLS model is overoptimistic in both cases and mPLS provides a better and more reliable PLS in this example. 