Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
78 lines (53 sloc) 2.84 KB
Advanced Principal Components Analysis (PCA) demo
-------------------------------------------------
The tutorial is a continuation of examples/PCA/Main.lhs
Previously, we were able to quickly determine that the food ration
in Northern Ireland is somewhat different comparing to the other three
countries. We have projected data from 17 dimensions into
one dimension. However, we could also make a projection into
two or more dimensions. So how do we determine which dimensionality
reduction does preserve the most of the information? How do we make
sure that only redundant information was removed?
For that purpose, we will calculate retained variance [1]
depending on the number of the principal components.
[1] http://www.dsc.ufcg.edu.br/~hmg/disciplinas/posgraduacao/rn-copin-2014.3/material/SignalProcPCA.pdf
Let's start with the imports and data definition.
> import Learning ( pca' )
> import qualified Numeric.LinearAlgebra as LA
> import Data.List ( scanl' )
> import Text.Printf ( printf )
> england = [375, 57, 245, 1472, 105, 54, 193, 147, 1102,
> 720, 253, 685, 488, 198, 360, 1374, 156]
> northernIreland = [135, 47, 267, 1494, 66, 41, 209, 93, 674,
> 1033, 143, 586, 355, 187, 334, 1506, 139]
> scotland = [458, 53, 242, 1462, 103, 62, 184, 122, 957,
> 566, 171, 750, 418, 220, 337, 1572, 147]
> wales = [475, 73, 227, 1582, 103, 64, 235, 160, 1137,
> 874, 265, 803, 570, 203, 365, 1256, 175]
> countries = map LA.fromList [england, northernIreland, scotland, wales]
In order to compute the eigenvectors u and eigenvalues eig of
a covariance matrix, we use function pca'. In fact, pca' was
already called under the hood in the previous tutorial.
> (u, eig) = pca' countries
Sums of the first N eigenvalues:
> cumul = drop 1 $ scanl' (+) 0 $ LA.toList eig
Sum of all eigenvalues:
> total = last cumul
> main = mapM_ (\(i, s) ->
> let retained = s / total * 100 :: Double
> msg = "%d principal component(s): Retained variance %.1f%%"
> in putStrLn $ printf msg i retained)
> $ zip [1::Int ..] cumul
1 principal component(s): Retained variance 67.4%
2 principal component(s): Retained variance 96.5%
3 principal component(s): Retained variance 100.0%
4 principal component(s): Retained variance 100.0%
...
17 principal component(s): Retained variance 100.0%
From this data we can conclude that the first two principal components
contain 96.5% of information. Therefore, we will loose 3.5% of information
after projecting into two orthogonal axes in the transformed coordinate system
obtained after PCA.
Hint: to compute compression (dimensionality reduction) and
decompression functions for specified variance to retain, use
(_compress. pcaVariance) and (_decompress. pcaVariance) functions.
You can’t perform that action at this time.