Python implementation of HDDA
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
2D_gmm.png
2D_hdda.png
2D_true_labels.png
LICENSE
README.org
bic_icl_crabs.png
crabs.npz
grassland_B.png
grassland_G.png
grassland_NIR.png
grassland_R.png
grassland_gmm_B.png
grassland_gmm_G.png
grassland_gmm_NIR.png
grassland_gmm_R.png
hdda.py
mda.py
script_crabs.py
script_grasslands.py

README.org

README for the HDDA python toolbox

Objectives

The package provides a python implementation of the High Dimensional Discriminant analysis/clustering models, see publications:

The original R package available on the CRAN: http://cran.r-project.org/web/packages/HDclassif/index.html

Some of the models are actually implemented, those who usually provide the best results in terms of classification accuracy.

Install and requirements

Just download the package. It has been tested on linux, Debian Stretch. Scipy and Scikit should be installed. Also, for a faster processing, a good linear algebra library is preferable. Openblas is a good option.

import platform
platform.python_version()

‘2.7.13’

import scipy as sp
sp.__version__

‘1.0.0’

import sklearn
sklearn.__version__

‘0.19.1’

Usage

Crabs data set

We provide an introductory example taken from Charles Bouveyron thesis, on the crabs data set. We also compare with the standard GMM with EM provided by scikit.

First, load packages and define some variables:

import hdda
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.decomposition import PCA
from sklearn import mixture

# Parameters for HDDA
MODEL = 'M2'
C = 4 # For the example with do not fit the number of classes
th = 0.05 # The threshold for the Cattel test

Load the data

data = sp.load('crabs.npz')
X = data['x']
Y = data['y']

For illustration, we can plot the projection on the two first PC axis.

plt.figure()
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
plt.scatter(Xp[:,0],Xp[:,1],c=Y,s=40)
plt.savefig('2D_true_labels.png')

2D_true_labels.png

We could select the best model using BIC or ICL:

bic, icl = [], []
for model_ in ['M1', 'M2', 'M3', 'M4', 'M5']:
    model = hdda.HDDC(C=C,th=th,model=model_)
    model.fit(X)
    bic.append(model.bic)
    icl.append(model.icl)

plt.figure()
plt.plot(bic)
plt.plot(icl)
plt.legend(("BIC", "ICL"))
plt.xticks(sp.arange(5), ('M1', 'M2', 'M3', 'M4', 'M5'))
plt.grid()
plt.savefig('bic_icl_crabs.png')

bic_icl_crabs.png

model = hdda.HDDC(C=C,th=th,model=MODEL)
model.fit(X)
model.bic
yp=model.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig("2D_hdda.png")

2D_hdda.png

The same learning is done with the GMM from scikit, and we plot the results

clf = mixture.GaussianMixture(n_components=4, covariance_type='full')
clf.fit(X)
yp=clf.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig('2D_gmm.png')

2D_gmm.png

The complete file is available in script_crabs.py.

Grassland

In this example, we show how HDDC clusterizes pixels from a satellite image time series. Again, we need to load data and set up some parameters. Then we use the fit_all function to learn the best model. This is an example of the work of Maïlys Lopes on grassland monitoring from satellite image time series.
import hdda
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import mixture
import time as time

# Load data
data = sp.load('prairie5.npy')
x = data
n,d=x.shape
print "Number of samples: {}\n Number of variables: {}".format(n,d)
# Parameters
MODEL = ['M1','M2','M3','M4','M5','M6','M7','M8']
th = [0.05,0.1,0.2]
C = sp.arange(1,5)

# Model Selection
model = hdda.HDGMM()
tic = time.clock()
model.fit_all(x,MODEL=MODEL,C=C,th=th,VERBOSE=True)
toc = time.clock()
print "Processing time: {}".format(toc-tic)

# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
    plt.figure()
    # Plot the samples
    for j in xrange(n):
        plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
    # Plot the means
    for j in xrange(len(model.mean)):
        plt.plot(model.mean[j][(i*17):((i+1)*17)],lw=3)
    plt.savefig('grassland_{}.png'.format(b))
for b in {B,G,R,NIR}
do
    echo [[file:grassland_$b.png]]
done

grassland_B.png grassland_G.png grassland_R.png grassland_NIR.png

Let’s do the clustering with the conventional GMM from Scikit. The best model is selected according to the BIC (taken from scikit).

bicGmm = []
lowest_bic  = sp.infty
for c in C:
    gmm = mixture.GMM(n_components=c, covariance_type='full')
    gmm.fit(x)
    bicGmm.append(gmm.bic(x))
    if bicGmm[-1] < lowest_bic:
        lowest_bic = bicGmm[-1]
        best_gmm = gmm

print bicGmm

# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
    plt.figure()
    # Plot the samples
    for j in xrange(n):
        plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
    # Plot the means
    for j in xrange(best_gmm.means_.shape[0]):
        plt.plot(best_gmm.means_[j][(i*17):((i+1)*17)],lw=3)
    plt.savefig('grassland_gmm_{}.png'.format(b))        
for b in {B,G,R,NIR}
do
    echo [[file:grassland_gmm_$b.png]]
done

grassland_gmm_B.png grassland_gmm_G.png grassland_gmm_R.png grassland_gmm_NIR.png