{{ message }}

# mfauvel / HDDA

Python implementation of HDDA

## Files

Failed to load latest commit information.

# Objectives

The package provides a python implementation of the High Dimensional Discriminant analysis/clustering models, see publications:

The original `R` package available on the CRAN: http://cran.r-project.org/web/packages/HDclassif/index.html

Some of the models are actually implemented, those who usually provide the best results in terms of classification accuracy.

# Install and requirements

Just download the package. It has been tested on linux, Debian Stretch. Scipy and Scikit should be installed. Also, for a faster processing, a good linear algebra library is preferable. Openblas is a good option.

```import platform
platform.python_version()```

‘2.7.13’

```import scipy as sp
sp.__version__```

‘1.0.0’

```import sklearn
sklearn.__version__```

‘0.19.1’

# Usage

## Crabs data set

We provide an introductory example taken from Charles Bouveyron thesis, on the crabs data set. We also compare with the standard GMM with EM provided by scikit.

First, load packages and define some variables:

```import hdda
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.decomposition import PCA
from sklearn import mixture

# Parameters for HDDA
MODEL = 'M2'
C = 4 # For the example with do not fit the number of classes
th = 0.05 # The threshold for the Cattel test```

```data = sp.load('crabs.npz')
X = data['x']
Y = data['y']```

For illustration, we can plot the projection on the two first PC axis.

```plt.figure()
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
plt.scatter(Xp[:,0],Xp[:,1],c=Y,s=40)
plt.savefig('2D_true_labels.png')``` We could select the best model using BIC or ICL:

```bic, icl = [], []
for model_ in ['M1', 'M2', 'M3', 'M4', 'M5']:
model = hdda.HDDC(C=C,th=th,model=model_)
model.fit(X)
bic.append(model.bic)
icl.append(model.icl)

plt.figure()
plt.plot(bic)
plt.plot(icl)
plt.legend(("BIC", "ICL"))
plt.xticks(sp.arange(5), ('M1', 'M2', 'M3', 'M4', 'M5'))
plt.grid()
plt.savefig('bic_icl_crabs.png')``` ```model = hdda.HDDC(C=C,th=th,model=MODEL)
model.fit(X)
model.bic
yp=model.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig("2D_hdda.png")``` The same learning is done with the GMM from scikit, and we plot the results

```clf = mixture.GaussianMixture(n_components=4, covariance_type='full')
clf.fit(X)
yp=clf.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig('2D_gmm.png')``` The complete file is available in script_crabs.py.

## Grassland

In this example, we show how HDDC clusterizes pixels from a satellite image time series. Again, we need to load data and set up some parameters. Then we use the `fit_all` function to learn the best model. This is an example of the work of Maïlys Lopes on grassland monitoring from satellite image time series.
```import hdda
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import mixture
import time as time

x = data
n,d=x.shape
print "Number of samples: {}\n Number of variables: {}".format(n,d)
# Parameters
MODEL = ['M1','M2','M3','M4','M5','M6','M7','M8']
th = [0.05,0.1,0.2]
C = sp.arange(1,5)

# Model Selection
model = hdda.HDGMM()
tic = time.clock()
model.fit_all(x,MODEL=MODEL,C=C,th=th,VERBOSE=True)
toc = time.clock()
print "Processing time: {}".format(toc-tic)```

```# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
plt.figure()
# Plot the samples
for j in xrange(n):
plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
# Plot the means
for j in xrange(len(model.mean)):
plt.plot(model.mean[j][(i*17):((i+1)*17)],lw=3)
plt.savefig('grassland_{}.png'.format(b))```
```for b in {B,G,R,NIR}
do
echo [[file:grassland_\$b.png]]
done```

Let’s do the clustering with the conventional GMM from `Scikit`. The best model is selected according to the BIC (taken from scikit).

```bicGmm = []
lowest_bic  = sp.infty
for c in C:
gmm = mixture.GMM(n_components=c, covariance_type='full')
gmm.fit(x)
bicGmm.append(gmm.bic(x))
if bicGmm[-1] < lowest_bic:
lowest_bic = bicGmm[-1]
best_gmm = gmm

print bicGmm

# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
plt.figure()
# Plot the samples
for j in xrange(n):
plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
# Plot the means
for j in xrange(best_gmm.means_.shape):
plt.plot(best_gmm.means_[j][(i*17):((i+1)*17)],lw=3)
plt.savefig('grassland_gmm_{}.png'.format(b))        ```
```for b in {B,G,R,NIR}
do
echo [[file:grassland_gmm_\$b.png]]
done```

Python implementation of HDDA