Skip to content

How To with Examples

Quantitative Life Sciences Initiative edited this page Jan 30, 2017 · 5 revisions

Clustering is a widely used unsupervised technique for identifying natural classes within a set of data. The idea is to group unlabeled data into subsets so the within-group homogeneity is relatively high and the between-group heterogeneity is also relatively high. The implication is that the groups should reflect the underlying structure of the data generator (DG). Amiri et al. (2015) developed methods for clustering via ensembling, and we implemented these methods in an R package, entitled EnsCat, available here and from CRAN. The basic technique is for low dimensional data but scales up to high dimensional data by increasing the size of the ensemble. 

We explain the clustering method below for low and high dimensional data.

1. Low dimensional data

1.2 K-modes clustering

The clustering of categorical data can be done in different ways, but the basic method is K-modes. It is a modified version of K-means that uses the cluster modes in place of the cluster means, see Huang (1998). The following code shows how to run K-modes:   

> data(zoo)
> kmodes(zoo$obs,k=7,1:7)

The data set zoo contains 101 observations of 16 dimensions. It can be downloaded from the UCI Machine Learning Repository, see Lichman (2013). The labels of the true clusters are saved in zoo$lab. The second argument in the function kmodes, k=7, is the number of clusters and the third argument is the specification of the initial modes, 1:7, as the first seven data points.  As recognized in Huang (1998), this method is sensitive to initial values leading to instability and frequent inaccuracy. To see this effect run the following code that selects the initial modes randomly.  

> kmodes(zoo$obs,k=7,sort(sample(dim(zoo$obs)[1],7)))

1.3 Model-based clustering

The clustering of data using a model-based approach is implemented in R in the "Rmixmod" library. Code for clustering the zoo data by a model-based approach is as follows.  

> library(Rmixmod)
> zoo2<-matrix(,nrow=nrow(zoo$obs),ncol=ncol(zoo$obs)
> for(ii in 1:ncol(zoo$obs)){
>  zoo2[,ii]<-as.character(zoo$obs[,ii])
> }
> xem2<-mixmodCluster(data.frame(zoo2), 2,model = mixmodMultinomialModel())
> xem2[12]

1.4 Hierarchical approach

Hierarchical clustering provides a nested sequence of clusterings that can be represented as a tree or dendrogram. This technique requires a matrix specifying the distances between data points. A hierarchical clustering using hamming distance and average linkage can be obtained by

> distham0<-hammingD((zoo$obs))
> distham<-as.dist(distham0)
> hcham<-hclust(distham,method=’average’)
> plot(hcham,label=zoo$lab,hang =-1)

The first command generates the average Hamming distance matrix for the data; the second and third commands produce and plot the hierarchical clustering as a dendrogram, respectively. The results are shown in Figure 1.1.  The numbers along the leaves are the true classes.

Figure 1.1 Dendrogram generated by hamming distance and average linkage on zoo data.

The ensembled distance matrix for low dimensional data can be obtained by: 

> disten<-Benhc(zoo$obs,En=200)
> en<-hclust(disten,method='average')
> plot(en,label=zoo$lab) 

The first command runs the clustering with ensembling on zoo$obs and produces the distance matrix. The second command generates the hierarchical clustering. hclust is a command in R from the package stats that can be used to make dendrograms. The final command plots the dendrogram; see Fig. 1.2.

Figure 1.2. Dendrogram generated by the bootstrap ensemble method on zoo data.

The same procedure can be used for any data set, Amiri et al. (2015) considered several datasets the codes are as below

 
 #########################
 ###### Run on Soy data
 #########################
 
 load("soy.RData")
 
 DisALsoy0<-hammingD(soy$obs)
 DisALsoy<-as.dist(DisALsoy0)
 hcALsoy<-hclust(DisALsoy,method='average')
 plot(hcALsoy,label=soy$lab,hang =-1)
 
 DisEnsoy<-Benhc(soy$obs,En=200)
 hsEnsoy<-hclust(DisEnsoy,method='average')
 plot(hsEnsoy,label=soy$lab,hang =-1)
 #########################
 ###### Run on Mushroom data
 #########################

 load("mush.RData")
 
 DisALmush0<-hammingD(mush$obs)
 DisALmush<-as.dist(DisALmush0)
 hcALmush<-hclust(DisALmush,method='average')
 plot(hcALmush,label=mush$lab,hang =-1)
 
 DisEnmush<-Benhc(mush$obs,En=200)
 hsEnmush<-hclust(DisEnmush,method='average')
 plot(hsEnmush,label=mush$lab,hang =-1)
 #########################
 ###### Run on Lympho data
 #########################

 load("lym.RData")
 
 DisALlym0<-hammingD(lym$obs)
 DisALlym<-as.dist(DisALlym0)
 hcALlym<-hclust(DisALlym,method='average')
 plot(hcALlym,label=lym$lab,hang =-1)
 
 DisEnlym<-Benhc(lym$obs,En=200)
 hsEnlym<-hclust(DisEnlym,method='average')
 plot(hsEnlym,label=lym$lab,hang =-1)
 #########################
 ###### Run on Cancer data
 #########################

 load("cancer.RData")
 
 DisALcancer0<-hammingD(cancer$obs)
 DisALcancer<-as.dist(DisALcancer0)
 hcALcancer<-hclust(DisALcancer,method='average')
 plot(hcALcancer,label=cancer$lab,hang =-1)
 
 DisEncancer<-Benhc(cancer$obs,En=200)
 hsEncancer<-hclust(DisEncancer,method='average')
 plot(hsEncancer,label=cancer$lab,hang =-1)

2. High dimensional data

To explain our approach for high dimensional data, we consider DNA sequence data after multiple alignment. The sequence data takes values from the set {A, T , C, G, φ}. R runs faster on numerical values than character values, and numerical values take less memory, so it is efficient to convert the data set from character to numeric values, hence it is convenient to convert the values {A,T,C,G, φ} to {1,2,3,4,NA}. Let the dataset be "dna.fasta"; fasta is a common format for genetic data. The following commands convert the data

> library("seqinr")
> x.dna0 <- read.fasta("dna.fasta")
> x.dna<-CTN(x.dna0)

To demonstrate the clustering of high dimensional data, we cluster the Alpha herpesviridae viral genome data (Pickett et al. (2012 )). According to the Virus Pathogen Database and Analysis Resource (http://www.viprbrc.org/), Herpesviridae is divided into three subfamilies (Alphaherpesviridae, Betaherpesviridae and Gammaherpesviridae). We limited our analysis to the distinct and complete genomes in Alphaherpesviridae that have known hosts. Within Alphaherpesviridae there are five genera, namely, Iltovirus (IIt), Mardivirus (Mar), Scutavirus, Simplexvirus (Sim), and Varicellovirus (Var). Since Scutavirus did not have any complete genomes, we omitted it. The remaining genera had 20, 18, 20, and 40 genomes, respectively, from different hosts, namely, human, monkey, chicken, turkey, duck, cow, bat, equidae, boar, cat, and Amazona oratrix (denoted hum, mon, chi, tur, duc, cow, bat, equ, boa, cat, and ora, respectively, in the dendrograms). These viral genomes have lengths ranging from 124784 to 178311 base pairs. To cluster categorical vectors of different lengths, the first step is to preprocess the data using a multiple alignment approach so all the vectors have the same length. There are several programs that do multiple alignment. We used MAFFT-7, see Katoh and Standley (2013). We stored the aligned data in a file called ‘alphadata’ and pulled it into R using

> data("alphadata")

The data labels are assigned using  

> xlab1<-NULL
> xlab1[1:8]<-"Var-boa"
> xlab1[9:13]<-"Var-hum"
> xlab1[14]<-"Var-cat"
> xlab1[15:32]<-"Var-equ"
> xlab1[33]<-"Var-mon"
> xlab1[34:40]<-"Var-cow"

> xlab1[41:45]<-"Sim-mon"
> xlab1[46:47]<-"Sim-mon"
> xlab1[48:58]<-"Sim-hum"
> xlab1[59]<-"Sim-bat"
> xlab1[60]<-"Sim-mon"

> xlab1[61]<-"Mar-tur"
> xlab1[62:71]<-"Mar-chi"
> xlab1[72:78]<-"Mar-duc"
> xlab1[79]<-"Ilt-ora"
> xlab1[80:98]<-"Ilt-chi"

The following commands show how to generate a dendrogram using the hamming distance  and average linkage.

> dis0<-hammingD(alphadata)
> REDIST<-as.dist(dis0)
> hc0 <- hclust(REDIST,method = "average")
> plot(hc0,label=xlab1,hang =-1)

To speed up the calculations, the following code should be run

library("foreach")
library("doParallel")

cl <- makeCluster(detectCores()-1) # create a cluster with 2 cores
registerDoParallel(cl) # register the cluster
ens = foreach(i = 1:200, 
              .combine = "cbind", 
              .packages = "EnsCat") %dopar% {
                fit1 <- enhcHi(alphadata,En=1,len=c(2,10), type=2)
                fit1
              }
stopCluster(cl) 

dis0<-hammingD(data.frame(ens))
REDIST<-as.dist(dis0)
hc0 <- hclust(REDIST,method = "average")
plot(hc0,label=xlab1,hang =-1)

The figure is presented in Figure 2.1.  The sample labels are shown on the leaves.

Figure 2.1. Dendrogram generated by the hamming distance with average linkage on alphadata.

The following commands show how to get the dendrogram using the random subspace ensemble approach.

> ens<-enhcHi(alphadata,En=100,len=c(2,10))
> dis0<-hammingD(ens)
> REDIST<-as.dist(dis0)
> hc0 <- hclust(REDIST,method = "average")
> plot(hc0,label=xlab1,hang =-1) 

The ensemble method combines clustering of random sizes, where the sizes are drawn from a discrete uniform.  Each component of the ensemble is based on a different double bootstrap sample of the variables.  In the commands above, DU(2,10) indicates the range of the number of clusters. The result is presented in Figure 2.2

Figure 2.2. Dendrogram generated by ensembling clusterings of random sizes on DU(2,10) on alphadata.

The codes of Ebola and SNP RICE are as follow:

#########################
###### Run on Ebola data
#########################

load("ebola.RData")
# Run AL
DisAL0<-hammingD(ebola$obs)
DisAL<-as.dist(DisAL0)
hcAL <- hclust(DisAL,method = "average")
plot(hcAL,label=ebola$la,hang =-1) 

# Run EN-AL
ens<-enhcHi(ebola$obs,En=200,len=c(2,10), type=2)
DisEn0<-hammingD(ens)
DisEn<-as.dist(DisEn0)
hcEN <- hclust(DisEn,method = "average")
plot(hcEN,label=ebola$la,hang =-1)
#########################
###### Run on SNP data
#########################
SNPRICEgit <- "https://github.com/jlp2duke/EnsCat/blob/master/SNPRICE.RData?raw=true"
load(url(SNPRICEgit))

### Note:the label are a:aromatic,  d:indica, 
### e: temperate japonica, r: tropical japonica, s:aus 
### Note: This data is very big data so  and running ensembling is very time-consuming

DisAL0<-hammingD(SNPRICE$obs)
DisAL<-as.dist(DisAL0)
hcAL <- hclust(DisAL,method = "average")
plot(hcAL,label=SNPRICE$lab,hang =-1) 


# Run EN-AL
ens<-enhcHi(SNPRICE$obs,En=200,len=c(2,20), type=2)
DisEn0<-hammingD(ens)
DisEn<-as.dist(DisEn0)
hcEN <- hclust(DisEn,method = "average")
plot(hcEN,label=SNPRICE$lab,hang =-1)

library(trio)
library(ape)

colorCodes <- c(e="red", d="green", s="blue", a="yellow", r="black")

ashc<-as.phylo(hcAL)
ashc$tip.label<-SNPRICE$lab

plot(ashc,tip.color=colorCodes[SNPRICE$lab], type="fan")

ashc<-as.phylo(hcEN)
ashc$tip.label<-SNPRICE$lab

plot(ashc,tip.color=colorCodes[SNPRICE$lab], type="fan")

REFERENCES

[1] Amiri, S., Clarke, J., & Clarke, B. (2015). Clustering categorical data via ensembling dissimilarity matrices. arXiv:1506.07930.
[2] Huang, Z. (1998). Extensions to the v-means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery, 2, 283-304.
[3] Katoh, K., and D.M. Standley (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability, Molecular biology and evolution, 30(4), 772-780. http://mafft.cbrc.jp/alignment/software/
[4] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 
[5] Pickett, B. E., Sadat, E. L., Zhang, Y., Noronha, J. M., Squires, R. B., Hunt, V., ... & Scheuermann, R. H. (2012). ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic acids research, 40(D1), D593-D598. 

Clone this wiki locally
You can’t perform that action at this time.