Installing and Testing Mahout

Xiling Sun edited this page Aug 12, 2013 · 8 revisions

Apache Mahout which runs on Hadoop is a new Apache project to create scalable, machine learning algorithms. The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.

Before you install Mahout on your Cluster, you need make sure that:

1. You have Java and the latest JDK on your Cluster.

2. You have ssh on your Cluster.

3. You have Hadoop running on your Cluster.

In this tutorial, we will configure and test Mahout on a Hadoop cluster with one master node and nine slave nodes. Master Node:

129.105.126.242 priv-social10

Slave Nodes:

129.105.126.243 priv-social11
129.105.126.244 priv-social12
129.105.126.245 priv-social13
129.105.126.246 priv-social14
129.105.126.247 priv-social15
129.105.126.248 priv-social16
129.105.126.249 priv-social17
129.105.126.250 priv-social18
129.105.126.251 priv-social19

Step 1: Download the Released Mahout

Step 1.1: Download

You can use this link to download Mahout 0.8 (http://mirror.symnds.com/software/Apache/mahout/0.8/mahout-distribution-0.8.tar.gz)

Step 1.2: Installing Mahout

Unpack the mahout-distribution-0.8.tar.gz in any directory you want, I put it in /tmp. Use the following command for unpacking:tar -xzvf mahout-distribution-0.8.tar.gz.

You need to make sure that you are the owner of the directory you chose. Suppose that the user name of system is user, you can use the following command to give yourself the ownership of the directory, :chown -R user:user mahout-distribution-0.8.

Step 1.3: Setting Environment Variables

We need set HADOOP_HOME and MAHOUT_HOME before testing.

export HADOOP_HOME=/tmp/hadoop-1.1.2
export PATH=$HADOOP_HOME/bin:$PATH

export MAHOUT_HOME=/tmp/mahout-distribution-0.8
export PATH=$MAHOUT_HOME/bin:$PATH

Step 2: Testing Mahout

There is no need for configuring Mahout, so we are ready for testing now.

We will do 3 test examples in this tutorial.

Step 2.1: Twenty Newsgroups Classification Example

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will use Mahout Bayes Classifier to create a model that would classify a new document into one of the 20 newsgroup.

You need download data file first and then run this script in mahout-distribution-0.8 folder: ./examples/bin/classify-20newsgroups.sh

Note: This script only download data from website, but it doesn't put test data into HDFS. So you need put test data into HDFS manually.

You can download test data from this address: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.

Then unzip this package, you will get 20news-bydate-test and 20news-bydate-train/. Copy these two folder to HDFS:

hadoop fs -put 20news-bydate-test 20news-bydate-train /tmp/mahout-work-download/20news-all

Then you are ready to run this script in mahout-distribution-0.8 folder: ./examples/bin/classify-20newsgroups.sh.

And you will get the similar result:

=======================================================
Confusion Matrix
-------------------------------------------------------
a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   <--Classified as
381 0   0   0   0   9   1   0   0   0   1   0   0   2   0   1   0   0   3   0   0    |  398  a     = rec.motorcycles
1   284 0   0   0   0   1   0   6   3   11  0   66  3   0   1   6   0   4   9   0    |  395  b     = comp.windows.x
2   0   339 2   0   3   5   1   0   0   0   0   1   1   12  1   7   0   2   0   0    |  376  c     = talk.politics.mideast
4   0   1   327 0   2   2   0   0   2   1   1   0   5   1   4   12  0   2   0   0    |  364  d     = talk.politics.guns
7   0   4   32  27  7   7   2   0   12  0   0   6   0   100 9   7   31  0   0   0    |  251  e     = talk.religion.misc
10  0   0   0   0   359 2   2   0   1   3   0   1   6   0   1   0   0   11  0   0    |  396  f     = rec.autos
0   0   0   0   0   1   383 9   1   0   0   0   0   0   0   0   0   0   3   0   0    |  397  g     = rec.sport.baseball
1   0   0   0   0   0   9   382 0   0   0   0   1   1   1   0   2   0   2   0   0    |  399  h     = rec.sport.hockey
2   0   0   0   0   4   3   0   330 4   4   0   5   12  0   0   2   0   12  7   0    |  385  i     = comp.sys.mac.hardware
0   3   0   0   0   0   1   0   0   368 0   0   10  4   1   3   2   0   2   0   0    |  394  j     = sci.space
0   0   0   0   0   3   1   0   27  2   291 0   11  25  0   0   1   0   13  18  0    |  392  k     = comp.sys.ibm.pc.hardware
8   0   1   109 0   6   11  4   1   18  0   98  1   3   11  10  27  1   1   0   0    |  310  l     = talk.politics.misc
0   11  0   0   0   3   6   0   10  6   11  0   299 13  0   2   13  0   7   8   0    |  389  m     = comp.graphics
6   0   1   0   0   4   2   0   5   2   12  0   8   321 0   4   14  0   8   6   0    |  393  n     = sci.electronics
2   0   0   0   0   0   4   1   0   3   1   0   3   1   372 6   0   2   1   2   0    |  398  o     = soc.religion.christian
4   0   0   1   0   2   3   3   0   4   2   0   7   12  6   342 1   0   9   0   0    |  396  p     = sci.med
0   1   0   1   0   1   4   0   3   0   1   0   8   4   0   2   369 0   1   1   0    |  396  q     = sci.crypt
10  0   4   10  1   5   6   2   2   6   2   0   2   1   86  15  14  152 0   1   0    |  319  r     = alt.atheism
4   0   0   0   0   9   1   1   8   1   12  0   3   6   0   2   0   0   341 2   0    |  390  s     = misc.forsale
8   5   0   0   0   1   6   0   8   5   50  0   40  2   1   0   9   0   3   256 0    |  394  t     = comp.os.ms-windows.misc
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    |  0    u     = unknown

Step 2.2: Clustering of synthetic control data

You can download test data set from this link: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data. You will get a data file named synthetic_control.data.

As the Mahout synthetic control API hard code the input path to testdata in HDFS, so you need create the folder in HDFS first and then put downloaded data into HDFS.

You can use this command to make testdata folder in HDFS: $HADOOP_HOME/bin/hadoop fs -mkdir testdata.

And you can use this command to put synthetic_control.data into HDFS: $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata.

Then you are ready to run this test example now, pick one of the following five examples and run it.

For canopy :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job

For kmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

For fuzzykmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

For dirichlet :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

For meanshift :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

Step 2.3: Parallel Frequent Pattern Mining

Mahout has a Top K Parallel FPGrowth Implementation. It is based on this paper http://infolab.stanford.edu/~echang/recsys08-69.pdf with some optimisations in mining the data.

You can download test data set from this website: http://fimi.ua.ac.be/data/. We will use retail data for the testing.

Download retail data from this link: http://fimi.ua.ac.be/data/retail.dat.

Then make directory and put data into the created directory by the following two commands:

$HADOOP_HOME/bin/hadoop fs -mkdir mahout-test/fpg
$HADOOP_HOME/bin/hadoop fs -put <PATH TO retial.dat> mahout-test/fpg

Then you can write a command script to configure the parameter and run test. Here is my command script named frequent-pattern-mining-retail-mapreduce.sh.

nohup mahout fpg \
    -i mahout-test/fpg/retail.dat \
    -o mahout-test/fpg/retail-patterns \
    -k 50 \
    -method mapreduce \
    -regex '[\ ]' \
    -s 2 > mahout-retail-mapreduce-output 2>&1 &

Then you can simply run this command ./frequent-pattern-mining-retail-mapreduce.sh to run the test in background. And all the output will store in mahout-retail-mapreduce-output.