Installing and Testing Mahout
Pages 12
- Installing and Testing GraphLab
- Installing and Testing Mahout
- Installing and Testing Spark
- Installing Cassandra on Cluster
- Installing Hadoop on Clusters
- Installing Hadoop on Mac OS X (10.8)
- Installing HBase on Clusters
- Installing HBase on Mac OS X (10.8)
- Installing OpenMPI
- Installing Scala
- Installing YCSB for Cassandra on Cluster
- Installing YCSB for HBase on Cluster
Clone this wiki locally
Apache Mahout which runs on Hadoop is a new Apache project to create scalable, machine learning algorithms. The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.
Before you install Mahout on your Cluster, you need make sure that:
1. You have Java and the latest JDK on your Cluster.
2. You have ssh on your Cluster.
3. You have Hadoop running on your Cluster.
In this tutorial, we will configure and test Mahout on a Hadoop cluster with one master node and nine slave nodes. Master Node:
129.105.126.242 priv-social10
Slave Nodes:
129.105.126.243 priv-social11
129.105.126.244 priv-social12
129.105.126.245 priv-social13
129.105.126.246 priv-social14
129.105.126.247 priv-social15
129.105.126.248 priv-social16
129.105.126.249 priv-social17
129.105.126.250 priv-social18
129.105.126.251 priv-social19
Step 1: Download the Released Mahout
Step 1.1: Download
You can use this link to download Mahout 0.8 (http://mirror.symnds.com/software/Apache/mahout/0.8/mahout-distribution-0.8.tar.gz)
Step 1.2: Installing Mahout
Unpack the mahout-distribution-0.8.tar.gz in any directory you want, I put it in /tmp. Use the following command for unpacking:tar -xzvf mahout-distribution-0.8.tar.gz.
You need to make sure that you are the owner of the directory you chose. Suppose that the user name of system is user, you can use the following command to give yourself the ownership of the directory, :chown -R user:user mahout-distribution-0.8.
Step 1.3: Setting Environment Variables
We need set HADOOP_HOME and MAHOUT_HOME before testing.
export HADOOP_HOME=/tmp/hadoop-1.1.2
export PATH=$HADOOP_HOME/bin:$PATH
export MAHOUT_HOME=/tmp/mahout-distribution-0.8
export PATH=$MAHOUT_HOME/bin:$PATH
Step 2: Testing Mahout
There is no need for configuring Mahout, so we are ready for testing now.
We will do 3 test examples in this tutorial.
Step 2.1: Twenty Newsgroups Classification Example
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will use Mahout Bayes Classifier to create a model that would classify a new document into one of the 20 newsgroup.
You need download data file first and then run this script in mahout-distribution-0.8 folder: ./examples/bin/classify-20newsgroups.sh
Note: This script only download data from website, but it doesn't put test data into HDFS. So you need put test data into HDFS manually.
You can download test data from this address: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.
Then unzip this package, you will get 20news-bydate-test and 20news-bydate-train/. Copy these two folder to HDFS:
hadoop fs -put 20news-bydate-test 20news-bydate-train /tmp/mahout-work-download/20news-all
Then you are ready to run this script in mahout-distribution-0.8 folder: ./examples/bin/classify-20newsgroups.sh.
And you will get the similar result:
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t u <--Classified as
381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 0 | 398 a = rec.motorcycles
1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 1 6 0 4 9 0 | 395 b = comp.windows.x
2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 0 | 376 c = talk.politics.mideast
4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 0 | 364 d = talk.politics.guns
7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 0 | 251 e = talk.religion.misc
10 0 0 0 0 359 2 2 0 1 3 0 1 6 0 1 0 0 11 0 0 | 396 f = rec.autos
0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 0 3 0 0 | 397 g = rec.sport.baseball
1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 0 | 399 h = rec.sport.hockey
2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 0 | 385 i = comp.sys.mac.hardware
0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 0 | 394 j = sci.space
0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18 0 | 392 k = comp.sys.ibm.pc.hardware
8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 0 | 310 l = talk.politics.misc
0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 0 | 389 m = comp.graphics
6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 0 | 393 n = sci.electronics
2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 0 | 398 o = soc.religion.christian
4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 0 | 396 p = sci.med
0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 0 | 396 q = sci.crypt
10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 0 | 319 r = alt.atheism
4 0 0 0 0 9 1 1 8 1 12 0 3 6 0 2 0 0 341 2 0 | 390 s = misc.forsale
8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256 0 | 394 t = comp.os.ms-windows.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 u = unknown
Step 2.2: Clustering of synthetic control data
You can download test data set from this link: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data. You will get a data file named synthetic_control.data.
As the Mahout synthetic control API hard code the input path to testdata in HDFS, so you need create the folder in HDFS first and then put downloaded data into HDFS.
You can use this command to make testdata folder in HDFS: $HADOOP_HOME/bin/hadoop fs -mkdir testdata.
And you can use this command to put synthetic_control.data into HDFS: $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata.
Then you are ready to run this test example now, pick one of the following five examples and run it.
For canopy :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
For kmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
For fuzzykmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
For dirichlet :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
For meanshift :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
Step 2.3: Parallel Frequent Pattern Mining
Mahout has a Top K Parallel FPGrowth Implementation. It is based on this paper http://infolab.stanford.edu/~echang/recsys08-69.pdf with some optimisations in mining the data.
You can download test data set from this website: http://fimi.ua.ac.be/data/. We will use retail data for the testing.
Download retail data from this link: http://fimi.ua.ac.be/data/retail.dat.
Then make directory and put data into the created directory by the following two commands:
$HADOOP_HOME/bin/hadoop fs -mkdir mahout-test/fpg
$HADOOP_HOME/bin/hadoop fs -put <PATH TO retial.dat> mahout-test/fpg
Then you can write a command script to configure the parameter and run test. Here is my command script named frequent-pattern-mining-retail-mapreduce.sh.
nohup mahout fpg \
-i mahout-test/fpg/retail.dat \
-o mahout-test/fpg/retail-patterns \
-k 50 \
-method mapreduce \
-regex '[\ ]' \
-s 2 > mahout-retail-mapreduce-output 2>&1 &
Then you can simply run this command ./frequent-pattern-mining-retail-mapreduce.sh to run the test in background. And all the output will store in mahout-retail-mapreduce-output.