Skip to content
ArnauPrat edited this page Oct 16, 2014 · 19 revisions

#Configuration

##Setup Hadoop The LDBC data generator uses Apache Hadoop version 1.2.1 and we not guarantee compatibility with newer releases. You can download Hadoop 1.2.1 from here.

To install Hadoop, untar the hadoop-1.2.1.tar.gz to your /home/user folder (we will use /home/user for this example, but you can choose the folder that best fits your needs):

cd /home/user
$ tar xvfz hadoop-1.2.1.tar.gz

This will create a directory named hadoop-1.2.1 in your home folder. Hadoop can be configured to run in three different modes: Standalone, Pseudo-Distributed and Distributed modes.

###Standalone###

Standalone mode is typically used for debugging purposes. It is a single threaded mode that runs on a single node machine, and uses the local filesystem instead of HDFS to run. By default, Hadoop is configured to run on Standalone mode, and we suggest to use this mode the first time you run the data generator, to test everything works. Standalone mode is suitable to generate datasets of up to 100K Persons and 3 years of activity. To generate larger datasets, it is desirable to configure Hadoop to run on Pseudo-Distributed or Distributed mode.

###Pseudo-Distributed###

If your machine consists of a single node, with multiple cores/cpus and you want to take advantage of them, then we suggest to configure Hadoop in Pseudo-Distributed mode. This will increase the speed of data generation, to target larger datasets. Hadoop is configured by means of the files found into the /home/user/hadoop-1.2.1/conf folder. Only three files have to be edited: core-site.xml, hdfs-site.xml and mapred-site.xml.

core-site.xml

This file contains the core configuration of hadoop. Your core-site.xml should look like this:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

mapred-site.xml

By means of this file we configure the mapreduce process. It should look like this:

<configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
    <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx16000m</value>
    </property>
    <property>
      <name>mapred.task.timeout</name>
        <value>1800000</value>
    </property>
    <property>
      <name>mapred.child.ulimit</name>
        <value>100000000</value>
    </property>
    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>16</value>
    </property>
    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>16</value>
    </property>
</configuration>

The mapred.child.java.opts is used to set the java options. In this case, we are increasing the size of the java heap, because hadoop and the data generator are memory intensive, specially for large dataset generation. Set the value according to the specifications of your machines, and increase it specially if you suffer from out of memory problems. The mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are the maximum number of threads to perform map and reduce operations respectively, that Hadoop can create. By the default, Hadoop sets these values to one. If you want to take advantage of the multiple cores/processors of your machine, increase these values.

hdfs-site.xml

Is file is used to configure the Hadoop File System.

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/tmp/hdfs/data</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/tmp/hdfs/name</value>
    </property>
</configuration>

dfs.data.dir property specifies the directory where the data of the Hadoop file system will be stored. This IS NOT the folder where to find your output data when the generator runs, but the folder where HDFS stores the HDFS files in its internal format. We suggest to set this property to point to a device with a large amount of space available around (500GB if you want to generate a 1M dataset). dfs.name.dir property specifies the directory where the Hadoop file system information and metadata will be stored.

Finally, it is necessary to have access to localhost, in order to run Hadoop in Pseudo-Distributed mode. Furthermore, it is preferable to be able to ssh to localhost without a passphrase. To check that, type the following:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

If you are not able to access without a passphrase, do not worry, but you will have to type three times the passphrese everytime you start or stop Hadoop :).

Distributed Mode

For information on setting up fully-distributed, non-trivial clusters see [Cluster Setup] (http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html).

For more information about how to configure your Hadoop machine or cluster, please visit the official page

Configuring the run.sh script

We provide a run.sh script to ease the execution of hadoop. This script looks like this:

 #!/bin/bash
HADOOP_HOME=/home/user/hadoop-1.2.1 #change to your hadoop folder
LDBC_SNB_DATAGEN_HOME=/home/user/ldbc_socialnet_bm/ldbc_socialnet_dbgen #change to your ldbc_socialnet_dbgen folder 

export HADOOP_HOME
export LDBC_SNB_DATAGEN_HOME

mvn clean
mvn assembly:assembly

cp $LDBC_SNB_DATAGEN_HOME/target/ldbc_socialnet_dbgen.jar $LDBC_SNB_DATAGEN_HOME/
rm $LDBC_SNB_DATAGEN_HOME/target/ldbc_socialnet_dbgen.jar

$HADOOP_HOME/bin/hadoop  jar LDBC_SNB_DATAGEN_HOME/ldbc_socialnet_dbgen.jar LDBC_SNB_DATAGEN_HOME/params.ini 

#parameter generation
PARAM_GENERATION=1
if [ $PARAM_GENERATION -eq 1 ]
then
	mkdir -p substitution_parameters	
	python paramgenerator/generateparams.py m0factors.txt m0friendList0.csv substitution_parameters/
        rm -f m0factors.txt
        rm -f m0friendList*
fi

The following variables are used to configure the script:

  • HADOOP_HOME: points to where Hadoop was installed. Following our example, this folder is home/user/hadoop-1.2.1.
  • LDBC_SNB_DATAGEN_HOME: points to the LDBC data generator folder.
  • PARAM_GENERATION: indicates whether the parameters for SNB queries are generated. You should only use it with standard scaleFactor (e.g., SF 1). Always disable PARAMETER_GENERATION when using the data generator for non-standard input parameters (e.g., when you set numYears instead of using scaleFactor).

Finally, open /home/user/hadoop-1.2.1/conf/hadoop-env.sh and set JAVA_HOME to point to your jdk folder.

Clone this wiki locally