-
Download Virtual Box from: https://www.virtualbox.org/wiki/Downloads
-
Download Ubuntu 16.04.3 (desktop version amd64) from: https://www.ubuntu.com/download/desktop OR Direct Download from: http://mirror.pnl.gov/releases/xenial/ubuntu-16.04.3-desktop-amd64.iso
-
After installing Ubuntu login to th VM and follow instructions given in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html . Here I am giving step by step details for the installation steps.
-
First we will update the system's local repository and then install JAVA (default JDK). Run below commands on the terminal.
sudo apt-get update
-
Now we will install ssh and rsync packages by running following commands.
sudo apt-get install ssh -y
-
Now download Hadoop 2.7.4 from http://www.apache.org/dyn/closer.cgi/hadoop/common/
-
Change directory to Downloads or where ever you have downloaded the hadoop tar file. In my case it is in Downloads and all further instruction are considering that hadoop tart file is in ~/Downloads.
-
Update JAVA_HOME variable in etc/hadoop/hadoop-env.sh file using gedit command as shown below.
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
- Now you should be able to run hadoop; check it by running below command
- Now we will update some configuration files for pseudo-distributed operation. First we will edit etc/hadoop/core-site.xml file as below.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Similarly, we will update etc/hadoop/hdfs-site.xml file as below.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
- Now we will setup passwordless ssh for Hadoop. First check if you already have passwordless ssh authentication setup; if it is new Ubuntu installation most likely it wouldn't set up. If passwordless ssh authentication is not setup, please follow next step othervise skip it.
ssh localhost
- run below commands:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
-
Now we will start NameNode and DataNode but before that we will format the HDFS file system.
-
Now we can access Web-interface for NameNode at http://localhost:50070/
-
Let's download one html page http://hadoop.apache.org and upload into HDFS file system.
wget http://hadoop.apache.org -O hadoop_home_page.html
Please note that HDFS file system is not same as root file system.
- For this example we are using hadoop-mapreduce-examples-2.7.4.jar file which comes along with Hadoop. In this example we are trying to count the total number of 'https' word occurences in the given files. First we run the Hadoop job then copy the results from HDFS to the local file system. We can see that there are 2 occurences of https in the given file and same we can validate using wget command.