diff --git a/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/index.md b/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/index.md index ccf3d06d522..da80810cc9e 100644 --- a/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/index.md +++ b/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/index.md @@ -24,21 +24,23 @@ external_resources: ## What is Hadoop? -Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It's composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**: a framework for job scheduling that executes data processing tasks on all nodes. +Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It is composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**, a framework for job scheduling that executes data processing tasks on all nodes. ![How to Install and Set Up a 3-Node Hadoop Cluster](hadoop-1-logo.png "How to Install and Set Up a 3-Node Hadoop Cluster") ## Before You Begin -1. Follow the [Getting Started](/docs/getting-started/) guide to create three (3) Linodes. They'll be referred to throughout this guide as **node-master**, **node1** and **node2**. It's recommended that you set the hostname of each Linode to match this naming convention. +1. Follow the [Getting Started](/docs/getting-started/) guide to create three (3) Linodes. They'll be referred to throughout this guide as **node-master**, **node1**, and **node2**. It is recommended that you set the hostname of each Linode to match this naming convention. Run the steps in this guide from the **node-master** unless otherwise specified. -2. Follow the [Securing Your Server](/docs/security/securing-your-server/) guide to harden the three servers. Create a normal user for the install, and a user called `hadoop` for any Hadoop daemons. Do **not** create SSH keys for `hadoop` users. SSH keys will be addressed in a later section. +1. [Add a Private IP Address](/docs/platform/manager/remote-access/#adding-private-ip-addresses) to each Linode so that your Cluster can communicate with an additional layer of security. -3. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or grab the latest JDK from Oracle. +1. Follow the [Securing Your Server](/docs/security/securing-your-server/) guide to harden each of the three servers. Create a normal user for the Hadoop installation, and a user called `hadoop` for the Hadoop daemons. Do **not** create SSH keys for `hadoop` users. SSH keys will be addressed in a later section. -4. The steps below use example IPs for each node. Adjust each example according to your configuration: +1. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or install the latest JDK from Oracle. + +1. The steps below use example IPs for each node. Adjust each example according to your configuration: - **node-master**: 192.0.2.1 - **node1**: 192.0.2.2 @@ -50,23 +52,23 @@ This guide is written for a non-root user. Commands that require elevated privil ## Architecture of a Hadoop Cluster -Before configuring the master and slave nodes, it's important to understand the different components of a Hadoop cluster. +Before configuring the master and worker nodes, it's important to understand the different components of a Hadoop cluster. -A **master node** keeps knowledge about the distributed file system, like the `inode` table on an `ext3` filesystem, and schedules resources allocation. **node-master** will handle this role in this guide, and host two daemons: +A **master node** maintains knowledge about the distributed file system, like the `inode` table on an `ext3` filesystem, and schedules resources allocation. **node-master** will handle this role in this guide, and host two daemons: -* The **NameNode**: manages the distributed file system and knows where stored data blocks inside the cluster are. -* The **ResourceManager**: manages the YARN jobs and takes care of scheduling and executing processes on slave nodes. +* The **NameNode** manages the distributed file system and knows where stored data blocks inside the cluster are. +* The **ResourceManager** manages the YARN jobs and takes care of scheduling and executing processes on worker nodes. -**Slave nodes** store the actual data and provide processing power to run the jobs. They'll be **node1** and **node2**, and will host two daemons: +**Worker nodes** store the actual data and provide processing power to run the jobs. They'll be **node1** and **node2**, and will host two daemons: -* The **DataNode** manages the actual data physically stored on the node; it's named, `NameNode`. +* The **DataNode** manages the physical data stored on the node; it's named, `NameNode`. * The **NodeManager** manages execution of tasks on the node. ## Configure the System ### Create Host File on Each Node -For each node to communicate with its names, edit the `/etc/hosts` file to add the IP address of the three servers. Don't forget to replace the sample IP with your IP: +For each node to communicate with each other by name, edit the `/etc/hosts` file to add the private IP addresses of the three servers. Don't forget to replace the sample IP with your IP: {{< file "/etc/hosts" >}} 192.0.2.1 node-master @@ -77,26 +79,32 @@ For each node to communicate with its names, edit the `/etc/hosts` file to add t ### Distribute Authentication Key-pairs for the Hadoop User -The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster. +The master node will use an SSH connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster. -1. Login to **node-master** as the `hadoop` user, and generate an ssh-key: +1. Login to **node-master** as the `hadoop` user, and generate an SSH key: ssh-keygen -b 4096 -2. Copy the key to the other nodes. It's good practice to also copy the key to the **node-master** itself, so that you can also use it as a DataNode if needed. Type the following commands, and enter the `hadoop` user's password when asked. If you are prompted whether or not to add the key to known hosts, enter `yes`: + When generating this key, leave the password field blank so your Hadoop user can communicate unprompted. + +1. View the **node-master** public key and copy it to your clipboard to use with each of your worker nodes. + + less /home/hadoop/.ssh/id_rsa.pub + +1. In each Linode, make a new file `master.pub` in the `/home/hadoop/.ssh` directory. Paste your public key into this file and save your changes. + +1. Copy your key file into the authorized key store. - ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node-master - ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node1 - ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node2 + cat ~/.ssh/master.pub >> ~/.ssh/authorized_keys ### Download and Unpack Hadoop Binaries -Login to **node-master** as the `hadoop` user, download the Hadoop tarball from [Hadoop project page](https://hadoop.apache.org/), and unzip it: +Log into **node-master** as the `hadoop` user, download the Hadoop tarball from [Hadoop project page](https://hadoop.apache.org/), and unzip it: cd - wget http://apache.mindstudios.com/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz - tar -xzf hadoop-2.8.1.tar.gz - mv hadoop-2.8.1 hadoop + wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-3.1.2.tar.gz + tar -xzf hadoop-3.1.2.tar.gz + mv hadoop-3.1.2 hadoop ### Set Environment Variables @@ -104,21 +112,26 @@ Login to **node-master** as the `hadoop` user, download the Hadoop tarball from {{< file "/home/hadoop/.profile" shell >}} PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH - {{< /file >}} +1. Add Hadoop to your PATH for the shell. Edit `.bashrc` and add the following lines: + + {{< file "/home/hadoop/.bashrc" shell >}} +export HADOOP_HOME=/home/hadoop/hadoop +export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin +{{< /file >}} ## Configure the Master Node -Configuration will be done on **node-master** and replicated to other nodes. +Configuration will be performed on **node-master** and replicated to other nodes. ### Set JAVA_HOME -1. Get your Java installation path. If you installed open-jdk from your package manager, you can get the path with the command: +1. Find your Java installation path. This is known as `JAVA_HOME`. If you installed open-jdk from your package manager, you can find the path with the command: update-alternatives --display java - Take the value of the current link and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`. + Take the value of the *current link* and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`. If you installed java from Oracle, `JAVA_HOME` is the path where you unzipped the java archive. @@ -126,7 +139,7 @@ Configuration will be done on **node-master** and replicated to other nodes. export JAVA_HOME=${JAVA_HOME} - with your actual java installation path. For example on a Debian with open-jdk-8: + with your actual java installation path. On a Debian 9 Linode with open-jdk-8 this will be as follows: {{< file "~/hadoop/etc/hadoop/hadoop-env.sh" shell >}} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre @@ -137,7 +150,7 @@ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre ### Set NameNode Location -On each node update `~/hadoop/etc/hadoop/core-site.xml` you want to set the NameNode location to **node-master** on port `9000`: +Update your `~/hadoop/etc/hadoop/core-site.xml` file to set the NameNode location to **node-master** on port `9000`: {{< file "~/hadoop/etc/hadoop/core-site.xml" xml >}} @@ -154,7 +167,7 @@ On each node update `~/hadoop/etc/hadoop/core-site.xml` you want to set the Name ### Set path for HDFS -Edit `hdfs-site.conf`: +Edit `hdfs-site.conf` to resemble the following configuration: {{< file "~/hadoop/etc/hadoop/hdfs-site.xml" xml >}} @@ -177,23 +190,30 @@ Edit `hdfs-site.conf`: {{< /file >}} -The last property, `dfs.replication`, indicates how many times data is replicated in the cluster. You can set `2` to have all the data duplicated on the two nodes. Don't enter a value higher than the actual number of slave nodes. +The last property, `dfs.replication`, indicates how many times data is replicated in the cluster. You can set `2` to have all the data duplicated on the two nodes. Don't enter a value higher than the actual number of worker nodes. ### Set YARN as Job Scheduler -1. In `~/hadoop/etc/hadoop/`, rename `mapred-site.xml.template` to `mapred-site.xml`: - - cd ~/hadoop/etc/hadoop - mv mapred-site.xml.template mapred-site.xml +Edit the `mapred-site.xml` file, setting YARN as the default framework for MapReduce operations: -2. Edit the file, setting yarn as the default framework for MapReduce operations: - - {{< file "~/hadoop/etc/hadoop/mapred-site.xml" xml >}} +{{< file "~/hadoop/etc/hadoop/mapred-site.xml" xml >}} mapreduce.framework.name yarn + + yarn.app.mapreduce.am.env + HADOOP_MAPRED_HOME=$HADOOP_HOME + + + mapreduce.map.env + HADOOP_MAPRED_HOME=$HADOOP_HOME + + + mapreduce.reduce.env + HADOOP_MAPRED_HOME=$HADOOP_HOME + {{< /file >}} @@ -201,7 +221,7 @@ The last property, `dfs.replication`, indicates how many times data is replicate ### Configure YARN -Edit `yarn-site.xml`: +Edit `yarn-site.xml`, which contains the configuration options for YARN. In the `value` field for the `yarn.resourcemanager.hostname`, replace `203.0.113.0` with the public IP address of **node-master**: {{< file "~/hadoop/etc/hadoop/yarn-site.xml" xml >}} @@ -212,7 +232,7 @@ Edit `yarn-site.xml`: yarn.resourcemanager.hostname - node-master + 203.0.113.0 @@ -224,11 +244,11 @@ Edit `yarn-site.xml`: {{< /file >}} -### Configure Slaves +### Configure Workers -The file `slaves` is used by startup scripts to start required daemons on all nodes. Edit `~/hadoop/etc/hadoop/slaves` to be: +The file `workers` is used by startup scripts to start required daemons on all nodes. Edit `~/hadoop/etc/hadoop/workers` to include both of the nodes: -{{< file "~/hadoop/etc/hadoop/slaves" resource >}} +{{< file "~/hadoop/etc/hadoop/workers" resource >}} node1 node2 @@ -246,7 +266,7 @@ A YARN job is executed with two kind of resources: - An *Application Master* (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster. - Some executors that are created by the AM actually run the job. For a MapReduce jobs, they'll perform map or reduce operation, in parallel. -Both are run in *containers* on slave nodes. Each slave node runs a *NodeManager* daemon that's responsible for container creation on the node. The whole cluster is managed by a *ResourceManager* that schedules container allocation on all the slave-nodes, depending on capacity requirements and current charge. +Both are run in *containers* on worker nodes. Each worker node runs a *NodeManager* daemon that's responsible for container creation on the node. The whole cluster is managed by a *ResourceManager* that schedules container allocation on all the worker-nodes, depending on capacity requirements and current charge. Four types of resource allocations need to be configured properly for the cluster to work. These are: @@ -310,7 +330,7 @@ For 2GB nodes, a working configuration may be: {{< /file >}} - The last property disables virtual-memory checking and can prevent containers from being allocated properly on JDK8. + The last property disables virtual-memory checking which can prevent containers from being allocated properly with JDK8 if enabled. 2. Edit `/home/hadoop/hadoop/etc/hadoop/mapred-site.xml` and add the following lines: @@ -337,25 +357,25 @@ For 2GB nodes, a working configuration may be: ## Duplicate Config Files on Each Node -1. Copy the hadoop binaries to slave nodes: +1. Copy the Hadoop binaries to worker nodes: cd /home/hadoop/ scp hadoop-*.tar.gz node1:/home/hadoop scp hadoop-*.tar.gz node2:/home/hadoop -2. Connect to **node1** via ssh. A password isn't required, thanks to the ssh keys copied above: +2. Connect to **node1** via SSH. A password isn't required, thanks to the SSH keys copied above: ssh node1 3. Unzip the binaries, rename the directory, and exit **node1** to get back on the node-master: - tar -xzf hadoop-2.8.1.tar.gz - mv hadoop-2.8.1 hadoop + tar -xzf hadoop-3.1.2.tar.gz + mv hadoop-3.1.2 hadoop exit 4. Repeat steps 2 and 3 for **node2**. -5. Copy the Hadoop configuration files to the slave nodes: +5. Copy the Hadoop configuration files to the worker nodes: for node in node1 node2; do scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/; @@ -379,20 +399,20 @@ This section will walk through starting HDFS on NameNode and DataNodes, and moni start-dfs.sh - It'll start **NameNode** and **SecondaryNameNode** on node-master, and **DataNode** on **node1** and **node2**, according to the configuration in the `slaves` config file. + This will start **NameNode** and **SecondaryNameNode** on node-master, and **DataNode** on **node1** and **node2**, according to the configuration in the `workers` config file. -2. Check that every process is running with the `jps` command on each node. You should get on **node-master** (PID will be different): +2. Check that every process is running with the `jps` command on each node. On **node-master**, you should see the following (the PID number will be different): 21922 Jps 21603 NameNode 21787 SecondaryNameNode - and on **node1** and **node2**: + And on **node1** and **node2** you should see the following: 19728 DataNode 19819 Jps -3. To stop HDFS on master and slave nodes, run the following command from **node-master**: +3. To stop HDFS on master and worker nodes, run the following command from **node-master**: stop-dfs.sh @@ -406,7 +426,7 @@ This section will walk through starting HDFS on NameNode and DataNodes, and moni hdfs dfsadmin -help -2. You can also automatically use the friendlier web user interface. Point your browser to http://node-master-IP:50070 and you'll get a user-friendly monitoring console. +2. You can also automatically use the friendlier web user interface. Point your browser to http://node-master-IP:9870, where node-master-IP is the IP address of your node-master, and you'll get a user-friendly monitoring console. ![Screenshot of HDFS Web UI](hadoop-3-hdfs-webui-wide.png "Screenshot of HDFS Web UI") @@ -426,8 +446,8 @@ Let's use some textbooks from the [Gutenberg project](https://www.gutenberg.org/ cd /home/hadoop wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt - wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8 - wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8 + wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt + wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt 3. Put the three books through HDFS, in the `books`directory: @@ -451,7 +471,7 @@ There are many commands to manage your HDFS. For a complete list, you can look a ## Run YARN -HDFS is a distributed storage system, it doesn't provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN. +HDFS is a distributed storage system, and doesn't provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN. ### Start and Stop YARN @@ -477,29 +497,29 @@ HDFS is a distributed storage system, it doesn't provide any services for runnin To get all available parameters of the `yarn` command, see [Apache YARN documentation](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html). -2. As with HDFS, YARN provides a friendlier web UI, started by default on port `8088` of the Resource Manager. Point your browser to http://node-master-IP:8088 and browse the UI: +2. As with HDFS, YARN provides a friendlier web UI, started by default on port `8088` of the Resource Manager. Point your browser to http://node-master-IP:8088, where node-master-IP is the IP address of your node-master, and browse the UI: ![Screenshot of YARN Web UI](hadoop-4-yarn-webui-wide.png "Screenshot of YARN Web UI") ### Submit MapReduce Jobs to YARN -Yarn jobs are packaged into `jar` files and submitted to YARN for execution with the command `yarn jar`. The Hadoop installation package provides sample applications that can be run to test your cluster. You'll use them to run a word count on the three books previously uploaded to HDFS. +YARN jobs are packaged into `jar` files and submitted to YARN for execution with the command `yarn jar`. The Hadoop installation package provides sample applications that can be run to test your cluster. You'll use them to run a word count on the three books previously uploaded to HDFS. -1. Submit a job with the sample jar to YARN. On **node-master**, run: +1. Submit a job with the sample `jar` to YARN. On **node-master**, run: - yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount "books/*" output + yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount "books/*" output The last argument is where the output of the job will be saved - in HDFS. 2. After the job is finished, you can get the result by querying HDFS with `hdfs dfs -ls output`. In case of a success, the output will resemble: Found 2 items - -rw-r--r-- 1 hadoop supergroup 0 2017-10-11 14:09 output/_SUCCESS - -rw-r--r-- 1 hadoop supergroup 269158 2017-10-11 14:09 output/part-r-00000 + -rw-r--r-- 2 hadoop supergroup 0 2019-05-31 17:21 output/_SUCCESS + -rw-r--r-- 2 hadoop supergroup 789726 2019-05-31 17:21 output/part-r-00000 3. Print the result with: - hdfs dfs -cat output/part-r-00000 + hdfs dfs -cat output/part-r-00000 | less ## Next Steps