-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[Update] How to Install and Set Up a 3-Node Hadoop Cluster #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ## What is Hadoop? | ||
|
|
||
| Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It's composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**: a framework for job scheduling that executes data processing tasks on all nodes. | ||
| Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It is composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**, a framework for job scheduling that executes data processing tasks on all nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the ":" is strange here, and it's used more as a comma. Avoiding contractions
| Run the steps in this guide from the **node-master** unless otherwise specified. | ||
|
|
||
| 2. Follow the [Securing Your Server](/docs/security/securing-your-server/) guide to harden the three servers. Create a normal user for the install, and a user called `hadoop` for any Hadoop daemons. Do **not** create SSH keys for `hadoop` users. SSH keys will be addressed in a later section. | ||
| 1. [Add a Private IP Address](/docs/platform/manager/remote-access/#adding-private-ip-addresses) to each Linode so that your Cluster can communicate with an additional layer of security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadoop functions fine from private IP addresses, and this will decrease a potential attack surface
|
|
||
| 4. The steps below use example IPs for each node. Adjust each example according to your configuration: | ||
| 1. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or grab the latest JDK from Oracle. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made that securing the server comes after setting up the private IP so that a user can keep this in mind when configuring their security
| 4. The steps below use example IPs for each node. Adjust each example according to your configuration: | ||
| 1. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or grab the latest JDK from Oracle. | ||
|
|
||
| 1. The steps below use example IPs for each node. Adjust each example according to your configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is more of a disclaimer than a step you need to follow, I moved it to the bottom of the list
| * The **NameNode**: manages the distributed file system and knows where stored data blocks inside the cluster are. | ||
| * The **ResourceManager**: manages the YARN jobs and takes care of scheduling and executing processes on worker nodes. | ||
| * The **NameNode** manages the distributed file system and knows where stored data blocks inside the cluster are. | ||
| * The **ResourceManager** manages the YARN jobs and takes care of scheduling and executing processes on worker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a full sentence that defines something, not a definition itself
| ### Distribute Authentication Key-pairs for the Hadoop User | ||
|
|
||
| The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster. | ||
| The master node will use an ssh connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run-on sentence
| ssh-keygen -b 4096 | ||
|
|
||
| 1. View the **node-master** public key so you can copy it to each of the worker nodes. | ||
| When generating this key, leave the password field blank so your hadoop user can communicate unprompted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first time I set this up I entered a password for the key pair which prevented me from following the guide further
| 1. View the **node-master** public key so you can copy it to each of the worker nodes. | ||
| When generating this key, leave the password field blank so your hadoop user can communicate unprompted. | ||
|
|
||
| 1. View the **node-master** public key and copy it to your clipboard to use with each of your worker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
felt this clarification was worth it for people who may not fully understand how less works
| less /home/hadoop/.ssh/id_rsa.pub | ||
|
|
||
| 1. In each node, make a new file `master.pub` in `/home/hadoop/.ssh`, paste in, and save this key. | ||
| 1. In each Linode, make a new file `master.pub` in the `/home/hadoop/.ssh` directory. Paste your public key into this file and save your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the sentence with some additional clarifications and better flow
| update-alternatives --display java | ||
|
|
||
| Take the value of the current link and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`. | ||
| Take the value of the *current link* and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Italicized current link to differentiate it conceptually
| export JAVA_HOME=${JAVA_HOME} | ||
|
|
||
| with your actual java installation path. For example on a Debian with open-jdk-8: | ||
| with your actual java installation path. On a Debian 9 Linode with open-jdk-8 this will be as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't want to use "example" again
| ### Set NameNode Location | ||
|
|
||
| On each node update `~/hadoop/etc/hadoop/core-site.xml` you want to set the NameNode location to **node-master** on port `9000`: | ||
| Update your `~/hadoop/etc/hadoop/core-site.xml` file to set the NameNode location to **node-master** on port `9000`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do this on each node since it is already performed in a later step.
|
|
||
|
|
||
| The last property disables virtual-memory checking and can prevent containers from being allocated properly on JDK8. | ||
| The last property disables virtual-memory checking which can prevent containers from being allocated properly with JDK8 if enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Felt this was worth clarifying
| wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt | ||
| wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8 | ||
| wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8 | ||
| wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf-8 will not work. Used the links for plaintext instead
| <property> | ||
| <name>yarn.resourcemanager.hostname</name> | ||
| <value>node-master</value> | ||
| <value>203.0.113.0</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're using a private IP with the hostname, we need to use the public IP in order for the yarn site to resolve
|
This needs to be rebased once the quick-disclosure-note and table changes are merged into to develop. |
* Updated to Hadoop 3.1.2 * tech edit * Copy Edit
Updated guide for Hadoop 3.1.2
CT-472