Skip to content

Conversation

@hzoppetti
Copy link
Contributor

Updated guide for Hadoop 3.1.2
CT-472

hzoppetti and others added 2 commits May 31, 2019 16:25
## What is Hadoop?

Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It's composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**: a framework for job scheduling that executes data processing tasks on all nodes.
Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It is composed of the **Hadoop Distributed File System (HDFS™)** that handles scalability and redundancy of data across nodes, and **Hadoop YARN**, a framework for job scheduling that executes data processing tasks on all nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the ":" is strange here, and it's used more as a comma. Avoiding contractions

Run the steps in this guide from the **node-master** unless otherwise specified.

2. Follow the [Securing Your Server](/docs/security/securing-your-server/) guide to harden the three servers. Create a normal user for the install, and a user called `hadoop` for any Hadoop daemons. Do **not** create SSH keys for `hadoop` users. SSH keys will be addressed in a later section.
1. [Add a Private IP Address](/docs/platform/manager/remote-access/#adding-private-ip-addresses) to each Linode so that your Cluster can communicate with an additional layer of security.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadoop functions fine from private IP addresses, and this will decrease a potential attack surface


4. The steps below use example IPs for each node. Adjust each example according to your configuration:
1. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or grab the latest JDK from Oracle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made that securing the server comes after setting up the private IP so that a user can keep this in mind when configuring their security

4. The steps below use example IPs for each node. Adjust each example according to your configuration:
1. Install the JDK using the appropriate guide for your distribution, [Debian](/docs/development/java/install-java-on-debian/), [CentOS](/docs/development/java/install-java-on-centos/) or [Ubuntu](/docs/development/java/install-java-on-ubuntu-16-04/), or grab the latest JDK from Oracle.

1. The steps below use example IPs for each node. Adjust each example according to your configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is more of a disclaimer than a step you need to follow, I moved it to the bottom of the list

* The **NameNode**: manages the distributed file system and knows where stored data blocks inside the cluster are.
* The **ResourceManager**: manages the YARN jobs and takes care of scheduling and executing processes on worker nodes.
* The **NameNode** manages the distributed file system and knows where stored data blocks inside the cluster are.
* The **ResourceManager** manages the YARN jobs and takes care of scheduling and executing processes on worker nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a full sentence that defines something, not a definition itself

### Distribute Authentication Key-pairs for the Hadoop User

The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.
The master node will use an ssh connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-on sentence

ssh-keygen -b 4096

1. View the **node-master** public key so you can copy it to each of the worker nodes.
When generating this key, leave the password field blank so your hadoop user can communicate unprompted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first time I set this up I entered a password for the key pair which prevented me from following the guide further

1. View the **node-master** public key so you can copy it to each of the worker nodes.
When generating this key, leave the password field blank so your hadoop user can communicate unprompted.

1. View the **node-master** public key and copy it to your clipboard to use with each of your worker nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

felt this clarification was worth it for people who may not fully understand how less works

less /home/hadoop/.ssh/id_rsa.pub

1. In each node, make a new file `master.pub` in `/home/hadoop/.ssh`, paste in, and save this key.
1. In each Linode, make a new file `master.pub` in the `/home/hadoop/.ssh` directory. Paste your public key into this file and save your changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the sentence with some additional clarifications and better flow

update-alternatives --display java

Take the value of the current link and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`.
Take the value of the *current link* and remove the trailing `/bin/java`. For example on Debian, the link is `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java`, so `JAVA_HOME` should be `/usr/lib/jvm/java-8-openjdk-amd64/jre`.
Copy link
Contributor

@sagesyr sagesyr Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Italicized current link to differentiate it conceptually

export JAVA_HOME=${JAVA_HOME}

with your actual java installation path. For example on a Debian with open-jdk-8:
with your actual java installation path. On a Debian 9 Linode with open-jdk-8 this will be as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want to use "example" again

### Set NameNode Location

On each node update `~/hadoop/etc/hadoop/core-site.xml` you want to set the NameNode location to **node-master** on port `9000`:
Update your `~/hadoop/etc/hadoop/core-site.xml` file to set the NameNode location to **node-master** on port `9000`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this on each node since it is already performed in a later step.



The last property disables virtual-memory checking and can prevent containers from being allocated properly on JDK8.
The last property disables virtual-memory checking which can prevent containers from being allocated properly with JDK8 if enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Felt this was worth clarifying

wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf-8 will not work. Used the links for plaintext instead

<property>
<name>yarn.resourcemanager.hostname</name>
<value>node-master</value>
<value>203.0.113.0</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're using a private IP with the hostname, we need to use the public IP in order for the yarn site to resolve

@andystevensname
Copy link
Collaborator

This needs to be rebased once the quick-disclosure-note and table changes are merged into to develop.

@Guaris Guaris merged commit 4f21223 into linode:develop Jul 22, 2019
Guaris pushed a commit that referenced this pull request Aug 19, 2019
* Updated to Hadoop 3.1.2

* tech edit

* Copy Edit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants