<h1 align="center"> Setup Hadoop Locally on your Workstation</h1>

---

### 1. Download and Install JDK

``` bash

# Mac

# If you have brew installed

brew tap caskroom/versions
brew cask install java8

# If you DO NOT have brew installed 

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew tap caskroom/versions
brew cask install java8


# ELSE (you may want to speak to your instructor if not using Mac)

# CentOS
# You may need to update the URL per latest versions

wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u144-b01/090f390dda5b47b9b721c7dfaa008135/jdk-8u144-linux-x64.rpm
yum localinstall jdk-8u121-linux-x64.rpm


cd ~/Downloads
curl -v -j -k -L -H "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jre-8u112-macosx-x64.dmg > jre-8u112-macosx-x64.dmg
hdiutil attach jre-8u112-macosx-x64.dmg
sudo installer -pkg /Volumes/Java\ 8\ Update\ 112/Java\ 8\ Update\ 112.app/Contents/Resources/JavaAppletPlugin.pkg -target /
diskutil umount /Volumes/Java\ 8\ Update\ 112 
rm jre-8u112-macosx-x64.dmg
```

### Ensure JDK 1.8 is installed

### 2. Enable SSH and Keyless Access

``` bash
sudo systemsetup -setremotelogin on

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost
```

### 3. Set JAVA_HOME

Add following lines in your ***.bash_profile*** or ***.bashrc***

```bash
export JAVA_HOME=$(/usr/libexec/java_home)

# For those of you who don't have java_home in your path add it like this.

sudo ln -s /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java_home /usr/libexec/java_home

```

### 4. Install Hadoop

```bash
brew install hadoop
```

Try the following commmand

```
cd /usr/local/opt/hadoop/

bin/hadoop
```

This will display the usage documentation for the hadoop script

### Local Operations
```
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
cat output/*
```

# Pseudo-Distributed Operation

## Configurations

Next, you have to start configuring a couple of files. Go to /usr/local/opt/hadoop. In there you will need to make some changes or create the following files

1. hadoop-env.sh
2. core-site.xml
3. mapred-site.xml
4. hdfs-site.xml

In *hadoop-env.sh* look for

```
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
```

Replace it with

```
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME=$(/usr/libexec/java_home)
```

**Use the following:**

cd /usr/local/opt/hadoop

Update the following files with the connect as below:

**vi libexec/etc/hadoop/core-site.xml**

```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```

**vi libexec/etc/hadoop/hdfs-site.xml:**

```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

verify keyless access

### Execution

#### Format the File System:

In [None]:
bin/hdfs namenode -format

#### Start NameNode and DataNode daemon

In [None]:
sbin/start-dfs.sh

Browse the web interface for the NameNode; by default it is available at:

NameNode - http://localhost:9870/

Make the HDFS directories required to execute MapReduce jobs:

In [None]:
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:

In [None]:
bin/hdfs dfs -mkdir input
bin/hdfs dfs -put etc/hadoop/*.xml input

Run some of the examples provided:

In [None]:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'


Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

In [None]:
bin/hdfs dfs -get output output
cat output/*

#OR 

bin/hdfs dfs -cat output/*

The Job did run in local (defualt) mode. 

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.

```
vi libexec/etc/hadoop/mapred-site.xml
```

In [None]:
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

vi libexec/etc/hadoop/yarn-site.xml

In [None]:
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

In [None]:
sbin/start-yarn.sh

Browse the web interface for the ResourceManager; by default it is available at:

ResourceManager - http://localhost:8088/

In [None]:
jps

You can configure hadoop home to be able to stop and start Hadoop services from anywhere otherwise you need to be in /usr/local/opt/hadoop everytime.