A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Hadoop File System

In this problem, we will explore some basic components of the Hadoop Distributed File System (HDFS).

In [None]:
import os
from nose.tools import assert_equal, assert_true

We will first set up the a local Hadoop environement. Let's first stop the [namenode and datanodes](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes) in case the nodes are running in the background from a previous run.

In [None]:
! $HADOOP_PREFIX/sbin/stop-dfs.sh
! $HADOOP_PREFIX/sbin/stop-yarn.sh

If there are any temporary files created during the previous Hadoop operation, we want to clean them up. You may see something that looks like an error:

```
rm: cannot remove ‘/tmp/hsperfdata_root’: Operation not permitted
```

It's not really an error, and it won't affect our result, so you can safely ignore it.

In [None]:
! rm -rf /tmp/*

We will simply [format the namenode](https://wiki.apache.org/hadoop/GettingStartedWithHadoop#Formatting_the_Namenode) and delete all files in our HDFS. Note that our HDFS is in an ephemeral Docker container, so all data will be lost anyway when the Docker container is shut down.

In [None]:
! echo "Y" | $HADOOP_PREFIX/bin/hdfs namenode -format 2> /dev/null

After formatting the namenode, we restart the namenode and datanodes.

In [None]:
! $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
! $HADOOP_PREFIX/sbin/start-dfs.sh
! $HADOOP_PREFIX/sbin/start-yarn.sh

Sometimes when the namenode is restarted, it enteres [Safe Mode](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode), not allowing any changes to the file system. We do want to make changes, so we manually leave Safe Mode.

In [None]:
! $HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave

## Create a new directory /user/data_scientist in HDFS.

- In the following code cell, ceate a new directory in HDFS at `/user/data_scientist/wc/in`

- As the [Introduction to Hadoop notebook](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week12/notebooks/intro2hadoop.ipynb) explains, we must use the [HDFS file system interface](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfs) to move around the HDFS file system. We use `$HADOOP_PREFIX/bin/hdfs` to do this. Furthermore, `$HADOOP_PREFIX/bin/hdfs` is a Unix command, and to execute a Unix command in Jupyter notebook we must the ! magic. Putting them together, you answer should start with `!$HADOOP_PREFIX/bin/hdfs`.

- Running `!$HADOOP_PREFIX/bin/hdfs` by itself will list the available commands:

```python
!$HADOOP_PREFIX/bin/hdfs
```

```
Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  ...
```

    where I only showed the first line because we only need the `dfs` subcommand. The [`dfs` commands](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html) mirrors many of the traditional Unix file systems commands. The full listing can be obtained by entering `!$HADOOP_PREFIX/bin/hdfs dfs`:

```python
!$HADOOP_PREFIX/bin/hdfs dfs
```

```
Usage: hadoop fs [generic options]
    ...
	[-mkdir [-p] <path> ...]
    ...
```

    Here, I only shortened the output to show only the relevant option.

In [None]:
# YOUR CODE HERE

In [None]:
ls_wc = ! $HADOOP_PREFIX/bin/hdfs dfs -ls wc
print('\n'.join(ls_wc))

In [None]:
assert_true('wc/in' in ls_wc.s)

## Copy /home/data_scientist/data/iris.csv to /wc/in/iris.csv

- There's a file called `iris.csv` in the `data` directory of the **local host file system**.

```python
!ls -lah /home/data_scientist/data/iris.csv
```

```
-rw-r--r-- 1 root root 4.5K Nov  7 16:18 /home/data_scientist/data/iris.csv
```

    In the following code cell, copy this `iris.csv` file into the `wc/in` directory in **HDFS**.

- Run `!$HADOOP_PREFIX/bin/hdfs dfs` again to find which option we need to use.

```python
!$HADOOP_PREFIX/bin/hdfs dfs
```

```
Usage: hadoop fs [generic options]
    ...
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
    ...
```

In [None]:
# YOUR CODE HERE

In [None]:
ls_wc_in = ! $HADOOP_PREFIX/bin/hdfs dfs -ls wc/in
print('\n'.join(ls_wc_in))

In [None]:
assert_true('wc/in/iris.csv' in ls_wc_in.s)

We are done. Having the namenode and datanodes running in the background consumes quite a bit of memory. So we should shut down the nodes at the end of the notebook. Make sure you run the assertion tests in the final code cell.

In [None]:
!$HADOOP_PREFIX/sbin/stop-dfs.sh
!$HADOOP_PREFIX/sbin/stop-yarn.sh

In [None]:
check_dfs_stopped = !$HADOOP_PREFIX/sbin/stop-dfs.sh
assert_true("no namenode to stop" in check_dfs_stopped.s)
assert_true("no datanode to stop" in check_dfs_stopped.s)
assert_true("no secondarynamenode to stop" in check_dfs_stopped.s)

check_yarn_stopped = !$HADOOP_PREFIX/sbin/stop-yarn.sh
assert_true("no resourcemanager to stop" in check_yarn_stopped.s)
assert_true("no nodemanager to stop" in check_yarn_stopped.s)
assert_true("no proxyserver to stop" in check_yarn_stopped.s)