# Lab 1 - Introduction to (Py)Spark and (Sheffield)HPC

[COM6012 Scalable Machine Learning **2021**](https://github.com/haipinglu/ScalableML) by [Haiping Lu](http://staffwww.dcs.shef.ac.uk/people/H.Lu/) at The University of Sheffield

# NOT Ready yet. Still in Preparation

## Study schedule

* Task 1: To finish by Wednesday. **Critical**
* Task 2: To finish by Wednesday. **Critical**
* Task 3: To finish by Thursday. **Essential**
* Task 4: To finish by Thursday. **Essential**
* Task 5: To finish before the next Monday. ***Exercise***
* Task 6: To explore further. *Optional*

**Suggested reading**: 
* Chapters 2 to 4 of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf) (several sections in Chapter 3 can be safely skipped)
* [Spark Quick Start](https://spark.apache.org/docs/3.0.1/quick-start.html) (Choose **Python** rather than the default *scala*)

**<span style="color:red">Note - Please read before proceeding</span>**:
- HPC nodes are **shared** resources (**like buses/trains**) relying on considerate usage of every user. When requesting resources, if you ask for too much (e.g. 50 cores), it will take a long time to get allocated, particularly during "*rush hours*" (e.g. close to deadlines) and once allocated, it will leave much less for the others. If everybody is asking for too much, the system won't work and everyone suffers.
- We have five nodes (each with 40 cores, 768GB RAM) reserved for this module. You can specify `-P rse-com6012` (e.g. after `qrshx`) to get access. However, these nodes are not always more available, e.g. if all of us are using it. There are **100+** regular nodes, many of which may be idle.
- Please follow **all steps (step by step without skipping)** unless you are very confident in handling problems by yourself. 
- Please try your best to follow the **study schedule** above to finish the tasks on time. If you start early/on time, you will find your problems early so that you can make good use of the online sessions to get help from the instructors and teaching assistants to fix your problems early, rather than getting panic close to an assessment deadline. Based on our experience from the past four years, rushing towards an assessment deadline in this module is likely to make you fall, sometimes painfully.

## 1. Connect to HPC and Install Spark

You **must** first connect to the [university's VPN](https://www.sheffield.ac.uk/it-services/vpn) unless you are on the campus network, which is unlikely during the lockdown.

### 1.1 Connect to ShARC HPC via SSH

Follow the [official instruction](https://docs.hpc.shef.ac.uk/en/latest/hpc/index.html) from our university. I have get your HPC account created already (first step done). Use your university **username** such as `abc18de` and the associated password to log in. 

If you have problem logging in, email me 1) your username; 2) a connection status of your VPN, 3) a screen capture of the error message (**never send me your password**). If you can log in but encounter HPC-related problem, you can email ` hpc@sheffield.ac.uk` for help.

Following the [official instructins](https://docs.hpc.shef.ac.uk/en/latest/hpc/connecting.html) for [Windows](https://docs.hpc.shef.ac.uk/en/latest/hpc/connecting.html#ssh-client-software-on-windows) or [Mac OS/X and Linux](https://docs.hpc.shef.ac.uk/en/latest/hpc/connecting.html#ssh-client-software-on-mac-os-x-and-linux) to open a terminal and connect to sharc via SSH by

`ssh -X $USER@$sharc.shef.ac.uk`

You need to replace `$USER` with your username, let's assume it is `abc1de`. If successful, you should see 

`[abc1de@sharc-login1 ~]$`

`abc1de` should be your username. 

**MobaXterm tips**

- You can save the host, username (and password if your computer is secure) as a **Session** if you want to save time in future.
- Shift + Insert for copy (after text selection), not Ctrl + C. Ctrl + V works as paste (to confirm on new PC).

### 1.2 Set up the environment and install PySpark

#### Start an interactive session

Type `qrshx` for a *regular* node **or** `qrshx -P rse-com6012` for a com6012-reserved node

#### Load Java and conda

`module load apps/java/jdk1.8.0_102/binary`

`module load apps/python/conda`

#### Create a virtual environment called `myspark`

`conda create -n myspark python=3.6`

When you are asked whether to proceed, say `y`

#### Activate the environment

`source activate myspark`

You **must** see `(myspark) [abc1de@sharc-nodeXXX ~]$`, i.e. **(myspark)** in front, before proceeding. Otherwise, you did not get the proper environment. Check the above steps. 

#### Install pyspark 3.0.1 using `pip`

`pip install pyspark==3.0.1`

When you are asked whether to proceed, say `y`

#### Run spark

`pyspark`

You should see spark version **3.0.1** displayed. Quit pyspark shell by `Ctrl + D`.

### 1.3 Get more familiar with the HPC

**Terminal/command line**: learn the [basic use of the command line](https://github.com/mikecroucher/Intro_to_HPC/blob/gh-pages/terminal_tutorial.md) in Linux, e.g. use `pwd` to find out your **current directory**.

**Transfer files**: learn how to [transfer files to/from ShARC HPC](https://www.sheffield.ac.uk/it-services/research/hpc/using/access).

**Line ending warning**: if you are using Windows, you should be aware that [line endings differ between Windows and Linux](https://stackoverflow.com/questions/426397/do-line-endings-differ-between-windows-and-linux). If you edit a shell script (below) in Windows, make sure that you use a Unix/Linux compatible editor or do the conversion before using it on HPC.

### 1.4 *Optional: Install PySpark on your own machine*  

This module focuses on the HPC terminal. Labs are in the format of Jupyter Notebooks but you should use the HPC terminal to complete the labs. ALL assessments use the HPC terminal.

Installation of PySpark on your own machine is more complicated than installing a regualr python library because it depends on Java (i.e. not pure python). Four basic steps are

- Intall **Java 8**, i.e. java version *1.8.xxx* (not Java 11, or 1.11) via [**Java JRE**](https://www.oracle.com/java/technologies/javase-jre8-downloads.html). Most instructions online ask you to install *Java SDK*, which is heavy but unnecessary. Actually you only need to install [**Java JRE**](https://www.oracle.com/java/technologies/javase-jre8-downloads.html), which is light and sufficient.
- Install Python **3.6+** (if not yet)
- Install PySpark **3.0.1** with **Hadoop 2.7**
- Set up the proper environments (see references below)

As far as I know, it is not necessary to install *Scala*.

Different OS (Windows/Linux/Mac) may have different problems. We provide some references below if you wish to try but it is *not required* and we can provide only very limited support on this task (i.e. we may not be able to solve all problems that you may encounter).

If you do want to install PySpark and run Jupyter Notebooks on your own machine, you need to complete the steps above with reference to the instructions below for your OS (Windows/Linux/Mac).

#### References (use with caution, not necessarily up to date or the best)

If you follow the steps in these references, be aware that they are not up to date so you should install the correct versions: **Java 1.8**, Python **3.6+**, PySpark **3.0.1** with **Hadoop 2.7**. *Scala* is optional.

* Windows: 1) [Install Spark on Windows (PySpark)](https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c) (with video) 2) [How to install Spark on Windows in 5 steps](https://medium.com/@dvainrub/how-to-install-apache-spark-2-x-in-your-pc-e2047246ffc3) **Note:** The following may be needed. Go to your System Environment Variables and add PYTHONPATH to it with the following value: `%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip;%PYTHONPATH%`, just check what py4j version you have in your `spark/python/lib` folder ([source](https://stackoverflow.com/questions/53161939/pyspark-error-does-not-exist-in-the-jvm-error-when-initializing-sparkcontext?noredirect=1&lq=1)).

* Linux: 1) [Install PySpark on Ubuntu](https://medium.com/@GalarnykMichael/install-spark-on-ubuntu-pyspark-231c45677de0) (with video); 2)[Installing PySpark with JAVA 8 on ubuntu 18.04](https://towardsdatascience.com/installing-pyspark-with-java-8-on-ubuntu-18-04-6a9dea915b5b)

* Mac: 1) [Install Spark on Mac (PySpark)](https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b) (with video); 2) [Install Spark/PySpark on Mac](https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735)

#### Install PySpark on Windows

Here we provide more detailed instructions only for Windows


## 2. Run Spark

### On HPC: get a node and activate myspark

- Get a node via `qrshx` or `qrshx -P rse-com6012`.
- Activate the environment by

  `module load apps/java/jdk1.8.0_102/binary`

  `module load apps/python/conda`

  `source activate myspark`
  or alternatively, put `HPC/myspark.sh` under your root directory (see above on how to transfer files) and do
  
  `source myspark.sh` will run the three commands in sequence. You could modify it further to suit yourself better.


### Interactive (HPC or local machine)

#### If running notebook of spark on your local machine
**Note** If `import pyspark` reports error, you may try `pip install findspark`, `import findspark`, 
`findspark.init()`, and then `import pyspark` should work.

In [1]:
#import findspark
#findspark.init()
import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext

#### If running spark in a shell on either HPC or your local machine, `spark` (SparkSession) and `sc` (SparkContext) is automatically created.

Run pyspark (optionally, specify to use multiple cores)

`pyspark` or `pyspark --master local[2]` with two cores



Check your SparkSession and SparkContext object (you will see different output if running in a shell).

In [2]:
spark

Create and check sc (SparkContext)

In [3]:
sc

In [4]:
nums = sc.parallelize([1,2,3,4])
nums.map(lambda x: x*x).collect()

[1, 4, 9, 16]

## 3. Log Mining with Spark - Example


This example deals with **Semi-Structured** data in a text file. 

Firstly, you need to **make sure the file is in the proper directory and change the file path if necessary**, on either HPC or local machine.

**If running on HPC, you need to transfer files there.** Here is how to [**transfer files to HPC**](https://www.sheffield.ac.uk/cics/research/hpc/using/access). Please **click** and follow the instructions unless you are already familiar with it.

### GUI-based file transfer

* [**MobaXterm**](https://mobaxterm.mobatek.net/) is recommended for **Windows**
* [**Cyberduck**](https://en.wikipedia.org/wiki/Cyberduck) or [**FileZilla**](https://en.wikipedia.org/wiki/FileZilla) is recommended for **Mac**
* **FileZilla** is recommended for **Linux (e.g., Ubuntu)**

For example, in MobaXterm (for Windows), you just need to **Drag your file or folder to the left directory pane of MobaXterm**.

In [5]:
logFile=spark.read.text("Data/NASA_Aug95_100.txt")
logFile

DataFrame[value: string]

In [6]:
logFile.count()

100

In [7]:
logFile.first()

Row(value='in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839')

**Question**: How many accesses are from Japan?

In [8]:
hostsJapan = logFile.filter(logFile.value.contains(".jp"))

Check whether you are getting what you want.

In [9]:
hostsJapan.show(5,False)

+--------------------------------------------------------------------------------------------------------------+
|value                                                                                                         |
+--------------------------------------------------------------------------------------------------------------+
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280                         |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866|
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0   |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:22 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0    |
+-----------------------------------------------------------------------------------------------

In [10]:
hostsJapan.count()

11

### Self-contained Application

To run a self-contained application, you need to **exit your shell, by `Ctrl+D` first**.

Create a file `LogMining100.py`

~~~~
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("WARN")

logFile=spark.read.text("Data/NASA_Aug95_100.txt")
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()

print("\n\nHello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))

spark.stop()
~~~~


Then run it with `spark-submit Code/LogMining100.py`  **Note: You need exit your shell, by `Ctrl+D` first**



## 4. Big Data Log Mining with Spark 

**Data**: Download the August data in gzip (NASA_access_log_Aug95.gz) from [NASA HTTP server access log](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and put into your `Data` folder. `NASA_Aug95_100.txt` above is the first 100 lines of the August data.

**Question**: How many accesses are from Japan and UK respectively?

Create a file `LogMiningBig.py`

~~~~
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("WARN")

logFile=spark.read.text("../Data/NASA_access_log_Aug95.gz").cache()

hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
hostsUK = logFile.filter(logFile.value.contains(".uk")).count()

print("\n\nHello Spark: There are %i hosts from UK.\n" % (hostsUK))
print("Hello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))

spark.stop()
~~~~
**Spark can read gzip file directly. You do not need to unzip it to a big file.**

**Note the use of cache() above**

### Run a program in batch mode

[How to submi batch jobs to ShARC](https://www.sheffield.ac.uk/cics/research/hpc/sharc/batch) **The more resources you request, the longer you need to queue**

Interactive mode will be good for learning, exploring and debugging, with smaller data. For big data, it will be more convenient to use batch processing. You submit the job to the node to join a queue. Once allocated, your job will run, with output properly recorded. This is done via a shell script. **Warning: Do not create such a file under WINDOWS.**

Create a file `Lab1_SubmitBatch.sh`

~~~~
#!/bin/bash
#$ -l h_rt=2:00:00  #time needed
#$ -pe smp 2 #number of cores
#$ -l rmem=4G #number of memery
#$ -o COM6012_Lab1.output #This is where your output and errors are logged.
#$ -j y # normal and error outputs into a single file (the file above)
#$ -M youremail@shef.ac.uk #Notify you by email, remove this line if you don't like
#$ -m ea #Email you when it finished or aborted
#$ -cwd # Run job from current directory

module load apps/java/jdk1.8.0_102/binary

module load apps/python/conda

source activate myspark

spark-submit ../Code/LogMiningBig.py
~~~~

* Get necessary files on your ShARC.
* Start a session with command `qrshx`
* Under appopriate directory (`HPC`) submit yur job via the `qsub` comand

`qsub Lab1_SubmitBatch.sh`

Check the status of your quening/running job(s) `qstat` (jobs not shown are finished already).

Check your output file, which is **`COM6012_Lab1.output`** specified with option **`-o`** above. You can change it to a name you like.

## 5. Exercises

## 6. Additional ideas to explore

### More mining questions (completing three or more questions is considered as completion of this exercise):

#### Easier questions
* How many requests in total?
* How many requests on a particular day (e.g., 15th August)?
* How many 404 (page not found) errors in total?
* How many 404 (page not found) errors on a particular day (e.g., 15th August)?
* How many requests from a particular host (e.g.,uplherc.up.com)?
* Any other question that you are interested in.

#### More challenging questions that will become easier to answer in Session 2 (optional for Session 1)
* How many **unique** hosts on a particular day (e.g., 15th August)?
* How many **unique** hosts in total (i.e., in August 1995)?
* Which host is the most frequent visitor?
* How many different types of return codes?
* How many requests per day on average?
* How many requests per post on average?
* Any other question that you are interested in.

### The effects of caching (recommended)
* **Compare** the time taken to complete your jobs **with and without** `cache()`.

# Acknowledgements

Many thanks to Twin, Will, Mike, Vamsi for their kind help and all those kind contributors of open resources.

The log mining problem is adapted from [UC Berkeley cs105x L3](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x).