# Session 1 - Introduction to PyTorch

## Objectives

* To learn and run PyTorch on your machine.

**Suggested reading**: 
* What is PyTorch from [PyTorch tutorial](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py)

#### Assumptions: basic python programming and [Anaconda](https://anaconda.org/) installed.

## 1. Install PyTorch

#### Install [PyTorch](https://github.com/pytorch/pytorch) via [Anaconda](https://anaconda.org/)
`conda install -c pytorch pytorch`

When you are asked whether to proceed, say `y`

#### Install [torchvision](https://github.com/pytorch/vision)
`conda install -c pytorch torchvision`

When you are asked whether to proceed, say `y`

In [None]:
import os
import subprocess
def module(*args):        
    if isinstance(args[0], list):        
        args = args[0]        
    else:        
        args = list(args)        
    (output, error) = subprocess.Popen(['/usr/bin/modulecmd', 'python'] + args, stdout=subprocess.PIPE).communicate()
    exec(output)    
module('load', 'apps/java/jdk1.8.0_102/binary')    
os.environ['PYSPARK_PYTHON'] = os.environ['HOME'] + '/.conda/envs/jupyter-spark/bin/python'

### 1.2. Windows/Linux/Mac: on your own / lab machine.  

You need to follow instructions for your OS (Windows/Linux/Mac) below to install Java, set up the proper paths, etc., except that if you have `conda` installed already (e.g., from COM6509 the MLAI module), `pyspark 2.3.2` can be installed via (see above)

`conda install -c conda-forge pyspark=2.3.2`


* Windows: 1) With video - [Install Spark on Windows (PySpark)](https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c) 2) [How to install Spark on Windows in 5 steps](https://medium.com/@dvainrub/how-to-install-apache-spark-2-x-in-your-pc-e2047246ffc3) **Note:** The following may be needed. Go to your System Environment Variables and add PYTHONPATH to it with the following value: `%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip:%PYTHONPATH%`, just check what py4j version you have in your `spark/python/lib` folder ([source](https://stackoverflow.com/questions/53161939/pyspark-error-does-not-exist-in-the-jvm-error-when-initializing-sparkcontext?noredirect=1&lq=1)).

* Linux: With video - [Install PySpark on Ubuntu](https://medium.com/@GalarnykMichael/install-spark-on-ubuntu-pyspark-231c45677de0)

* Mac: [Install Spark/PySpark on Mac](https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735) **Note: you need to use Java 8. Java 11 is having problems.**

Quiz Pyspark shell by `Ctrl+D`.

## 2. Run Spark

### On HPC: connect and activate first

`qrshx`

`module load apps/java/jdk1.8.0_102/binary`

`module load apps/python/conda`

`source activate myspark`


### Interactive (HPC or local machine)

#### If running notebook of spark on your local machine
**Note** If `import pyspark` reports error, you may try `pip install findspark`, `import findspark`, 
`findspark.init()`, and then `import pyspark` should work.

In [1]:
#import findspark
#findspark.init()
import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext

#### If running spark in a shell on either HPC or your local machine, `spark` (SparkSession) and `sc` (SparkContext) is automatically created.

Run pyspark (optionally, specify to use multiple cores)

`pyspark` or `pyspark --master local[2]` with two cores



Check your SparkSession and SparkContext object (you will see different output if running in a shell).

In [2]:
spark

Create and check sc (SparkContext)

In [3]:
sc

In [4]:
nums = sc.parallelize([1,2,3,4])
nums.map(lambda x: x*x).collect()

[1, 4, 9, 16]

## 3. Log Mining with Spark - Example


This example deals with **Semi-Structured** data in a text file. 

Firstly, you need to **make sure the file is in the proper directory and change the file path if necessary**, on either HPC or local machine.

**If running on HPC, you need to transfer files there.** Here is how to [**transfer files to HPC**](https://www.sheffield.ac.uk/cics/research/hpc/using/access). Please **click** and follow the instructions unless you are already familiar with it.

### GUI-based file transfer

* [**MobaXterm**](https://mobaxterm.mobatek.net/) is recommended for **Windows**
* [**Cyberduck**](https://en.wikipedia.org/wiki/Cyberduck) or [**FileZilla**](https://en.wikipedia.org/wiki/FileZilla) is recommended for **Mac**
* **FileZilla** is recommended for **Linux (e.g., Ubuntu)**

For example, in MobaXterm (for Windows), you just need to **Drag your file or folder to the left directory pane of MobaXterm**.

In [5]:
logFile=spark.read.text("Data/NASA_Aug95_100.txt")
logFile

DataFrame[value: string]

In [6]:
logFile.count()

100

In [7]:
logFile.first()

Row(value='in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839')

**Question**: How many accesses are from Japan?

In [8]:
hostsJapan = logFile.filter(logFile.value.contains(".jp"))

Check whether you are getting what you want.

In [9]:
hostsJapan.show(5,False)

+--------------------------------------------------------------------------------------------------------------+
|value                                                                                                         |
+--------------------------------------------------------------------------------------------------------------+
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280                         |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866|
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0   |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:22 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0    |
+-----------------------------------------------------------------------------------------------

In [10]:
hostsJapan.count()

11

### Self-contained Application

To run a self-contained application, you need to **exit your shell, by `Ctrl+D` first**.

Create a file `LogMining100.py`

~~~~
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("WARN")

logFile=spark.read.text("Data/NASA_Aug95_100.txt")
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()

print("\n\nHello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))

spark.stop()
~~~~


Then run it with `spark-submit Code/LogMining100.py`  **Note: You need exit your shell, by `Ctrl+D` first**



## 4. Big Data Log Mining with Spark 

**Data**: Download the August data in gzip (NASA_access_log_Aug95.gz) from [NASA HTTP server access log](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and put into your `Data` folder. `NASA_Aug95_100.txt` above is the first 100 lines of the August data.

**Question**: How many accesses are from Japan and UK respectively?

Create a file `LogMiningBig.py`

~~~~
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Spark Intro") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("WARN")

logFile=spark.read.text("../Data/NASA_access_log_Aug95.gz").cache()

hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
hostsUK = logFile.filter(logFile.value.contains(".uk")).count()

print("\n\nHello Spark: There are %i hosts from UK.\n" % (hostsUK))
print("Hello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))

spark.stop()
~~~~
**Spark can read gzip file directly. You do not need to unzip it to a big file.**

**Note the use of cache() above**

### Run a program in batch mode

[How to submi batch jobs to ShARC](https://www.sheffield.ac.uk/cics/research/hpc/sharc/batch) **The more resources you request, the longer you need to queue**

Interactive mode will be good for learning, exploring and debugging, with smaller data. For big data, it will be more convenient to use batch processing. You submit the job to the node to join a queue. Once allocated, your job will run, with output properly recorded. This is done via a shell script.

Create a file `Lab1_SubmitBatch.sh`

~~~~
#!/bin/bash
#$ -l h_rt=2:00:00  #time needed
#$ -pe smp 2 #number of cores
#$ -l rmem=4G #number of memery
#$ -o COM6012_Lab1.output #This is where your output and errors are logged.
#$ -j y # normal and error outputs into a single file (the file above)
#$ -M youremail@shef.ac.uk #Notify you by email, remove this line if you don't like
#$ -m ea #Email you when it finished or aborted
#$ -cwd # Run job from current directory

module load apps/java/jdk1.8.0_102/binary

module load apps/python/conda

source activate myspark

spark-submit ../Code/LogMiningBig.py
~~~~

* Get necessary files on your ShARC.
* Start a session with command `qrshx`
* Under appopriate directory (`HPC`) submit yur job via the `qsub` comand

`qsub Lab1_SubmitBatch.sh`

Check the status of your quening/running job(s) `qstat` (jobs not shown are finished already).

Check your output file, which is **`COM6012_Lab1.output`** specified with option **`-o`** above. You can change it to a name you like.

## 5. Exercise

### More mining questions (completing three or more questions is considered as completion of this exercise):

#### Easier questions (recommended)
* How many requests in total?
* How many requests on a particular day (e.g., 15th August)?
* How many 404 (page not found) errors in total?
* How many 404 (page not found) errors on a particular day (e.g., 15th August)?
* How many requests from a particular host (e.g.,uplherc.up.com)?
* Any other question that you are interested in.

#### More challenging questions that will become easier to answer in Session 2 (optional for Session 1)
* How many **unique** hosts on a particular day (e.g., 15th August)?
* How many **unique** hosts in total (i.e., in August 1995)?
* Which host is the most frequent visitor?
* How many different types of return codes?
* How many requests per day on average?
* How many requests per post on average?
* Any other question that you are interested in.

### The effects of caching (recommended)
* **Compare** the time taken to complete your jobs **with and without** `cache()`.

# Acknowledgements

Many thanks to Twin, Will, Mike, Vamsi for their kind help and all those kind contributors of open resources.

The log mining problem is adapted from [UC Berkeley cs105x L3](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x).