# CRAB Day One 

### Table of content

1. [CRAB in brief](#crab-in-brief)
2. [Requests and access](#requests-and-access)
3. [EOS](#eos)
4. [SWAN](#swan)
5. [PySpark on SWAN](#pyspark-on-swan)
6. [HDFS Starter](#hdfs-starter)
7. [LXPLUS](#lxplus)

### CRAB in brief


**CRAB**, short for the CMS Remote Analysis Builder, is a utility to submit CMSSW jobs to distributed computing resources(Grid). 

By using CRAB you will be able to:
* Access CMS data and Monte-Carlo which are distributed to CMS aligned centres worldwide.
* Exploit the CPU and storage resources at CMS aligned centres.

To use CRAB to submit your CMSSW job to the Grid, you must meet some prerequisites:
1. Get a Grid certificate and the registration to CMS VO
2. Setup your certificate for LCG
3. Test your grid certificate
4. Test the code locally
5. Validate a CMSSW config file

### Requests and access

1. Hadoop Cluster Access
    * To do data analysis and be connected to the Hadoop cluster, you need to gain the Hadoop cluster access fisrt via the [CERN Service Portal.](https://cern.service-now.com/service-portal?id=service_element&name=Hadoop-Service)
    * [Getting started with Hadoop](https://hadoop-user-guide.web.cern.ch/gettingstarted_md.html)
    * [Using SPARK on Hadoop](https://hadoop-user-guide.web.cern.ch/spark/Using_Spark_on_Hadoop.html)
2. CERNBox
    * CERNBox is a place you can share your work projects. It is also required that you have your own CERNBox before using SWAN as it will be your SWAN home directory.
    * Access to CERNBox [here.](https://cernbox.web.cern.ch/cernbox/) 
3. Office Key Request
    * It is also important that you have your own key to the office. Request the key [here.](https://cern.service-now.com/service-portal?id=service_element&name=locks-keys-app-support)
4. Bike Rental
    * Rent a bike [here.](https://cern.service-now.com/service-portal?id=service_element&name=bicycle-rental)

### EOS

EOS provides a service for storing large amounts of physics data and user files, with a focus on interactive and batch analysis. [CERN EOS main page](https://eos-web.web.cern.ch/eos-web/). It is important to note that your files, folders, pictures that are saved in SWAN are stored in EOS where you can interact with them through your CERNBox home directory. 

To check your EOS quota, follow this [instructions.](https://clouddocs.web.cern.ch/containers/tutorials/hdfs.html)

### SWAN

SWAN (Service for Web based ANalysis) is a platform to perform interactive data analysis in the cloud. It uses your CERNBox as the home directory and you can access CERN experiments' and user data in CERN Cloud (EOS) there. [Read more about SWAN.](https://swan.web.cern.ch/swan/) 

Examples of how to use SWAN to do data analysis and visualization [here.](https://swan-gallery.web.cern.ch/basic/)

### PySpark on SWAN

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. [Read more.](https://spark.apache.org/docs/latest/api/python/)

Pre-requisites:
1. *CERNBox*: Request [here](https://cern.service-now.com/service-portal?id=service_element&name=CERNBox-Service)
2. *Hadoop Cluster Access*: Request [here](https://cern.service-now.com/service-portal?id=service_element&name=Hadoop-Service)

How-to:
1. Log in to SWAN directly from the [SWAN page](https://swan-k8s.cern.ch) or log in to your CERNBox then open SWAN Notebook from there.
2. Create a new project and a new file.
3. On the top panel, look for a *Star* button. It will appear there only if you already have the Hadoop cluster access. Configure needed library then click *connect*.


Note:
- The following variables are automatically initiated when connect to the Spark cluster:
    * sc = [SparkContext](https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/SparkContext.html#:~:text=A%20SparkContext%20represents%20the%20connection,before%20creating%20a%20new%20one.)
    * spark = [SparkSession](https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/SparkSession.html)

### HDFS Starter

Learn more about CERN HDFS [here.](https://clouddocs.web.cern.ch/containers/tutorials/hdfs.html)

#### Connect via command line

1. Connect to [LXPLUS](#LXPLUS) using SSH. 
    * Machines running CentOS7 Linux 64 bit mode: @lxplus7.cern.ch 
    * Machines running CentOS8: @lxplus8.cern.ch 

In [None]:
ssh [username]@lxplus7.cern.ch

2. configure as indicated in [here](https://hadoop-user-guide.web.cern.ch/getstart/client_cvmfs.html)

In [None]:
#first time setup
source /cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/setup.sh
#if use spark
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3

In [None]:
kinit

3. Explore the HDFS by looking up the path. [More path from CMS MONIT docs](https://cmsmonit-docs.web.cern.ch/MONIT/sources/)

In [None]:
hdfs dfs -ls /

#### Connect via SWAN

After connecting to the cluster, you can explore HDFS directly from SWAN

In [None]:
!hdfs dfs -ls /

#### Create your own directory

You can have your own directory in the HDFS. The example shows the directory in case pf working under CMS.

In [None]:
!hdfs dfs -mkdir hdfs://analytix/cms/users/[username]

#### More commands

Read the HDFS commands full document [here.](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)

1. Check size of the folder

In [None]:
!hdfs dfs -du -s -h hdfs://[path]

2. Copy a file/ folder from HDFS to your EOS storage

In [None]:
!hdfs dfs -cp  hdfs://[path] file:${PWD}/[path]

Note that the PWD specifies your current directory in the [EOS](#EOS) space. You can check it by running the command on SWAN.

In [None]:
!$PWD

or

In [None]:
import os
os.getcwd()

### LXPLUS

LXPLUS (Linux Public Login User Service) is the interactive logon service to Linux for all CERN users. The cluster LXPLUS consists of public machines provided by the IT Department for interactive work. [Read more here.](https://lxplusdoc.web.cern.ch/) LXPLUS can be accessed using SSH, short for [Secure Shell.](https://www.ucl.ac.uk/isd/what-ssh-and-how-do-i-use-it) The machines all have access to the [AFS](https://cern.service-now.com/service-portal?id=service_element&name=afs-service) file system for home directories, group space and some degree of scratch space, all of which are accessible through normal file system commands. In addition for physics and bulk data, the EOS and CASTOR services are avaialble. In particular EOS can be accessed as a filesystem through a fuse mount point such as /eos/user. 