# AIMRC Data Science Core Workshop on Using AHPCC

- Link to [Google Drive Data folder](https://drive.google.com/drive/folders/1_IlfCe9hak_ggr7v2aydl-Zg21vIep3K?usp=sharing). Please download the `workshop-hpc-data` folder.
- Link to [Github Repository](https://github.com/pv-is-nrt/aimrc-data-science-core). Please download the repository by going to Code (green button) > Download ZIP.
- Link to [download PuTTY Installer](https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.81-installer.msi) for Windows
- Link to [download Fiji](https://downloads.imagej.net/fiji/latest/fiji-linux64.zip) for Linux

## Exercise 1: Here we go!

In this exercise, you learned how to log into the Pinnacle Desktop and how to navigate the system. This exercise was done outside of the Jupyter notebook environment, of course.

Make sure you
1. Logged in to Pinnacle in your browser.
2. Opened a Pinnacle Desktop and Jupyter Notebook session within your browser.
3. Accessed Firefox browser within the Pinnacle Desktop session.
4. Downloaded the `workshop-hpc-data` folder from the Google Drive link above.
5. Downloaded the Github repository from the link above.
6. Extracted both downloaded zip files to your home/downloads directory. Right click in the downloaded zip file and click on Extract Here.
7. You are free to download/extract your files to any directory you prefer. But if you followed the instructions above, you Home/Downloads directory should look like this:
    
    ```bash
    /home/username/Downloads/
        ├── aimrc-data-science-core-main
        ├── workshop-hpc-data
        ├── aimrc-data-science-core-main.zip
        └── workshop-hpc-data-20240725T******Z-001.zip
    ```

## Exercise 2: Basic Shell Commands

In this exercise, you will learn about logging into the Pinnacle server using SSH and running basic shell commands. You can use Windows Command Prompt for SSH or you can download and install PuTTY (link above). 

Type the following in Windows/Mac Command Prompt  
`ssh username@hpc-portal2.hpc.uark.edu`

Accept the certificate warning and enter your UARK password.

Practice running the following commands
```bash
    - env
    - who
    - whoami
    - hostname
    - date
    - df
    - cd /home/username/Downloads
    - ls -al
```

This exercise is done outside of the Jupyter notebook environment.

## Exercise 3: What have I gotten myself into?

### Get basic information about the hardware

Run the following cells to get basic information about the system you are logged into using Python and from within a Jupyter environment, without needing to access the command line / shell.

In [None]:
# we import psutil that allows us to interact with the system processes
import psutil

# print the number of CPU cores available and the percent usage
print("Number of CPU cores:", psutil.cpu_count())
print("CPU usage:", psutil.cpu_percent(interval=1), "%")

# print the total and used memory on the system
memory = psutil.virtual_memory()
print("Total memory:", memory.total / (1024 * 1024 * 1024), "GB")
print("Used memory:", memory.used / (1024 * 1024 * 1024), "GB")

# print the amount of GPU memory on the system
gpu_memory = psutil.virtual_memory()
print("Total GPU memory:", gpu_memory.total / (1024 * 1024 * 1024), "GB")
print("Used GPU memory:", gpu_memory.used / (1024 * 1024 * 1024), "GB")

# print the storage information on the system
disk_partitions = psutil.disk_partitions()
for d_p in disk_partitions:
    print("Device: " + d_p.device + "; "
          "Mountpoint: " + d_p.mountpoint + "; "
          "File system type: " + d_p.fstype)
disk_usage = psutil.disk_usage('/')
print("Total disk space:", disk_usage.total / (1024 * 1024 * 1024), "GB")
print("Used disk space:", disk_usage.used / (1024 * 1024 * 1024), "GB")

We can get detailed information about the Nvidia GPU by using the nvidia-smi utility. This command will not work if the node does not have a dedicated Nvidia GPU.  
Note that you can use shell commands in Jupyter Notebooks by prefixing the command with an exclamation mark (!).

In [None]:
!nvidia-smi

### Get information about the packages installed

In [None]:
# print the list of packages installed in the current environment
# !conda list # OR
!pip list

## Exercise 4: Running a simple Python program in Jupyter Notebook

In this exercise, we will run a simple Python program in Jupyter Notebook. We write a program to crop images and save the cropped images in a new folder. We will use the images you already downloaded from the Google Drive link above.

In [None]:
# run if needed
!pip install pillow

In [None]:
from pathlib import Path
import PIL.Image as Image

DATA_FOLDER = '/home/prateek/Downloads/workshop-hpc-data/NFFA images sampled'
TARGET_FOLDER = '/home/prateek/Downloads/workshop-hpc-data/NFFA images sampled cropped'

# Create target folder if it does not exist
Path(TARGET_FOLDER).mkdir(parents=True, exist_ok=True)

In [None]:
# get the list of all files in the source folder
files = list(Path(DATA_FOLDER).rglob('*.jpg'))
print(len(files), 'total files found')

# get the dimension of the first image
img = Image.open(files[0])
print('Dimensions of the first image:', img.size)

In [None]:
# define the cropping function
def crop_image(file, data_folder, target_folder):
    img = Image.open(file)
    img = img.crop((0, 0, 1024, 600))
    # save the cropped image in the target folder matching the subfolder structure
    save_to_path = str(file).replace(data_folder, target_folder)
    # make sure the subfolder exists
    Path(save_to_path).parent.mkdir(parents=True, exist_ok=True)
    img.save(save_to_path)
    print('saved ' + save_to_path) # uncomment if you wish to see detailed output

In [None]:
# crop all images
for file in files:
    crop_image(file, DATA_FOLDER, TARGET_FOLDER)

## Exercise 5: Running jobs on the Pinnacle cluster

In this exercise, we learn about submitting jobs to the Pinnacle server through the SSH terminal. This exercise is done outside of the Jupyter notebook environment. This batch job will run the Python script inspired from Exercise 4. You can see the python file and the corresponding batch script in the `workshops/hpc` directory.

Make sure you:
- delete the `NFFA images sampled cropped` folder if it exists in the `/home/USERNAME/Downloads/workshop-hpc-data/` directory
- change into the /workshops/hpc directory  
    `cd /home/USERNAME/Downloads/aimrc-data-science-core-main/workshops/hpc`
- activate base environment  
    `module list`  
    `module load python/3.10-anaconda`  
    `which python`  
    `source /share/apps/bin/conda-3.10.sh`  
- submit a Python script as a job    
    `sinfo` observe the output   
    `squeue` observe the output  
    Run one of the following  
    - `sbatch --partition pcon06 --constraint 'aimrc' --nodes=1 myjob.sh` OR  
    - `sbatch --partition agpu06 --nodes=1 myjob.sh` OR  
    - `sbatch --partition gpu72 --nodes=1 myjob.sh`
- If your job is taking too long to start, you may want to cancel it and use another partition/constraint  
    `squeue -u <yourusername>` will show only your jobs  
    `scancel <jobid>`
- Observe the `slurm-xxxxxx.out` file in the directory for script output (best observed through Jupyter)
- Observe the `/home/USERNAME/Downloads/workshop-hpc-data/NFFA images sampled cropped/` directory for cropped images (best observed through Pinnacle Desktop) 

## Exercise 6: Fiji

In this exercise, we learn about logging into the Pinnacle Desktop, browsing the internet, downloading and installing Fiji and using Fiji to open and process image files.

Make sure you:
- Download (link at the top) and extract Fiji
- Launch `ImageJ` executable file
- Use one of the NFFA microscopy images to do an image processing of your choice in Fiji and save your result in your Documents folder.

## Exercise 7: A machine learning experiment

In this exercise, we write a simple machine learning program to classify images from the CIFAR10 dataset into ten categories. We will build and train a simple CNN model to classify the images. You will also learn to install any missing packages: These are installed into your home directory and you won't have to install them every time you log in.

### NOTE: Please pick appropriate kernel 
In your Jupyter Notebook on Pinnacle, go to Kernel > Change Kernel and pick "pytorch-cuda-3.14" for this exercise.

In [None]:
# you'll probably need to install matplotlib and scipy for the first time
!pip install matplotlib
!pip install scipy

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

# try to suppress tensorflow warnings
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

# IGNORE TENSORFLOW WARNING IN RED for now. As long as the code finishes running, it's fine.

In [None]:
# load the CIFAR-10 dataset from their website
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

Visualize some images from the dataset.

In [None]:
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(6,7))

for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i])
    plt.xlabel(class_names[train_labels[i][0]])

plt.show()

Build a simple CNN model to classify the images.

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()

Compile and train the model.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

See how the model performed.

In [None]:
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')

test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print("Test accuracy =", test_acc)

### [Optional] Classification experiment on the NFFA SEM dataset

You are welcome to try to run a classification experiment on the 250 microscope images (25 images per class) that you downloaded from the Google Drive link above. You can use the same CIFAR10 classification experiment as a template. You will need to modify the code to load the images and labels from the NFFA SEM dataset. Note that the performance may be poor because we only have 250 images.

Prepare data.

In [None]:
# make sure the path to the data is still correct
DATA_FOLDER = '/home/prateek/Downloads/workshop-hpc-data/NFFA images sampled'

# prepare a training dataset from the images

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
    DATA_FOLDER,
    target_size=(1000, 750),
    batch_size=32,
    shuffle=True,
    # pick an appropriate class mode for 10 classes
    class_mode='categorical')

# print the class indices
print(train_generator.class_indices) 

In [None]:
# visualize some images from the first batch of training data
image_batch, label_batch = next(train_generator)

plt.figure(figsize=(10, 2))
for i in range(5):
    ax = plt.subplot(1, 5, i + 1)
    plt.imshow(image_batch[i])
    plt.xlabel(list(train_generator.class_indices.keys())[list(train_generator.class_indices.values()).index(label_batch[i].argmax())]) # there is probably a simpler way to do this

Create a new model appropriate for this dataset.

In [None]:
# create a simple 3 layer CNN model
model_2 = models.Sequential()
model_2.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(1000, 750, 3)))
model_2.add(layers.MaxPooling2D((2, 2)))
model_2.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_2.add(layers.MaxPooling2D((2, 2)))
model_2.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_2.add(layers.Flatten())
model_2.add(layers.Dense(64, activation='relu'))
model_2.add(layers.Dense(10, activation='softmax'))
model_2.summary()

Compile and train the model.

In [None]:
# compile the model
model_2.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

# train the model
history = model_2.fit(train_generator, epochs=5)

Evaluate the performance.

In [None]:
# generate predictions for the first batch of images
predictions = model_2.predict(image_batch)

# visualize the predicted class for the predictions calculated above
plt.figure(figsize=(10, 2))
for i in range(5):
    ax = plt.subplot(1, 5, i + 1)
    plt.imshow(image_batch[i])
    plt.xlabel(list(train_generator.class_indices.keys())[predictions[i].argmax()])