# Neocortex: Hands-on FC-MNIST Example


## Introduction

### Welcome notes

Welcome to this hands-on example of training a Fully Connected (FC) model for the MNIST dataset! In this exercise, we will explore the fascinating world of deep learning by building a neural network capable of recognizing handwritten digits. MNIST is a widely used dataset in the field of computer vision and serves as an excellent starting point for beginners.

The objective of this exercise is to guide you through the process of constructing and training a simple FC model using Python and the TensorFlow deep learning framework running on top of the Cerebras Software stack on the Neocortex system. We will break down the example step by step, ensuring that you gain a clear understanding of the underlying concepts and techniques.

By the end of this hands-on example, you will have a trained FC model on a Cerebras CS-2 machine that can accurately classify handwritten digits from the MNIST dataset. You will also gain valuable insights into the fundamentals of deep learning, including model architecture, training data preparation, loss functions, and optimization algorithms.

Whether you are new to the Neocortex system or looking to reinforce your knowledge, this exercise will provide you with a solid foundation to explore more advanced implementations using your custom model and dataset. So, let's dive in and embark on this exciting journey of training an FC-MNIST model on the Neocortex system!

### Neocortex

[Neocortex](https://www.cmu.edu/psc/aibd/neocortex/) is a highly innovative resource that targets the acceleration of AI-powered scientific discovery by vastly shortening the time required for deep learning training, featuring two [Cerebras CS-2](https://www.cerebras.net/product-system/) systems and an [HPE Superdome Flex HPC server](https://buy.hpe.com/ca/en/compute/mission-critical-x86-servers/superdome-flex-servers/superdome-flex-server/hpe-superdome-flex-280-server/p/1012865453) (SDF) robustly provisioned to drive the CS-2 systems simultaneously at maximum speed and support the complementary requirements of AI and HPDA workflows.

There are four types of applications currently supported on the system, divided into the following individual tracks:

* **Track 1**, [Cerebras modelzoo ML models](https://portal.neocortex.psc.edu/docs/supported-applications/track1.html): models already present in version R1.6.0 of the Cerebras modelzoo ML models software.
* **Track 2**, [Models similar to the Cerebras modelzoo models](https://portal.neocortex.psc.edu/docs/supported-applications/track2.html): a combination of the building blocks used by modelzoo models and/or the layers supported by Cerebras as listed in their documentation.
* **Track 3**, [General purpose SDK](https://portal.neocortex.psc.edu/docs/supported-applications/track3.html): a general purpose SDK that can be used for a variety of things. This track requires you to write low-level code, similar to writing CUDA, for implementing your research.
* **Track 4**, [WFA, WSE Field-equation API](https://portal.neocortex.psc.edu/docs/supported-applications/track4.html): for field equations, includes ML inference. This API was recently used for advancing CFP simulations at unprecedented resolution and speed ([more info](https://www.cmu.edu/psc/aibd/neocortex/2023-02-netl-psc-pioneer-first-ever-computational-fluid-dynamics-simulation-on-cerebras-wse.html)).
This document is expected to serve as an example of how to train a (Track 1) Cerebras modelzoo ML FC-MNIST model example from scratch. 

This document is under continuous development. If you have any recommendations for this document, please make sure to share them with the team (see the [Feedback](https://portal.neocortex.psc.edu/docs/providing-feedback.html) page).


## Setup and Requirements

To follow along with this hands-on tutorial on training an FC-MNIST model, you will need the following setup and requirements:

1. <u>A PSC account to access the Neocortex system</u>. You should have gotten an email requesting you to create/provide a valid PSC account.
2. <u>An SSH terminal client</u>.
3. <u>A web browser</u>.
4. A development environment with all of the Cerebras software stack libraries. This is already present in the Neocortex system. You don't need to download it separately.
5. The Cerebras modelzoo repository using the 1.6.0 release tag (R_1.6.0). This repository will be downloaded as part of the tutorial. You don't need to download it separately.
6. MNIST Dataset: The tutorial utilizes the MNIST dataset, which consists of a large collection of handwritten digit images. Fortunately, both TensorFlow and PyTorch provide convenient functions to automatically download and load the MNIST dataset. You don't need to download it separately.

With these requirements in place, you are all set to start this tutorial.

## Expected Steps

The training is composed of different stages. We will be performing the following tasks:

1. Define all of the helper variables and commands used across the tutorial steps. This way we can reuse code and focus in the logic behind the steps, i.e. setting the paths to access the Cerebras software stack.
2. Procure the Cerebras modelzoo repository. This repository contains the example code we will be using.
3. Navigate to the FC-MNIST code location inside the Cerebras modelzoo repository.
4. Precompile the code, using Cerebras tools to validate everything looks good code-wise.
5. Compile the code. This will generate the executable to use.
6. Train the model using the generated executable.

Here is a simple flow of the expected steps:

```
Set helper variables -> Get example code -> Change to FC-MNIST dir -> Validate -> Compile -> Train model
```

## Step 1: Set helper variables

In [1]:
# Set the folder path to the Cerebras directory
import os

account_id = "ACCOUNT_ID"  # Project allocation to use. The `projects` command shows your projects. i.e. cis123456p
username = os.environ["USER"]

# Set Cerebras-related environment variables, such as the base directory containing the development environment
cerebras_dir = "/ocean/neocortex/cerebras"
os.environ["CEREBRAS_DIR"] = cerebras_dir
os.environ['CEREBRAS_CONTAINER'] = f"{cerebras_dir}/cbcore_latest.sif"

# Set your individual code environment variables, such as the directory to be used for running the compilation
project_path = f"/ocean/projects/{account_id}/{username}"
os.environ["PROJECT"] = project_path
os.environ["YOUR_ENTRY_SCRIPT_LOCATION"] = f"{project_path}/modelzoo/modelzoo/fc_mnist/tf"
your_entry_script_location = os.environ["YOUR_ENTRY_SCRIPT_LOCATION"]
os.environ['BIND_LOCATIONS'] = f"/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,{project_path}"

# Set Slurm-related environment variables and command arguments to use for running this example
os.environ['SLURM_GRES_ARGUMENT'] = "--gres=cs:cerebras:1"
os.environ['SLURM_ARGUMENTS'] = f"--ntasks=7 --time=0-00:15 --cpus-per-task=28 --account={account_id}"

# Define a method we will use to get some required arguments for the model training.
def set_cs_ip_addr_value():
    """
    Runs a SLURM command to retrieve the CS IP address and compute node ID to use, and sets them as environment
    variables in the system.
    """
    # Run a job while requesting a CS machine, get the assigned value for the CS_IP_ADDR environment variable.
    cs_ip_addr_output = !salloc ${SLURM_GRES_ARGUMENT} ${SLURM_ARGUMENTS} --ntasks=1 srun /bin/bash -c set -o posix | grep CS_IP_ADDR
    cs_ip_addr_output = [item for item in cs_ip_addr_output if item.startswith("CS_IP_ADDR")]
    os.environ["CS_IP_ADDR"] = cs_ip_addr_output[0].split("=")[1]
    cs_ip_addr = os.environ["CS_IP_ADDR"]
    
    # Execute sacct to figure out the compute node is (the SDF partition) assigned to driving that specific CS machine
    node_id_output = !sacct --allocations --format=NodeList,AllocTRES --state=COMPLETED --parsable2 --starttime=now-1hours --endtime=now | grep "gres/cs=1" | tail --lines 1
    print(node_id_output)
    os.environ["NODE_ID"] = node_id_output[0].split("|")[0]
    node_id = os.environ["NODE_ID"]
    
    print(f"The CS_IP_ADDR ({cs_ip_addr}) and NODE_ID ({node_id}) environment variables have been set.")

## Step 2: Get the example code
### Procure the Cerebras modelzoo examples repository

The [Cerebras Model Zoo GitHub repository](https://github.com/Cerebras/modelzoo/tree/R_1.6.0) is public and contains examples of common deep learning models that can be trained on Cerebras hardware.

Please clone the repository and then check out the R_1.6.0 tag (the current version running on Neocortex system) using the following commands:

In [2]:
repository_exists = os.path.isdir(f"{project_path}/modelzoo")
                                   
if repository_exists:
    !rm -rf ${PROJECT}/modelzoo

!git clone https://github.com/Cerebras/modelzoo.git ${PROJECT}/modelzoo
%cd "$project_path/modelzoo"
!git checkout tags/R_1.6.0

!echo -e "\nOK: The reference modelzoo folder has been cloned into the ${PROJECT} directory."

!ls -lash && echo -e "\nOK: Successfully listed directories into the modelzoo folder" ${PWD}

Cloning into '/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo'...
remote: Enumerating objects: 2158, done.[K
remote: Counting objects: 100% (308/308), done.[K
remote: Compressing objects: 100% (177/177), done.[K
remote: Total 2158 (delta 181), reused 136 (delta 131), pack-reused 1850[K
Receiving objects: 100% (2158/2158), 22.10 MiB | 31.88 MiB/s, done.
Resolving deltas: 100% (1211/1211), done.
/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo
Note: checking out 'tags/R_1.6.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 886a438... R_1.6.0

OK: The reference modelzoo folder has been cloned into the /ocean/projects/ACCOUNT_ID

---
Example of the expected output (final lines):

    Cloning into '/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo'...
    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    HEAD is now at 886a438... R_1.6.0

    OK: The reference modelzoo folder has been cloned into the /ocean/projects/ACCOUNT_ID/USERNAME directory.
---

You should have the freshly checked-out folder now and it should have a modelzoo subdirectory inside as well as some other files required for running the examples. See it pointed out in the following output:

    total 92K
    4.0K drwxr-xr-x 8 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 .git
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID   94 Jun  13 12:29 .gitignore
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID  12K Jun  13 12:29 LICENSE
    4.0K drwxr-xr-x 6 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 modelzoo  # <- This is the one we will be using
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.5K Jun  13 12:29 PYTHON-SETUP.md
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID 8.2K Jun  13 12:29 README.md
     28K -rw-r--r-- 1 USERNAME ACCOUNT_ID  27K Jun  13 12:29 RELEASE-NOTES.md
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.2K Jun  13 12:29 requirements_pytorch_gpu.txt
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.3K Jun  13 12:29 requirements_tensorflow_gpu.txt
    4.0K drwxr-xr-x 2 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 user_scripts
    
You can check if the contents of that directory look the same by running the following command:

## Step 3: Change into the FC-MIST Example folder

We will now change directories to the `modelzoo/fc_mnist/tf` folder for running the FC-MNIST example. We need to do it as that location has the entry script and all of the code for running the example.

In [3]:
%cd "$your_entry_script_location"
!ls -lash && echo -e "\nOK: Successfully listed directories in the FC-MNIST example folder"

/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo/modelzoo/fc_mnist/tf
total 16K
4.0K drwxr-xr-x 3 USERNAME ACCOUNT_ID 4.0K Aug  4 08:15 .
4.0K drwxr-xr-x 5 USERNAME ACCOUNT_ID 4.0K Aug  4 08:15 ..
4.0K drwxr-xr-x 2 USERNAME ACCOUNT_ID 4.0K Aug  4 08:15 configs
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 2.6K Aug  4 08:15 data.py
   0 -rw-r--r-- 1 USERNAME ACCOUNT_ID    0 Aug  4 08:15 __init__.py
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.3K Aug  4 08:15 model.py
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.2K Aug  4 08:15 prepare_data.py
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 6.4K Aug  4 08:15 README.md
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.0K Aug  4 08:15 run-appliance.py
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 8.5K Aug  4 08:15 run.py
 512 -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.5K Aug  4 08:15 utils.py

OK: Successfully listed directories in the FC-MNIST example folder


## Structure of the code

The following is the base structure that Cerebras uses for their code (the code template). If you want to run your research on their system, the suggested way to add your model and dataset is to take one of these base examples and start changing specific components.

The files for this specific TensorFlow FC-MNIST directory we just switched to should look like this:

    /ocean/projects/ACCOUNT_ID/USERNAME/modelzoo/modelzoo/fc_mnist/tf
    total 56K
    4.0K drwxr-xr-x 2 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 configs
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 2.6K Jun  13 12:29 data.py
       0 -rw-r--r-- 1 USERNAME ACCOUNT_ID    0 Jun  13 12:29 __init__.py
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.3K Jun  13 12:29 model.py
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.2K Jun  13 12:29 prepare_data.py
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 6.4K Jun  13 12:29 README.md
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 run-appliance.py
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID 8.5K Jun  13 12:29 run.py
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.5K Jun  13 12:29 utils.py

Let's go over the main files in this location:

* **configs/params.yaml:** YAML file containing the model configuration and the training hyperparameter settings.
* **data.py:** where the input data pipeline is called. Additional data processor modules may be defined elsewhere (e.g., in the input folder) for data pipeline implementation.
* **model.py:** It contains the model function definition. For information about the layers supported, please visit the Cerebras Documentation.
* **run.py:** it contains the training/compilation/evaluation script.
* **utils.py:** it contains the helper scripts.

If you would like to modify an example to integrate your code, model, or dataset, the suggested order of files to modify is: **model.py** > **data.py** > **utils.py** > **configs/params.yaml** > **run.py**

Additionally, the following diagram shows the suggested order in which the modifications should be performed for porting the code. The arrows represent the suggested order for the modification process to perform. This diagram should be read from left to right.

![diagram](https://portal.neocortex.psc.edu/static/images/code_migration/code_migration_workflow.svg)


Since the Cerebras modelzoo examples are all ready to be executed, we will not make any modifications for this example as we are now ready to start running it. 

The execution will happen over three different steps: validate, compile, and train.

1. **validate**: this step runs a fast verification (on CPU), running a light-weight compilation up to performing kernel library matching, helping you determine if you are using any TensorFlow layer or functionality that is unsupported by the Cerebras development stack.
<br/>The argument used for running this step is `--mode train --validate_only`.

2. **compile**: this steps runs the full compilation (on CPU) through all stages of the Cerebras software stack to generate a CS system executable. When the above compilation is successful, the model is guaranteed to run on the CS system. <br/>The argument used for running this step is `--mode train --compile_only`.

3. **train**: this step runs the actual training job on the CS system using the compiled executable.
<br/>The argument used for running this step is simply `--mode train` (without additional "`_only`" arguments).

For more information, please visit the [Cerebras TensorFlow Quickstart Documentation v1.6.0](https://docs.cerebras.net/en/1.6.0/getting-started/cs-tf-quickstart.html) page.

## Running the example

As the development environment uses custom Cerebras libraries and it would need to be configured beforehand, Cerebras kindly offers a pre-made container with everything preconfigured and ready to execute the Cerebras modelzoo examples.

The actual example code is then executed from that [Singularity/Apptainer](https://apptainer.org/) container, and multi-user access to the CS systems is handled by Slurm.

The actual validation, compilation and training commands can be found below.

### Step 4: Validate de code

Run the code validation using the following command:

In [4]:
model_dir_exists = os.path.isdir("tutorial_model_dir")
if model_dir_exists:
    !rm -rf tutorial_model_dir

!salloc ${SLURM_ARGUMENTS} --ntasks=1 singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir tutorial_model_dir

salloc: Granted job allocation 283011
salloc: Waiting for resource configuration
salloc: Nodes sdf-1 are ready for job
INFO:tensorflow:TF_CONFIG environment variable: {}
INFO:root:Running None on CS-2
INFO:absl:Generating dataset mnist (./tfds/mnist/3.0.1)
[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to ./tfds/mnist/3.0.1...[0m
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Extraction completed...: 0 file [00:00, ? file/s][A[AINFO:absl:Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into tfds/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.9b190f80eccb49fe933cba7532cb28f2...
Dl Completed...:   0%|                                  | 0/1 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Extraction completed...: 0 file [00:00, ? file/s][A[AINFO:absl:Downloading https://stora

Generating train examples...: 377 examples [00:00, 18.78 examples/s][A
Generating train examples...: 567 examples [00:00, 26.72 examples/s][A
Generating train examples...: 757 examples [00:00, 37.94 examples/s][A
Generating train examples...: 947 examples [00:00, 53.74 examples/s][A
Generating train examples...: 1137 examples [00:00, 75.85 examples/s][A
Generating train examples...: 1326 examples [00:00, 106.52 examples/s][A
Generating train examples...: 1516 examples [00:00, 148.59 examples/s][A
Generating train examples...: 1704 examples [00:01, 205.30 examples/s][A
Generating train examples...: 1892 examples [00:01, 280.16 examples/s][A
Generating train examples...: 2080 examples [00:01, 376.13 examples/s][A
Generating train examples...: 2269 examples [00:01, 494.94 examples/s][A
Generating train examples...: 2459 examples [00:01, 635.83 examples/s][A
Generating train examples...: 2649 examples [00:01, 793.92 examples/s][A
Generating train examples...: 2839 examples [00

Generating train examples...: 20898 examples [00:11, 1890.54 examples/s][A
Generating train examples...: 21088 examples [00:11, 1891.15 examples/s][A
Generating train examples...: 21278 examples [00:11, 1891.12 examples/s][A
Generating train examples...: 21468 examples [00:11, 1889.96 examples/s][A
Generating train examples...: 21658 examples [00:11, 1890.64 examples/s][A
Generating train examples...: 21848 examples [00:11, 1891.27 examples/s][A
Generating train examples...: 22038 examples [00:11, 1888.29 examples/s][A
Generating train examples...: 22227 examples [00:11, 1886.84 examples/s][A
Generating train examples...: 22417 examples [00:11, 1889.21 examples/s][A
Generating train examples...: 22608 examples [00:12, 1894.89 examples/s][A
Generating train examples...: 22798 examples [00:12, 1887.45 examples/s][A
Generating train examples...: 22988 examples [00:12, 1890.25 examples/s][A
Generating train examples...: 23178 examples [00:12, 1891.45 examples/s][A
Generating t

Generating train examples...: 41177 examples [00:21, 1899.48 examples/s][A
Generating train examples...: 41367 examples [00:21, 1897.90 examples/s][A
Generating train examples...: 41558 examples [00:22, 1899.14 examples/s][A
Generating train examples...: 41748 examples [00:22, 1898.34 examples/s][A
Generating train examples...: 41938 examples [00:22, 1897.14 examples/s][A
Generating train examples...: 42128 examples [00:22, 1896.30 examples/s][A
Generating train examples...: 42319 examples [00:22, 1897.74 examples/s][A
Generating train examples...: 42509 examples [00:22, 1896.19 examples/s][A
Generating train examples...: 42699 examples [00:22, 1894.64 examples/s][A
Generating train examples...: 42889 examples [00:22, 1892.64 examples/s][A
Generating train examples...: 43080 examples [00:22, 1896.08 examples/s][A
Generating train examples...: 43270 examples [00:23, 1896.65 examples/s][A
Generating train examples...: 43460 examples [00:23, 1893.42 examples/s][A
Generating t

Generating test examples...: 346 examples [00:00, 1645.19 examples/s][A
Generating test examples...: 537 examples [00:00, 1714.91 examples/s][A
Generating test examples...: 727 examples [00:00, 1766.46 examples/s][A
Generating test examples...: 918 examples [00:00, 1805.04 examples/s][A
Generating test examples...: 1110 examples [00:00, 1837.11 examples/s][A
Generating test examples...: 1301 examples [00:00, 1856.50 examples/s][A
Generating test examples...: 1492 examples [00:00, 1869.75 examples/s][A
Generating test examples...: 1683 examples [00:00, 1879.08 examples/s][A
Generating test examples...: 1874 examples [00:01, 1886.36 examples/s][A
Generating test examples...: 2065 examples [00:01, 1891.27 examples/s][A
Generating test examples...: 2255 examples [00:01, 1893.23 examples/s][A
Generating test examples...: 2445 examples [00:01, 1893.41 examples/s][A
Generating test examples...: 2633 examples [00:01, 1885.65 examples/s][A
Generating test examples...: 2823 examples

---
Example of the expected output (final lines):

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    =============== Starting Cerebras Compilation ===============                   
    Stack:   0%|                      | 0/10 [00:00s, ?stages/s ]
    =============== Cerebras Compilation Completed ==============
---

The command above exectuted a Slurm job using `srun` and `singularity`, both with multiple arguments.

For the Slurm arguments, the following were used:
* **--ntasks=7**: Specify  the number of tasks to run.
* **--time=0-00:15**: Set a limit on the total run time of the job allocation. The format used is "dd-hh:mm:mm".
* **--cpus-per-task=28**: Request that ncpus be allocated per process. A 28-core CPU processor is being used per task.
* **--account=ACCOUNT_ID**: this is the AI tutorial allocation account to which resource utilization is being charged.
* **--pty**: Execute task zero in pseudo terminal mode. Meaning, it will provide an interactive session.

For the Singularity/Apptainer arguments, the following were used:
* **exec**: the "exec" mode is an alternative to running the "run" or "shell" mode, and it allows running a specific command.
* **--bind \${BIND_LOCATIONS}**: it allows specifying the bind paths, or folders to make available to the container apps in the same (or a specific) location. The folder we are requesting to bind is the one in which the FC-MNIST example is located. Other folders with input data should also be mounted as required.
* **--pwd \${YOUR_ENTRY_SCRIPT_LOCATION}**: sets the working directory to use when running the commands with singularity exec. The actual value is `/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo/modelzoo/fc_mnist/tf`
* **${CEREBRAS_CONTAINER}**: this is the full path to the latest Cerebras cbcore container. The actual value is `/ocean/neocortex/cerebras/cbcore_latest.sif`.

And as for the Python run.py file, the following arguments were used:
* **--mode train**: this use the training mode on the Cerebras software stack. Other modes available are the evaluation and prediction mode. More information about this can be found in the [Cerebras Documentation](https://docs.cerebras.net/en/1.6.0/tensorflow-docs/running-a-model/train-eval-predict.html).
* **--validate_only**: as mentioned above, this argument allow running a light-weight compilationthat helps determine any unsupported TensorFlow layer or functionality is being used in the code.
* **--model_dir tutorial_model_dir**: this argument specifies the target directory to use for the compilation process. Using different folders for this argument allows starting from scratch when something goes wrong.


### Step 5: Compile the code

Run the full compilation using the following command:

In [5]:
!salloc ${SLURM_ARGUMENTS} --ntasks=1 singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir tutorial_model_dir

salloc: Granted job allocation 283012
salloc: Waiting for resource configuration
salloc: Nodes sdf-1 are ready for job
INFO:tensorflow:TF_CONFIG environment variable: {}
INFO:root:Running None on CS-2
INFO:absl:Load dataset info from ./tfds/mnist/3.0.1
INFO:absl:Reusing dataset mnist (./tfds/mnist/3.0.1)
INFO:absl:Constructing tf.data.Dataset mnist for split None, from ./tfds/mnist/3.0.1
INFO:root:---------- Suggestions to improve input_fn performance ----------
INFO:root:[input_fn] - batch(): batch_size set to 256
INFO:root:----------------- End of input_fn suggestions -----------------
INFO:absl:Load dataset info from ./tfds/mnist/3.0.1
INFO:absl:Reusing dataset mnist (./tfds/mnist/3.0.1)
INFO:absl:Constructing tf.data.Dataset mnist for split None, from ./tfds/mnist/3.0.1
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:cerebras.stack.tools.caching_stack:Using lair flow into stack
salloc: Relinquishing job allocation 283012


---
Example of the expected output (final lines):

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    =============== Starting Cerebras Compilation ===============
    Estimating performance:  94%|█████████| 33/35 [00:50s,  3.99s/stages ]
    =============== Cerebras Compilation Completed ===============
---
   
Similar to the previous command, another job was executed using `srun` and `singularity`, but now in compilation mode.

For the Python run.py file, the following argument changed:
* **--compile_only**: as mentioned above, this steps runs the full compilation (on CPU) through all stages of the Cerebras software stack to generate a CS system executable.

### Step 6: Run the training on the CS machine

In [6]:
set_cs_ip_addr_value()
!salloc ${SLURM_GRES_ARGUMENT} ${SLURM_ARGUMENTS} --nodelist=${NODE_ID} srun /usr/bin/singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir tutorial_model_dir --cs_ip ${CS_IP_ADDR}

['sdf-1|billing=98,cpu=98,gres/cs=1,node=1']
The CS_IP_ADDR (10.8.88.10) and NODE_ID (sdf-1) environment variables have been set.
salloc: Granted job allocation 283014
salloc: Waiting for resource configuration
salloc: Nodes sdf-1 are ready for job
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['sdf-1:21893'], 'worker': ['sdf-1:21895', 'sdf-1:21897', 'sdf-1:21899', 'sdf-1:21901', 'sdf-1:21903', 'sdf-1:21905']}, 'task': {'type': 'chief', 'index': 0}}
INFO:root:Running train on CS-2
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2023-08-04 08:24:56.489613: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2023-08-04 08:24:56.495056: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102

INFO:tensorflow:global step 1700: loss = 0.037811279296875 (2812.92 steps/sec)
INFO:tensorflow:global step 1800: loss = 0.02142333984375 (2845.64 steps/sec)
INFO:tensorflow:global step 1900: loss = 0.0208740234375 (2877.17 steps/sec)
INFO:tensorflow:global step 2000: loss = 0.032745361328125 (2904.57 steps/sec)
INFO:tensorflow:global step 2100: loss = 0.0225830078125 (2930.5 steps/sec)
INFO:tensorflow:global step 2200: loss = 0.031097412109375 (2953.7 steps/sec)
INFO:tensorflow:global step 2300: loss = 0.02081298828125 (2977.08 steps/sec)
INFO:tensorflow:global step 2400: loss = 0.00661468505859375 (2996.56 steps/sec)
INFO:tensorflow:global step 2500: loss = 0.01525115966796875 (3015.87 steps/sec)
INFO:tensorflow:global step 2600: loss = 0.00873565673828125 (3032.47 steps/sec)
INFO:tensorflow:global step 2700: loss = 0.0101470947265625 (3050.52 steps/sec)
INFO:tensorflow:global step 2800: loss = 0.032745361328125 (3065.99 steps/sec)
INFO:tensorflow:global step 2900: loss = 0.0128402709

INFO:tensorflow:global step 11600: loss = 0.006496429443359375 (2337.17 steps/sec)
INFO:tensorflow:global step 11700: loss = 0.0002703666687011719 (2344.16 steps/sec)
INFO:tensorflow:global step 11800: loss = 0.0179290771484375 (2350.86 steps/sec)
INFO:tensorflow:global step 11900: loss = 0.003566741943359375 (2357.75 steps/sec)
INFO:tensorflow:global step 12000: loss = 0.010162353515625 (2364.39 steps/sec)
INFO:tensorflow:global step 12100: loss = 0.01299285888671875 (2370.74 steps/sec)
INFO:tensorflow:global step 12200: loss = 0.0006418228149414062 (2377.15 steps/sec)
INFO:tensorflow:global step 12300: loss = 0.0195159912109375 (2383.43 steps/sec)
INFO:tensorflow:global step 12400: loss = 0.0037212371826171875 (2389.58 steps/sec)
INFO:tensorflow:global step 12500: loss = 0.00751495361328125 (2395.76 steps/sec)
INFO:tensorflow:global step 12600: loss = 0.0036296844482421875 (2401.9 steps/sec)
INFO:tensorflow:global step 12700: loss = 0.0011892318725585938 (2407.96 steps/sec)
INFO:tens

INFO:tensorflow:global step 21000: loss = 0.001911163330078125 (1877.73 steps/sec)
INFO:tensorflow:global step 21100: loss = 0.004123687744140625 (1881.86 steps/sec)
INFO:tensorflow:global step 21200: loss = 0.00033092498779296875 (1886.09 steps/sec)
INFO:tensorflow:global step 21300: loss = 0.0007901191711425781 (1890.21 steps/sec)
INFO:tensorflow:global step 21400: loss = 0.0011157989501953125 (1894.36 steps/sec)
INFO:tensorflow:global step 21500: loss = 0.000583648681640625 (1898.45 steps/sec)
INFO:tensorflow:global step 21600: loss = 4.6133995056152344e-05 (1902.53 steps/sec)
INFO:tensorflow:global step 21700: loss = 0.0002880096435546875 (1906.57 steps/sec)
INFO:tensorflow:global step 21800: loss = 3.993511199951172e-05 (1910.67 steps/sec)
INFO:tensorflow:global step 21900: loss = 0.00555419921875 (1914.7 steps/sec)
INFO:tensorflow:global step 22000: loss = 0.01042938232421875 (1918.77 steps/sec)
INFO:tensorflow:global step 22100: loss = 0.0073699951171875 (1922.68 steps/sec)
INFO

INFO:tensorflow:global step 30600: loss = 0.0004100799560546875 (2188.63 steps/sec)
INFO:tensorflow:global step 30700: loss = 0.00788116455078125 (2191.34 steps/sec)
INFO:tensorflow:global step 30800: loss = 2.384185791015625e-06 (2194.11 steps/sec)
INFO:tensorflow:global step 30900: loss = 0.0003185272216796875 (2196.81 steps/sec)
INFO:tensorflow:global step 31000: loss = 7.3909759521484375e-06 (2199.58 steps/sec)
INFO:tensorflow:global step 31100: loss = 0.026153564453125 (2202.15 steps/sec)
INFO:tensorflow:global step 31200: loss = 0.0054931640625 (2204.86 steps/sec)
INFO:tensorflow:global step 31300: loss = 0.00928497314453125 (2207.52 steps/sec)
INFO:tensorflow:global step 31400: loss = 0.0001399517059326172 (2210.24 steps/sec)
INFO:tensorflow:global step 31500: loss = 0.0077056884765625 (2212.89 steps/sec)
INFO:tensorflow:global step 31600: loss = 5.960464477539062e-07 (2215.55 steps/sec)
INFO:tensorflow:global step 31700: loss = 0.003238677978515625 (2218.19 steps/sec)
INFO:tens

INFO:tensorflow:global step 40200: loss = 2.562999725341797e-06 (2310.73 steps/sec)
INFO:tensorflow:global step 40300: loss = 4.214048385620117e-05 (2312.7 steps/sec)
INFO:tensorflow:global step 40400: loss = 0.0001347064971923828 (2314.74 steps/sec)
INFO:tensorflow:global step 40500: loss = 0.00017952919006347656 (2316.7 steps/sec)
INFO:tensorflow:global step 40600: loss = 6.592273712158203e-05 (2318.66 steps/sec)
INFO:tensorflow:global step 40700: loss = 0.0015697479248046875 (2320.64 steps/sec)
INFO:tensorflow:global step 40800: loss = 1.9252300262451172e-05 (2322.63 steps/sec)
INFO:tensorflow:global step 40900: loss = 0.03125 (2324.57 steps/sec)
INFO:tensorflow:global step 41000: loss = 0.0307159423828125 (2326.58 steps/sec)
INFO:tensorflow:global step 41100: loss = 0.004940032958984375 (2328.48 steps/sec)
INFO:tensorflow:global step 41200: loss = 1.7285346984863281e-06 (2330.4 steps/sec)
INFO:tensorflow:global step 41300: loss = 8.83936882019043e-05 (2332.29 steps/sec)
INFO:tensor

INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 50000...
INFO:tensorflow:global step 50000: loss = 0.01044464111328125 (2317.7 steps/sec)
INFO:tensorflow:global step 50100: loss = 0.0011854171752929688 (2319.26 steps/sec)
INFO:tensorflow:global step 50200: loss = 0.0003447532653808594 (2320.84 steps/sec)
INFO:tensorflow:global step 50300: loss = 0.0009608268737792969 (2322.46 steps/sec)
INFO:tensorflow:global step 50400: loss = 4.506111145019531e-05 (2324.04 steps/sec)
INFO:tensorflow:global step 50500: loss = 0.0009326934814453125 (2325.65 steps/sec)
INFO:tensorflow:global step 50600: loss = 0.00026226043701171875 (2327.19 steps/sec)
INFO:tensorflow:global step 50700: loss = 0.006351470947265625 (2328.79 steps/sec)
INFO:tensorflow:global step 50800: loss = 0.0030918121337890625 (2330.33 steps/sec)
INFO:tensorflow:global step 50900: loss = 0.0010251998901367188 (2331.92 steps/sec)
INFO:tensorflow:global step 51000: loss = 2.09808349609375e-05 (2333.46 steps/sec)
IN

INFO:tensorflow:global step 59800: loss = 2.980232238769531e-07 (2457.46 steps/sec)
INFO:tensorflow:global step 59900: loss = 0.00860595703125 (2458.78 steps/sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 60000...
INFO:tensorflow:Saving checkpoints for 60000 into tutorial_model_dir/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 60000...
INFO:tensorflow:global step 60000: loss = 5.960464477539063e-08 (2316.0 steps/sec)
INFO:tensorflow:global step 60100: loss = 2.205371856689453e-06 (2317.24 steps/sec)
INFO:tensorflow:global step 60200: loss = 4.172325134277344e-07 (2318.53 steps/sec)
INFO:tensorflow:global step 60300: loss = 3.5762786865234375e-07 (2319.89 steps/sec)
INFO:tensorflow:global step 60400: loss = 7.152557373046875e-07 (2321.09 steps/sec)
INFO:tensorflow:global step 60500: loss = 0.00225067138671875 (2322.42 steps/sec)
INFO:tensorflow:global step 60600: loss = 0.01532745361328125 (2323.65 steps/sec)
INFO:tensorflow

INFO:tensorflow:global step 69500: loss = 2.0742416381835938e-05 (2430.67 steps/sec)
INFO:tensorflow:global step 69600: loss = 2.384185791015625e-07 (2431.75 steps/sec)
INFO:tensorflow:global step 69700: loss = 7.843971252441406e-05 (2432.89 steps/sec)
INFO:tensorflow:global step 69800: loss = 2.562999725341797e-06 (2433.99 steps/sec)
INFO:tensorflow:global step 69900: loss = 1.2516975402832031e-06 (2435.12 steps/sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 70000...
INFO:tensorflow:Saving checkpoints for 70000 into tutorial_model_dir/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 70000...
INFO:tensorflow:global step 70000: loss = 0.0005993843078613281 (2317.81 steps/sec)
INFO:tensorflow:global step 70100: loss = 2.9206275939941406e-06 (2318.87 steps/sec)
INFO:tensorflow:global step 70200: loss = 4.750490188598633e-05 (2319.97 steps/sec)
INFO:tensorflow:global step 70300: loss = 0.00018978118896484375 (2321.14 steps/sec)
IN

INFO:tensorflow:global step 79400: loss = 1.6093254089355469e-06 (2416.62 steps/sec)
INFO:tensorflow:global step 79500: loss = 0.0006542205810546875 (2417.59 steps/sec)
INFO:tensorflow:global step 79600: loss = 2.980232238769531e-07 (2418.55 steps/sec)
INFO:tensorflow:global step 79700: loss = 0.031982421875 (2419.53 steps/sec)
INFO:tensorflow:global step 79800: loss = 1.7881393432617188e-07 (2420.53 steps/sec)
INFO:tensorflow:global step 79900: loss = 9.953975677490234e-06 (2322.13 steps/sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 80000...
INFO:tensorflow:Saving checkpoints for 80000 into tutorial_model_dir/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 80000...
INFO:tensorflow:global step 80000: loss = 0.0166168212890625 (2317.83 steps/sec)
INFO:tensorflow:global step 80100: loss = 0.0 (2318.68 steps/sec)
INFO:tensorflow:global step 80200: loss = 7.152557373046875e-07 (2319.69 steps/sec)
INFO:tensorflow:global step 8030

INFO:tensorflow:global step 89200: loss = 5.364418029785156e-07 (2402.83 steps/sec)
INFO:tensorflow:global step 89300: loss = 1.0013580322265625e-05 (2403.69 steps/sec)
INFO:tensorflow:global step 89400: loss = 6.496906280517578e-06 (2404.56 steps/sec)
INFO:tensorflow:global step 89500: loss = 0.0323486328125 (2405.42 steps/sec)
INFO:tensorflow:global step 89600: loss = 9.5367431640625e-07 (2406.27 steps/sec)
INFO:tensorflow:global step 89700: loss = 0.0002884864807128906 (2407.12 steps/sec)
INFO:tensorflow:global step 89800: loss = 0.00012540817260742188 (2408.01 steps/sec)
INFO:tensorflow:global step 89900: loss = 0.00981903076171875 (2321.38 steps/sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 90000...
INFO:tensorflow:Saving checkpoints for 90000 into tutorial_model_dir/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 90000...
INFO:tensorflow:global step 90000: loss = 0.02374267578125 (2316.67 steps/sec)
INFO:tensorflow:glo

INFO:tensorflow:global step 99100: loss = 6.496906280517578e-06 (2391.64 steps/sec)
INFO:tensorflow:global step 99200: loss = 1.9431114196777344e-05 (2392.43 steps/sec)
INFO:tensorflow:global step 99300: loss = 2.5391578674316406e-05 (2393.2 steps/sec)
INFO:tensorflow:global step 99400: loss = 2.6941299438476562e-05 (2394.0 steps/sec)
INFO:tensorflow:global step 99500: loss = 1.7881393432617188e-06 (2394.77 steps/sec)
INFO:tensorflow:global step 99600: loss = 2.4437904357910156e-06 (2395.55 steps/sec)
INFO:tensorflow:global step 99700: loss = 0.0016317367553710938 (2396.31 steps/sec)
INFO:tensorflow:global step 99800: loss = 0.0008220672607421875 (2397.1 steps/sec)
INFO:tensorflow:global step 99900: loss = 4.172325134277344e-06 (2397.86 steps/sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100000...
INFO:tensorflow:Saving checkpoints for 100000 into tutorial_model_dir/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100000...
I

---

Example of the expected output (final lines):

    SLURM environment variables have been set successfully.
    srun: job 1234 queued and waiting for resources
    srun: job 1234 has been allocated resources
    INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['sdf:29231'], 'worker': ['sdf:29233', 'sdf:29235', 'sdf:29237', 'sdf:29239', 'sdf:29241', 'sdf:29243']}, 'task': {'type': 'chief', 'index': 0}}
    INFO:root:Running train on CS-2

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:global step 99700: loss = 4.76837158203125e-07 (2277.17 steps/sec)
    INFO:tensorflow:global step 99800: loss = 0.0 (2277.98 steps/sec)
    INFO:tensorflow:global step 99900: loss = 5.304813385009766e-06 (2278.76 steps/sec)
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100000...
    INFO:tensorflow:Saving checkpoints for 100000 into tutorial_model_dir/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100000...
    INFO:tensorflow:Training finished with 25600000 samples in 43.869 seconds, 583554.35 samples/second.
    INFO:tensorflow:Loss for final step: 0.06384.
    =============== Starting Cerebras Compilation ===============
    =============== Cerebras Compilation Completed ===============

---

This third command also executed a job using `srun` and `singularity`, but now using the actual training mode that utilizes the CS system.

For the Singularity/Apptainer arguments, the following argument changed:
* **\${SLURM_GRES_ARGUMENT}**: the actual value is `--gres=cs:cerebras:1`, and it requests a CS machine as a special resource to be used for the Slurm job training the model. If this flag were not to be used, the CS would not be allocated for the job (unavailable).
* **--cs_ip ${CS_IP_ADDR}**: this value is set when a CS machine has been requested (using `--gres=cs:cerebras:1`). It will point to the IP address of the CS machine mapped to the specific compute node running the training. This value is dynamic and changes as required by the system administrators.

For the Python run.py file, the following argument changed:
* **--mode train**: no additional `_only` arguments were used, just the `--mode train` argument to start the training on the CS system using the compiled executable.

## Closing notes

Congratulations on successfully training an FC-MNIST model on the Neocortex system! Throughout this tutorial, you have gained hands-on experience in performing all of the steps required for running a reference example from scratch on a Cerebras CS-2 machine.

This example is based on the [Cerebras Documentation](https://docs.cerebras.net/en/1.6.0/). 

Other links of interest are:

* [Neocortex System](https://www.cmu.edu/psc/aibd/neocortex/)
* [Cerebras ML Workflow](https://docs.cerebras.net/en/1.6.0/cerebras-basics/cs-ml-workflow.html)
* [Neocortex Documentation](https://portal.neocortex.psc.edu/docs/)

This material is based upon work supported by the [National Science Foundation under Grant Number 2005597](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2005597).

## Next steps (optional):


Neocortex provides a unique opportunity to access the remarkable integrated technologies of the Cerebras CS-2 and the HPE Superdome Flex Servers available in PSC's Neocortex system. 

We invite you to run your research on Neocortex. You could first, identify the track that your project belongs to, then answer some general questions, and then finally, apply for accessing the system over a full-fledged research grant.

You can take a look at the [previous Neocortex Call for Proposals page](https://www.cmu.edu/psc/aibd/neocortex/2023-03-cfp-spring-2023.html) for more details.