# Neocortex: Hands-on FC-MNIST Example


## Introduction

### Welcome notes

Welcome to this hands-on example of training a Fully Connected (FC) model for the MNIST dataset! In this exercise, we will explore the fascinating world of deep learning by building a neural network capable of recognizing handwritten digits. MNIST is a widely used dataset in the field of computer vision and serves as an excellent starting point for beginners.

The objective of this exercise is to guide you through the process of constructing and training a simple FC model using Python and the TensorFlow deep learning framework running on top of the Cerebras Software stack on the Neocortex system. We will break down the example step by step, ensuring that you gain a clear understanding of the underlying concepts and techniques.

By the end of this hands-on example, you will have a trained FC model on a Cerebras CS-2 machine that can accurately classify handwritten digits from the MNIST dataset. You will also gain valuable insights into the fundamentals of deep learning, including model architecture, training data preparation, loss functions, and optimization algorithms.

Whether you are new to the Neocortex system or looking to reinforce your knowledge, this exercise will provide you with a solid foundation to explore more advanced implementations using your custom model and dataset. So, let's dive in and embark on this exciting journey of training an FC-MNIST model on the Neocortex system!

### Training Description
#### Model used
Fully Connected (FC) model. Neural Network where each neuron in one layer is connected to every neuron in the next layer, enabling complex pattern recognition and decision-making.

<img src="img/fcnn.png" width="30%" alt/>
<em>Fully Connected Neural Network</em>

#### Dataset used
MNIST dataset. Is a collection of 70,000 handwritten digits (0 through 9) widely used in the field of image recognition tasks (Modified National Institute of Standards and Technology database).

<img src="img/mnist.png" width="30%" alt/>
<em>MNIST dataset</em>

#### Training task used: 
* batch_size: 256
* max_steps: 100.000
* save_checkpoints_steps: 10.000
* keep_checkpoint_max: 2

### Neocortex

[Neocortex](https://www.cmu.edu/psc/aibd/neocortex/) is a highly innovative resource that targets the acceleration of AI-powered scientific discovery by vastly shortening the time required for deep learning training, featuring two [Cerebras CS-2](https://www.cerebras.net/product-system/) systems and an [HPE Superdome Flex HPC server](https://buy.hpe.com/ca/en/compute/mission-critical-x86-servers/superdome-flex-servers/superdome-flex-server/hpe-superdome-flex-280-server/p/1012865453) (SDF) robustly provisioned to drive the CS-2 systems simultaneously at maximum speed and support the complementary requirements of AI and HPDA workflows.

There are four types of applications currently supported on the system, divided into the following individual tracks:

* **Track 1**, [Cerebras modelzoo ML models](https://portal.neocortex.psc.edu/docs/supported-applications/track1.html): models already present in version R1.6.0 of the Cerebras modelzoo ML models software.
* **Track 2**, [Models similar to the Cerebras modelzoo models](https://portal.neocortex.psc.edu/docs/supported-applications/track2.html): a combination of the building blocks used by modelzoo models and/or the layers supported by Cerebras as listed in their documentation.
* **Track 3**, [General purpose SDK](https://portal.neocortex.psc.edu/docs/supported-applications/track3.html): a general purpose SDK that can be used for a variety of things. This track requires you to write low-level code, similar to writing CUDA, for implementing your research.
* **Track 4**, [WFA, WSE Field-equation API](https://portal.neocortex.psc.edu/docs/supported-applications/track4.html): for field equations, includes ML inference. This API was recently used for advancing CFP simulations at unprecedented resolution and speed ([more info](https://www.cmu.edu/psc/aibd/neocortex/2023-02-netl-psc-pioneer-first-ever-computational-fluid-dynamics-simulation-on-cerebras-wse.html)).
This document is expected to serve as an example of how to train a (Track 1) Cerebras modelzoo ML FC-MNIST model example from scratch. 

This document is under continuous development. If you have any recommendations for this document, please make sure to share them with the team (see the [Feedback](https://portal.neocortex.psc.edu/docs/providing-feedback.html) page).


## Setup and Requirements

To follow along with this hands-on tutorial on training an FC-MNIST model, you will need the following setup and requirements:

1. <u>A PSC account to access the Neocortex system</u>. You should have gotten an email requesting you to create/provide a valid PSC account.
2. <u>An SSH terminal client</u>.
3. <u>A web browser</u>.
4. A development environment with all of the Cerebras software stack libraries. This is already present in the Neocortex system. You don't need to download it separately.
5. The Cerebras modelzoo repository using the 1.6.0 release tag (R_1.6.0). This repository will be downloaded as part of the tutorial. You don't need to download it separately.
6. MNIST Dataset: The tutorial utilizes the MNIST dataset, which consists of a large collection of handwritten digit images. Fortunately, both TensorFlow and PyTorch provide convenient functions to automatically download and load the MNIST dataset. You don't need to download it separately.

With these requirements in place, you are all set to start this tutorial.

## Expected Steps

The training is composed of different stages. We will be performing the following tasks:

1. Define all of the helper variables and commands used across the tutorial steps. This way we can reuse code and focus in the logic behind the steps, i.e. setting the paths to access the Cerebras software stack.
2. Procure the Cerebras modelzoo repository. This repository contains the example code we will be using.
3. Navigate to the FC-MNIST code location inside the Cerebras modelzoo repository.
4. Precompile the code, using Cerebras tools to validate everything looks good code-wise.
5. Compile the code. This will generate the executable to use.
6. Train the model using the generated executable.

Here is a simple flow of the expected steps:

```
Set helper variables -> Get example code -> Change to FC-MNIST dir -> Validate -> Compile -> Train model
```

## Step 1: Set helper variables

In [1]:
# Set the folder path to the Cerebras directory
import os
import tempfile
import subprocess

# Create a temporary directory and capture the path
tmp_dir = subprocess.check_output("mktemp -d", shell=True).decode().strip()

account_id = os.environ.get("SLURM_JOB_ACCOUNT")  # Project allocation to use. The `projects` command shows your projects. i.e. tra250009p
username = os.environ.get("SLURM_JOB_USER")

# Set Cerebras-related environment variables, such as the base directory containing the development environment
local_dir = os.environ.get("LOCAL")
cerebras_dir = f"{local_dir}/cerebras"
os.environ["CEREBRAS_DIR"] = cerebras_dir
os.environ['CEREBRAS_CONTAINER'] = f"{cerebras_dir}/cbcore_latest.sif"

# Set your individual code environment variables, such as the directory to be used for running the compilation
project_path = tmp_dir
os.environ["PROJECT"] = tmp_dir
os.environ["YOUR_ENTRY_SCRIPT_LOCATION"] = f"{project_path}/modelzoo/modelzoo/fc_mnist/tf"
your_entry_script_location = os.environ["YOUR_ENTRY_SCRIPT_LOCATION"]
os.environ['BIND_LOCATIONS'] = f"/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,{project_path}"

# Set Slurm-related environment variables and command arguments to use for running this example
os.environ['SLURM_GRES_ARGUMENT'] = "--gres=cs:cerebras:1"
os.environ['SLURM_ARGUMENTS'] = f"--ntasks=7 --time=0-00:15 --cpus-per-task=28 --account={account_id}"

# Define a method we will use to get some required arguments for the model training.
def set_cs_ip_addr_value():
    """
    Runs a SLURM command to retrieve the CS IP address and compute node ID to use, and sets them as environment
    variables in the system.
    """
    # Run a job while requesting a CS machine, get the assigned value for the CS_IP_ADDR environment variable.
    cs_ip_addr_output = !salloc ${SLURM_GRES_ARGUMENT} ${SLURM_ARGUMENTS} --ntasks=1 srun /bin/bash -c set -o posix | grep CS_IP_ADDR
    cs_ip_addr_output = [item for item in cs_ip_addr_output if item.startswith("CS_IP_ADDR")]
    os.environ["CS_IP_ADDR"] = cs_ip_addr_output[0].split("=")[1]
    cs_ip_addr = os.environ["CS_IP_ADDR"]
    
    # Execute sacct to figure out the compute node is (the SDF partition) assigned to driving that specific CS machine
    node_id_output = !sacct --allocations --format=NodeList,AllocTRES --state=COMPLETED --parsable2 --starttime=now-1hours --endtime=now | grep "gres/cs=1" | tail --lines 1
    print(node_id_output)
    os.environ["NODE_ID"] = node_id_output[0].split("|")[0]
    node_id = os.environ["NODE_ID"]
    
    print(f"The CS_IP_ADDR ({cs_ip_addr}) and NODE_ID ({node_id}) environment variables have been set.")

## Step 2: Get the example code
### Procure the Cerebras modelzoo examples repository

The [Cerebras Model Zoo GitHub repository](https://github.com/Cerebras/modelzoo/tree/R_1.6.0) is public and contains examples of common deep learning models that can be trained on Cerebras hardware.

Please clone the repository and then check out the R_1.6.0 tag (the current version running on Neocortex system) using the following commands:

In [2]:
import os


# Check if the repository already exists
repository_exists = os.path.isdir(f"{project_path}/modelzoo")

if repository_exists:
    !rm -rf {project_path}/modelzoo

# Clone the repository
!git clone https://github.com/Cerebras/modelzoo.git {project_path}/modelzoo

# Change to the repository directory
os.chdir(f"{project_path}/modelzoo")

# Checkout the specific tag
!git checkout tags/R_1.6.0

# Confirm the operations
print(f"\nOK: The reference modelzoo folder has been cloned into the {project_path} directory.")

# List the contents of the directory
!ls -lash


Cloning into '/tmp/tmp.VT9bug2LBf/modelzoo'...
remote: Enumerating objects: 5129, done.[K
remote: Counting objects: 100% (411/411), done.[K
remote: Compressing objects: 100% (231/231), done.[K
remote: Total 5129 (delta 255), reused 189 (delta 176), pack-reused 4718 (from 2)[K
Receiving objects: 100% (5129/5129), 25.20 MiB | 38.74 MiB/s, done.
Resolving deltas: 100% (3284/3284), done.
Note: checking out 'tags/R_1.6.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 886a438... R_1.6.0

OK: The reference modelzoo folder has been cloned into the /tmp/tmp.VT9bug2LBf directory.
total 72K
   0 drwxr-xr-x 5 julian pscstaff  22

---
Example of the expected output (final lines):

    Cloning into '/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo'...
    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    HEAD is now at 886a438... R_1.6.0

    OK: The reference modelzoo folder has been cloned into the /ocean/projects/ACCOUNT_ID/USERNAME directory.
---

You should have the freshly checked-out folder now and it should have a modelzoo subdirectory inside as well as some other files required for running the examples. See it pointed out in the following output:

    total 92K
    4.0K drwxr-xr-x 8 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 .git
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID   94 Jun  13 12:29 .gitignore
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID  12K Jun  13 12:29 LICENSE
    4.0K drwxr-xr-x 6 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 modelzoo  # <- This is the one we will be using
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.5K Jun  13 12:29 PYTHON-SETUP.md
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID 8.2K Jun  13 12:29 README.md
     28K -rw-r--r-- 1 USERNAME ACCOUNT_ID  27K Jun  13 12:29 RELEASE-NOTES.md
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.2K Jun  13 12:29 requirements_pytorch_gpu.txt
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.3K Jun  13 12:29 requirements_tensorflow_gpu.txt
    4.0K drwxr-xr-x 2 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 user_scripts
    
You can check if the contents of that directory look the same by running the following command:

## Step 3: Change into the FC-MIST Example folder

<img src="img/step3.png" width="30%" alt/>

We will now change directories to the `modelzoo/fc_mnist/tf` folder for running the FC-MNIST example. We need to do it as that location has the entry script and all of the code for running the example.

In [3]:
%cd "$your_entry_script_location"
!ls -lash && echo -e "\nOK: Successfully listed directories in the FC-MNIST example folder"

/tmp/tmp.VT9bug2LBf/modelzoo/modelzoo/fc_mnist/tf
total 44K
   0 drwxr-xr-x 3 julian pscstaff  165 Apr  9 12:56 .
   0 drwxr-xr-x 5 julian pscstaff   64 Apr  9 12:56 ..
   0 drwxr-xr-x 2 julian pscstaff   25 Apr  9 12:56 configs
4.0K -rw-r--r-- 1 julian pscstaff 2.6K Apr  9 12:56 data.py
   0 -rw-r--r-- 1 julian pscstaff    0 Apr  9 12:56 __init__.py
8.0K -rw-r--r-- 1 julian pscstaff 4.3K Apr  9 12:56 model.py
4.0K -rw-r--r-- 1 julian pscstaff 1.2K Apr  9 12:56 prepare_data.py
8.0K -rw-r--r-- 1 julian pscstaff 6.4K Apr  9 12:56 README.md
4.0K -rw-r--r-- 1 julian pscstaff 4.0K Apr  9 12:56 run-appliance.py
 12K -rw-r--r-- 1 julian pscstaff 8.5K Apr  9 12:56 run.py
4.0K -rw-r--r-- 1 julian pscstaff 1.5K Apr  9 12:56 utils.py

OK: Successfully listed directories in the FC-MNIST example folder


## Structure of the code

The following is the base structure that Cerebras uses for their code (the code template). If you want to run your research on their system, the suggested way to add your model and dataset is to take one of these base examples and start changing specific components.

The files for this specific TensorFlow FC-MNIST directory we just switched to should look like this:

    /ocean/projects/ACCOUNT_ID/USERNAME/modelzoo/modelzoo/fc_mnist/tf
    total 56K
    4.0K drwxr-xr-x 2 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 configs
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 2.6K Jun  13 12:29 data.py
       0 -rw-r--r-- 1 USERNAME ACCOUNT_ID    0 Jun  13 12:29 __init__.py
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.3K Jun  13 12:29 model.py
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.2K Jun  13 12:29 prepare_data.py
    8.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 6.4K Jun  13 12:29 README.md
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 4.0K Jun  13 12:29 run-appliance.py
     12K -rw-r--r-- 1 USERNAME ACCOUNT_ID 8.5K Jun  13 12:29 run.py
    4.0K -rw-r--r-- 1 USERNAME ACCOUNT_ID 1.5K Jun  13 12:29 utils.py

Let's go over the main files in this location:

* **configs/params.yaml:** YAML file containing the model configuration and the training hyperparameter settings.
* **data.py:** where the input data pipeline is called. Additional data processor modules may be defined elsewhere (e.g., in the input folder) for data pipeline implementation.
* **model.py:** It contains the model function definition. For information about the layers supported, please visit the Cerebras Documentation.
* **run.py:** it contains the training/compilation/evaluation script.
* **utils.py:** it contains the helper scripts.

If you would like to modify an example to integrate your code, model, or dataset, the suggested order of files to modify is: **model.py** > **data.py** > **utils.py** > **configs/params.yaml** > **run.py**

Additionally, the following diagram shows the suggested order in which the modifications should be performed for porting the code. The arrows represent the suggested order for the modification process to perform. This diagram should be read from left to right.

![](img/code_migration_workflow.png)


Since the Cerebras modelzoo examples are all ready to be executed, we will not make any modifications for this example as we are now ready to start running it. 

The execution will happen over three different steps: validate, compile, and train.

1. **validate**: this step runs a fast verification (on CPU), running a light-weight compilation up to performing kernel library matching, helping you determine if you are using any TensorFlow layer or functionality that is unsupported by the Cerebras development stack.
<br/>The argument used for running this step is `--mode train --validate_only`.

2. **compile**: this steps runs the full compilation (on CPU) through all stages of the Cerebras software stack to generate a CS system executable. When the above compilation is successful, the model is guaranteed to run on the CS system. <br/>The argument used for running this step is `--mode train --compile_only`.

3. **train**: this step runs the actual training job on the CS system using the compiled executable.
<br/>The argument used for running this step is simply `--mode train` (without additional "`_only`" arguments).

For more information, please visit the [Cerebras TensorFlow Quickstart Documentation v1.6.0](https://docs.cerebras.net/en/1.6.0/getting-started/cs-tf-quickstart.html) page.

## Running the example

As the development environment uses custom Cerebras libraries and it would need to be configured beforehand, Cerebras kindly offers a pre-made container with everything preconfigured and ready to execute the Cerebras modelzoo examples.

The actual example code is then executed from that [Singularity/Apptainer](https://apptainer.org/) container, and multi-user access to the CS systems is handled by Slurm.

The actual validation, compilation and training commands can be found below.

### Step 4: Validate de code

Run the code validation using the following command:

In [4]:
model_dir_exists = os.path.isdir("tutorial_model_dir")
if model_dir_exists:
    !rm -rf tutorial_model_dir

!salloc ${SLURM_ARGUMENTS} --ntasks=1 singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir tutorial_model_dir

salloc: Granted job allocation 320727
INFO:tensorflow:TF_CONFIG environment variable: {}
INFO:root:Running None on CS-2
INFO:absl:Generating dataset mnist (./tfds/mnist/3.0.1)
[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to ./tfds/mnist/3.0.1...[0m
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Extraction completed...: 0 file [00:00, ? file/s][A[AINFO:absl:Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into tfds/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.70f56cb777744e08b49c75cc4c44e845...
Dl Completed...:   0%|                                  | 0/1 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Extraction completed...: 0 file [00:00, ? file/s][A[AINFO:absl:Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz into tfds/downloa

---
Example of the expected output (final lines):

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    =============== Starting Cerebras Compilation ===============                   
    Stack:   0%|                      | 0/10 [00:00s, ?stages/s ]
    =============== Cerebras Compilation Completed ==============
---

The command above exectuted a Slurm job using `srun` and `singularity`, both with multiple arguments.

For the Slurm arguments, the following were used:
* **--ntasks=7**: Specify  the number of tasks to run.
* **--time=0-00:15**: Set a limit on the total run time of the job allocation. The format used is "dd-hh:mm:mm".
* **--cpus-per-task=28**: Request that ncpus be allocated per process. A 28-core CPU processor is being used per task.
* **--account=ACCOUNT_ID**: this is the AI tutorial allocation account to which resource utilization is being charged.
* **--pty**: Execute task zero in pseudo terminal mode. Meaning, it will provide an interactive session.

For the Singularity/Apptainer arguments, the following were used:
* **exec**: the "exec" mode is an alternative to running the "run" or "shell" mode, and it allows running a specific command.
* **--bind \${BIND_LOCATIONS}**: it allows specifying the bind paths, or folders to make available to the container apps in the same (or a specific) location. The folder we are requesting to bind is the one in which the FC-MNIST example is located. Other folders with input data should also be mounted as required.
* **--pwd \${YOUR_ENTRY_SCRIPT_LOCATION}**: sets the working directory to use when running the commands with singularity exec. The actual value is `/ocean/projects/ACCOUNT_ID/USERNAME/modelzoo/modelzoo/fc_mnist/tf`
* **${CEREBRAS_CONTAINER}**: this is the full path to the latest Cerebras cbcore container. The actual value is `/ocean/neocortex/cerebras/cbcore_latest.sif`.

And as for the Python run.py file, the following arguments were used:
* **--mode train**: this use the training mode on the Cerebras software stack. Other modes available are the evaluation and prediction mode. More information about this can be found in the [Cerebras Documentation](https://docs.cerebras.net/en/1.6.0/tensorflow-docs/running-a-model/train-eval-predict.html).
* **--validate_only**: as mentioned above, this argument allow running a light-weight compilationthat helps determine any unsupported TensorFlow layer or functionality is being used in the code.
* **--model_dir tutorial_model_dir**: this argument specifies the target directory to use for the compilation process. Using different folders for this argument allows starting from scratch when something goes wrong.


### Step 5: Compile the code

Run the full compilation using the following command:

In [5]:
!salloc ${SLURM_ARGUMENTS} --ntasks=1 singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir tutorial_model_dir

salloc: Granted job allocation 320728
INFO:tensorflow:TF_CONFIG environment variable: {}
INFO:root:Running None on CS-2
INFO:absl:Load dataset info from ./tfds/mnist/3.0.1
INFO:absl:Reusing dataset mnist (./tfds/mnist/3.0.1)
INFO:absl:Constructing tf.data.Dataset mnist for split None, from ./tfds/mnist/3.0.1
INFO:root:---------- Suggestions to improve input_fn performance ----------
INFO:root:[input_fn] - batch(): batch_size set to 256
INFO:root:----------------- End of input_fn suggestions -----------------
INFO:absl:Load dataset info from ./tfds/mnist/3.0.1
INFO:absl:Reusing dataset mnist (./tfds/mnist/3.0.1)
INFO:absl:Constructing tf.data.Dataset mnist for split None, from ./tfds/mnist/3.0.1
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:cerebras.stack.tools.caching_stack:Using lair flow into stack
salloc: Relinquishing job allocation 320728
salloc: Job allocation 320728 has been revoked.


---
Example of the expected output (final lines):

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]
    =============== Starting Cerebras Compilation ===============
    Estimating performance:  94%|█████████| 33/35 [00:50s,  3.99s/stages ]
    =============== Cerebras Compilation Completed ===============
---
   
Similar to the previous command, another job was executed using `srun` and `singularity`, but now in compilation mode.

For the Python run.py file, the following argument changed:
* **--compile_only**: as mentioned above, this steps runs the full compilation (on CPU) through all stages of the Cerebras software stack to generate a CS system executable.

### Step 6: Run the training on the CS machine

In [17]:
# set_cs_ip_addr_value()
!salloc ${SLURM_GRES_ARGUMENT} ${SLURM_ARGUMENTS} --nodelist=sdf-1 srun /usr/bin/singularity exec --bind ${BIND_LOCATIONS} --pwd ${YOUR_ENTRY_SCRIPT_LOCATION} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir tutorial_model_dir --cs_ip 10.8.88.10

salloc: Granted job allocation 320735
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['sdf-1:49600'], 'worker': ['sdf-1:49602', 'sdf-1:49604', 'sdf-1:49606', 'sdf-1:49608', 'sdf-1:49610', 'sdf-1:49612']}, 'task': {'type': 'chief', 'index': 0}}
INFO:root:Running train on CS-2
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2025-04-09 13:17:20.222782: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2025-04-09 13:17:20.257994: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2700000000 Hz
2025-04-09 13:17:20.285321: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9532230 initialized for platform Host (this does not guarantee that XLA will be used). D

---

Example of the expected output (final lines):

    SLURM environment variables have been set successfully.
    srun: job 1234 queued and waiting for resources
    srun: job 1234 has been allocated resources
    INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['sdf:29231'], 'worker': ['sdf:29233', 'sdf:29235', 'sdf:29237', 'sdf:29239', 'sdf:29241', 'sdf:29243']}, 'task': {'type': 'chief', 'index': 0}}
    INFO:root:Running train on CS-2

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:global step 99700: loss = 4.76837158203125e-07 (2277.17 steps/sec)
    INFO:tensorflow:global step 99800: loss = 0.0 (2277.98 steps/sec)
    INFO:tensorflow:global step 99900: loss = 5.304813385009766e-06 (2278.76 steps/sec)
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100000...
    INFO:tensorflow:Saving checkpoints for 100000 into tutorial_model_dir/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100000...
    INFO:tensorflow:Training finished with 25600000 samples in 43.869 seconds, 583554.35 samples/second.
    INFO:tensorflow:Loss for final step: 0.06384.
    =============== Starting Cerebras Compilation ===============
    =============== Cerebras Compilation Completed ===============

---

This third command also executed a job using `srun` and `singularity`, but now using the actual training mode that utilizes the CS system.

For the Singularity/Apptainer arguments, the following argument changed:
* **\${SLURM_GRES_ARGUMENT}**: the actual value is `--gres=cs:cerebras:1`, and it requests a CS machine as a special resource to be used for the Slurm job training the model. If this flag were not to be used, the CS would not be allocated for the job (unavailable).
* **--cs_ip ${CS_IP_ADDR}**: this value is set when a CS machine has been requested (using `--gres=cs:cerebras:1`). It will point to the IP address of the CS machine mapped to the specific compute node running the training. This value is dynamic and changes as required by the system administrators.

For the Python run.py file, the following argument changed:
* **--mode train**: no additional `_only` arguments were used, just the `--mode train` argument to start the training on the CS system using the compiled executable.

In [21]:
# To get the ratio of utilized components. Theoretical peak throughput. Grep the “total_utilization” percentage from the plan.json file in the destination folder.
!grep -o '"total_utilization":[0-9\.]*' ${YOUR_ENTRY_SCRIPT_LOCATION}/tutorial_model_dir/*/plan.json


/tmp/tmp.VT9bug2LBf/modelzoo/modelzoo/fc_mnist/tf/tutorial_model_dir/cs_0233d1734bc58d9f1ad221730921ebe2bbcc507232f5d3bb32762a5a2baf5a31/plan.json:"total_utilization":1.71281689019136
/tmp/tmp.VT9bug2LBf/modelzoo/modelzoo/fc_mnist/tf/tutorial_model_dir/cs_0863d6792f6cb5c1590f3fc48f98fd90f3eb7951cd1a5fa2049cfe56bd6d41df/plan.json:"total_utilization":1.71281689019136
/tmp/tmp.VT9bug2LBf/modelzoo/modelzoo/fc_mnist/tf/tutorial_model_dir/cs_0233d1734bc58d9f1ad221730921ebe2bbcc507232f5d3bb32762a5a2baf5a31/deltat_estimate.json:"estimated_deltat":43264
/tmp/tmp.VT9bug2LBf/modelzoo/modelzoo/fc_mnist/tf/tutorial_model_dir/cs_0863d6792f6cb5c1590f3fc48f98fd90f3eb7951cd1a5fa2049cfe56bd6d41df/deltat_estimate.json:"estimated_deltat":


## Closing notes

Congratulations on successfully training an FC-MNIST model on the Neocortex system! Throughout this tutorial, you have gained hands-on experience in performing all of the steps required for running a reference example from scratch on a Cerebras CS-2 machine.

This example is based on the [Cerebras Documentation](https://docs.cerebras.net/en/1.6.0/). 

Other links of interest are:

* [Neocortex System](https://www.cmu.edu/psc/aibd/neocortex/)
* [Cerebras ML Workflow](https://docs.cerebras.net/en/1.6.0/cerebras-basics/cs-ml-workflow.html)
* [Neocortex Documentation](https://portal.neocortex.psc.edu/docs/)

This material is based upon work supported by the [National Science Foundation under Grant Number 2005597](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2005597).

## Next steps (optional):


Neocortex provides a unique opportunity to access the remarkable integrated technologies of the Cerebras CS-2 and the HPE Superdome Flex Servers available in PSC's Neocortex system. 

We invite you to run your research on Neocortex. You could first, identify the track that your project belongs to, then answer some general questions, and then finally, apply for accessing the system over a full-fledged research grant.

You can take a look at the [previous Neocortex Call for Proposals page](https://www.cmu.edu/psc/aibd/neocortex/2023-03-cfp-spring-2023.html) for more details.