<img src="https://fsdl.me/logo-720-dark-horizontal">

# Lab 05: Troubleshooting & Testing

### What You Will Learn

- Practices and tools for testing and linting Python code in general: `black`, `flake8`, `precommit`, `pytests` and `doctests`
- How to implement tests for ML training systems in particular
- What a PyTorch training step looks like under the hood and how to troubleshoot performance bottlenecks

# Setup

If you're running this notebook on Google Colab,
the cell below will run full environment setup.

It should take about three minutes to run.

In [9]:
lab_idx = 5

if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    #!wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    !wget --quiet https://gist.githubusercontent.com/mchen50/0b202810bdc771b010f018561b48e3ef/raw/17e9ee7cca20fefdd1c07f803818d843c1411bc1/gistfile1.txt -O bootstrap.py
    import bootstrap

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    # allow "hot-reloading" of modules
    %load_ext autoreload
    %autoreload 2
    # needed for inline plots in some contexts
    %matplotlib inline

    bootstrap.run = False  # change to True re-run setup

!pwd
%ls

env: PYTHONPATH=.:/env/python
.:/env/python
/content/fsdl-text-recognizer-2022-labs/lab05
[0m[01;34mnotebooks[0m/  [01;34mtasks[0m/  [01;34mtext_recognizer[0m/  [01;34mtraining[0m/


In [12]:
from IPython.display import display, HTML, IFrame

full_width = True
frame_height = 720  # adjust for your screen

if full_width:  # if we want the notebook to take up the whole width
    # add styling to the notebook's HTML directly
    display(HTML("<style>.container { width:100% !important; }</style>"))
    display(HTML("<style>.output_result { max-width:100% !important; }</style>"))

### Follow along with a video walkthrough on YouTube:

In [None]:
IFrame(src="https://fsdl.me/2022-lab-05-video-embed", width="100%", height=frame_height)

# Linting Python and Shell Scripts

### Automatically linting with `pre-commit`

We want keep our code clean and uniform across developers
and time.

Applying the cleanliness checks and style rules should be
as painless and automatic as possible.

For this purpose, we recommend bundling linting tools together
and enforcing them on all commits with
[`pre-commit`](https://pre-commit.com/).

In addition to running on every commit,
`pre-commit` separates the model development environment from the environments
needed for the linting tools, preventing conflicts
and simplifying maintenance and onboarding.

This cell runs `pre-commit`.

The first time it is run on a machine, it will install the environments for all tools.

In [None]:
!pre-commit run --all-files

[INFO][m Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO][m Initializing environment for https://github.com/psf/black.
[INFO][m Initializing environment for https://github.com/PyCQA/flake8.
[INFO][m Initializing environment for https://github.com/PyCQA/flake8:flake8-bandit,flake8-bugbear,flake8-docstrings,flake8-import-order,darglint,mypy,pycodestyle,pydocstyle.
[INFO][m Initializing environment for https://github.com/shellcheck-py/shellcheck-py.
[INFO][m Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
[INFO][m Installing environment for https://github.com/psf/black.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few minutes...
[INFO][m Installing environment for https://github.com/PyCQA/flake8.
[INFO][m Once installed this environment will be reused.
[INFO][m This may take a few m

The output lists all the checks that are run and whether they are passed.

Notice there are a number of simple version-control hygiene practices included
that aren't even specific to Python, much less to machine learning.

For example, several of the checks prevent accidental commits with private keys, large files,
leftover debugger statements, or merge conflict annotations in them.

These linting actions are configured via
([what else?](https://twitter.com/charles_irl/status/1446235836794564615?s=20&t=OOK-9NbgbJAoBrL8MkUmuA))
a YAML file:

In [None]:
!cat .pre-commit-config.yaml

repos:
  # a set of useful Python-based pre-commit hooks
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.1.0
    hooks:
      # list of definitions and supported hooks: https://pre-commit.com/hooks.html
      - id: trailing-whitespace      # removes any whitespace at the ends of lines
      - id: check-toml               # check toml syntax by loading all toml files
      - id: check-yaml               # check yaml syntax by loading all yaml files
      - id: check-json               # check-json syntax by loading all json files
      - id: check-merge-conflict     # check for files with merge conflict strings
        args: ['--assume-in-merge']  #  and run this check even when not explicitly in a merge
      - id: check-added-large-files  # check that no "large" files have been added
        args: ['--maxkb=10240']      #  where large means 10MB+, as in Hugging Face's git server
      - id: debug-statements         # check for python debug statements (import pdb, 

Most of the general cleanliness checks are from hooks built by `pre-commit`.

See the comments and links in the `.pre-commit-config.yaml` for more:

In [None]:
!cat .pre-commit-config.yaml | grep repos -A 15

repos:
  # a set of useful Python-based pre-commit hooks
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.1.0
    hooks:
      # list of definitions and supported hooks: https://pre-commit.com/hooks.html
      - id: trailing-whitespace      # removes any whitespace at the ends of lines
      - id: check-toml               # check toml syntax by loading all toml files
      - id: check-yaml               # check yaml syntax by loading all yaml files
      - id: check-json               # check-json syntax by loading all json files
      - id: check-merge-conflict     # check for files with merge conflict strings
        args: ['--assume-in-merge']  #  and run this check even when not explicitly in a merge
      - id: check-added-large-files  # check that no "large" files have been added
        args: ['--maxkb=10240']      #  where large means 10MB+, as in Hugging Face's git server
      - id: debug-statements         # check for python debug statements (import pdb, 

Let's take a look at the section of the file
that applies most of our Python style enforcement with
[`flake8`](https://flake8.pycqa.org/en/latest/):

In [None]:
!cat .pre-commit-config.yaml | grep "flake8 python" -A 10

  # flake8 python linter with all the fixins
  - repo: https://github.com/PyCQA/flake8
    rev: 3.9.2
    hooks:
      - id: flake8
        exclude: (lab01|lab02|lab03|lab04|lab06|lab07|lab08)
        additional_dependencies: [
          flake8-bandit, flake8-bugbear, flake8-docstrings,
          flake8-import-order, darglint, mypy, pycodestyle, pydocstyle]
        args: ["--config", ".flake8"]
    # additional configuration of flake8 and extensions in .flake8


The majority of the style checking behavior we want comes from the
`additional_dependencies`, which are
[plugins](https://flake8.pycqa.org/en/latest/glossary.html#term-plugin)
that extend `flake8`'s list of lints.

Notice that we have a `--config` file passed in to the `args` for the `flake8` command.

We keep the configuration information for `flake8`
separate from that for `pre-commit`
in case we want to use additional tools with `flake8`,
e.g. if some developers want to integrate it directly into their editor,
and so that if we change away from `.pre-commit`
but keep `flake8` we don't have to
recreate our configuration in a different tool.

As much as possible, codebases should strive for single sources of truth
and link back to those sources of truth with documentation or comments,
as in the last line above.

Let's take a look at the contents of `flake8`:

In [None]:
!cat .flake8

[flake8]
select = ANN,B,B9,BLK,C,D,E,F,I,S,W
  # only check selected error codes
max-complexity = 12
  # C9 - flake8 McCabe Complexity checker -- threshold
max-line-length = 120
  # E501 - flake8 -- line length too long, actually handled by black
extend-ignore =
  # E W - flake8 PEP style check
    E203,E402,E501,W503,  # whitespace, import, line length, binary operator line breaks
  # S - flake8-bandit safety check
    S101,S113,S311,S105,  # assert removed in bytecode, no request timeout, pRNG not secure, hardcoded password
  # ANN - flake8-annotations type annotation check
    ANN,ANN002,ANN003,ANN101,ANN102,ANN202,  # ignore all for now, but always ignore some
  # D1 - flake8-docstrings docstring style check
    D100,D102,D103,D104,D105,  # missing docstrings
  # D2 D4 - flake8-docstrings docstring style check
    D200,D205,D400,D401,  # whitespace issues and first line content
  # DAR - flake8-darglint docstring correctness check
    DAR103,  # mismatched or missing type in docstr

There's a lot here! We'll focus on the most important bits.

Linting tools in Python generally work by emitting error codes
with one or more letters followed by three numbers.
The `select` argument picks which error codes we want to check for.
Error codes are matched by prefix,
so for example `B` matches `BTS101` and
`G1` matches `G102` and `G199` but not `ARG404`.

Certain codes are `ignore`d in the default `flake8` style,
which is done via the `ignore` argument,
and we can `extend` the list of `ignore`d codes with `extend-ignore`.
For example, we rely on `black` to do our formatting,
so we ignore some of `flake8`'s formatting codes.

Together, these settings define our project's particular style.

But not every file fits this style perfectly.
Most of the conventions in `black` and `flake8` come from the style-defining
[Python Enhancement Proposal 8](https://peps.python.org/pep-0008/),
which exhorts you to "know when to be inconsistent".

To allow ourselves to be inconsistent when we know we should be,
`flake8` includes `per-file-ignores`,
which let us ignore specific warnings in specific files.
This is one of the "escape valves"
that makes style enforcement tolerable.
We can also `exclude` files in the `pre-commit` config itself.

For details on selecting and ignoring,
see the [`flake8` docs](https://flake8.pycqa.org/en/latest/user/violations.html)

For definitions of the error codes from `flake8` itself,
see the [list in the docs](https://flake8.pycqa.org/en/latest/user/error-codes.html).
Individual extensions list their added error codes in their documentation,
e.g. `darglint` does so
[here](https://github.com/terrencepreilly/darglint#error-codes).

The remainder are configurations for the other `flake8` plugins that we use to define and enforce the rest of our style.

You can read more about each in their documentation:
- [`flake8-import-order`](https://github.com/PyCQA/flake8-import-order) for checking imports
- [`flake8-docstrings`](https://github.com/pycqa/flake8-docstrings) for docstring style
- [`darglint`](https://github.com/terrencepreilly/darglint) for docstring completeness
- [`flake8-annotations`](https://github.com/sco1/flake8-annotations) for type annotations

### Linting via a script and using `shellcheck`

To avoid needing to think about `pre-commit`
(was the command `pre-commit run` or `pre-commit check`?)
while developing locally,
we might put our linters into a shell script:

In [None]:
!cat tasks/lint.sh

#!/bin/bash
set -uo pipefail
set +e

FAILURE=false

# apply automatic formatting
echo "black"
pre-commit run black || FAILURE=true

# check for python code style violations, see .flake8 for details
echo "flake8"
pre-commit run flake8 || FAILURE=true

# check for shell scripting style violations and common bugs
echo "shellcheck"
pre-commit run shellcheck || FAILURE=true

# check python types
echo "mypy"
pre-commit run mypy || FAILURE=true

if [ "$FAILURE" = true ]; then
  echo "Linting failed"
  exit 1
fi
echo "Linting passed"
exit 0


These kinds of short and simple shell scripts are common in projects
of intermediate size.

They are useful for adding automation and reducing friction.

But these scripts are code,
and all code is susceptible to bugs and subject to concerns of style consistency.

We can't check these scripts with tools that lint Python code,
so we include a shell script linting tool,
[`shellcheck`](https://www.shellcheck.net/),
in our `pre-commit`.

More so than checking for correct style,
this tool checks for common bugs or surprising behaviors of shells,
which are unfortunately numerous.

In [None]:
script_filename = "tasks/lint.sh"
!pre-commit run shellcheck --files {script_filename}

shellcheck...............................................................[42mPassed[m


That script has already been tested, so we don't see any errors.

Try copying over a script you've written yourself or
even from a popular repo that you like
(by adding to the notebook directory or by making a cell
with `%%writefile` at the top)
and test it by changing the `script_filename`.

You'd be surprised at the classes of subtle bugs possible in bash!

### Try "unofficial bash strict mode" for louder failures in scripts

Another way to reduce bugs is to use the suggested "unofficial bash strict mode" settings by
[@redsymbol](https://twitter.com/redsymbol),
which appear at the top of the script:

In [None]:
!head -n 3 tasks/lint.sh

#!/bin/bash
set -uo pipefail
set +e


The core idea of strict mode is to fail more loudly.
This is a desirable behavior of scripts,
like the ones we're writing,
even though it's an undesirable behavior for an interactive shell --
it would be unpleasant to be logged out every time you hit an error.

`set -u` means scripts fail if a variable's value is `u`nset,
i.e. not defined.
Otherwise bash is perfectly happy to allow you to reference undefined variables.
The result is just an empty string, which can lead to maddeningly weird behavior.

`set -o pipefail` means failures inside a pipe of commands (`|`) propagate,
rather than using the exit code of the last command.
Unix tools are perfectly happy to work on nonsense input,
like sorting error messages, instead of the filenames you meant to send.

You can read more about these choices
[here](http://redsymbol.net/articles/unofficial-bash-strict-mode/),
and considerations for working with other non-conforming scripts in "strict mode"
and for handling resource teardown when scripts error out.

# Testing ML Codebases

## Testing Python code with `pytests`


ML codebases are Python first and foremost, so first let's get some Python tests going.

At a basic level,
we can write functions that `assert`
that our code behaves as expected in
a given scenario and include it in the same module.

In [15]:
from text_recognizer.lit_models.metrics import test_character_error_rate

test_character_error_rate??

The standard tool for testing Python code is
[`pytest`]((https://docs.pytest.org/en/7.1.x/)).

We can use it as a command-line tool in a variety of ways,
including to execute these kinds of tests.

If passed a filename, `pytest` will look for
any classes that start with `Test` or
any functions that start with `test_` and run them.

In [16]:
!pytest text_recognizer/lit_models/metrics.py

platform linux -- Python 3.10.12, pytest-7.1.1, pluggy-1.3.0
rootdir: /content/fsdl-text-recognizer-2022-labs, configfile: pyproject.toml
plugins: cov-3.0.0, typeguard-2.13.3, anyio-3.7.1
collected 1 item                                                                                   [0m

text_recognizer/lit_models/metrics.py [32m.[0m[32m                                                      [100%][0m

---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name                                             Stmts   Miss Branch BrPart  Cover
----------------------------------------------------------------------------------
text_recognizer/__init__.py                          0      0      0      0   100%
text_recognizer/callbacks/__init__.py                4      4      0      0     0%
text_recognizer/callbacks/imtotext.py               72     72     32      0     0%
text_recognizer/callbacks/model.py                  62     62     18      0     0%
text_recognizer/cal

After the results of the tests (pass or fail) are returned,
you'll see a report of "coverage" from
[`codecov`](https://about.codecov.io/).

This coverage report tells us which files and how many lines in those files
were at touched by the testing suite.

We do not actually need to provide the names of files with tests in them to `pytest`
in order for it to run our tests.

By default, `pytest` looks for any files named `test_*.py` or `*_test.py`.

It's [good practice](https://docs.pytest.org/en/7.1.x/explanation/goodpractices.html#test-discovery)
to separate these from the rest of your code
in a folder or folders named `tests`,
rather than scattering them around the repo.

In [17]:
!ls text_recognizer/tests

__pycache__  test_callback_utils.py  test_iam.py


Let's take a look at a specific example:
the tests for some of our utilities around
custom PyTorch Lightning `Callback`s.

In [18]:
from text_recognizer.tests import test_callback_utils


test_callback_utils.__doc__

'Tests for the text_recognizer.callbacks.util module.'

Notice that we can easily import this as a module!

That's another benefit of organizing tests into specialized files.

The particular utility we're testing
here is designed to prevent crashes:
it checks for a particular type of error and turns it into a warning.

In [19]:
from text_recognizer.callbacks.util import check_and_warn

check_and_warn??

Error-handling code is a common cause of bugs,
a fact discovered
[again and again across forty years of error analysis](https://twitter.com/full_stack_dl/status/1561880960886505473?s=20&t=5OZBonILaUJE9J4ah2Qn0Q),
so it's very important to test it well!

We start with a very basic test,
which does not touch anything
outside of the Python standard library,
even though this tool is intended to be used
with more complex features of third-party libraries,
like `wandb` and `tensorboard`.

In [20]:
test_callback_utils.test_check_and_warn_simple??

Here, we are just testing the core logic.
This test won't catch many bugs,
but when it does fail, something has gone seriously wrong.

These kinds of tests are important for resolving a bug:
we learn nearly as much from the tests that passed
as we did from the tests that failed.
If this test has failed, possibly along with others,
we can rule out an issue in one of the large external codebases
touched in the other tests, saving us lots of time in our troubleshooting.

The reasoning for the test is explained in the docstrings,
which are close to the code.

Your test suite should be as welcoming
as the rest of your codebase!
The people reading it, for example yourself in six months,
are likely upset and in need of some kindness.

More practically, we want keep our time to resolve errors as short as possible,
and five minutes to write a good docstring now
can save five minutes during an outage, when minutes really matter.

That basic test is a start, but it's not enough by itself.
There's a specific error case that triggered the addition of this code.

So we test that it's handled as expected.

In [None]:
test_callback_utils.test_check_and_warn_tblogger??

That test can fail if the libraries change around our code,
i.e. if the `TensorBoardLogger` gets a `log_table` method.

We want to be careful when making assumptions
about other people's software,
especially for fast-moving libraries like Lightning.
If we test that those assumptions hold willy-nilly,
we'll end up with tests that fail because of
harmless changes in our dependencies.

Tests that require a ton of maintenance and updating
without leading to code improvements soak up
more engineering time than they save
and cause distrust in the testing suite.

We include this test because `TensorBoardLogger` getting
a `log_table` method will _also_ change the behavior of our code
in a breaking way, and we want to catch that before it breaks
a model training job.

Adding error handling can also accidentally kill the "happy path"
by raising an error incorrectly.

So we explicitly test the _absence of an error_,
not just its presence:

In [None]:
test_callback_utils.test_check_and_warn_wandblogger??

There are more tests we could build, e.g. manipulating classes and testing the behavior,
testing more classes that might be targeted by `check_and_warn`, or
asserting that warnings are raised to the command line.

But these three basic tests are likely to catch most changes that would break our code here,
and they're a lot easier to write than the others.

If this utility starts to get more usage and become a critical path for lots of features, we can always add more!

## Interleaving testing and documentation with `doctests`

One function of tests is to build user/reader confidence in code.

One function of documentation is to build user/reader knowledge in code.

These functions are related. Let's put them together:
put code in a docstring and test that code.

This feature is part of the
Python standard library via the
[`doctest` module](https://docs.python.org/3/library/doctest.html).

Here's an example from our `torch` utilities.

The `first_appearance` function can be used to
e.g. quickly look for stop tokens,
giving the length of each sequence.

In [None]:
from text_recognizer.lit_models.util import first_appearance


first_appearance??

Notice that in the "Examples" section,
there's a short block of code formatted as a
Python interpreter session,
complete with outputs.

We can copy and paste that code and
check that we get the right outputs:

In [None]:
import torch


first_appearance(torch.tensor([[1, 2, 3], [2, 3, 3], [1, 1, 1], [3, 1, 1]]), 3)

We can run the test with `pytest` by passing a command line argument,
`--doctest-modules`:

In [21]:
!pytest --doctest-modules text_recognizer/lit_models/util.py

platform linux -- Python 3.10.12, pytest-7.1.1, pluggy-1.3.0
rootdir: /content/fsdl-text-recognizer-2022-labs, configfile: pyproject.toml
plugins: cov-3.0.0, typeguard-2.13.3, anyio-3.7.1
collected 2 items                                                                                  [0m

text_recognizer/lit_models/util.py [32m.[0m[32m.[0m[32m                                                        [100%][0m

---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name                                             Stmts   Miss Branch BrPart  Cover
----------------------------------------------------------------------------------
text_recognizer/__init__.py                          0      0      0      0   100%
text_recognizer/callbacks/__init__.py                4      4      0      0     0%
text_recognizer/callbacks/imtotext.py               72     72     32      0     0%
text_recognizer/callbacks/model.py                  62     62     18      0     0%
text_recog

With the
[right configuration](https://github.com/full-stack-deep-learning/fsdl-text-recognizer-2022/blob/627dc9dabc9070cb14bfe5bfcb1d6131eb7dc7a8/pyproject.toml#L12-L17),
running `doctest`s happens automatically
when `pytest` is invoked.

## Basic tests for data code

ML code can be hard to test
since it involes very heavy artifacts, like models and data,
and very expensive jobs, like training.

For testing our data-handling code in the FSDL codebase,
we mostly just use `assert`s,
which throw errors when behavior differs from expectation:

In [22]:
!grep "assert" -r text_recognizer/data

text_recognizer/data/iam.py:    assert any(region is not None for region in line_regions), "Line regions cannot be None"
text_recognizer/data/iam_lines.py:        self.input_dims = metadata.DIMS  # We assert that this is correct in setup()
text_recognizer/data/iam_lines.py:        self.output_dims = metadata.OUTPUT_DIMS  # We assert that this is correct in setup()
text_recognizer/data/iam_lines.py:            assert image_width <= metadata.IMAGE_WIDTH
text_recognizer/data/iam_lines.py:            assert self.output_dims[0] >= max([len(_) for _ in labels_train]) + 2  # Add 2 for start/end tokens.
text_recognizer/data/iam_lines.py:            assert self.output_dims[0] >= max([len(_) for _ in labels_val]) + 2  # Add 2 for start/end tokens.
text_recognizer/data/iam_lines.py:            assert self.output_dims[0] >= max([len(_) for _ in labels_test]) + 2
text_recognizer/data/iam_lines.py:    assert len(crops) == len(labels)
text_recognizer/data/iam_lines.py:    assert len(crops) == len(lab

This isn't great practice,
especially as a codebase grows,
because we can't easily know when these are executed
or incorporate them into
testing automation and coverage analysis tools.

So it's preferable to collect up these assertions of simple data properties
into tests that are run like our other tests.

The test below checks whether any data is leaking
between training, validation, and testing.

In [23]:
from text_recognizer.tests.test_iam import test_iam_data_splits


test_iam_data_splits??

Notice that we were able to load the test into the notebook
because it is in a module,
and so we can run it here as well:

In [24]:
test_iam_data_splits()

Downloading raw dataset from https://s3-us-west-2.amazonaws.com/fsdl-public-assets/iam/iamdb.zip to /content/fsdl-text-recognizer-2022-labs/data/downloaded/iam/iamdb.zip...


586MB [00:44, 13.9MB/s]                           


Computing SHA-256...
Extracting IAM data


But we're checking something pretty simple here,
so the new code in each test is just a single line.

What if we wanted to test more complex properties,
like comparing rows or calculating statistics?

We'll end up writing more complex code that might itself have subtle bugs,
requiring tests for our tests and suffering from
"tester's regress".

This is the phenomenon,
named by analogy with
[experimenter's regress](https://en.wikipedia.org/wiki/Experimenter%27s_regress)
in sociology of science,
where the validity of our tests is itself
up for dispute only resolvable by testing the tests,
but those tests are themselves possibly invalid.

We cut this Gordian knot by using
a library or framework that is well-tested.

We recommend checking out
[`great_expectations`](https://docs.greatexpectations.io/docs/)
if you're looking for a high-quality data testing tool.

Especially with data, some tests are particularly "heavy" --
they take a long time,
and we might want to run them
on different machines
and on a different schedule
than our other tests.

For example, consider testing whether the download of a dataset succeeds and gives the right checksum.

We can't just use a cached version of the data,
since that won't actually execute the code!

This test will take
as long to run
and consume as many resources as
a full download of the data.

`pytest` allows the separation of tests
into suites with `mark`s,
which "tag" tests with names.

In [25]:
!pytest --markers | head -n 10

@pytest.mark.slow: marks a test as slow (deselect with '-m "not slow"']

@pytest.mark.data: marks a test as dependent on a data download (deselect with '-m "not data"')

@pytest.mark.anyio: mark the (coroutine function) test to be run asynchronously via anyio.

@pytest.mark.no_cover: disable coverage for this test.




We can choose to run tests with a given mark
or to skip tests with a given mark,
among other basic logical operations around combining and filtering marks,
with `-m`:

In [27]:
!wandb login {key} # one test requires wandb authentication

!pytest -m "not data and not slow"

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
platform linux -- Python 3.10.12, pytest-7.1.1, pluggy-1.3.0
rootdir: /content/fsdl-text-recognizer-2022-labs, configfile: pyproject.toml
plugins: cov-3.0.0, typeguard-2.13.3, anyio-3.7.1
collected 7 items                                                                                  [0m

text_recognizer/lit_models/util.py [32m.[0m[32m.[0m[32m                                                        [ 28%][0m
text_recognizer/tests/test_callback_utils.py [32m.[0m[32m.[0m[32m.[0m[32m                                             [ 71%][0m
text_recognizer/tests/test_iam.py [32m.[0m[32m.[0m[32m                                                         [100%][0m

---------- coverage: platform linux, python 3.10.12-final-0 ----------
Name                                             Stmts   Miss Branch BrPart  Cover
------------------------------------------------------------------------------

## Testing training with memorization tests

Training is the process by which we convert inert data into executable models,
so it is dependent on both.

We decouple checking whether the script has a critical bug
from whether the data or model code is broken
by testing on some basic "fake data",
based on a utility from `torchvision`.

In [28]:
from text_recognizer.data import FakeImageData


FakeImageData.__doc__

'Fake images dataset.'

We then test on the actual data with a smaller version of the real model.

We use the Lightning `--fast_dev_run` feature,
which sets the number of training, validation, and test batches to `1`.

We use a smaller version so that this test can run in just a few minutes
on a CPU without acceleration.

That allows us to run our tests in environments without GPUs,
which saves on costs for executing tests.

Here's the script:

In [29]:
!cat training/tests/test_run_experiment.sh

#!/bin/bash
set -uo pipefail
set +e

FAILURE=false

echo "running full loop test with CNN on fake data"
python training/run_experiment.py --data_class=FakeImageData --model_class=CNN --conv_dim=2 --fc_dim=2 --loss=cross_entropy --num_workers=4 --max_epochs=1 || FAILURE=true

echo "running fast_dev_run test of real model class on real data"
python training/run_experiment.py --data_class=IAMParagraphs --model_class=ResnetTransformer --loss=transformer \
  --tf_dim 4 --tf_fc_dim 2 --tf_layers 2 --tf_nhead 2 --batch_size 2 --lr 0.0001 \
  --fast_dev_run --num_sanity_val_steps 0 \
  --num_workers 1 || FAILURE=true

if [ "$FAILURE" = true ]; then
  echo "Test for run_experiment.py failed"
  exit 1
fi
echo "Tests for run_experiment.py passed"
exit 0


In [30]:
! ./training/tests/test_run_experiment.sh

running full loop test with CNN on fake data
2023-11-07 10:48:01.213388: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 10:48:01.213443: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 10:48:01.213483: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Missing logger folder: training/logs/lightning_logs
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available:

The above tests don't actaully check
whether any learning occurs,
they just check
whether training runs mechanically,
without any errors.

We also need a
["smoke test"](https://en.wikipedia.org/wiki/Smoke_testing_(software))
for learning.
For that we recommending checking whether
the model can learn the right
outputs for a single batch --
to "memorize" the outputs for
a particular input.

This memorization test won't
catch every bug or issue in training,
which is notoriously difficult,
but it will flag
some of the most serious issues.

The script below runs a memorization test.

It takes up to two arguments:
a `MAX`imum number of `EPOCHS` to run for and
a `CRITERION` value of the loss to test against.

The test passes if the loss is lower than the `CRITERION` value
after the `MAX`imum number of `EPOCHS` has passed.

The important line in this script is the one that invokes our training script,
`training/run_experiment.py`.

The arguments to `run_experiment` have been tuned for maximum possible speed:
turning off regularization, shrinking the model,
and skipping parts of Lightning that we don't want to test.

In [31]:
!cat training/tests/test_memorize_iam.sh

#!/bin/bash
set -uo pipefail
set +e

# tests whether we can achieve a criterion loss
#  on a single batch within a certain number of epochs

FAILURE=false

# constants and CLI args set by aiming for <5 min test on commodity GPU,
#   including data download step
MAX_EPOCHS="${1:-100}"  # syntax for basic optional arguments in bash
CRITERION="${2:-1.0}"

# train on GPU if it's available
GPU=$(python -c 'import torch; print(int(torch.cuda.is_available()))')

python ./training/run_experiment.py \
  --data_class=IAMParagraphs --model_class=ResnetTransformer --loss=transformer \
  --limit_test_batches 0.0 --overfit_batches 1 --num_sanity_val_steps 0 \
  --augment_data false --tf_dropout 0.0 \
  --gpus "$GPU" --precision 16 --batch_size 16 --lr 0.0001 \
  --log_every_n_steps 25 --max_epochs "$MAX_EPOCHS"  --num_workers 2 --wandb || FAILURE=true

python -c "import json; loss = json.load(open('training/logs/wandb/latest-run/files/wandb-summary.json'))['train/loss']; assert loss < $CRITERION" ||

If you'd like to see what a memorization run looks like,
flip the `running_memorization` flag to `True`
and watch the results stream in to W&B.

The cell should run in about ten minutes on a commodity GPU.

In [32]:
%%time
running_memorization = False

if running_memorization:
    max_epochs = 1000
    loss_criterion = 0.05
    !./training/tests/test_memorize_iam.sh {max_epochs} {loss_criterion}

2023-11-07 10:50:51.592381: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 10:50:51.592442: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 10:50:51.592481: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: Currently logged in as: [33msaicmsaicm[0m. Use [1m`wandb login --relogin`[0m to force relogin
2023-11-07 10:50:57.743957: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 10:50:57.744009: E te

# Troubleshooting model speed with the PyTorch Profiler

Testing code is only half the story here:
we also need to fix the issues that our tests flag.
This is the process of troubleshooting.

In this lab,
we'll focus on troubleshooting model performance issues:
what do to when your model runs too slowly.

Troubleshooting deep neural networks for speed is challenging.

There are at least three different common approaches,
each with an increasing level of skill required:

1. Follow best practices advice from others
([this @karpathy tweet](https://t.co/7CIDWfrI0J), summarizing
[this NVIDIA talk](https://www.youtube.com/watch?v=9mS1fIYj1So&ab_channel=ArunMallya), is a popular place to start) and use existing implementations.
2. Take code that runs slowly and use empirical observations to iteratively improve it.
3. Truly understand distributed, accelerated tensor computations so you can write code correctly from scratch the first time.

For the full stack deep learning engineer,
the final level is typically out of reach,
unless you're specializing in the model performance
part of the stack in particular.

So we recommend reaching the middle level,
and this segment of the lab walks through the
tools that make this easier.

Because neural network training involves GPU acceleration,
generic Python profiling tools like
[`py-spy`](https://github.com/benfred/py-spy)
won't work, and
we'll need tools specialized for tracing and profiling DNN training.

In general, these tools are for observing what happens while your code is executing:
_tracing_ which operations were happening when and summarizing that into a _profile_ of the code.

Because they help us observe the execution in detail,
they will also help us understand just what is going on during
a PyTorch training step in greater detail.

To support profiling and tracing,
we've added a new argument to `training/run_experiment.py`, `--profile`:

In [33]:
!python training/run_experiment.py --help | grep -A 1 -e "^\s*--profile\s"

2023-11-07 11:04:46.634432: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-07 11:04:46.634491: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-07 11:04:46.634530: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  --profile             If passed, uses the PyTorch Profiler to track computation, exported as a
                        Chrome-style trace.


As with experiment management, this relies mostly on features of PyTorch Lightning,
which themselves wrap core utilities from libraries like PyTorch and TensorBoard,
and we just add a few lines of customization:

In [34]:
!cat training/run_experiment.py | grep args.profile -A 5

    if args.profile:
        sched = torch.profiler.schedule(wait=0, warmup=3, active=4, repeat=0)
        profiler = pl.profiler.PyTorchProfiler(export_to_chrome=True, schedule=sched, dirpath=experiment_dir)
        profiler.STEP_FUNCTIONS = {"training_step"}  # only profile training
    else:
        profiler = pl.profiler.PassThroughProfiler()


For more on profiling with Lightning, see the
[Lightning tutorial](https://pytorch-lightning.readthedocs.io/en/1.6.1/advanced/profiler.html).

The cell below runs an epoch of training with tracing and profiling turned on
and then saves the results locally and to W&B.

In [35]:
import glob

import torch
import wandb

from text_recognizer.data.base_data_module import DEFAULT_NUM_WORKERS


# make it easier to separate these from training runs
%env WANDB_JOB_TYPE=profile

batch_size = 16
num_workers = DEFAULT_NUM_WORKERS  # change this number later and see how the results change
gpus = 1  # must be run with accelerator

%run training/run_experiment.py --wandb --profile \
  --max_epochs=1 \
  --num_sanity_val_steps=0 --limit_val_batches=0 --limit_test_batches=0 \
  --model_class=ResnetTransformer --data_class=IAMParagraphs --loss=transformer \
  --batch_size={batch_size} --num_workers={num_workers} --precision=16 --gpus=1

latest_expt = wandb.run

try:  # add execution trace to logged and versioned binaries
    folder = wandb.run.dir
    trace_matcher = wandb.run.dir + "/*.pt.trace.json"
    trace_file = glob.glob(trace_matcher)[0]
    trace_at = wandb.Artifact(name=f"trace-{wandb.run.id}", type="trace")
    trace_at.add_file(trace_file, name="training_step.pt.trace.json")
    wandb.log_artifact(trace_at)
except IndexError:
    print("trace not found")

wandb.finish()

env: WANDB_JOB_TYPE=profile


ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msaicmsaicm[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit native Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True, used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:IAMParagraphs.setup(fit): Loading IAM paragraph regions and lines...
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
   | Name                      | Type         

Model State Dict Disk Size: 56.07 MB


Training: 0it [00:00, ?it/s]

INFO:pytorch_lightning.profiler.profiler:FIT Profiler Report
Profile stats for: records
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          ProfilerStep*        13.08%     326.220ms        49.42%        1.233s     308.139ms       0.000us         0.00%      99.168ms      24.792ms             4  
                        [pl][profile]run_training_batch         0.42%      10.504ms        38.44%     958.832ms     239

VBox(children=(Label(value='16.383 MB of 16.383 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, m…

0,1
epoch,▁
optimizer/lr-Adam,▁▁
size/mb_disk,▁
size/nparams,▁
train/loss,▁
trainer/global_step,▁▁███

0,1
epoch,0.0
optimizer/lr-Adam,0.001
size/mb_disk,56.06506
size/nparams,13988756.0
train/loss,3.18367
trainer/global_step,49.0


We get out a table of statistics in the terminal,
courtesy of Lightning.

Each row lists an operation
and and provides information,
described in the column headers,
about the time spent on that operation
across all the training steps we profiled.

With practice, some useful information can be read out from this table,
but it's better to start from both a less detailed view,
in the TensorBoard dashboard,
and a more detailed view,
using the Chrome Trace viewer.

## High-level statistics from the PyTorch Profiler in TensorBoard

Let's look at the profiling info in a high-level TensorBoard dashboard, conveniently hosted for us on W&B.

In [36]:
your_tensorboard_url = latest_expt.url + "/tensorboard"

print(your_tensorboard_url)

https://wandb.ai/saicmsaicm/fsdl-text-recognizer-2022-labs-lab05_training/runs/1iq1hx0r/tensorboard


If at any point you run into issues,
like the description not matching what you observe,
check out one of our example runs:

In [None]:
example_tensorboard_url = "https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/runs/67j1qxws/tensorboard?workspace=user-cfrye59"
print(example_tensorboard_url)

Once the TensorBoard session has loaded up,
we are dropped into the Overview
(see [this screenshot](https://pytorch.org/tutorials/_static/img/profiler_overview1.png)
for an example).

In the top center, we see the **GPU Summary** for our system.

In addition to the name of our GPU,
there are a few configuration details and top-level statistics.
They are (tersely) documented
[here](https://github.com/pytorch/kineto/blob/main/tb_plugin/docs/gpu_utilization.md).

- **[Compute Capability](https://developer.nvidia.com/cuda-gpus)**:
this is effectively a coarse "version number" for your GPU hardware.
It indexes which features are available,
with more advanced features being available only at higher compute capabilities.
It does not directly index the speed or memory of the GPU.

- **GPU Utilization**: This metric represents the fraction of time an operation (a CUDA kernel) is running on the GPU. This is also reported by the `!nvidia-smi` command or in the sytem metrics tab in W&B. This metric will be our first target to increase.

- **[Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/)**:
for devices with compute capability of at least 7, you'll see information about how much your execution used DNN-specialized
Tensor Cores.
If you're running on an older GPU without Tensor Cores,
you should consider upgrading.
If you're running a more recent GPU but not seeing Tensor Core usage,
you should switch to single precision floating point numbers,
which Tensor Cores are specialized on.

- **Est. SM Efficiency** and **Est. Occupancy** are high-level summaries of the utilization of GPU hardware
at a lower level than just whether something is running at all,
as in utilization.
Unlike utilization, reaching 100% is not generally feasible
and sometimes not desirable.
Increasing these numbers requires expertise in
CUDA programming, so we'll target utilization instead.

- **Execution Summary**: This table and pie chart indicates
how much time within a profiled step
was spent in each category.
The value for "kernel" execution here
is equal to the GPU utilization,
and we want that number to be as close to 100%
as possible.
This summary helps us know which
other operations are taking time,
like memory being copied between CPU and GPU (`memcpy`)
or `DataLoader`s executing on the CPU,
so we can decide where the bottleneck is.

At the very bottom, you'll find a
**Performance Recommendation**
tab that sometimes suggests specific methods for improving performance.

If this tab makes suggestions, you should certainly take them!

For more on using the profiler in TensorBoard,
including some of the other, more detailed views
available view the "Views" dropdown menu, see
[this PyTorch tutorial](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html?highlight=profiler).

## Going deeper with the Chrome Trace Viewer

So far, we've seen summary-level information about our training steps
in the table from Lightning and in the TensorBoard Overview.
These give aggregate statistics about the computations that occurred,
but understanding how to interpret those statistics
and use them to speed up our networks
requires understanding just what is
happening in our training step.

Fundamentally,
all computations are processes that unfold in time.

If we want to really understand our training step,
we need to display it that way:
what operations were occurring,
on both the CPU and GPU,
at each moment in time during the training step.

This information on timing is collected in the trace.
One of the best tools for viewing the trace over time
is the [Chrome Trace Viewer](https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/).

Let's tour the trace we just logged
with an aim to really understanding just
what is happening when we call
`training_step`
and by extension `.forward`, `.backward`, and `optimizer.step`.

The Chrome Trace Viewer is built into W&B,
so we can view our traces in their interface.

The cell below embeds the trace inside the notebook,
but you may wish to open it separately,
with the "Open page" button or by navigating to the URL,
so that you can interact with it
as you read the description below.
Display directly on W&B is also a bit less temperamental
than display on W&B inside a notebook.

Furthermore, note that the Trace Viewer was originally built as part of the Chromium project,
so it works best in browsers in that lineage -- Chrome, Edge, and Opera.
It also can interact poorly with browser extensions (e.g. ad blockers),
so you may need to deactivate them temporarily in order to see it.

In [37]:
trace_files_url = latest_expt.url.split("/runs/")[0] + f"/artifacts/trace/trace-{latest_expt.id}/latest/files/"
trace_url = trace_files_url + "training_step.pt.trace.json"

example_trace_url = "https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-67j1qxws/latest/files/training_step.pt.trace.json"

print(trace_url)
IFrame(src=trace_url, height=frame_height * 1.5, width="100%")

https://wandb.ai/saicmsaicm/fsdl-text-recognizer-2022-labs-lab05_training/artifacts/trace/trace-1iq1hx0r/latest/files/training_step.pt.trace.json


> **Heads up!** We're about to do a tour of the
> precise details of the tracing information logged
> during the execution of the training code.
> The only way to learn how to troubleshoot model performance
> empirically is to look at the details,
> but the details depend on the precise machine being used
> -- GPU and CPU and RAM.
> That means even within Colab,
> these details change from session to session.
> So if you don't observe a phenomenon or feature
> described in the tour below, check out
> [the example trace](https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-67j1qxws/latest/files/training_step.pt.trace.json)
> on W&B while reading through the next section of the lab,
> and return to your trace once you understand the trace viewer better at the end.
> Also, these are very much bleeding-edge expert developer tools, so the UX and integrations
> can sometimes be a bit janky.

This trace reveals, in nanosecond-level detail,
what's going on inside of a `training_step`
on both the GPU and the CPU.

Time is on the horizontal axis.
Colored bars represent method calls,
and the methods called by a method are placed underneath it vertically,
a visualization known as an
[icicle chart](https://www.brendangregg.com/flamegraphs.html).

Let's orient ourselves with some gross features:
the forwards pass,
GPU kernel execution,
the backwards pass,
and the optimizer step.

### The forwards pass

Type in `resnet` to the search bar in the top-right.

This will highlight the first part of the forwards passes we traced, the encoding of the images with a ResNet.

It should be in a vertical block of the trace that says `thread XYZ (python)` next to it.

You can click the arrows next to that tile to partially collapse these blocks.

Next, type in `transformerdecoder` to highlight the second part of our forwards pass.
It should be at roughly the same height.

Clear the search bar so that the trace is in color.
Zoom in on the area of the forwards pass
using the "zoom" tool in the floating toolbar,
so you can see more detail.
The zoom tool is indicated by a two-headed arrow
pointing into and out of the screen.

Switch to the "drag" tool,
represented by a four-headed arrow.
Click-and-hold to use this tool to focus
on different parts of the timeline
and click on the individual colored boxes
to see details about a particular method call.

As we go down in the icicle chart,
we move from a very abstract level in Python ("`resnet`", "`MultiheadAttention`")
to much more precise `cudnn` and `cuda` operations
("`aten::cudnn_convolution`", "`aten::native_layer_norm`").

`aten` ([no relation to the Pharaoh](https://twitter.com/charles_irl/status/1422232585724432392?s=20&t=Jr4j5ZXhV20xGwUVD1rY0Q))
is the tensor math library in PyTorch
that links to specific backends like `cudnn`.

### GPU kernel execution

Towards the bottom, you should see a section labeled "GPU".
The label appears on the far left.

Within it, you'll see one or more "`stream`s".
These are units of work on a GPU,
akin loosely to threads on the CPU.

When there are colored bars in this area,
the GPU is doing work of some kind.
The fraction of this bar that is filled in with color
is the same as the "GPU Utilization %" we've seen previously.
So the first thing to visually assess
in a trace view of PyTorch code
is what fraction of this area is filled with color.

In CUDA, work is queued up to be
placed into streams and completed, on the GPU,
in a distributed and asynchronous manner.

The selection of which work to do
is happening on the CPU,
and that's what we were looking at above.

The CPU and the GPU have to work together to coordinate
this work.

Type `cuda` into the search bar and you'll see these coordination operations happening:
`cudaLaunchKernel`, for example, is the CPU telling the GPU what to do.

Running the same PyTorch model
with the same high level operations like `Conv2d` in different versions of PyTorch,
on different GPUs, and even on tensors of different sizes will result
in different choices of concrete kernel operation,
e.g. different matrix multiplication algorithms.

Type `sync` into the search bar and you'll see places where either work on the GPU
or work on the CPU needs to await synchronization,
e.g. copying data from the CPU to the GPU
or the CPU waiting to decide what to do next
on the basis of the contents of a tensor.

If you see a "sync" block above an area
where the stream on the GPU is empty,
you've got a performance bottleneck due to synchronization
between the CPU and GPU.

To resolve the bottleneck,
head up the icicle chart until you reach the recognizable
PyTorch modules and operations.
Find where they are called in your PyTorch module.
That's a good place to review your code to understand why the synchronization is happening
and removing it if it's not necessary.

### The backwards pass

Type in `backward` into the search bar.

This will highlight components of our backwards pass.

If you read it from left to right,
you'll see that it begins by calculating the loss
(`NllLoss2DBackward` in the search bar if you can't find it)
and ends by doing a `ConvolutionBackward`,
the first layer of the ResNet.
It is, indeed, backwards.

Like the forwards pass,
the backwards pass also involves the CPU
telling the GPU which kernels to run.
It's typically run in a separate
thread from the forwards pass,
so you'll see it separated out from the forwards pass
in the trace viewer.

Generally, there's no need to specifically optimize the backwards pass --
removing bottlenecks in the forwards pass results in a fast backwards pass.

One reason why is that these two passes are just
"transposes" of one another,
so they share a lot of properties,
and bottlenecks in one become bottlenecks in the other.
We can choose to optimize either one of the two.
But the forwards pass is under our direct control,
so it's easier for us to reason about.

Another reason is that the forwards pass is more likely to have bottlenecks.
The forwards pass is a dynamic process,
with each line of Python adding more to the compute graph.
Backwards passes, on the other hand, use a static compute graph,
the one just defined by the forwards pass,
so more optimizations are possible.

### The optimizer step

Type in `Adam.step` to the search bar to highlight the computations of the optimizer.

As with the two passes,
we are still using the CPU
to launch kernels on the GPU.
But now the CPU is looping,
in Python, over the parameters
and applying the ADAM updates rules to each.

We now know enough to see that
this is not great for our GPU utilization:
there are many areas of gray
in between the colored bars
in the GPU stream in this area.

In the time it takes CUDA to multiply
thousands of numbers,
Python has not yet finished cleaning up
after its request for that multiplication.

As of writing in August 2022,
more efficient optimizers are not a stable part of PyTorch (v1.12), but
[there is an unstable API](https://github.com/pytorch/pytorch/issues/68041)
and stable implementations outside of PyTorch.
The standard implementations are in
[in NVIDIA's `apex.optimizers` library](https://nvidia.github.io/apex/optimizers.html),
not to be confused with the
[Apex Optimizers Project](https://www.apexoptimizers.com/),
which is a collection of fitness-themed cheetah NFTs.

## Take-aways for PyTorch performance bottleneck troubleshooting

Our goal here was to learn some basic principles and tools for bottlenecking
the most common issues and the lowest-hanging fruit in PyTorch code.


Here's an overview in terms of a "host",
generally the CPU,
and a "device", here the GPU.

- The slow-moving host operates at the level of an abstract compute graph ("convolve these weights with this input"), not actual numerical computations.
- During execution, host's memory stores only metadata about tensors, like their types and shapes. This metadata needed to select the concrete operations, or CUDA kernels, for the device to run.
  - Convolutions with very large filter sizes, for example, might use fast Fourier transform-based convolution algorithms, while the smaller filter sizes typical of contemporary CNNs are generally faster with Winograd-style convolution algorithms.
- The much beefier device executes actual operations, but has no control over which operations are executed. Its memory
stores information about the contents of tensors,
not just their metadata.

Towards that goal, we viewed the trace to get an understanding of
what's going on inside a PyTorch training step.

Here's what we've means in terms of troubleshooting bottlenecks.

We want Python to chew its way through looking up the right CUDA kernel and telling the GPU that's what it needs next
before the previous kernel finishes.

Ideally, the CPU is actually getting far _ahead_ of execution
on the GPU.
If the CPU makes it all the way through the backwards pass before the GPU is done,
that's great!
The GPU(s) are the expensive part,
and it's easy to use multiprocessing so that
the CPU has other things to do.

This helps explain at least one common piece of advice:
the larger our batches are,
the more work the GPU has to do for the same work done by the CPU,
and so the better our utilization will be.

We operationalize our desire to never be waiting on the CPU with a simple metric:
**100% GPU utilization**, meaning a kernel is running at all times.

This is the aggregate metric reported in the systems tab on W&B or in the output of `!nvidia-smi`.

You should not buy faster GPUs until you have maxed this out! If you have 50% utilization, the fastest GPU in the world can't give you more than a 2x speedup, and it will more than 2x cost.

Here are some of the most common issues that lead to low GPU Utilization, and how to resolve them:
1. **The CPU is too weak**.
Because so much of the discussion around DNN performance is about GPUs,
it's easy when specing out a machine to skimp on the CPUs, even though training can bottleneck on CPU operations.
_Resolution_:
Use nice CPUs, like
[threadrippers](https://www.amd.com/en/products/ryzen-threadripper).
2. **Too much Python during the `training_step`**.
Python is very slow, so if you throw in a really slow Python operation, like dynamically creating classes or iterating over a bunch of bytes, especially from disk, during the training step, you can end up waiting on a `__init__`
that takes longer than running an entire layer.
_Resolution_:
Look for low utilization areas of the trace
and check what's happening on the CPU at that time
and carefully review the Python code being executed.
3. **Unnecessary Host/Device synchronization**.
If one of your operations depends on the values in a tensor,
like `if xs.mean() >= 0`,
you'll induce a synchronization between
the host and the device and possibly lead
to an expensive and slow copy of data.
_Resolution_:
Replace these operations as much as possible
with purely array-based calculations.
4. **Bottlenecking on the DataLoader**.
In addition to coordinating the work on the GPU,
CPUs often perform heavy data operations,
including communication over the network
and writing to/reading from disk.
These are generally done in parallel to the forwards
and backwards passes,
but if they don't finish before that happens,
they will become the bottleneck.
_Resolution_:
Get better hardware for compute,
memory, and network.
For software solutions, the answer
is a bit more complex and application-dependent.
For generic tips, see
[this classic post by Ross Wightman](https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548/19)
in the PyTorch forums.
For techniques in computer vision, see
[the FFCV library](https://github.com/libffcv/ffcv)
and for techniques in NLP, see e.g.
[Hugging Face datasets with Arrow](https://huggingface.co/docs/datasets/about_arrow)
and [Hugging Face FastTokenizers](https://huggingface.co/course/chapter6/3).

### Further steps in making DNNs go brrrrrr

It's important to note that utilization
is just an easily measured metric
that can reveal common bottlenecks.
Having high utilization does not automatically mean
that your performance is fully optimized.

For example,
synchronization events between GPUs
are counted as kernels,
so a deadlock during distributed training
can show up as 100% utilization,
despite literally no useful work occurring.

Just switching to
double precision floats, `--precision=64`,
will generally lead to much higher utilization.
The GPU operations take longer
for roughly the same amount of CPU effort,
but the added precision brings no benefit.

In particular, it doesn't make for models
that perform better on our correctness metrics,
like loss and accuracy.

Another useful yardstick to add
to utilization is examples per second,
which incorporates how quickly the model is processing data examples
and calculating gradients.

But really,
the gold star is _decrease in loss per second_.
This metric connects model design choices
and hyperparameters with purely engineering concerns,
so it disrespects abstraction barriers
and doesn't generally lead to actionable recommendations,
but it is, in the end, the real goal:
make the loss go down faster so we get better models sooner.

For PyTorch internals abstractly,
see [Ed Yang's blog post](http://blog.ezyang.com/2019/05/pytorch-internals/).

For more on performance considerations in PyTorch,
see [Horace He's blog post](https://horace.io/brrr_intro.html).

# Exercises

### 🌟 Compare `num_workers=0`  with `DEFAULT_NUM_WORKERS`.

One of the most important features for making
PyTorch run quickly is the
`MultiprocessingDataLoader`,
which executes batching of data in a separate process
from the forwards and backwards passes.

By default in PyTorch,
this feature is actually turned off,
via the `DataLoader` argument `num_workers`
having a default value of `0`,
but we set the `DEFAULT_NUM_WORKERS`
to a value based on the number of CPUs
available on the system running the code.

Re-run the profiling cell,
but set `num_workers` to `0`
to turn off multiprocessing.

Compare and contrast the two traces,
both for total runtime
(see the time axis at the top of the trace)
and for utilization.

If you're unable to run the profiles,
see the results
[here](https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-2eddoiz7/v0/files/training_step.pt.trace.json#f388e363f107e21852d5$trace-67j1qxws),
which juxtaposes two traces,
with in-process dataloading on the left and
multiprocessing dataloading on the right.

### 🌟🌟 Resolve issues with a file by fixing flake8 lints, then write a test.

The file below incorrectly implements and then incorrectly tests
a simple PyTorch utility for adding five to every entry of a tensor
and then calculating the sum.

Even worse, it does it with horrible style!

The cells below apply our linting checks
(after automatically fixing the formatting)
and run the test.

Fix all of the lints,
implement the function correctly,
and then implement some basic tests.

- [`flake8`](https://flake8.pycqa.org/en/latest/user/error-codes.html) for core style
- [`flake8-import-order`](https://github.com/PyCQA/flake8-import-order) for checking imports
- [`flake8-docstrings`](https://github.com/pycqa/flake8-docstrings) for docstring style
- [`darglint`](https://github.com/terrencepreilly/darglint) for docstring completeness
- [`flake8-annotations`](https://github.com/sco1/flake8-annotations) for type annotations

In [None]:
%%writefile training/fixme.py
import torch
from training import run_experiment
from numpy import *
import random
from pathlib import Path




def add_five_and_sum(tensor):
  # this function is not implemented right,
  #    but it's supposed to add five to all tensor entries and sum them up
  return 1

def test_add_five_and_sum():
    # and this test isn't right either! plus this isn't exactly a docstring
    all_zeros, all_ones = torch.zeros((2, 3)), torch.ones((1, 4, 72))
    all_fives = 5 * all_ones
    assert False

In [None]:
!pre-commit run black --files training/fixme.py

In [None]:
!cat training/fixme.py

In [None]:
!pre-commit run --files training/fixme.py

In [None]:
!pytest training/fixme.py