Skip to content

Commit

Permalink
Enabled most pre-commit hooks (#1080)
Browse files Browse the repository at this point in the history
- Enabled most pre-commit hooks, except the quote fixer, as that will result in a messy diff. Will wait till PRs are at a minimum before adding that one
- Reformatting existing files to be compliant with the pre-commit config. Mainly involved fixing links, adding newlines, removing trailing spaces, and removing executable bits.
  • Loading branch information
ravi-mosaicml committed May 24, 2022
1 parent 8499eed commit ceebbdb
Show file tree
Hide file tree
Showing 165 changed files with 408 additions and 259 deletions.
3 changes: 3 additions & 0 deletions .ci/test_lint_doctests.py
@@ -1,3 +1,6 @@
# Copyright 2022 MosaicML Composer authors
# SPDX-License-Identifier: Apache-2.0

# Pytest stub for running lint tests and doctests

# Running these checks through pytest allows us to report any errors in Junit format,
Expand Down
6 changes: 3 additions & 3 deletions .github/ISSUE_TEMPLATE/---bug-report.md
Expand Up @@ -19,9 +19,9 @@ assignees: ''

Steps to reproduce the behavior:

1.
2.
3.
1.
2.
3.

## Expected behavior

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/---new-method.md
Expand Up @@ -16,7 +16,7 @@ assignees: ''

## Attribution

<!-- Who are the authors that we should credit and/or contact for this method? -->
<!-- Who are the authors that we should credit and/or contact for this method? -->

## [Optional] Implementation

Expand Down
34 changes: 28 additions & 6 deletions .pre-commit-config.yaml
Expand Up @@ -49,19 +49,41 @@ repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.1.0
hooks:
# - id: trailing-whitespace # TODO(ravi): Enable this check later. Generates a large diff.
# - id: end-of-file-fixer # TODO(ravi): Enable this check later. Generates a large diff.
- id: check-added-large-files
- id: check-ast
- id: check-byte-order-marker
- id: check-builtin-literals
args:
- --no-allow-dict-kwargs
- id: check-case-conflict
- id: check-docstring-first
- id: check-executables-have-shebangs
- id: check-json
- id: check-shebang-scripts-are-executable
- id: pretty-format-json
args:
- --autofix
- --no-sort-keys
- --indent=4
- id: check-merge-conflict
- id: check-symlinks
- id: check-toml
- id: check-vcs-permalinks
- id: check-xml
- id: check-yaml
- id: debug-statements
# - id: name-tests-test # TODO(ravi): Enable this check later. Generates a large diff.
# args: ['--django']
# - id: double-quote-string-fixer # TODO(ravi): Enable this check later. Generates a large diff.
- id: destroyed-symlinks
# - id: double-quote-string-fixer # TODO(ravi): Enable this check later. Generates a large diff.
- id: end-of-file-fixer
- id: fix-byte-order-marker
- id: mixed-line-ending
- id: trailing-whitespace
# - id: name-tests-test # TODO(ravi): Enable this check later. Generates a large diff.
# args: ['--django']
- repo: https://github.com/Lucas-C/pre-commit-hooks
rev: v1.1.13
hooks:
- id: insert-license
files: composer
args:
- --license-filepath
- .ci/FILE_HEADER
Expand Down
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
@@ -1,5 +1,5 @@
# Community Guidelines

This repository is governed by MosaicML's community guidelines and code of conduct.
This repository is governed by MosaicML's community guidelines and code of conduct.
For more details, including information on how to report issues affecting the community, please read the
[MosaicML Community Guidelines](https://docs.google.com/document/d/1h8S9x9bCTsA_H8ourZJy3SQVWy-6z7i28TP5rcZt8RI/edit) and the [MosaicML Code of Conduct](https://docs.google.com/document/d/1aCaMLO65qfMaqP3uDYiUsTauMvBrSKd7qgeYqz458Ew/edit).
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -354,7 +354,7 @@ If you have any questions, please feel free to reach out to us on [Twitter](http
# 💫 Contributors
Composer is part of the broader Machine Learning community, and we welcome any contributions, pull requests, or issues!

To start contributing, see our [Contributing](CONTRIBUTING.md) page.
To start contributing, see our [Contributing](CONTRIBUTING.md) page.

# ✍️ Citation
```
Expand Down
2 changes: 1 addition & 1 deletion STYLE_GUIDE.md
Expand Up @@ -70,7 +70,7 @@ Here are some suggestions to deal with pyright errors:
Instead, add a check to ensure that `x is not None`:

```python
from typing import Union
from typing import Union

def foo(x: Union[int, None]):
if x is None:
Expand Down
Empty file modified composer/algorithms/__init__.py 100755 → 100644
Empty file.
8 changes: 4 additions & 4 deletions composer/algorithms/agc/README.md
Expand Up @@ -25,7 +25,7 @@ def training_loop(model, train_loader):
opt = torch.optim.Adam(model.parameters())
loss_fn = F.cross_entropy
model.train()

for epoch in range(num_epochs):
for X, y in train_loader:
opt.zero_grad()
Expand Down Expand Up @@ -66,15 +66,15 @@ AGC is implemented as follows:
On `Event.AFTER_TRAIN_BATCH`, for every parameter in the model that has gradients:
1. Compute the parameter's weight norm with an L2 norm (normalized across rows for MLP's, across entire filters for CNN's, and across the entire vector for biases).
2. Compute the parameter's gradient norm with an L2 norm.
3. If `grad_norm > weight_norm * clipping_threshold`, scale all the contributing gradients by `clipping_threshold * (weight_norm / grad_norm)`.
3. If `grad_norm > weight_norm * clipping_threshold`, scale all the contributing gradients by `clipping_threshold * (weight_norm / grad_norm)`.


## Suggested Hyperparameters

We haven't done much experimentation with AGC. However, [the original authors, Brock et al.](https://arxiv.org/abs/2102.06171)
and [Ayush Thakur](https://wandb.ai/ayush-thakur/nfnet/reports/Exploring-Adaptive-Gradient-Clipping-and-NFNets--Vmlldzo1MDc0NTQ)
have done some ablations have some recommendations. Note, both parties use AGC with NF-ResNets, which is a variation
of ResNets that removes Batch Norm and includes [Weight Standardization](https://arxiv.org/abs/1903.10520)
of ResNets that removes Batch Norm and includes [Weight Standardization](https://arxiv.org/abs/1903.10520)
among other modifications.

Brock et al. recommend using a `clipping threshold` of 0.01 for batch sizes between 1024 to 4096.
Expand All @@ -84,7 +84,7 @@ slightly increasing up to 0.08. They also recommend removing AGC from the last l
Thakur recommends large `clipping threshold` for small batch sizes (at least 0.16 for batch sizes 128 and 256) and smaller `clipping threshold` for large batch sizes .
They also found that AGC seems to work especially well for the NF-ResNet architecture. Specifically they found that for `clipping threshold` of 0.01 and batch size of 1024, AGC does not improve the the performance of a vanilla ResNet with Batch Norm removed.

<!-- ## Technical Details
<!-- ## Technical Details
TODO(eracah): fill in this section.
-->

Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/agc/agc.py
Expand Up @@ -28,7 +28,7 @@ def apply_agc(
.. testcode::
import composer.functional as cf
cf.apply_agc(model=model)
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/agc/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Gradients with norms larger than a maximum norm (equal to the weight norm multiplied by a constant) are scaled by the ratio of the maximum norm to the grad norm.",
"use": "Computer vision tasks"
}
}
}
Empty file modified composer/algorithms/algorithm_registry.py 100755 → 100644
Empty file.
8 changes: 4 additions & 4 deletions composer/algorithms/alibi/README.md
Expand Up @@ -16,7 +16,7 @@ ALiBi (Attention with Linear Biases) dispenses with position embeddings for toke

<!--pytest-codeblocks:importorskip(transformers)-->
```python
# Run the ALiBi algorithm directly on the model using the Composer functional API
# Run the ALiBi algorithm directly on the model using the Composer functional API

import torch
import composer.functional as cf
Expand All @@ -40,7 +40,7 @@ def training_loop(model, train_loader):
opt = torch.optim.Adam(model.parameters())
loss_fn = F.cross_entropy
model.train()

for epoch in range(num_epochs):
for X, y in train_loader:
y_hat = model(X)
Expand Down Expand Up @@ -102,7 +102,7 @@ Press et al. found that learning *m* did not lead to strong extrapolation. They
Press et al. report that models trained with ALiBi maintain similar performance even when tested on sequences 5-10x longer than they were trained on. ALiBi’s extrapolation capabilities can be leveraged to train on shorter sequences. This is desirable because the number of operations required to compute self-attention and the GPU memory usage required to store the resulting representations both increase with the square of the sequence length. In one example scenario, Press et al. reported training to equal perplexity 90% of the time and utilizing 90% of the GPU memory compared to a baseline model with sinusoidal position embeddings. Our experiments show that ALiBi can reduce perplexity by 0.2-0.6, train models 1.15x faster, and utilize 1.2x less GPU memory compared to baseline models (see below).

> ✅ ALiBi Improves the Tradeoff Between Quality and Training Speed
>
>
> In our experiments, ALiBi improves the attainable tradeoffs between training speed and the final quality of the trained model.
> We recommend ALiBi for training language models.
Expand All @@ -118,7 +118,7 @@ We conducted experiments on the GPT-2 model family trained on OpenWebText on 8x
|GPT2-125M ALiBi 0.25x|23.49|-0.63|25280|1.19x|74.83|1.28x|

> ❗ Don't Set the Sequence Length Too Short
>
>
>We observed that performance significantly degraded for ALiBi models trained on sequence lengths ≤128, implying that very short sequences (≤128 tokens) may be irreconcilably out-of-distribution with regard to longer sequences. Considering our results together with those of Press et al. leads us to suggest that models with ALiBi should not be trained on sequences ≤256 or `train_sequence_length_scaling≤0.03125`, whichever is larger.
## Attribution
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/alibi/alibi.py
Expand Up @@ -287,7 +287,7 @@ def _zero_and_freeze_expand_position_embeddings(


def _register_alibi(module: torch.nn.Module, n_heads: int, max_token_length: int):
# Modified from https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L742
# Modified from https://github.com/ofirpress/attention_with_linear_biases/blob/5b327adc6d131e28b40ba58906b30bb469483519/fairseq/models/transformer.py#L742
slopes = torch.Tensor(_get_alibi_head_slopes(n_heads))
# In the next line, the part after the * is what constructs the diagonal matrix
# (right matrix in Figure 3 in the paper).
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/alibi/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Encodes position information by biasing the query-key attention scores proportionally to each token pair\u2019s distance.",
"use": "Transformer-based NLP models"
}
}
}
10 changes: 5 additions & 5 deletions composer/algorithms/augmix/README.md
Expand Up @@ -50,7 +50,7 @@ def augmix_image(image: Union[PillowImage, torch.Tensor]):
import torchvision.transforms as transforms
from torchvision.datasets.vision import VisionDataset

from composer.algorithms.augmix import AugmentAndMixTransform
from composer.algorithms.augmix import AugmentAndMixTransform

augmix_transform = AugmentAndMixTransform(severity=3,
width=3,
Expand Down Expand Up @@ -105,7 +105,7 @@ The class form of AugMix runs on `Event.FIT_START` and inserts `AugmentAndMixTra
[As per Hendrycks et al. (2020)](https://arxiv.org/abs/1912.02781), we found that `width=3`, `depth=-1`, (`depth=-1` means that depth will be randomly sampled from the uniform distribution {1, 2, 3} for each data sample), `severity=3` (out of a maximum possible value of 10), and `alpha=1` (i.e., performing no mixing with the original image) worked well for different models of the ResNet family. We used `augmentation_set=all`.

> ❗ Potential CPU Bottleneck
>
>
> Further increasing `width` or `depth` significantly decreases throughput when training ResNet-50 on ImageNet due to bottlenecks in performing data augmentation on the CPU.
## Technical Details
Expand All @@ -125,7 +125,7 @@ When omitting the custom loss function and using the AugMix augmentation scheme
However, the increased CPU load imposed by AugMix substantially reduces throughput.

> ❗ Potential CPU Bottleneck
>
>
> We found that using AugMix with the hyperparameters recommended by Hendrycks et al. can increase the data augmentation load on the CPU so much that it bottlenecks training.
> Depending on the hardware configuration and model, we found that those hyperparameters increase training time by 1.1x-10x.
Expand All @@ -135,7 +135,7 @@ In addition, AugMix is a regularization technique, meaning it makes training mor
Doing so can allow models to reach higher quality, but this typically requires (1) larger models with more capacity to perform this more difficult learning and (2) longer training runs to allow these models time to learn.

> 🚧 AugMix May Reduce Quality for Smaller Models and Shorter Training Runs
>
>
> AugMix is a regularization technique that makes training more difficult for the model.
> Because AugMix is a regularization technique, it can allow models to reach higher quality for
>
Expand All @@ -150,7 +150,7 @@ Doing so can allow models to reach higher quality, but this typically requires (
> As general rule, composing regularization methods may lead to diminishing returns in quality improvements while increasing the risk of creating a CPU bottleneck.
> ❗ CIFAR-10C and ImageNet-C are no longer out-of-distribution
>
>
> [CIFAR-10C and ImageNet-C](https://github.com/hendrycks/robustness) are test sets created to evaluate the ability of models to generalize to images that are corrupted in various ways (i.e., images that are _out-of-distribution_ with respect to the standard CIFAR-10 and ImageNet training sets).
> These images were corrupted using some of the augmentation techniques in `augmentation_set=all`.
> If you use `augmentation_set=all`, these images are therefore no longer out-of-distribution.
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/augmix/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Creates multiple random chain of augmentations for each sample, and takes a convex combination over the chains",
"use": "Computer vision tasks"
}
}
}
2 changes: 1 addition & 1 deletion composer/algorithms/blurpool/README.md
Expand Up @@ -77,7 +77,7 @@ For max pooling, it replaces `torch.nn.MaxPool2d` instances with instances of a

🚧 Implementation Note
>
> Blurpool does not replace strided convolutions with fewer than `min_channels` input channels, which by default is set to `16`. This is a heuristic used to avoid blurpooling the network's input. Doing so is undesirable since it amounts to downsampling the input by more than the amount specified in the preprocessing pipeline.
> Blurpool does not replace strided convolutions with fewer than `min_channels` input channels, which by default is set to `16`. This is a heuristic used to avoid blurpooling the network's input. Doing so is undesirable since it amounts to downsampling the input by more than the amount specified in the preprocessing pipeline.
## Suggested Hyperparameters

Expand Down
6 changes: 3 additions & 3 deletions composer/algorithms/channels_last/README.md
Expand Up @@ -17,7 +17,7 @@ This is a systems-level method that does not change the math or outcome of train
### Functional Interface

```python
# Run the Channels Last algorithm directly on the model using the Composer functional API
# Run the Channels Last algorithm directly on the model using the Composer functional API

import composer.functional as cf

Expand All @@ -27,7 +27,7 @@ def training_loop(model, train_loader):
opt = torch.optim.Adam(model.parameters())
loss_fn = F.cross_entropy
model.train()

for epoch in range(num_epochs):
for X, y in train_loader:
y_hat = model(X)
Expand Down Expand Up @@ -78,7 +78,7 @@ If the model weights are instead initialized in NHWC format, PyTorch will automa
We currently implement this method by casting the user’s model to channels-last format (no changes to the dataloader are necessary). When the first convolution operation receives its input activation, it will automatically convert it to NHWC format, after which the memory format will persist for the remainder of the network (or until it reaches a layer that cannot support having channels last).

> ❗ Overhead from Operations Incompatible with Channels Last Memory Format
>
>
> If a model has layers that cannot support the channels last memory format, there will be overhead due to PyTorch switching activation tensors back and forth between NCHW and NHWC memory formats. We believe this problem currently affects placing channels last on UNet.
## Attribution
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/channels_last/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Stores activation and weight tensors in a NHWC (batch, height, width, channels) format, rather than Pytorch\u2019s default of NCHW.",
"use": "2D Convolutional Neural Networks"
}
}
}
2 changes: 1 addition & 1 deletion composer/algorithms/colout/README.md
Expand Up @@ -81,7 +81,7 @@ trainer.fit()
### Implementation Details

ColOut currently has two implementations.
One implementation, accessed by passing `batch=False`, acts as an additional data augmentation for use in PyTorch dataloaders. It runs on the CPU and applies ColOut independently to each training example.
One implementation, accessed by passing `batch=False`, acts as an additional data augmentation for use in PyTorch dataloaders. It runs on the CPU and applies ColOut independently to each training example.
A second implementation, accessed by passing `batch=True`, runs immediately before the training example is provided to the model. It runs on the GPU and drops the same rows and columns for all training examples in a mini-batch.

## Suggested Hyperparameters
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/colout/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Drops a fraction of the rows and columns of an input image to reduce the image size and add variability.",
"use": "Computer vision tasks"
}
}
}
4 changes: 2 additions & 2 deletions composer/algorithms/cutmix/cutmix.py
Expand Up @@ -184,10 +184,10 @@ class CutMix(Algorithm):
box such that each pixel has an equal probability of being mixed.
If ``False``, defaults to the sampling used in the original
paper implementation. Default: ``False``.
input_key (str, int, or Callable): A key that indexes to the input
input_key (str, int, or Callable): A key that indexes to the input
from the batch. Can also be a pair of get and set functions, where the getter
is assumed to be first in the pair.
target_key (str, int, or Callable): A key that indexes to the target
target_key (str, int, or Callable): A key that indexes to the target
from the batch. Can also be a pair of get and set functions, where the getter
is assumed to be first in the pair.
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/cutmix/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Overlays a patch of a different image onto the input, and interpolates labels accordingly.",
"use": "Image classification, semantic segmentation"
}
}
}
2 changes: 1 addition & 1 deletion composer/algorithms/cutout/README.md
Expand Up @@ -65,7 +65,7 @@ trainer.fit()

### Implementation Details

CutOut randomly selects `num_holes` square regions (which are possibly overlapping) with side length `length` and uses them to generate a binary mask for the image where the points within any hole are set to 0 and the remaining points are set to 1.
CutOut randomly selects `num_holes` square regions (which are possibly overlapping) with side length `length` and uses them to generate a binary mask for the image where the points within any hole are set to 0 and the remaining points are set to 1.
This mask is then multiplied element-wise with the image in order to set the pixel value of any pixel value within a hole to 0.

CutOut is implemented following the [original paper](https://arxiv.org/abs/1708.04552). However, our implementation currently differs in that CutOut operates on a batch of data and runs on device to avoid potential CPU bottlenecks.
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/cutout/cutout.py
Expand Up @@ -75,7 +75,7 @@ def cutout_batch(input: ImgT, num_holes: int = 1, length: float = 0.5, uniform_s


class CutOut(Algorithm):
"""`CutOut <https://arxiv.org/abs/1708.04552>`_ is a data augmentation technique
"""`CutOut <https://arxiv.org/abs/1708.04552>`_ is a data augmentation technique
that works by masking out one or more square regions of an input image.
This implementation cuts out the same square from all images in a batch.
Expand Down
2 changes: 1 addition & 1 deletion composer/algorithms/cutout/metadata.json
Expand Up @@ -12,4 +12,4 @@
"summary": "Masks out one or more square regions of an input image.",
"use": "Computer vision tasks"
}
}
}
2 changes: 1 addition & 1 deletion composer/algorithms/ema/README.md 100755 → 100644
Expand Up @@ -94,4 +94,4 @@ Our implementation of EMA also provides the option to use the EMA weights as the

Our implementation of EMA was inspired by [Tensorflow's Exponential Moving Average](https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage)

*This Composer implementation of this method and the accompanying documentation were produced by Cory Stephenson at MosaicML.*
*This Composer implementation of this method and the accompanying documentation were produced by Cory Stephenson at MosaicML.*
Empty file modified composer/algorithms/ema/__init__.py 100755 → 100644
Empty file.
Empty file modified composer/algorithms/ema/ema.py 100755 → 100644
Empty file.
2 changes: 1 addition & 1 deletion composer/algorithms/ema/metadata.json 100755 → 100644
Expand Up @@ -13,4 +13,4 @@
"summary": "Maintains an exponential moving average of model weights for use in evaluation.",
"use": "Generally applicable"
}
}
}

0 comments on commit ceebbdb

Please sign in to comment.