Enabled most pre-commit hooks (#1080)

- Enabled most pre-commit hooks, except the quote fixer, as that will result in a messy diff. Will wait till PRs are at a minimum before adding that one - Reformatting existing files to be compliant with the pre-commit config. Mainly involved fixing links, adding newlines, removing trailing spaces, and removing executable bits.
mosaicml · May 24, 2022 · ceebbdb · ceebbdb
1 parent 8499eed
commit ceebbdb
Show file tree

Hide file tree

Showing 165 changed files with 408 additions and 259 deletions.
diff --git a/.ci/test_lint_doctests.py b/.ci/test_lint_doctests.py
@@ -1,3 +1,6 @@
+# Copyright 2022 MosaicML Composer authors
+# SPDX-License-Identifier: Apache-2.0
+
 # Pytest stub for running lint tests and doctests
 
 # Running these checks through pytest allows us to report any errors in Junit format,

diff --git a/.github/ISSUE_TEMPLATE/---bug-report.md b/.github/ISSUE_TEMPLATE/---bug-report.md
@@ -19,9 +19,9 @@ assignees: ''
 
 Steps to reproduce the behavior:
 
-1. 
-2. 
-3. 
+1.
+2.
+3.
 
 ## Expected behavior
 

diff --git a/.github/ISSUE_TEMPLATE/---new-method.md b/.github/ISSUE_TEMPLATE/---new-method.md
@@ -16,7 +16,7 @@ assignees: ''
 
 ## Attribution
 
-<!-- Who are the authors that we should credit and/or contact for this method? --> 
+<!-- Who are the authors that we should credit and/or contact for this method? -->
 
 ## [Optional] Implementation
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -49,19 +49,41 @@ repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.1.0
     hooks:
-      # -   id: trailing-whitespace  # TODO(ravi): Enable this check later. Generates a large diff.
-      # -   id: end-of-file-fixer  # TODO(ravi): Enable this check later. Generates a large diff.
+      - id: check-added-large-files
+      - id: check-ast
+      - id: check-byte-order-marker
+      - id: check-builtin-literals
+        args:
+          - --no-allow-dict-kwargs
+      - id: check-case-conflict
       - id: check-docstring-first
+      - id: check-executables-have-shebangs
+      - id: check-json
+      - id: check-shebang-scripts-are-executable
+      - id: pretty-format-json
+        args:
+          - --autofix
+          - --no-sort-keys
+          - --indent=4
+      - id: check-merge-conflict
+      - id: check-symlinks
+      - id: check-toml
+      - id: check-vcs-permalinks
+      - id: check-xml
       - id: check-yaml
       - id: debug-statements
-    # -   id: name-tests-test  # TODO(ravi): Enable this check later. Generates a large diff.
-    #     args: ['--django']
-    # -   id: double-quote-string-fixer  # TODO(ravi): Enable this check later. Generates a large diff.
+      - id: destroyed-symlinks
+   #  - id: double-quote-string-fixer  # TODO(ravi): Enable this check later. Generates a large diff.
+      - id: end-of-file-fixer
+      - id: fix-byte-order-marker
+      - id: mixed-line-ending
+      - id: trailing-whitespace
+    # - id: name-tests-test  # TODO(ravi): Enable this check later. Generates a large diff.
+    #   args: ['--django']
   - repo: https://github.com/Lucas-C/pre-commit-hooks
     rev: v1.1.13
     hooks:
       - id: insert-license
-        files: composer
         args:
           - --license-filepath
           - .ci/FILE_HEADER

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -1,5 +1,5 @@
 # Community Guidelines
 
-This repository is governed by MosaicML's community guidelines and code of conduct. 
+This repository is governed by MosaicML's community guidelines and code of conduct.
 For more details, including information on how to report issues affecting the community, please read the
 [MosaicML Community Guidelines](https://docs.google.com/document/d/1h8S9x9bCTsA_H8ourZJy3SQVWy-6z7i28TP5rcZt8RI/edit) and the [MosaicML Code of Conduct](https://docs.google.com/document/d/1aCaMLO65qfMaqP3uDYiUsTauMvBrSKd7qgeYqz458Ew/edit).
diff --git a/README.md b/README.md
@@ -354,7 +354,7 @@ If you have any questions, please feel free to reach out to us on [Twitter](http
 # 💫 Contributors
 Composer is part of the broader Machine Learning community, and we welcome any contributions, pull requests, or issues!
 
-To start contributing, see our [Contributing](CONTRIBUTING.md) page. 
+To start contributing, see our [Contributing](CONTRIBUTING.md) page.
 
 # ✍️ Citation
 ```

diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
@@ -70,7 +70,7 @@ Here are some suggestions to deal with pyright errors:
     Instead, add a check to ensure that `x is not None`:
 
     ```python
-    from typing import Union 
+    from typing import Union
 
     def foo(x: Union[int, None]):
         if x is None:

diff --git a/composer/algorithms/__init__.py b/composer/algorithms/__init__.py
diff --git a/composer/algorithms/agc/README.md b/composer/algorithms/agc/README.md
@@ -25,7 +25,7 @@ def training_loop(model, train_loader):
     opt = torch.optim.Adam(model.parameters())
     loss_fn = F.cross_entropy
     model.train()
-  
+
     for epoch in range(num_epochs):
         for X, y in train_loader:
             opt.zero_grad()
@@ -66,15 +66,15 @@ AGC is implemented as follows:
 On `Event.AFTER_TRAIN_BATCH`, for every parameter in the model that has gradients:
 1. Compute the parameter's weight norm with an L2 norm (normalized across rows for MLP's, across entire filters for CNN's, and across the entire vector for biases).
 2. Compute the parameter's gradient norm with an L2 norm.
-3. If `grad_norm > weight_norm * clipping_threshold`, scale all the contributing gradients by `clipping_threshold * (weight_norm / grad_norm)`. 
+3. If `grad_norm > weight_norm * clipping_threshold`, scale all the contributing gradients by `clipping_threshold * (weight_norm / grad_norm)`.
 
 
 ## Suggested Hyperparameters
 
 We haven't done much experimentation with AGC. However, [the original authors, Brock et al.](https://arxiv.org/abs/2102.06171)
 and [Ayush Thakur](https://wandb.ai/ayush-thakur/nfnet/reports/Exploring-Adaptive-Gradient-Clipping-and-NFNets--Vmlldzo1MDc0NTQ)
 have done some ablations have some recommendations. Note, both parties use AGC with NF-ResNets, which is a variation
-of ResNets that removes Batch Norm and includes [Weight Standardization](https://arxiv.org/abs/1903.10520) 
+of ResNets that removes Batch Norm and includes [Weight Standardization](https://arxiv.org/abs/1903.10520)
 among other modifications.
 
 Brock et al. recommend using a `clipping threshold` of 0.01 for batch sizes between 1024 to 4096.
@@ -84,7 +84,7 @@ slightly increasing up to 0.08. They also recommend removing AGC from the last l
 Thakur recommends large `clipping threshold` for small batch sizes (at least 0.16 for batch sizes 128 and 256) and smaller `clipping threshold` for large batch sizes .
 They also found that AGC seems to work especially well for the NF-ResNet architecture. Specifically they found that for `clipping threshold` of 0.01 and batch size of 1024, AGC does not improve the the performance of a vanilla ResNet with Batch Norm removed.
 
-<!-- ## Technical Details 
+<!-- ## Technical Details
 TODO(eracah): fill in this section.
 -->
 

diff --git a/composer/algorithms/agc/agc.py b/composer/algorithms/agc/agc.py
@@ -28,7 +28,7 @@ def apply_agc(
          .. testcode::
 
             import composer.functional as cf
-            
+
             cf.apply_agc(model=model)
 
 

diff --git a/composer/algorithms/agc/metadata.json b/composer/algorithms/agc/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Gradients with norms larger than a maximum norm (equal to the weight norm multiplied by a constant) are scaled by the ratio of the maximum norm to the grad norm.",
         "use": "Computer vision tasks"
     }
-}
+}
diff --git a/composer/algorithms/algorithm_registry.py b/composer/algorithms/algorithm_registry.py
diff --git a/composer/algorithms/alibi/README.md b/composer/algorithms/alibi/README.md
@@ -16,7 +16,7 @@ ALiBi (Attention with Linear Biases) dispenses with position embeddings for toke
 
 <!--pytest-codeblocks:importorskip(transformers)-->
 ```python
-# Run the ALiBi algorithm directly on the model using the Composer functional API 
+# Run the ALiBi algorithm directly on the model using the Composer functional API
 
 import torch
 import composer.functional as cf
@@ -40,7 +40,7 @@ def training_loop(model, train_loader):
     opt = torch.optim.Adam(model.parameters())
     loss_fn = F.cross_entropy
     model.train()
-  
+
     for epoch in range(num_epochs):
         for X, y in train_loader:
             y_hat = model(X)
@@ -102,7 +102,7 @@ Press et al. found that learning *m* did not lead to strong extrapolation. They
 Press et al. report that models trained with ALiBi maintain similar performance even when tested on sequences 5-10x longer than they were trained on. ALiBi’s extrapolation capabilities can be leveraged to train on shorter sequences. This is desirable because the number of operations required to compute self-attention and the GPU memory usage required to store the resulting representations both increase with the square of the sequence length. In one example scenario, Press et al. reported training to equal perplexity 90% of the time and utilizing 90% of the GPU memory compared to a baseline model with sinusoidal position embeddings. Our experiments show that ALiBi can reduce perplexity by 0.2-0.6, train models 1.15x faster, and utilize 1.2x less GPU memory compared to baseline models (see below).
 
 > ✅ ALiBi Improves the Tradeoff Between Quality and Training Speed
-> 
+>
 > In our experiments, ALiBi improves the attainable tradeoffs between training speed and the final quality of the trained model.
 > We recommend ALiBi for training language models.
 
@@ -118,7 +118,7 @@ We conducted experiments on the GPT-2 model family trained on OpenWebText on 8x
 |GPT2-125M ALiBi 0.25x|23.49|-0.63|25280|1.19x|74.83|1.28x|
 
 > ❗ Don't Set the Sequence Length Too Short
-> 
+>
 >We observed that performance significantly degraded for ALiBi models trained on sequence lengths ≤128, implying that very short sequences (≤128 tokens) may be irreconcilably out-of-distribution with regard to longer sequences. Considering our results together with those of Press et al. leads us to suggest that models with ALiBi should not be trained on sequences ≤256 or `train_sequence_length_scaling≤0.03125`, whichever is larger.
 
 ## Attribution

diff --git a/composer/algorithms/alibi/alibi.py b/composer/algorithms/alibi/alibi.py
@@ -287,7 +287,7 @@ def _zero_and_freeze_expand_position_embeddings(
 
 
 def _register_alibi(module: torch.nn.Module, n_heads: int, max_token_length: int):
-    # Modified from https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L742
+    # Modified from https://github.com/ofirpress/attention_with_linear_biases/blob/5b327adc6d131e28b40ba58906b30bb469483519/fairseq/models/transformer.py#L742
     slopes = torch.Tensor(_get_alibi_head_slopes(n_heads))
     # In the next line, the part after the * is what constructs the diagonal matrix
     # (right matrix in Figure 3 in the paper).

diff --git a/composer/algorithms/alibi/metadata.json b/composer/algorithms/alibi/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Encodes position information by biasing the query-key attention scores proportionally to each token pair\u2019s distance.",
         "use": "Transformer-based NLP models"
     }
-}
+}
diff --git a/composer/algorithms/augmix/README.md b/composer/algorithms/augmix/README.md
@@ -50,7 +50,7 @@ def augmix_image(image: Union[PillowImage, torch.Tensor]):
 import torchvision.transforms as transforms
 from torchvision.datasets.vision import VisionDataset
 
-from composer.algorithms.augmix import AugmentAndMixTransform 
+from composer.algorithms.augmix import AugmentAndMixTransform
 
 augmix_transform = AugmentAndMixTransform(severity=3,
                                           width=3,
@@ -105,7 +105,7 @@ The class form of AugMix runs on `Event.FIT_START` and inserts `AugmentAndMixTra
 [As per Hendrycks et al. (2020)](https://arxiv.org/abs/1912.02781), we found that `width=3`, `depth=-1`, (`depth=-1` means that depth will be randomly sampled from the uniform distribution {1, 2, 3} for each data sample), `severity=3` (out of a maximum possible value of 10), and `alpha=1` (i.e., performing no mixing with the original image) worked well for different models of the ResNet family. We used `augmentation_set=all`.
 
 > ❗ Potential CPU Bottleneck
-> 
+>
 > Further increasing `width` or `depth` significantly decreases throughput when training ResNet-50 on ImageNet due to bottlenecks in performing data augmentation on the CPU.
 
 ## Technical Details
@@ -125,7 +125,7 @@ When omitting the custom loss function and using the AugMix augmentation scheme
 However, the increased CPU load imposed by AugMix substantially reduces throughput.
 
 > ❗ Potential CPU Bottleneck
-> 
+>
 > We found that using AugMix with the hyperparameters recommended by Hendrycks et al. can increase the data augmentation load on the CPU so much that it bottlenecks training.
 > Depending on the hardware configuration and model, we found that those hyperparameters increase training time by 1.1x-10x.
 
@@ -135,7 +135,7 @@ In addition, AugMix is a regularization technique, meaning it makes training mor
 Doing so can allow models to reach higher quality, but this typically requires (1) larger models with more capacity to perform this more difficult learning and (2) longer training runs to allow these models time to learn.
 
 > 🚧 AugMix May Reduce Quality for Smaller Models and Shorter Training Runs
-> 
+>
 > AugMix is a regularization technique that makes training more difficult for the model.
 > Because AugMix is a regularization technique, it can allow models to reach higher quality for
 >
@@ -150,7 +150,7 @@ Doing so can allow models to reach higher quality, but this typically requires (
 > As general rule, composing regularization methods may lead to diminishing returns in quality improvements while increasing the risk of creating a CPU bottleneck.
 
 > ❗ CIFAR-10C and ImageNet-C are no longer out-of-distribution
-> 
+>
 > [CIFAR-10C and ImageNet-C](https://github.com/hendrycks/robustness) are test sets created to evaluate the ability of models to generalize to images that are corrupted in various ways (i.e., images that are _out-of-distribution_ with respect to the standard CIFAR-10 and ImageNet training sets).
 > These images were corrupted using some of the augmentation techniques in `augmentation_set=all`.
 > If you use `augmentation_set=all`, these images are therefore no longer out-of-distribution.

diff --git a/composer/algorithms/augmix/metadata.json b/composer/algorithms/augmix/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Creates multiple random chain of augmentations for each sample, and takes a convex combination over the chains",
         "use": "Computer vision tasks"
     }
-}
+}
diff --git a/composer/algorithms/blurpool/README.md b/composer/algorithms/blurpool/README.md
@@ -77,7 +77,7 @@ For max pooling, it replaces `torch.nn.MaxPool2d` instances with instances of a
 
 🚧 Implementation Note
 >
-> Blurpool does not replace strided convolutions with fewer than `min_channels` input channels, which by default is set to `16`. This is a heuristic used to avoid blurpooling the network's input. Doing so is undesirable since it amounts to downsampling the input by more than the amount specified in the preprocessing pipeline. 
+> Blurpool does not replace strided convolutions with fewer than `min_channels` input channels, which by default is set to `16`. This is a heuristic used to avoid blurpooling the network's input. Doing so is undesirable since it amounts to downsampling the input by more than the amount specified in the preprocessing pipeline.
 
 ## Suggested Hyperparameters
 

diff --git a/composer/algorithms/channels_last/README.md b/composer/algorithms/channels_last/README.md
@@ -17,7 +17,7 @@ This is a systems-level method that does not change the math or outcome of train
 ### Functional Interface
 
 ```python
-# Run the Channels Last algorithm directly on the model using the Composer functional API 
+# Run the Channels Last algorithm directly on the model using the Composer functional API
 
 import composer.functional as cf
 
@@ -27,7 +27,7 @@ def training_loop(model, train_loader):
     opt = torch.optim.Adam(model.parameters())
     loss_fn = F.cross_entropy
     model.train()
-  
+
     for epoch in range(num_epochs):
         for X, y in train_loader:
             y_hat = model(X)
@@ -78,7 +78,7 @@ If the model weights are instead initialized in NHWC format, PyTorch will automa
 We currently implement this method by casting the user’s model to channels-last format (no changes to the dataloader are necessary). When the first convolution operation receives its input activation, it will automatically convert it to NHWC format, after which the memory format will persist for the remainder of the network (or until it reaches a layer that cannot support having channels last).
 
 > ❗ Overhead from Operations Incompatible with Channels Last Memory Format
-> 
+>
 > If a model has layers that cannot support the channels last memory format, there will be overhead due to PyTorch switching activation tensors back and forth between NCHW and NHWC memory formats. We believe this problem currently affects placing channels last on UNet.
 
 ## Attribution

diff --git a/composer/algorithms/channels_last/metadata.json b/composer/algorithms/channels_last/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Stores activation and weight tensors in a NHWC (batch, height, width, channels) format, rather than Pytorch\u2019s default of NCHW.",
         "use": "2D Convolutional Neural Networks"
     }
-}
+}
diff --git a/composer/algorithms/colout/README.md b/composer/algorithms/colout/README.md
@@ -81,7 +81,7 @@ trainer.fit()
 ### Implementation Details
 
 ColOut currently has two implementations.
-One implementation, accessed by passing `batch=False`, acts as an additional data augmentation for use in PyTorch dataloaders. It runs on the CPU and applies ColOut independently to each training example. 
+One implementation, accessed by passing `batch=False`, acts as an additional data augmentation for use in PyTorch dataloaders. It runs on the CPU and applies ColOut independently to each training example.
 A second implementation, accessed by passing `batch=True`, runs immediately before the training example is provided to the model. It runs on the GPU and drops the same rows and columns for all training examples in a mini-batch.
 
 ## Suggested Hyperparameters

diff --git a/composer/algorithms/colout/metadata.json b/composer/algorithms/colout/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Drops a fraction of the rows and columns of an input image to reduce the image size and add variability.",
         "use": "Computer vision tasks"
     }
-}
+}
diff --git a/composer/algorithms/cutmix/cutmix.py b/composer/algorithms/cutmix/cutmix.py
@@ -184,10 +184,10 @@ class CutMix(Algorithm):
             box such that each pixel has an equal probability of being mixed.
             If ``False``, defaults to the sampling used in the original
             paper implementation. Default: ``False``.
-        input_key (str, int, or Callable): A key that indexes to the input 
+        input_key (str, int, or Callable): A key that indexes to the input
             from the batch. Can also be a pair of get and set functions, where the getter
             is assumed to be first in the pair.
-        target_key (str, int, or Callable): A key that indexes to the target 
+        target_key (str, int, or Callable): A key that indexes to the target
             from the batch. Can also be a pair of get and set functions, where the getter
             is assumed to be first in the pair.
 

diff --git a/composer/algorithms/cutmix/metadata.json b/composer/algorithms/cutmix/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Overlays a patch of a different image onto the input, and interpolates labels accordingly.",
         "use": "Image classification, semantic segmentation"
     }
-}
+}
diff --git a/composer/algorithms/cutout/README.md b/composer/algorithms/cutout/README.md
@@ -65,7 +65,7 @@ trainer.fit()
 
 ### Implementation Details
 
-CutOut randomly selects `num_holes` square regions (which are possibly overlapping) with side length `length` and uses them to generate a binary mask for the image where the points within any hole are set to 0 and the remaining points are set to 1. 
+CutOut randomly selects `num_holes` square regions (which are possibly overlapping) with side length `length` and uses them to generate a binary mask for the image where the points within any hole are set to 0 and the remaining points are set to 1.
 This mask is then multiplied element-wise with the image in order to set the pixel value of any pixel value within a hole to 0.
 
 CutOut is implemented following the [original paper](https://arxiv.org/abs/1708.04552). However, our implementation currently differs in that CutOut operates on a batch of data and runs on device to avoid potential CPU bottlenecks.

diff --git a/composer/algorithms/cutout/cutout.py b/composer/algorithms/cutout/cutout.py
@@ -75,7 +75,7 @@ def cutout_batch(input: ImgT, num_holes: int = 1, length: float = 0.5, uniform_s
 
 
 class CutOut(Algorithm):
-    """`CutOut <https://arxiv.org/abs/1708.04552>`_ is a data augmentation technique 
+    """`CutOut <https://arxiv.org/abs/1708.04552>`_ is a data augmentation technique
     that works by masking out one or more square regions of an input image.
 
     This implementation cuts out the same square from all images in a batch.

diff --git a/composer/algorithms/cutout/metadata.json b/composer/algorithms/cutout/metadata.json
@@ -12,4 +12,4 @@
         "summary": "Masks out one or more square regions of an input image.",
         "use": "Computer vision tasks"
     }
-}
+}
diff --git a/composer/algorithms/ema/README.md b/composer/algorithms/ema/README.md
@@ -94,4 +94,4 @@ Our implementation of EMA also provides the option to use the EMA weights as the
 
 Our implementation of EMA was inspired by [Tensorflow's Exponential Moving Average](https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage)
 
-*This Composer implementation of this method and the accompanying documentation were produced by Cory Stephenson at MosaicML.*
+*This Composer implementation of this method and the accompanying documentation were produced by Cory Stephenson at MosaicML.*
diff --git a/composer/algorithms/ema/__init__.py b/composer/algorithms/ema/__init__.py
diff --git a/composer/algorithms/ema/ema.py b/composer/algorithms/ema/ema.py
diff --git a/composer/algorithms/ema/metadata.json b/composer/algorithms/ema/metadata.json
@@ -13,4 +13,4 @@
         "summary": "Maintains an exponential moving average of model weights for use in evaluation.",
         "use": "Generally applicable"
     }
-}
+}