In [2]:
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

In [2]:
# 修复 MPS 上的 area 模式问题
import os                                                                                                                                             
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1' 
import torch                                                                                                                                          
import torch.nn.functional as F                                                                                                                       
                                                                                                                                                    
_original_interpolate = F.interpolate                                                                                                                 
                                                                                                                                                    
def _patched_interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False):      
    if mode == 'area' and input.device.type == 'mps':                                                                                                 
        # MPS 不支持 area 模式,转到 CPU 执行                                                                                                          
        return _original_interpolate(                                                                                                                 
            input.cpu(), size, scale_factor, mode, align_corners, recompute_scale_factor, antialias                                                   
        ).to('mps')                                                                                                                                   
    return _original_interpolate(input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias)                                   
                                                                                                                                                    
F.interpolate = _patched_interpolate

  import pynvml  # type: ignore[import]


This is part 3 of the [Road to the Top](https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1) series, in which I show the process I used to tackle the [Paddy Doctor](https://www.kaggle.com/competitions/paddy-disease-classification) competition, leading to four 1st place submissions. The previous notebook is available here: [part 2](https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1).

## Memory and gradient accumulation

First we'll repeat the steps we used last time to access the data and ensure all the latest libraries are installed, and we'll also grab the files we'll need for the test set:

In [7]:
!python -c "import torch, torchvision; print('torch', torch.__version__); print('torchvision', torchvision.__version__); print('cuda', torch.version.cuda); print('cuda_available', torch.cuda.is_available())"
!python -c "import torchvision; import torchvision.ops; print('torchvision ops ok'); from torchvision.ops import nms; print('nms import ok')"
!pip show torch torchvision | sed -n '1,120p'

torch 2.9.1+cu128
torchvision 0.24.1+cu128
cuda 12.8
cuda_available True
torchvision ops ok
nms import ok
Name: torch
Version: 2.9.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org
Author: 
Author-email: PyTorch Team <packages@pytorch.org>
License: BSD-3-Clause
Location: /home/goosman/venvs/fastai/lib/python3.14/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvshmem-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: fastai, torchvision
---
Name: torchvision
Version: 0.24.1
Summary: image and video datasets and models for torch deep learning
Home-page: https://github.com/pytorch/vision
Author: P

In [4]:
comp = 'paddy-disease-classification'
path = setup_comp(comp, install='fastai "timm>=0.6.2.dev0"')
from fastai.vision.all import *
set_seed(42)

tst_files = get_image_files(path/'test_images').sorted()

  import pynvml  # type: ignore[import]


In this analysis our goal will be to train an ensemble of larger models with larger inputs. The challenge when training such models is generally GPU memory. Kaggle GPUs have 16280MiB of memory available, as at the time of writing. I like to try out my notebooks on my home PC, then upload them -- but I still need them to run OK on Kaggle (especially if it's a code competition, where this is required). My home PC has 24GiB cards, so just because it runs OK at home doesn't mean it'll run OK on Kaggle.
 
It's really helpful to be able to quickly try a few models and image sizes and find out what will run successfully. To make this quick, we can just grab a small subset of the data for running short epochs -- the memory use will still be the same, but it'll be much faster.

One easy way to do this is to simply pick a category with few files in it. Here's our options:

In [4]:
df = pd.read_csv(path/'train.csv')
df.label.value_counts()

label
normal                      1764
blast                       1738
hispa                       1594
dead_heart                  1442
tungro                      1088
brown_spot                   965
downy_mildew                 620
bacterial_leaf_blight        479
bacterial_leaf_streak        380
bacterial_panicle_blight     337
Name: count, dtype: int64

Let's use *bacterial_panicle_blight* since it's the smallest:

In [5]:
trn_path = path/'train_images'/'bacterial_panicle_blight'

Now we'll set up a `train` function which is very similar to the steps we used for training in the last notebook. But there's a few significant differences...

The first is that I'm using a `finetune` argument to pick whether we are going to run the `fine_tune()` method, or the `fit_one_cycle()` method -- the latter is faster since it doesn't do an initial fine-tuning of the head. When we fine tune in this function I also have it calculate and return the TTA predictions on the test set, since later on we'll be ensembling the TTA results of a number of models. Note also that we no longer have `seed=42` in the `ImageDataLoaders` line -- that means we'll have different training and validation sets each time we call this. That's what we'll want for ensembling, since it means that each model will use slightly different data.

The more important change is that I've added an `accum` argument to implement *gradient accumulation*. As you'll see in the code below, this does two things:

1. Divide the batch size by `accum`
1. Add the `GradientAccumulation` callback, passing in `accum`.

In [24]:
def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
    gc.collect()
    torch.cuda.empty_cache()
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
    cbs = GradientAccumulation(64) if accum > 1 else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    if finetune:
        learn.fine_tune(epochs, 0.01)
        return learn.tta(dl=dls.test_dl(tst_files))
    else:
        learn.unfreeze()
        learn.fit_one_cycle(epochs, 0.01)

*Gradient accumulation* refers to a very simple trick: rather than updating the model weights after every batch based on that batch's gradients, instead keep *accumulating* (adding up) the gradients for a few batches, and them update the model weights with those accumulated gradients. In fastai, the parameter you pass to `GradientAccumulation` defines how many batches of gradients are accumulated. Since we're adding up the gradients over `accum` batches, we therefore need to divide the batch size by that same number. The resulting training loop is nearly mathematically identical to using the original batch size, but the amount of memory used is the same as using a batch size `accum` times smaller!

For instance, here's a basic example of a single epoch of a training loop without gradient accumulation:

```python
for x,y in dl:
    calc_loss(coeffs, x, y).backward()
    coeffs.data.sub_(coeffs.grad * lr)
    coeffs.grad.zero_()
```

Here's the same thing, but with gradient accumulation added (assuming a target effective batch size of 64):

```python
count = 0            # track count of items seen since last weight update
for x,y in dl:
    count += len(x)  # update count based on this minibatch size
    calc_loss(coeffs, x, y).backward()
    if count>64:     # count is greater than accumulation target, so do weight update
        coeffs.data.sub_(coeffs.grad * lr)
        coeffs.grad.zero_()
        count=0      # reset count
```

The full implementation in fastai is only a few lines of code -- here's the [source code](https://github.com/fastai/fastai/blob/master/fastai/callback/training.py#L26).

To see the impact of gradient accumulation, consider this small model:

Let's create a function to find out how much memory it used, and also to then clear out the memory for the next run:

In [7]:
import gc
def report_gpu():
    if torch.backends.mps.is_available():
        # MPS 版本
        print(f"MPS allocated: {torch.mps.current_allocated_memory() / 1024**2:.2f} MB")
        print(f"MPS driver allocated: {torch.mps.driver_allocated_memory() / 1024**2:.2f} MB")
        gc.collect()
        torch.mps.empty_cache()
    else:
        # CUDA 版本
        print(torch.cuda.list_gpu_processes())
        gc.collect()
        torch.cuda.empty_cache()

So with `accum=1` the GPU used around 5GB RAM. Let's try `accum=2`:

In [25]:
print('before training')
report_gpu()
train('convnext_small_in22k', 128, epochs=1, accum=2, finetune=False)
print('after training')
report_gpu()


before training
GPU:0
process      11853 uses    17542.000 MB GPU memory


  model = create_fn(


epoch,train_loss,valid_loss,error_rate,time
0,2.384507,6.536691,0.808746,00:25


after training
GPU:0
process      11853 uses     8582.000 MB GPU memory


As you see, the RAM usage has now gone down to 4GB. It's not halved since there's other overhead involved (for larger models this overhead is likely to be relatively lower).

Let's try `4`:

In [9]:
print('before training')
report_gpu()
train('convnext_small_in22k', 128, epochs=1, accum=4, finetune=False)
print('after training')
report_gpu()

before training
GPU:0
process       9248 uses      506.000 MB GPU memory


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:01


after training
GPU:0
process       9248 uses     1882.000 MB GPU memory


In [17]:
# check fp16 if really used
dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=128, min_scale=0.75), bs=64)
      
learn = vision_learner(dls, 'convnext_small_in22k', metrics=error_rate)                                                                                               
print(next(learn.model.parameters()).dtype)  # 查看权重类型                                                                                           
                                                                                                                                                    
learn.to_fp16()
print([type(cb).__name__ for cb in learn.cbs])
print(next(learn.model.parameters()).dtype)  # 再次查看  

torch.float32
['TrainEvalCallback', 'Recorder', 'CastToTensor', 'ProgressCallback', 'MixedPrecision']
torch.float32


The memory use is even lower!

## Checking memory use

We'll now check the memory use for each of the architectures and sizes we'll be training later, to ensure they all fit in 16GB RAM. For each of these, I tried `accum=1` first, and then doubled it any time the resulting memory use was over 16GB. As it turns out, `accum=2` was what I needed for every case.

First, `convnext_large`:

In [15]:
trn_path = path/'train_images'/'bacterial_panicle_blight'
trn_path

Path('paddy-disease-classification/train_images/bacterial_panicle_blight')

In [9]:
train('convnext_large_in22k', 224, epochs=1, accum=1, finetune=False)
report_gpu()

  model = create_fn(


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:10


GPU:0
process      11853 uses    16590.000 MB GPU memory


In [23]:
train('convnext_large_in22k', (320,240), epochs=1, accum=1, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:09


GPU:0
process       2875 uses    39922.000 MB GPU memory


Here's `vit_large`. This one is very close to going over the 16280MiB we've got on Kaggle!

In [25]:
train('vit_large_patch16_224', 224, epochs=1, accum=1, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:02


GPU:0
process       2875 uses    39588.000 MB GPU memory


Then finally our `swinv2` and `swin` models:

In [10]:
# ========== 修复 Swin Transformer channels-last 输出问题 ==========                                                                                                            
# 修补 TimmBody 类的 forward 方法                                                                                                                                               
from fastai.vision.learner import TimmBody                                                                                                                                      
                                                                                                                                                                               
_original_timm_body_forward = TimmBody.forward                                                                                                                                                                         
def _patched_timm_body_forward(self, x):                                                                                                                                        
   out = _original_timm_body_forward(self, x)                                                                                                                                  
   # 检测是否是 channels-last 格式 (N, H, W, C)                                                                                                                                
   # 如果 dim=4 且最后一维 > 第一维(排除batch)，则是 channels-last                                                                                                             
   if out.dim() == 4 and out.shape[-1] > out.shape[1]:                                                                                                                         
       out = out.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)                                                                                                           
   return out                                                                                                                                                                  
                                                                                                                                      
TimmBody.forward = _patched_timm_body_forward                                                                                                                                  
print("✅ TimmBody.forward 已修补")                                                                                                                                             
# ========== 修复结束 ========== 

✅ TimmBody.forward 已修补


In [11]:
train('swinv2_large_window12_192_22k', 192, epochs=1, accum=1, finetune=False)
report_gpu()

  model = create_fn(


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:03


GPU:0
process      11853 uses    21592.000 MB GPU memory


In [16]:
train('swin_large_patch4_window7_224', 224, epochs=1, accum=1, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:02


GPU:0
process      11853 uses    18462.000 MB GPU memory


## Running the models

Using the previous notebook, I tried a bunch of different architectures and preprocessing approaches on small models, and picked a few which looked good. We'll using a `dict` to list our the preprocessing approaches we'll use for each architecture of interest based on that analysis:

In [26]:
res = 640,480

In [27]:
models = {
    'convnext_large_in22k': {
        (Resize(res), (320,224)),
    }, 'vit_large_patch16_224': {
        (Resize(480, method='squish'), 224),
        (Resize(res), 224),
    }, 'swinv2_large_window12_192_22k': {
        (Resize(480, method='squish'), 192),
        (Resize(res), 192),
    }, 'swin_large_patch4_window7_224': {
        (Resize(res), 224),
    }
}

We'll need to switch to using the full training set of course!

In [28]:
trn_path = path/'train_images'

Now we're ready to train all these models. Remember that each is using a different training and validation set, so the results aren't directly comparable.

We'll append each set of TTA predictions on the test set into a list called `tta_res`.

In [29]:
tta_res = []

for arch,details in models.items():
    for item,size in details:
        print('---',arch)
        print(size)
        print(item.name)
        tta_res.append(train(arch, size, item=item, accum=1)) #, epochs=1))
        gc.collect()
        torch.cuda.empty_cache()

--- convnext_large_in22k
(320, 224)
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,0.890889,0.546509,0.168669,00:45


epoch,train_loss,valid_loss,error_rate,time
0,0.349456,0.256535,0.073522,01:00
1,0.275063,0.213506,0.065834,01:00
2,0.282137,0.302557,0.084575,01:00
3,0.2156,0.244575,0.061989,01:00
4,0.157894,0.153115,0.037963,01:00
5,0.126237,0.281991,0.067756,01:00
6,0.085962,0.132171,0.025949,01:00
7,0.062512,0.144644,0.032677,01:00
8,0.041081,0.13439,0.025949,01:00
9,0.035297,0.125425,0.027391,01:00


--- vit_large_patch16_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,1.066306,0.570439,0.184527,00:42


epoch,train_loss,valid_loss,error_rate,time
0,0.415761,0.223551,0.065834,00:55
1,0.312363,0.326836,0.09851,00:55
2,0.333807,0.330522,0.09851,00:55
3,0.274093,0.250372,0.0716,00:55
4,0.202696,0.252313,0.063431,00:55
5,0.160416,0.166796,0.049976,00:55
6,0.125941,0.179168,0.041326,00:55
7,0.079664,0.085384,0.023066,00:55
8,0.055739,0.075119,0.016338,00:55
9,0.03795,0.072022,0.01826,00:55


--- vit_large_patch16_224
224
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,1.027437,0.664254,0.184046,00:41


epoch,train_loss,valid_loss,error_rate,time
0,0.410169,0.273226,0.08073,00:54
1,0.313051,0.286849,0.078808,00:54
2,0.332749,0.314416,0.097069,00:54
3,0.270615,0.208069,0.072081,00:54
4,0.191722,0.16272,0.049015,00:54
5,0.166271,0.185558,0.050937,00:54
6,0.097467,0.120691,0.030274,00:54
7,0.076019,0.116945,0.028832,00:54
8,0.064498,0.095065,0.024507,00:54
9,0.044415,0.088693,0.023546,00:54


--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,0.946039,0.52675,0.162422,00:52


epoch,train_loss,valid_loss,error_rate,time
0,0.409663,0.268924,0.0889,01:03
1,0.334137,0.199871,0.064392,01:04
2,0.330473,0.229528,0.067756,01:04
3,0.267714,0.285534,0.084094,01:04
4,0.238674,0.176885,0.05382,01:04
5,0.15952,0.131516,0.030274,01:04
6,0.113025,0.146471,0.038924,01:04
7,0.098334,0.118228,0.030274,01:04
8,0.071669,0.106274,0.022105,01:04
9,0.058576,0.091081,0.021144,01:04


--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,0.968439,0.494394,0.15185,00:53


epoch,train_loss,valid_loss,error_rate,time
0,0.420322,0.223695,0.068717,01:04
1,0.337548,0.276236,0.08025,01:04
2,0.351189,0.358843,0.10716,01:05
3,0.27535,0.268515,0.074964,01:04
4,0.208403,0.194488,0.050937,01:04
5,0.169383,0.16659,0.040365,01:05
6,0.136004,0.149654,0.038443,01:05
7,0.101726,0.129833,0.031716,01:05
8,0.062535,0.101526,0.021144,01:05
9,0.043516,0.089714,0.017299,01:05


--- swin_large_patch4_window7_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}



epoch,train_loss,valid_loss,error_rate,time
0,1.042753,0.631651,0.186929,00:42


epoch,train_loss,valid_loss,error_rate,time
0,0.451883,0.254706,0.077367,00:52
1,0.349714,0.240354,0.069678,00:51
2,0.330041,0.299767,0.085536,00:51
3,0.258307,0.18779,0.054301,00:51
4,0.242764,0.166374,0.043729,00:51
5,0.180317,0.142722,0.045171,00:51
6,0.143198,0.13462,0.042287,00:51
7,0.109201,0.124001,0.027871,00:51
8,0.074663,0.089387,0.024027,00:51
9,0.054525,0.093258,0.022105,00:51


## Ensembling

Since this has taken quite a while to run, let's save the results, just in case something goes wrong!

In [30]:
save_pickle('tta_res.pkl', tta_res)

`Learner.tta` returns predictions and targets for each rows. We just want the predictions:

In [35]:
tta_prs = first(zip(*tta_res))

Originally I just used the above predictions, but later I realised in my experiments on smaller models that `vit` was a bit better than everything else, so I decided to give those double the weight in my ensemble. I did that by simply adding the to the list a second time (we could also do this by using a weighted average):

In [36]:
tta_prs += tta_prs[1:3]

An *ensemble* simply refers to a model which is itself the result of combining a number of other models. The simplest way to do ensembling is to take the average of the predictions of each model:

In [37]:
avg_pr = torch.stack(tta_prs).mean(0)
avg_pr.shape

torch.Size([3469, 10])

That's all that's needed to create an ensemble! Finally, we copy the steps we used in the last notebook to create a submission file:

In [38]:
dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=224, min_scale=0.75))

In [39]:
idxs = avg_pr.argmax(dim=1)
vocab = np.array(dls.vocab)
ss = pd.read_csv(path/'sample_submission.csv')
ss['label'] = vocab[idxs]
ss.to_csv('subm.csv', index=False)

In [43]:
import pandas as pd
df = pd.read_csv('subm.csv')
df
comp

'paddy-disease-classification'

Now we can submit:

In [None]:
if not iskaggle:
    from kaggle import api
    api.competition_submit_cli('subm.csv', 'part 3 v2', comp)

That's it -- at the time of creating this analysis, that got easily to the top of the leaderboard! Here are the four submissions I entered, each of which was better than the last, and each of which was ranked #1:

<img src="https://user-images.githubusercontent.com/346999/174503966-65005151-8f28-4f8b-b3c3-212cf74014f1.png" width="400">

*Edit: Actually the one that got to the top of the leaderboard timed out when I ran it on Kaggle Notebooks, so I had to remove four of the runs from the ensemble. There's only a small difference in accuracy however.*

Going from bottom to top, here's what each one was:

1. `convnext_small` trained for 12 epochs, with TTA
1. `convnext_large` trained the same way
1. The ensemble in this notebook, with `vit` models not over-weighted
1. The ensemble in this notebook, with `vit` models over-weighted.

## Conclusion

The key takeaway I hope to get across from this series so far is that you can get great results in image recognition using very little code and a very standardised approach, and that with a rigorous process you can improve in significant steps. Our training function, including data processing and TTA, is just half a dozen lines of code, plus another 7 lines of code to ensemble the models and create a submission file!

If you found this notebook useful, please remember to click the little up-arrow at the top to upvote it, since I like to know when people have found my work useful, and it helps others find it too. If you have any questions or comments, please pop them below -- I read every comment I receive!

In [27]:
# This is what I use to push my notebook from my home PC to Kaggle

if not iskaggle:
    push_notebook('jhoward', 'scaling-up-road-to-the-top-part-3',
                  title='Scaling Up: Road to the Top, Part 3',
                  file='10-scaling-up-road-to-the-top-part-3.ipynb',
                  competition=comp, private=False, gpu=True)