<a href="https://colab.research.google.com/gist/justheuristic/60fdfe5c90c053a93c00f950a1abd0da/collaborative-training-v0-14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="https://i.imgur.com/FHMoW3N.png" width=360px><br><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Collaborative training <sup>v0.9 alpha</sup></b></center>


This notebook will use local or colab GPU to help train ALBERT-large collaboratively. Your instance will compute gradients and exchange them with a bunch of volunteers around the world. We explain how it works at the bottom. But for now, please run all cells :)

This is a test run to root out any issues before the main event. The run will be terminated by __23:59 11 may GMT+0__. Please do not run colab notebooks from multiple google accounts: google doesn't like this.

In [None]:
experiment_name = "bengali_test_1"
hivemind_version = "bengali_test_1"
collaborative_training_version = "main" 
coordinator_ip = '18.217.249.202'
coordinator_port = 31337

!echo "Installing dependencies..."
!pip install git+https://github.com/learning-at-home/hivemind.git@{hivemind_version} >> install.log 2>&1
!git clone https://github.com/mryab/collaborative-training -b {collaborative_training_version} >> install.log 2>&1
!cd collaborative-training && pip install -r requirements.txt >> install.log 2>&1 && cd ..
%cd ./collaborative-training

import torch
from runner import run_with_logging
assert torch.cuda.is_available(), "GPU device not found. If running in colab, please retry in a few minutes."
device_name = torch.cuda.get_device_name(0)
microbatch_size = 4 if 'T4' in device_name or 'P100' in device_name else 1
print(f"Running with device {device_name}, local batch size = {microbatch_size}")

import uuid
wandb_run_name = str(uuid.uuid4())

command = f"""ulimit -n 4096 && HIVEMIND_THREADS=256 python ./run_trainer.py \
 --client_mode --initial_peers {coordinator_ip}:{coordinator_port} --averaging_expiration 10 --statistics_expiration 120 \
 --batch_size_lead 200 --per_device_train_batch_size {microbatch_size} --gradient_accumulation_steps 1 \
 --logging_first_step --logging_steps 100 --run_name {wandb_run_name}  --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
 --experiment_prefix {experiment_name} --seed 42"""
run_with_logging(command, coordinator_ip, wandb_login=True)

2021-05-07 07:39:03.799888: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
05/07/2021 07:39:07 - WARN - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
05/07/2021 07:39:07 - INFO - __main__ -   Training/evaluation parameters AlbertTrainingArguments(output_dir='./outputs', overwrite_output_dir=True, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=4, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.00176, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=1000000, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=5000, logging_dir='./log

### What's up next?
* Check the training progress on public learning curves: https://wandb.ai/yhn112/Demo-run-2/runs/1cnb06p7
* Run a second GPU session with kaggle notebooks: **TBA**
* See [this tutorial](https://github.com/learning-at-home/hivemind/tree/master/examples/albert) on how to start your own collaborative runs!


_Co-created by [leshanbog](https://github.com/leshanbog), [yhn112](https://github.com/yhn112) and [foksly](https://github.com/foksly) from [hivemind](https://github.com/learning-at-home/hivemind) (YSDA), [lhoestq](https://github.com/lhoestq), [SaulLu](https://github.com/SaulLu) and [stas00@](https://github.com/stas00) from [huggingface](http://huggingface.co)_.


### How it works

Since peers can join and leave at any time, we can't use global [Ring All-Reduce](https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da) for averaging: a single missing peer can break the entire protocol. Instead, peers dynamically assemble into small groups and run all-reduce within each group. Consider an example with 9 GPUs:

<center>
<img src="https://i.imgur.com/QcD1mfG.png" width=360px><br>
The All-Reduce protocol within group can be Ring-AllReduce, but we use a simpler all-to-all algorithm known as butterfly-like all-reduce.<br>
<img src="https://i.imgur.com/ewq3vS6.png" width=380px><br>
After each successful round, participants shuffle around and find new groups:<br>
<img src="https://i.imgur.com/dexNCL3.png" width=350px>

If one of the peers fails to do his part, it will only affect his local group, and only for a single round.


<img src="https://i.imgur.com/RBmElUi.png" width=340px>

Afterwards, peers from the failed group will find new groupmates according to the [moshpit algorithm](https://arxiv.org/abs/2103.03239).

</center>


If you want to learn more and even host your own collaborative experiments, take a look at the [hivemind library](https://github.com/learning-at-home/hivemind/) or the [Moshpit-SGD paper](https://arxiv.org/pdf/2103.03239.pdf).


