<a href="https://colab.research.google.com/github/mryab/collaborative-training/blob/auth/colab_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="https://i.imgur.com/FHMoW3N.png" width=360px><br><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Collaborative training <sup>v0.9 alpha</sup></b></center>


This notebook will use local or colab GPU to help train ALBERT-large collaboratively. Your instance will compute gradients and exchange them with a bunch of volunteers around the world. We explain how it works at the bottom. But for now, please run all cells :)

To start training, you will need to login to your huggingface account, please fill in the prompts as in the example below (replace `robot-bengali` with your username):

![img](https://i.imgur.com/txuWbJi.png)

This is a test run to root out any issues before the main event. The run will be terminated by __23:59 11 may GMT+0__. Please do not run colab notebooks from multiple google accounts: google doesn't like this.

In [None]:
experiment_name = "bengali_test_1_auth"
hivemind_version = "bengali_test_1_auth"
collaborative_training_version = "auth" 
syslog_host = '18.191.120.93'

!echo "Installing dependencies..."
!pip install git+https://github.com/learning-at-home/hivemind.git@{hivemind_version} >> install.log 2>&1
!git clone https://github.com/mryab/collaborative-training -b {collaborative_training_version} >> install.log 2>&1
!cd collaborative-training && pip install -r requirements.txt >> install.log 2>&1 && cd ..
%cd ./collaborative-training

import shlex
from getpass import getpass
import torch
from runner import run_with_logging
assert torch.cuda.is_available(), "GPU device not found. If running in colab, please retry in a few minutes."
device_name = torch.cuda.get_device_name(0)
microbatch_size = 4 if 'T4' in device_name or 'P100' in device_name else 1
print(f"Running with device {device_name}, local batch size = {microbatch_size}")

username = shlex.quote(input('Huggingface login: '))
password = shlex.quote(getpass('Huggingface password: '))

command = f"""ulimit -n 4096 && HIVEMIND_THREADS=256 HF_USERNAME={username} HF_PASSWORD={password} python ./run_trainer.py \
 --client_mode --averaging_expiration 10 --statistics_expiration 120 \
 --batch_size_lead 200 --per_device_train_batch_size {microbatch_size} --gradient_accumulation_steps 1 \
 --logging_first_step --logging_steps 100 --run_name {username}  --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
 --experiment_prefix {experiment_name} --seed 42"""
run_with_logging(command, syslog_host, wandb_login=True)

Installing dependencies...
/content/collaborative-training
Running with device Tesla T4, local batch size = 4
Huggingface login: robot-bengali
Huggingface password: ··········
2021-05-07 09:19:00.512895: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading:   0%|          | 0.00/685 [00:00<?, ?B/s]
Downloading: 100%|██████████| 685/685 [00:00<00:00, 1.00MB/s]
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 16,
  "num_hidden_groups": 1,


### What's up next?
* Check the training progress on public learning curves: https://wandb.ai/yhn112/Demo-run-2/runs/2tqiiq40
* Run a second GPU session with kaggle notebooks: **TBA**
* See [this tutorial](https://github.com/learning-at-home/hivemind/tree/master/examples/albert) on how to start your own collaborative runs!


_Co-created by [yhn112](https://github.com/yhn112), [leshanbog](https://github.com/leshanbog), [foksly](https://github.com/foksly) and [borzunov](https://github.com/borzunov) from [hivemind](https://github.com/learning-at-home/hivemind) (YSDA), [lhoestq](https://github.com/lhoestq), [SaulLu](https://github.com/SaulLu) and [stas00@](https://github.com/stas00) from [huggingface](http://huggingface.co)_.


### How it works

Since peers can join and leave at any time, we can't use global [Ring All-Reduce](https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da) for averaging: a single missing peer can break the entire protocol. Instead, peers dynamically assemble into small groups and run all-reduce within each group. Consider an example with 9 GPUs:

<center>
<img src="https://i.imgur.com/QcD1mfG.png" width=360px><br>
The All-Reduce protocol within group can be Ring-AllReduce, but we use a simpler all-to-all algorithm known as butterfly-like all-reduce.<br>
<img src="https://i.imgur.com/ewq3vS6.png" width=380px><br>
After each successful round, participants shuffle around and find new groups:<br>
<img src="https://i.imgur.com/dexNCL3.png" width=350px>

If one of the peers fails to do his part, it will only affect his local group, and only for a single round.


<img src="https://i.imgur.com/RBmElUi.png" width=340px>

Afterwards, peers from the failed group will find new groupmates according to the [moshpit algorithm](https://arxiv.org/abs/2103.03239).

</center>


If you want to learn more and even host your own collaborative experiments, take a look at the [hivemind library](https://github.com/learning-at-home/hivemind/) or the [Moshpit-SGD paper](https://arxiv.org/pdf/2103.03239.pdf).


