# Goal-oriented self-improvement for Automated Theorem Proving

## Method
We are basically thinking of extending minimo by introducing a goal i.e. a set of problems the model should eventually solve. We do this by adding loss-term forcing the model to sample conjectures from our goal set. We control this ‘forcing’ using a hyperparameter `alpha`. A higher value gives the `progess_loss` more value, thus pushes the model towards the goal stronger.

In [13]:
# lets move to the parent directory so it is easier to run the scripts
import os

if os.getcwd().split('/')[-1] == 'experiments':
    os.chdir('../')

!ls


environment	  launch    pyproject.toml	     slurm-1275910.out
experiments	  learning  README.md		     tutorial.md
FAQ.md		  LICENSE   redis_hostname_port.txt  wandb
goals		  logs	    setup.sh
install_redis.sh  outputs   slurm-1275909.out


In [None]:
# Let's start some workers so our tasks get executed in parallel. 
# @franz you would need to start them on the same node
sbatch --job-name=redis --cpus-per-task=10 --mem=50G --time=5:00:00 --wrap="./launch/start_redis.sh"

sbatch --job-name=worker1 --cpus-per-task=4 --gres=gpu:1 --mem=50G --time=5:00:00 --wrap="./launch/start_worker.sh"
sbatch --job-name=worker2 --cpus-per-task=4 --gres=gpu:1 --mem=50G --time=5:00:00 --wrap="./launch/start_worker.sh"
sbatch --job-name=worker3 --cpus-per-task=4 --gres=gpu:1 --mem=50G --time=5:00:00 --wrap="./launch/start_worker.sh"

## Experiments

For the now we use a very simple goal set. It consists of a single goal which is proving the theorem: 

`a is a natural number: (0 + a) = a`

Or in peano:

`[('a0 : nat) -> (= (+ z 'a0) 'a0)]`

### 1. Overfit on the goal set
After implementing the goal-conditioning, let's run experiments with `alpha=0` and `alpha=1`. We expect the `progress_loss` to go down i.e. the model overfits to the `final_goal` set. However, the actual `train_loss` shouldn’t go down. 

In [20]:
# start a job with alpha=1 to overfit on the goals
!sbatch --job-name=train_alpha_1 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=1:00:00 --wrap="./launch/run_bootstrap_distributed.sh agent.max_mcts_nodes=100 agent.policy.alpha=1 agent.policy.total_iterations=2"

# start a job with alpha=0 as a baseline
!sbatch --job-name=train_alpha_0 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=1:00:00 --wrap="./launch/run_bootstrap_distributed.sh agent.max_mcts_nodes=100 agent.policy.alpha=0 agent.policy.total_iterations=2"

Submitted batch job 1275940
Submitted batch job 1275941


Works! The ``training_loss`` diverges for `alpha=1`. The `progress_loss` struggles to go below a certain threshold for `alpha=0`. As per our intuition, the former overfits to sampling the final theorem and can't solve any conjectures in the second iteration. We need something better for alpha.

### 2. Try different values for alpha

We try different schedules for alpha in the hope that the training loss and progress loss both converge. 

In [3]:
# TODO this still runs on single GPU
!sbatch --job-name=train_alpha_0_8 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.8"

!sbatch --job-name=train_alpha_0_6 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.6"

!sbatch --job-name=train_alpha_0_4 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.4"

!sbatch --job-name=train_alpha_0_2 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.2"

!sbatch --job-name=train_alpha_0_3e_4 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=3e-4"

!sbatch --job-name=train_alpha_0_1e_3 --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=24:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1e-3"

Submitted batch job 1275099


This works as well. We also observe a correlation between alpha values and the ratio of conjectures that the model is able to prove per iteration. The higher the alpha value, the fewer problems the model can solve. What if we could let the model explore problems for itself for a while and at a later stage push it towards our goal?

### 3 Alpha Schedules

We implement fancy schedules to 'warm-up' alpha over several iterations. The options are 

`alpha_schedule = [ constant | linear | quadratic | cubic | cos ]`

#### 3.1 Warm up alpha to 1.0

As a first step we warm up alpha to 1.0. 

In [6]:
# TODO this still runs on single GPU
!sbatch --job-name=train_alpha_lin --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=linear"

!sbatch --job-name=train_alpha_quad --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=quadratic"

!sbatch --job-name=train_alpha_cubic --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=cubic"

!sbatch --job-name=train_alpha_cos --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=cos"

Submitted batch job 1274969


It is apparent that high values of alpha just overpower the actual loss. Ideally we wan't to keep alpha much smaller. Let's try just warming it up to lower values. 

#### 3.2 Warm up alpha to lower values


In [3]:
# TODO this still runs on single GPU
!sbatch --job-name=train_alpha_cubic --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.2 alpha_schedule=cubic"

!sbatch --job-name=train_alpha_cubic --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.3 alpha_schedule=cubic"

!sbatch --job-name=train_alpha_cubic --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.4 alpha_schedule=cubic"

Submitted batch job 1274557
Submitted batch job 1274558
Submitted batch job 1274559


### 4 Use the ratio of solved conjectures 

Now comes the actually interesting part. We implement a more principled schedule.
Let's use the ratio of solved conjectures to total sampled conjectures to directly control alpha. 

In [4]:
# TODO this still runs on single GPU
!sbatch --job-name=train_alpha_ratio --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=ratio"

!sbatch --job-name=train_alpha_ratio --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0.2 alpha_schedule=ratio"

Submitted batch job 1274560
Submitted batch job 1274561


In [22]:
# start a job with alpha schedule ratio and max_alpha=1
!sbatch --job-name=train_ratio --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="./launch/run_bootstrap_distributed.sh agent.policy.alpha=1 agent.policy.alpha_schedule=ratio goals=nat-add-hard"

# start a job with alpha schedule ratio and max_alpha=5e-3
!sbatch --job-name=train_ratio_small --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="./launch/run_bootstrap_distributed.sh agent.policy.alpha=5e-3 agent.policy.alpha_schedule=ratio goals=nat-add-hard"

# start a job with alpha=0 as a baseline
# !sbatch --job-name=train_vanilla --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=10:00:00 --wrap="./launch/run_bootstrap_distributed.sh agent.policy.alpha=0 goals=nat-add-hard"

Submitted batch job 1275950
Submitted batch job 1275951


#### 4.1 Increase max iters

In [3]:
!sbatch --job-name=train_alpha_long --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=20:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=0 iterations=30"

!sbatch --job-name=train_alpha_long_ratio --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=20:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=ratio iterations=30"

!sbatch --job-name=train_alpha_long_cos --cpus-per-task=4 --mem=50G --gres=gpu:1,VRAM=12G --time=20:00:00 --wrap="python learning/bootstrap.py theory=nat-add alpha=1 alpha_schedule=cubic iterations=30"

Submitted batch job 1275039
Submitted batch job 1275040
Submitted batch job 1275041


### 