This notebook enables easy training for replicating the experimental results.

First, let us setup the environment. This consists of two steps
- cloning the git repo and install it as the `parsing_by_maxseminfo` package
- collecting the configuration file and data.


In [1]:
!git clone https://github.com/junjiechen-chris/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information.git
!pip install lightning==2.4.0
!pip3 install torch==2.5 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -e Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information
!cp -r Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information/config config

fatal: destination path 'Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information' already exists and is not an empty directory.
Looking in indexes: https://download.pytorch.org/whl/cu118
Obtaining file:///content/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: parsing_by_maxseminfo
  Attempting uninstall: parsing_by_maxseminfo
    Found existing installation: parsing_by_maxseminfo 0.1.0
    Uninstalling parsing_by_maxseminfo-0.1.0:
      Successfully uninstalled parsing_by_maxseminfo-0.1.0
  Running setup.py develop for parsing_by_maxseminfo
Successfully installed parsing_by_maxseminfo-0.1.0


In [2]:
!mkdir -p data
!wget https://huggingface.co/datasets/HarpySeal/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information/resolve/main/english.zip
!unzip -o english.zip -d data/english

--2025-05-05 05:40:55--  https://huggingface.co/datasets/HarpySeal/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information/resolve/main/english.zip
Resolving huggingface.co (huggingface.co)... 13.35.202.40, 13.35.202.97, 13.35.202.34, ...
Connecting to huggingface.co (huggingface.co)|13.35.202.40|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /datasets/junjiechen-chris/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information/resolve/main/english.zip [following]
--2025-05-05 05:40:55--  https://huggingface.co/datasets/junjiechen-chris/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information/resolve/main/english.zip
Reusing existing connection to huggingface.co:443.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/5f/45/5f45f36a47dc93e2ae374281cc1f2a9c8f524a4ceae687c58a8e1d107380d718/becd9d0fd83da283c6695c5c443efc7ad7b14d4b56ffc7f7

Session needs to be restarted here!

In [4]:
import sys
from parsing_by_maxseminfo import parser

# necessary to use prepackaged data
sys.modules['parser'] = parser

In [5]:
# import parsing_by_maxseminfo.utils.prep
import argparse
import yaml
import os
from easydict import EasyDict as edict
import torch
from transformers import TrainingArguments
import sys

# %%
import numpy as np
import random
import torch
import lightning


This block specifies the configuration for experiment. Here, we focus on the following parameters:
- training_mode: the flag controls how the PCFG model is trained. `rl` mode is the SemInfo-maximization training explained in the main text.
- f_config: the path to configuration file. Written in the block is the configuration for English experiments.


In [6]:

"""
Training mode can be selected from one of the below
rl: SemInfo mean-baseline training with CRF as explained in the main text
nll: LL training as explained in the main text
a2c: Stepwise SemInfo training with CRF as explained in Appendix A1
a2c_v0: Posterior V0 training with CRF
ta2c_rules: Stepwise SemInfo training with PCFG
ta2c: Posterior V0 training with PCFG
tavg: Posterior mean-baseline training with PCFG

"""
training_mode = "rl"
assert training_mode in ["rl", "nll", "a2c", "ta2c", "ta2c_rules", "a2c_v0", "tavg"]


f_config = "config/pas-grammar/english-ew-reward-tbtok-idf/npcfg_nt60_t120_en.spacy-10k-merged-0pas-fast-6-3-rlstart0.yaml"
input_args = [
    f"-c={f_config}",
    "--max_length=40",
    f"--set_training_mode={training_mode}",
    "--set_min_span_reward=-4",
    "--unset_ptb_mode",
    "--ckpt_dir=\"./checkpoints/\"",
    "--set_mode_reward=log_tfidf",
    "--set_include_unary",
    # "--use_pcfg_samples", #please decomment this option when using ta2c, ta2c_rules, and tavg
    "--langstr=english",
    "--unset_bert_mode",
]


# %%
from parsing_by_maxseminfo.utils.myargparse import get_argsndevice

# import train
args, device = get_argsndevice(input_args)

print("continue training from", args.continue_from)


Namespace(conf='config/pas-grammar/english-ew-reward-tbtok-idf/npcfg_nt60_t120_en.spacy-10k-merged-0pas-fast-6-3-rlstart0.yaml', use_tf32=False, rank=0, ngpu=1, max_pasdata_lendiff=-1, max_bandwidth=4, pas_subsample_count=-1, alignment_coefficient=-1, adversarial_coefficient=-1, batch_size=-1, debug=False, flag_use_spacy_preprocessing=False, langstr='english', use_ppl_loss=False, span_repr_mode='bge-m3', remark='none', forbid_bn=False, forbid_merged_nll=False, use_normalized_term_mlp=False, use_onesided=False, dropout=-1, force_fp32=False, max_length=40, wandb_tags=None, val_check_interval=5000, ckpt_dir='"./checkpoints/"', flag_curriculum_learning=False, flag_use_separate_nll_path=False, unset_logppl=False, unset_nll_weighing=False, set_fast_model=False, set_mode_offending_spans=False, ckpt=None, unset_renormalizing_marginals=False, set_lr=-1, unset_bert_mode=True, set_pas_suppression=False, wandb_project='Spanoverlap-PCFG', set_hit_count_threshold=-1, preprocessing_pas_subsample_coun

In [7]:
args

{'device': 0,
 'save_dir': 'log',
 'data': {'train_file': 'data/english/ptb_en-full.gd_instruction.batch.gpt4omini-ew-exp-tbtok-idf/train.pickle',
  'val_file': 'data/english/ptb_en-full.gd_instruction.batch.gpt4omini-ew-exp-tbtok-idf/val.pickle',
  'test_file': 'data/english/ptb_en-full.gd_instruction.batch.gpt4omini-ew-exp-tbtok-idf/test.pickle',
  'vocab_type': 'max_size',
  'vocab_size': 10000,
  'min_freq': 2,
  'language': 'english'},
 'model': {'model_name': 'NPCFGA2C-FixedCostReward',
  'NT': 60,
  'T': 120,
  's_dim': 512,
  'use_fast_pcfg': True,
  'use_bn': False,
  'use_normalized_term_mlp': False,
  'bert_mode': 'disabled'},
 'experimental': {'alignment_coefficient': 1.0,
  'adversarial_coefficient': 0.0,
  'pas_subsample_count': 0,
  'renormalizing_marginals': False,
  'weigh_nll_loss': True,
  'suppress_pas_contrib': False,
  'flag_curriculum_learning': False,
  'mode': 'rl',
  'hit_count_threshold': 2,
  'activation_flood': 0.001,
  'mode_offending_spans': True,
  'span

In [8]:
from parsing_by_maxseminfo.parser.helper.pas_grammar_data_helper import (
    DataModuleForPASCtrlPCFGReward,
)


# %%
import importlib
from parsing_by_maxseminfo.parser.helper import pas_grammar_data_helper

derivative = args.model.model_name.split("-")[1]
dst = DataModuleForPASCtrlPCFGReward(
        hparams=args,
        langstr=args.langstr,
        use_cache=True,
        max_size=10000,
        merge_pas_data=False,
        pas_subsample=args.preprocessing_pas_subsample_count,
        flag_use_pos_unks=(
            args.experimental.flag_use_pos_unks
            if hasattr(args.experimental, "flag_use_pos_unks")
            else False
        ),
    )


word_vocab = dst.word_vocab
print("working on vocab of size", word_vocab.vocab_size)

basemodel = args.model.model_name.split("-")[0]
from parsing_by_maxseminfo.parser.lightning_wrapper.LitNPCFG import (
    LitXNPCFGFCReward,
)

print(f"launching {args.model.model_name.split('-')}")

if basemodel in ["SNPCFG", "TNPCFG", "NPCFG", "CPCFG", "SCPCFG", "SNPCFGA2C", "NPCFGA2C", "CPCFGA2C"]:
    # raise NotImplementedError("No plan for TNPCFG experiments so fat")
    derivative = args.model.model_name.split("-")[1]
    print(f"launching {basemodel} {derivative}")
    model = LitXNPCFGFCReward(
            basemodel,
            args.model,
            word_vocab.vocab_size,
            args.experimental,
            args.optimizer,
            args.langstr,
        )

    # model = TNPCFGFixedCost(args.model, word_vocab.vocab_size, span_repr_mode="em", langstr = 'german').to(device)
else:
    raise NotImplementedError(f"{args.model.model_name} is not allowed")


english
loading from  /content/data/english/ptb_en-full.gd_instruction.batch.gpt4omini-ew-exp-tbtok-idf
Preparing datasets with 10000 PAS samples
working on vocab of size 10020
launching ['NPCFGA2C', 'FixedCostReward']
launching NPCFGA2C FixedCostReward
Constructing LitNPCFGFixedCost with experimental config {'alignment_coefficient': 1.0, 'adversarial_coefficient': 0.0, 'pas_subsample_count': 0, 'renormalizing_marginals': False, 'weigh_nll_loss': True, 'suppress_pas_contrib': False, 'flag_curriculum_learning': False, 'mode': 'rl', 'hit_count_threshold': 2, 'activation_flood': 0.001, 'mode_offending_spans': True, 'spancomp_loss_weight': 4.0, 'rl_warmup_steps': 5000, 'rl_start_step': 0, 'rl_initial_coeff': 0.0, 'rl_target_coeff': 1.0, 'rl_len_norm': False, 'apply_mean_baseline': True, 'maxent_initial_coeff': -0.01, 'maxent_target_coeff': -0.01, 'mode_reward': 'log_tfidf', 'min_span_reward': -4.0, 'include_unary': True, 'supervised_mode': False, 'sample_mode': 'crf'}


In [9]:

from lightning.pytorch.loggers import TensorBoardLogger


import lightning.pytorch as L
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.progress import TQDMProgressBar

# Setup early stopping
early_stop_callback = EarlyStopping(
    monitor="val/sentence_f1",  # Metric to monitor
    min_delta=0.002,  # Minimum change to qualify as an improvement
    patience=args.train.patience,  # Number of epochs with no improvement after which training will be stopped
    verbose=True,
    mode="max",  # Minimize the monitored metric (use 'max' for metrics like accuracy)
)

In [10]:
train_dl, _ = dst.train_dataloader(
        # "english",
        # args.langstr,
        "null",
        max_len=40,
        min_len=3,
        device=device,
        pas_subsample_count=args.experimental.pas_subsample_count,
        flag_curriculum_learning=(
            args.experimental.flag_curriculum_learning
            if hasattr(args.experimental, "flag_curriculum_learning")
            else False
        ),
        add_sentence_level_span=(
            args.experimental.add_sentence_level_span
            if hasattr(args.experimental, "add_sentence_level_span")
            else False
        ),
        min_span_reward=args.experimental.min_span_reward,  # min span reward must be specified
        mode_reward=(
            args.experimental.mode_reward
            if hasattr(args.experimental, "mode_reward")
            else "none"
        ),
        supervised_mode=(
            args.experimental.supervised_mode
            if hasattr(args.experimental, "supervised_mode")
            else False
        ),
    )

val_dl, _ = dst.dev_full_dataloader(
    args.langstr,
    max_len=100000,
    min_len=2,
    device=device,
    min_span_reward=args.experimental.min_span_reward,
    mode_reward=(
        args.experimental.mode_reward
        if hasattr(args.experimental, "mode_reward")
        else "none"
    ),
)

test_dl, _ = dst.test_dataloader(
    args.langstr,
    max_len=1000000,
    min_len=2,
    device=device,

)

train loader: add_sentence_level_span: False
train loader: reward mode: log_tfidf
dev full loader: add_sentence_level_span: False
train loader: reward mode: log_tfidf
finished pruning dataset, current dataset length 1690
sampling: current dataset size:  1690
Train Iter: add_sentence_level_span False
sampling: current dataset size:  2412


In [11]:
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint

best_sf1_checkpoint_callback = ModelCheckpoint(
    save_top_k=4,
    monitor="val/sentence_f1",

    mode="max",
    dirpath=args.ckpt_dir,
    filename="ckpt-sf1_{val/sentence_f1:.2f}",
)
saveall_checkpoint_callback = ModelCheckpoint(
    save_top_k=-1,
    dirpath=args.ckpt_dir,
    filename="ckpt-step_{step}",
)



In [12]:
from parsing_by_maxseminfo.parser.lightning_wrapper.scheduler import WarmupScheduler

rl_coeff_scheduler = WarmupScheduler(
    warmup_steps=(
        args.experimental.rl_warmup_steps
        if hasattr(args.experimental, "rl_warmup_steps")
        else 10000
    ),
    coeff_name="rl_coeff",
    initial_coeff=(
        args.experimental.rl_initial_coeff
        if hasattr(args.experimental, "rl_initial_coeff")
        else 0.0
    ),
    start_step=(
        args.experimental.rl_start_step
        if hasattr(args.experimental, "rl_start_step")
        else 20000
    ),
    target_coeff=(
        args.experimental.rl_target_coeff
        if hasattr(args.experimental, "rl_target_coeff")
        else 0.3
    ),
)

maxent_scheduler = WarmupScheduler(
    warmup_steps=(
        args.experimental.maxent_warmup_steps
        if hasattr(args.experimental, "maxent_warmup_steps")
        else 1
    ),
    coeff_name="maxent_coeff",
    initial_coeff=(
        args.experimental.maxent_initial_coeff
        if hasattr(args.experimental, "maxent_initial_coeff")
        else 0.5
    ),
    start_step=(
        args.experimental.maxent_start_step
        if hasattr(args.experimental, "maxent_start_step")
        else 0.0
    ),
    target_coeff=(
        args.experimental.maxent_target_coeff
        if hasattr(args.experimental, "maxent_target_coeff")
        else 0.5
    ),
)

This block starts the training. By default, the training goes on for min 3k steps and max 10k steps. The training should be finished in approx. 1hr using the colab T4 GPU.

In [13]:
max_steps = 10000
min_steps = 3000
val_check_interval = 500
assert not args.debug, "debug mode is not allowed in this version"
trainer = L.Trainer(
    max_steps=max_steps,
    min_steps=min_steps,
    min_epochs=0,
    val_check_interval=val_check_interval,
    check_val_every_n_epoch=None,
    gradient_clip_val=args.train.clip,
    gradient_clip_algorithm="norm",
    callbacks=[
        early_stop_callback,
        TQDMProgressBar(refresh_rate=10),
        best_sf1_checkpoint_callback if not args.analysis_mode and not args.corr_mode else saveall_checkpoint_callback,
        rl_coeff_scheduler,
        maxent_scheduler,
    ],
    logger=[],  # if not args.debug else None,
    # devices=[args.rank],
    inference_mode=False,
    log_every_n_steps=10,
    accelerator="gpu",
    devices=1,          # Number of GPUs to use
    # strategy="ddp"      # Use Distributed Data Parallel

)
# wandb_logger.watch(model, log_graph=False, log_freq=100)
trainer.fit(
    model,
    train_dataloaders=train_dl,
    val_dataloaders=val_dl,
    ckpt_path=args.continue_from,
)



INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type                         | Params | Mode 
---------------------------------------------------------------
0 | model | NeuralPCFGFixedCostRewardA2C | 24.5 M | train
---------------------------------------------------------------
24.5 M    Trainable params
0         Non-trainable params
24.5 M    Total params
98.046    Total estimated model params size (MB)
32        Modules in train mode
0         Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary

Using Adam optimizer


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

total samples 8.0
dst_size 8
finished pruning dataset
Constructing shuffled batches with 50 epoches


100%|██████████| 1/1 [00:03<00:00,  3.10s/it]

Train Iter: add_sentence_level_span False





Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved. New best score: 0.409
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved. New best score: 0.409


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.067 >= min_delta = 0.002. New best score: 0.476
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.067 >= min_delta = 0.002. New best score: 0.476


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.062 >= min_delta = 0.002. New best score: 0.539
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.062 >= min_delta = 0.002. New best score: 0.539


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.015 >= min_delta = 0.002. New best score: 0.553
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.015 >= min_delta = 0.002. New best score: 0.553


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.041 >= min_delta = 0.002. New best score: 0.594
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.041 >= min_delta = 0.002. New best score: 0.594


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.007 >= min_delta = 0.002. New best score: 0.601
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.007 >= min_delta = 0.002. New best score: 0.601


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.010 >= min_delta = 0.002. New best score: 0.611
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.010 >= min_delta = 0.002. New best score: 0.611


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.004 >= min_delta = 0.002. New best score: 0.615
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.004 >= min_delta = 0.002. New best score: 0.615


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.005 >= min_delta = 0.002. New best score: 0.620
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.005 >= min_delta = 0.002. New best score: 0.620


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.009 >= min_delta = 0.002. New best score: 0.629
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.009 >= min_delta = 0.002. New best score: 0.629


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.005 >= min_delta = 0.002. New best score: 0.634
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.005 >= min_delta = 0.002. New best score: 0.634


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.008 >= min_delta = 0.002. New best score: 0.642
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.008 >= min_delta = 0.002. New best score: 0.642


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Metric val/sentence_f1 improved by 0.009 >= min_delta = 0.002. New best score: 0.650
INFO:lightning.pytorch.callbacks.early_stopping:Metric val/sentence_f1 improved by 0.009 >= min_delta = 0.002. New best score: 0.650


total samples 1690.0
dst_size 1690


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690
finished pruning dataset
Constructing shuffled batches with 50 epoches



  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:03<00:00,  3.13s/it]


Train Iter: add_sentence_level_span False


Validation: |          | 0/? [00:00<?, ?it/s]

total samples 1690.0
dst_size 1690


INFO: `Trainer.fit` stopped: `max_steps=10000` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=10000` reached.


This block prints out the evaluation result on the test set.

In [15]:
print(
    "Training ends. The best model: \n",
    trainer.test(model, dataloaders=test_dl),
    file=sys.stderr,
)

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

total samples 2412.0


Training ends. The best model: 
 [{'test/corpus_f1': 0.6357101798057556, 'test/sentence_f1': 0.6312323212623596, 'test/avg_ll': -121.39409637451172, 'test/avg_ppl': 285.7364196777344}]
