# Prefix Language Modeling/Conditional Language Modeling: 
Prefix Language Modeling is a technique used in natural language processing where the model generates text based on a given condition or context(the prefix). This approach allows for more controlled generation of text as the model is conditioned on specific information before generating the subsequent text. It's a form of controlled text generation where the prefix acts as a guide to instruct the model on the kind of text to generate. In a machine translation, the prefix would be the source text in one language and the model generates the translation in the target language. Similarly, for summarization tasks, the prefix would be the original long text and the model would generate its summary. This method is particularly useful in scenarios where the generated text needs to be relevant to certain input data ensuring that the output is contextually aligned with the prefix.

# Autoregressive Language Modeling:
Autoregressive Language Modeling is a probabilistic model used to generate sequences of text by predicting one token at a time, where each token's prediction is dependent on the tokens that have been generated previously. This approach models the probability distribution of a token sequence in a way that each token is predicted based on its predecessors in the sequence, making the generation process inherently sequential and dependent on the previously generated context. Models like GPT are prime examples of autoregressive language models used for a wide range of tasks like text generation, conversation and even code generation. The autoregressive nature allows these models to generate coherent and contextually relevant text over long passages but the generation is not explicitly conditioned on external context beyond the text itself.

### Distinctions:
1. Prefix Language Modeling allows for more control over the text generation process by conditioning the model on a given context or prefix, making it suitable for tasks that require outputs tailored to specific inputs whereas Autoregressive Language Modeling, generates text in a more free-form manner, predicting each subsequent token based on the previously generated ones without explicit external conditioning beyond the generated sequence itself.

2. Prefix Language Modeling is particularly useful in tasks that require the model to generate text based on specific conditions or contexts, such as translation, summarization, or controlled content creation whereas Autoregressive Language Modeling is widely used for open-ended text generation tasks where the primary goal is to produce coherent and contextually relevant text based on the sequence generated so far.

3. While both approaches involve predicting the likelihood of subsequent tokens in a sequence, Prefix Language Modeling explicitly incorporates external conditions or contexts into the generation process whereas Autoregressive Language Modeling relies solely on the inherent sequence of tokens generated up to the current point for prediction.

In [2]:
from datasets import load_dataset
dataset = load_dataset("xsum")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [4]:
dataset['train']

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

In [5]:
dir(dataset['train'])

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getitems__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_estimate_nbytes',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_generate_tables_from_cache_file',
 '_generate_tables_from_shards',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_save_to_disk_single',
 '_select_contiguo

In [6]:
type(dataset['train'])

datasets.arrow_dataset.Dataset

In [7]:
dataset['train'].data

MemoryMappedTable
document: string
summary: string
id: string
----
document: [["The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much pu

In [8]:
dataset['train'].features

{'document': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [9]:
from transformers import T5Tokenizer
checkpoint='t5-small'
tokenizer = T5Tokenizer.from_pretrained(checkpoint)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

In [11]:
def preprocess(examples):
    inputs = ["summarize: " + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [12]:
tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=["document", "summary", "id"])

In [13]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11334
    })
})

In [14]:
small_train_dataset = tokenized_dataset["train"].select(range(int(0.1 * len(tokenized_dataset["train"]))))
small_eval_dataset = tokenized_dataset["validation"].select(range(int(0.1 * len(tokenized_dataset["validation"]))))

In [15]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=500,
)

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


In [16]:
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

In [17]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

In [18]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mvishwas-mishra1234[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,4.3777
1000,0.7089
1500,0.6926
2000,0.6884
2500,0.6876
3000,0.6749
3500,0.6813
4000,0.6668
4500,0.6719
5000,0.6779


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-3000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-3500 already exists and is non-empty.Saving will proceed but saved results ma

TrainOutput(global_step=5101, training_loss=1.0450513892443922, metrics={'train_runtime': 6560.1504, 'train_samples_per_second': 3.11, 'train_steps_per_second': 0.778, 'total_flos': 2761514117234688.0, 'train_loss': 1.0450513892443922, 'epoch': 1.0})

In [19]:
# Save the model
model.save_pretrained("model")

In [20]:
eval_results = trainer.evaluate()

In [21]:
import torch
print(f"Perplexity: {torch.exp(torch.tensor(eval_results['eval_loss']))}")

Perplexity: 1.8298892974853516
