<br /><h1> <font color="blue"> <b> Part III: Controls </b></h1> <br />

# Politeness control

Some aspects of generation can be controlled thanks to special tokens in the input. For instance multi-domain models can be trained and used using source-side domain tags (https://aclanthology.org/R17-1049).

This work https://aclanthology.org/N16-1005/ used special tokens to control the politeness of the output.

We will implement this approach for English-French translation, to control the use of "tu" VS "vous" pronouns, which are formal/informal translations of "you".

We only need to partition the training data into formal VS informal splits, by looking for occurrences of "tu" and "vous". Then, add source-side control tags depending on the politeness level of the target, and train the model with this.
At test time, we only need to put the right control tag and the model will know how to interpret it to pick the right level of politeness.


## Preparing the data

As we only rely on the "politeness control token," it is necessary to prepare distinctive polite and non-polite training samples from the corpus.

While a lot of different aspects of French grammar can be considered here, to start with, we pick sentences that contain "tu" and "vous" — both meaning "you"  in English — and label them as "non-polite" and "polite," respectively.

### Regular expressions

To extract the sentences that contain the words "tu" or "vous", we can use the following regular expressions:
```python
r'(^|\W)(vous)(\W|$)'
r'(^|\W)(tu)(\W|$)'
```
They match sentences that contain the corresponding words by making sure that each word is preceded and followed by a "non-word" character (e.g., whitespace or dash)

For more information on regexes, you can check out the following resources:
- https://www.regular-expressions.info/tutorial.html
- https://docs.python.org/3/library/re.html
- https://regex101.com/#python

In [None]:
def is_formal(line):
    """
    Contains formal French translations of "you"
    """
    regex = r'(^|\W)(vous)(\W|$)'
    return bool(re.search(regex, line, re.IGNORECASE))

def is_informal(line):
    """
    Contains informal French translations of "you"
    """
    regex = r'(^|\W)(tu)(\W|$)'
    return bool(re.search(regex, line, re.IGNORECASE))

### Adding politeness control tags

When we identify sentences that are either polite or non-polite, we can attach corresponding control tags in front of each sentence.

In [None]:
def preprocess_formal(source_line, target_line=None, source_lang=None, target_lang=None):
    """
    Tokenizes the given line pair and prepends the <formal> source-side tag
    """
    source_line, target_line = preprocess(source_line, target_line)
    source_line = f'<formal> {source_line}'
    return source_line, target_line

def preprocess_informal(source_line, target_line=None, source_lang=None, target_lang=None):
    """
    Tokenizes the given line pair and prepends the <informal> source-side tag
    """
    source_line, target_line = preprocess(source_line, target_line)
    source_line = f'<informal> {source_line}'
    return source_line, target_line

def preprocess_formal_or_informal(source_line, target_line, source_lang=None, target_lang=None):
    """
    Preprocessing function for politeness control:
    - keep only line pairs whose target side has French formal or informal pronouns
    - prepend politeness control tags to the source side
    """
    if is_formal(target_line):
        return preprocess_formal(source_line, target_line)
    elif is_informal(target_line):
        return preprocess_informal(source_line, target_line)
    else:  # this line pair in neither formal nor informal
        # This example will be filtered out by load_dataset (uncomment below to keep it, without a control tag):
        # return preprocess(source_line, target_line)
        return None

### Filtering and loading the dataset

Finally, we can filter and load the dataset by passing the `preprocess_formal_or_informal` function to `load_dataset`.
This will keep only the line pairs that contain formal or informal pronouns and preprocess the sources to add control tags.

In [None]:
# Use the same dataset as before
train_path = os.path.join(data_dir, 'train.en-fr')
valid_path = os.path.join(data_dir, 'valid.en-fr')

# But preprocess it to keep only line pairs that use tu/vous pronouns and to append control tags
train_data = load_dataset(
    train_path, 'en', 'fr',
    preprocess=preprocess_formal_or_informal,
)

valid_data = load_dataset(
    valid_path, 'en', 'fr',
    preprocess=preprocess_formal_or_informal,
)

## Setting up for training

As we are introducing new vocabularies (i.e., the control tokens), we need to add them to our pretrained model's existing vocabulary.

Here, we replace the last two most infrequent tokens so that we do not need to resize the vocabulary and embeddings.

Note that the replaced words will now be mapped to UNK.

In [None]:
source_dict = rnn_attn_model.source_dict
print(source_dict)

# Replace some infrequent tokens with the new control tokens (these words will now be mapped to UNK)
# This is a bit dirty, but this way we don't have to resize the pretrained model's vocabulary and embeddings
source_dict[len(source_dict) - 2] = '<formal>'
source_dict[len(source_dict) - 1] = '<informal>'

# Binarize the training and validation data with these vocabularies
binarize(train_data, source_dict, target_dict, sort=True)
binarize(valid_data, source_dict, target_dict, sort=False)

# You can see that the training source examples now start with special tokens.
print(train_data[:5])

print('train_size={}, valid_size={}, min_len={}, max_len={}, avg_len={:.1f}'.format(
    len(train_data),
    len(valid_data),
    train_data['source_len'].min(),
    train_data['source_len'].max(),
    train_data['source_len'].mean(),
))

reset_seed()

train_iterator = BatchIterator(train_data, 'en', 'fr', batch_size=512, max_len=30, shuffle=True)
valid_iterator = BatchIterator(valid_data, 'en', 'fr', batch_size=512, max_len=30, shuffle=False)

In [None]:
# Finetune the EN-FR pretrained model with the new data
new_checkpoint_path = os.path.join(model_root, 'en-fr', 'rnn-attn.pt')
rnn_attn_model.reset_optimizer()

train_model(rnn_attn_model, train_iterator, [valid_iterator], new_checkpoint_path, epochs=5)

## Inference

In [None]:
translate(rnn_attn_model, "would you lend me your bicycle ?", preprocess_formal, 'en', 'fr')

In [None]:
translate(rnn_attn_model, "would you lend me your bicycle ?", preprocess_informal, 'en', 'fr')

<br /><br /><br /><hr width=170% /><br />
<h2><font color=green><b>EXERCISE 3:</b> Politeness control in Dutch</font> </h2>


<font color=green><b></b></font>&nbsp;&nbsp;&nbsp;

Implement politeness control for Dutch. Below, we already added code to download the data for English-Dutch and to train a new BPE. In order to complete this exercise, you should train a new model for English-Dutch, define new functions for <em>is_formal()</em> and <em>is_informal()</em>, use them in preprocessing and finetune the English-Dutch model. Try to come up with the most complete rules and demonstrate that this works by 'formalizing' and 'deformalizing' English to Dutch translations using the translate() function as done above.

<br />

If you don't speak Dutch, please indicate that in your exercise. We still expect you to complete the exercise but we'll know your knowledge of the grammar rules is limited. For a short intro to politeness control in Dutch, see here: https://blogs.transparent.com/dutch/formal-and-informal-pronouns/. Of course, you can also ask questions about this during the labs.

<br /><br /><hr width=170% /><br />






In [None]:
# Download preprocessed data for English -> Dutch
if not os.path.exists('dev.en-nl.nl'):
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/valid.en-nl.en  # English validation set
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/valid.en-nl.nl  # Dutch validation set
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/test.en-nl.en  # English test set
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/test.en-nl.nl  # Dutch test set
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/train.en-nl.en  # English train set
  !wget https://raw.githubusercontent.com/esther2000/MT-2022/main/train.en-nl.nl  # Dutch train set

# The files are now available
! ls train*

# The format is the same as that of the previously used data
! head -5 train.en-nl.nl

In [None]:
# Train new BPE for English and Dutch
if not os.path.exists('data/bpecodes.en-nl'):
  !cat train.en-nl.en train.en-nl.nl | subword-nmt learn-bpe -o data/bpecodes.en-nl -s 8000 -v

In [None]:
# The BPE preprocessing functions should be reloaded, as they use
# 'bpe_model', which was changed to accomodate Dutch.

with open(bpe_path) as bpe_codes:
    bpe_model = BPE(bpe_codes)

def preprocess(source_line, target_line=None, source_lang=None, target_lang=None):
    source_line = bpe_model.segment(source_line.lower())
    if target_line is not None:
        target_line = bpe_model.segment(target_line.lower())
    return source_line, target_line