# Data Augmentation
Michael's Version

We will try a number of trivial strategies for augmenting data.

## 1. Copying Transformation
To help model the types of data in the input and output space, let's add rows for every unique source and target sentence where the change is empty, and repeat the sentence back. We can do this for the training sentences and the `Source` values from the dev sentences. This is essentially *lemma copying* from [Yang et al (2022)](https://aclanthology.org/2022.sigmorphon-1.23).

In [16]:
import pandas as pd 

langs = ['bribri', 'maya', 'guarani']

def create_identity_dataframe(lang):
    train_df = pd.read_csv(f"../data/yoyodyne/{lang}-train.tsv", delimiter="\t")
    sentences = list(set(train_df['Source'].values.tolist() + train_df['Target'].values.tolist()))

    dev_df = pd.read_csv(f"../data/yoyodyne/{lang}-dev.tsv", delimiter="\t")
    sentences += list(set(dev_df['Source'].values.tolist()))
    return pd.DataFrame({'Source': sentences, 'Target': sentences, 'Change': 'NOCHANGE'})

for lang in langs:
    identity_df = create_identity_dataframe(lang)
    identity_df.to_csv(f"../data/augmented/{lang}-identity.tsv", sep="\t", index=False)

## 2. Lateral Transformations
Our training data includes a number of groups of examples with the same source sentence and the same type of change applied. For example, Maya includes the following:

```text
Ma' táan in bin ich kooli' [PERSON:2_SI] -> Ma' táan a bin ich kooli'	
Ma' táan in bin ich kooli' [PERSON:3_SI] -> Ma' táan u bin ich kooli'	
```

(Here, the source sentence is presumably PERSON:1_SI but we won't utilize that yet)

Because the two examples share the same source sentence and feature, it is evident that the following:

```text
Ma' táan u bin ich kooli [PERSON:2_SI] -> Ma' táan a bin ich kooli'	
```
i.e., the third person to second person transformation, is also valid. Likewise, we have:

```text
Ma' táan a bin ich kooli [PERSON:3_SI] -> Ma' táan u bin ich kooli'	
```
i.e., the second person to third person.

For each group of source sentences, we can create these **lateral transformations**. This includes when two sentences have exactly the same type of features, but also if one sentence has a superset of the types of changes of the other (since the additional features are originally unmarked/null).

In [42]:
toy_df = pd.DataFrame({
    'Source': ['A', 'A', 'A', 'A'],
    'Target': ['B', 'C', 'D', 'E'],
    'Change': [
        'TYPE1:VALUE1',
        'TYPE2:VALUE3;TYPE3:VALUE4',
        'TYPE1:VALUE5;TYPE3:VALUE6',
        'TYPE2:VALUE7;TYPE3:VALUE8'
    ]
})

def lateral_augment(df: pd.DataFrame):
    # Parse the Change column into a set of TYPE values
    def parse_change(change_str):
        return {change.split(':')[0] for change in change_str.split(';')}

    df['Types'] = df['Change'].apply(parse_change)

    def find_matching_rows(df, row_index):
        target_types = df.iloc[row_index]['Types']
        matching_rows = []
        for i, row in df.iterrows():
            if i != row_index and target_types.issubset(row['Types']):
                matching_rows.append(i)
        return matching_rows
    
    new_sources = []
    new_targets = []
    new_changes = []

    for source_row_index in range(len(df)):
        matching_rows = find_matching_rows(df, source_row_index)
        # For each matching row, create a new row 
        for target_row_index in matching_rows:
            new_sources.append(df.iloc[source_row_index]['Target'])
            new_targets.append(df.iloc[target_row_index]['Target'])
            new_changes.append(df.iloc[target_row_index]['Change'])
    return pd.DataFrame({'Source': new_sources, 'Target': new_targets, 'Change': new_changes})

lateral_augment(toy_df)

[2]
[3]
[]
[1]


Unnamed: 0,Source,Target,Change
0,B,D,TYPE1:VALUE5;TYPE3:VALUE6
1,C,E,TYPE2:VALUE7;TYPE3:VALUE8
2,E,C,TYPE2:VALUE3;TYPE3:VALUE4


In [44]:
def create_lateral_dataframe(lang):
    train_df = pd.read_csv(f"../data/yoyodyne/{lang}-train.tsv", delimiter="\t")
    augmented_dfs = []
    for _, group in train_df.groupby(['Source']):
        if (group.size == 1):
            continue

        augmented_dfs.append(lateral_augment(group.reset_index()))
    
    return pd.concat(augmented_dfs, axis=0)
            
for lang in langs:
    lateral_df = create_lateral_dataframe(lang)
    lateral_df.to_csv(f"../data/augmented/{lang}-lateral.tsv", sep="\t", index=False)

[]
[3, 6, 8, 10, 12, 13, 16, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]
[4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 28, 29, 30, 31, 32, 33, 34, 35, 38, 39]
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[6, 8, 10, 12, 13, 16, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39]
[5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 28, 29, 30, 31, 32, 33, 34, 35, 38, 39]
[4, 6, 7, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 28, 29, 30, 31, 32, 33, 34, 35, 38, 39]
[8, 12, 13, 16, 19, 21, 23, 29, 31, 33, 35, 39]
[4, 5, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 28, 29, 30, 31, 32, 33, 34, 35, 38, 39]
[6, 12, 13, 16, 19, 21, 23, 29, 31, 33, 35, 39]
[2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[3, 6, 8, 12, 13, 16, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39]
[12, 13, 14, 15, 16, 38, 39]
[13]
[]
[1