## Update text labels and generate variants
1. assign text uid;
2. update sentiment and relation labels;
3. generate multiple text variants.

### Assign unique ids for texts
This is critical for (1) non-overlapping train/val/test split; and (2) training batch sampling that supports contrastive loss.

In [1]:
import pandas as pd
df = pd.read_pickle('./data/tmp/zuco_label_input_text.df')
print(df.columns)

uids, unique_texts = pd.factorize(df['input text'])
df['text uid'] = uids.tolist()
print(df.columns)

Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text'], dtype='object')
Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'text uid'],
      dtype='object')


### Update semantic/relation labels
1. to separate columns;
2. with more tractable natural language terms.

In [2]:
print(df.value_counts('raw label'))

raw label
1                                                                 140
0                                                                 137
-1                                                                123
EDUCATION                                                         120
JOB_TITLE                                                         114
                                                                 ... 
Where was he born? Philadelphia                                     1
Which army did he join?  U.S. Army                                  1
Which university did he get his degree from? Tulane University      1
Which war interrupted his sutdies? World War II                     1
Who ran an oil drilling company? his father                         1
Name: count, Length: 82, dtype: int64


### Sentiment label for task1

In [3]:
df_task1 = df[df['task']=='task1']
print(df_task1.value_counts('raw label'))

raw label
1     140
0     137
-1    123
Name: count, dtype: int64


In [4]:
def create_readable_sentiment_label(src):
    
    labels = ['-1', '0', '1']
    new_labels = ['negative', 'neutral', 'positive']
    assert src in labels
    tgt = new_labels[labels.index(src)]
    return tgt

df_task1['sentiment label'] = df_task1['raw label'].apply(create_readable_sentiment_label)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_task1['sentiment label'] = df_task1['raw label'].apply(create_readable_sentiment_label)


### Relation label for task3

In [5]:
df_task3 = df[df['task']=='task3']
print(df_task3.value_counts('raw label'))

raw label
EDUCATION                120
JOB_TITLE                114
NATIONALITY              109
EMPLOYER                 104
WIFE                     102
POLITICAL_AFFILIATION     92
FOUNDER                   84
VISITED                   48
AWARD                     45
Name: count, dtype: int64


In [6]:
def create_readable_relation_label(src):
    # replace the original labels by LM-friendly terms
    labels = ['AWARD', 'EDUCATION', 'EMPLOYER', 
            'FOUNDER', 'JOB_TITLE', 'NATIONALITY', 
            'POLITICAL_AFFILIATION', 'VISITED', 'WIFE']
    
    new_labels = ['awarding', 'education', 'employment',
                    'foundation', 'job title', 'nationality', 
                    'political affiliation','visit', 'marriage'] 
    assert src in labels
    tgt = new_labels[labels.index(src)]
    return tgt

df_task3['relation label'] = df_task3['raw label'].apply(create_readable_relation_label)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_task3['relation label'] = df_task3['raw label'].apply(create_readable_relation_label)


In [7]:
df_task1_task3 = pd.concat([df_task1, df_task3], ignore_index=False)
df_class_label = pd.merge(df, df_task1_task3, how='left')
print(df_class_label.shape)
print(df_class_label.columns)

(1888, 9)
Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'text uid', 'sentiment label', 'relation label'],
      dtype='object')


### Generate text variants
- Run [_gen_variants_llm_regular.py](_gen_variants_llm_regular.py) to generate 6 finely paraphrased variants using LLM with detailed instruction.

In [8]:
import pandas as pd
df1 = pd.read_pickle('./data/tmp/zuco_label_lexical_simplification.df')
df2 = pd.read_pickle('./data/tmp/zuco_label_semantic_clarity.df')
df3 = pd.read_pickle('./data/tmp/zuco_label_syntax_simplification.df')
print(df1.shape, df1.columns)
print(df2.shape, df2.columns)
print(df3.shape, df3.columns)

(1888, 2) Index(['lexical simplification (v0)', 'lexical simplification (v1)'], dtype='object')
(1888, 2) Index(['semantic clarity (v0)', 'semantic clarity (v1)'], dtype='object')
(1888, 2) Index(['syntax simplification (v0)', 'syntax simplification (v1)'], dtype='object')


- Run [_gen_variants_llm_general.py](_gen_variants_llm_general.py) to generate 8+8 simplified/rewritten variants using LLM with general instruction

In [9]:
df_simplified = pd.read_pickle('./data/tmp/zuco_simplified_text.df')
df_rewritten = pd.read_pickle('./data/tmp/zuco_rewritten_text.df')
print(df_simplified.shape, df_simplified.columns)
print(df_rewritten.shape, df_rewritten.columns)

(1888, 8) Index(['simplified text (v0)', 'simplified text (v1)', 'simplified text (v2)',
       'simplified text (v3)', 'simplified text (v4)', 'simplified text (v5)',
       'simplified text (v6)', 'simplified text (v7)'],
      dtype='object')
(1888, 8) Index(['rewritten text (v0)', 'rewritten text (v1)', 'rewritten text (v2)',
       'rewritten text (v3)', 'rewritten text (v4)', 'rewritten text (v5)',
       'rewritten text (v6)', 'rewritten text (v7)'],
      dtype='object')


- Run [_gen_variants_t5_naive.py](_gen_variants_t5_naive.py) to generate 1+1 simplified/rewritten variants using the integerated LM.

In [10]:
df_naive = pd.read_pickle('./data/tmp/zuco_label_naive.df')
print(df_naive.shape, df_naive.columns)

(1888, 8) Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'naive rewritten', 'naive simplified'],
      dtype='object')


### Merge labels and text variants
- Our final 8 variants consist of: (1) the 6 LLM-finely-paraphrased variants; and (2) the 2 naive variants, where we replace the excpetion samples that marked by an <ERROR> in (1) with those general-simplified variants. 

In [11]:
df_6var_error = pd.concat([df_class_label, df1, df2, df3], axis=1)
print(df_6var_error.shape, df_6var_error.columns)

(1888, 15) Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'text uid', 'sentiment label', 'relation label',
       'lexical simplification (v0)', 'lexical simplification (v1)',
       'semantic clarity (v0)', 'semantic clarity (v1)',
       'syntax simplification (v0)', 'syntax simplification (v1)'],
      dtype='object')


- handle `<ERROR>` cases

In [12]:
from copy import deepcopy

# Find rows that contain "<ERROR>" in any column
error_rows = df_6var_error[df_6var_error.apply(lambda row: row.astype(str).str.contains('<ERROR>').any(), axis=1)]

error_check_columns = ['lexical simplification (v0)', 'lexical simplification (v1)', 
                       'semantic clarity (v0)', 'semantic clarity (v1)', 
                       'syntax simplification (v0)', 'syntax simplification (v1)']

# List of columns in df2 that will replace the columns in df1
replace_columns_df_simplified = ['simplified text (v0)', 'simplified text (v1)', 
                                 'simplified text (v2)', 'simplified text (v3)', 
                                 'simplified text (v4)', 'simplified text (v5)']

df_all = deepcopy(df_6var_error)
# Iterate through each row in df1 and check for "<ERROR>" in the specified columns
for i, row in df_6var_error.iterrows():
    if row[error_check_columns].astype(str).str.contains('<ERROR>').any():
        # Replace the columns in df1 with the corresponding columns from df2
        df_all.loc[i, error_check_columns] = df_simplified.loc[i, replace_columns_df_simplified].values

print(df_all.shape, df_all.columns)
error_rows_6var = df_all[df_all.apply(lambda row: row.astype(str).str.contains('<ERROR>').any(), axis=1)]
print(error_rows_6var.shape)

(1888, 15) Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'text uid', 'sentiment label', 'relation label',
       'lexical simplification (v0)', 'lexical simplification (v1)',
       'semantic clarity (v0)', 'semantic clarity (v1)',
       'syntax simplification (v0)', 'syntax simplification (v1)'],
      dtype='object')
(0, 15)


- add naive variants

In [13]:
df_all[['naive rewritten', 'naive simplified']] = df_naive[['naive rewritten', 'naive simplified']]
print(df_all.columns)

Index(['raw text', 'dataset', 'task', 'control', 'raw label', 'input text',
       'text uid', 'sentiment label', 'relation label',
       'lexical simplification (v0)', 'lexical simplification (v1)',
       'semantic clarity (v0)', 'semantic clarity (v1)',
       'syntax simplification (v0)', 'syntax simplification (v1)',
       'naive rewritten', 'naive simplified'],
      dtype='object')


In [14]:
pd.to_pickle(df_all, './data/tmp/zuco_label_8variants.df')