## Data Preprocessing

To finetune the model on question-avoidance data, question dodging examples are needed. 

Data sources: 

- [Question Avoidance Study](https://github.com/YanaPalacheva/avoidance_study/tree/master), containing question-response pairs classified as avoidant and non-avoidant; if avoidant, they are classified to fight or flight responses.

- [TODO: add more]


To make various types of data usable for training a zero-shot [NLI (Natural Language Inference) model](https://nlpprogress.com/english/natural_language_inference.html) , the data needs to be relabelled as contradiction/neutral/entailment for pairs of premise and hypothesis texts. This [StackOverflow answer provides a decent explanation](https://stackoverflow.com/questions/76213873/how-to-finetune-a-zero-shot-model-for-text-classification). 

In [None]:
import pandas as pd
import os

annotated_data = {}
raw_data_path = "../data/raw"
data_path = "../data"

#### Question Avoidance Study

In [None]:
# get the csv file from github
import requests
resp = requests.get("https://raw.githubusercontent.com/YanaPalacheva/avoidance_study/master/Annotation/Avoidance_annotated.csv")

dataset_name = "question_avoidance"
filename = f"{raw_data_path}/{dataset_name}.csv" # "question_avoidance.csv"
with open(filename, "w") as f:
    content = str(resp.content, encoding="utf-8")
    f.write(content)

In [None]:
annotated_data[dataset_name] = pd.read_csv(filename)

In [None]:
annotated_data[dataset_name].sample(3)

In [None]:
# rows with no avoid_type_avg (NaNs) are examples of non-avoidant answers
annotated_data[dataset_name].avoid_type_avg.fillna("non-avoidant", inplace=True)

# to verify this, the avoid_rate_avg of these is likely low or below 2
non_avoidant = annotated_data[dataset_name][annotated_data[dataset_name].avoid_type_avg == "non-avoidant"]
non_avoidant.avoid_rate_avg.describe()


In [None]:
# Checking that the data is correct
annotated_data[dataset_name][["avoid_type_avg"]].value_counts()

In [None]:
annotated_data[dataset_name][["avoid_rate_avg"]].describe()

#### Fitting the data fo an NLI task

In [None]:
def avoid_rate_to_id(avoid_rate: float):
    if avoid_rate >= 2.0:
        return 0 # entailment, high avoidance
    elif avoid_rate >= 1.0 and avoid_rate < 2.0:
        return 1 # neutral
    else:
        return 2 # contradiction

In [None]:
# make a copy of the dataset
processed_dataset = f"{dataset_name}_preprocessed"
annotated_data[processed_dataset] = annotated_data[dataset_name]

In [None]:
annotated_data[processed_dataset].rename(columns={"text_q": "question", "text_a": "answer"}, inplace=True)
#data["label"] = data["avoid_rate_avg"].apply(lambda x: id2label[avoid_rate_to_id(x)])
annotated_data[processed_dataset]["label"] = annotated_data[processed_dataset]["avoid_rate_avg"].apply(lambda x: avoid_rate_to_id(x))

In [None]:
annotated_data[processed_dataset] = annotated_data[processed_dataset][["question", "answer", "label"]]

In [None]:
annotated_data[processed_dataset].sample(3)

#### Save the data as parquet

In [None]:
annotated_data_files = annotated_data.keys()
for i in annotated_data_files:
    parquet_filename = f"{data_path}/{i}_dataset.parquet"
    print(parquet_filename)
    annotated_data[i].to_parquet(parquet_filename, engine="pyarrow")