## Data Preprocessing

To finetune the model on question-avoidance data, question dodging examples are needed. 

Data sources: 

- [Question Avoidance Study](https://github.com/YanaPalacheva/avoidance_study/tree/master), containing question-response pairs classified as avoidant and non-avoidant; if avoidant, they are classified to fight or flight responses.

- [TODO: add more]


To make various types of data usable for training a zero-shot [NLI (Natural Language Inference) model](https://nlpprogress.com/english/natural_language_inference.html) , the data needs to be relabelled as contradiction/neutral/entailment for pairs of premise and hypothesis texts. This [StackOverflow answer provides a decent explanation](https://stackoverflow.com/questions/76213873/how-to-finetune-a-zero-shot-model-for-text-classification). 

In [1]:
import pandas as pd
import os

annotated_data = {}
raw_data_path = "../data/raw"
data_path = "../data/processed"

#### Question Avoidance Study

In [2]:
# get the csv file from github
import requests
resp = requests.get("https://raw.githubusercontent.com/YanaPalacheva/avoidance_study/master/Annotation/Avoidance_annotated.csv")

dataset_name = "question_avoidance"
filename = f"{raw_data_path}/{dataset_name}.csv" # "question_avoidance.csv"
with open(filename, "w") as f:
    content = str(resp.content, encoding="utf-8")
    f.write(content)

In [3]:
annotated_data[dataset_name] = pd.read_csv(filename)

In [4]:
annotated_data[dataset_name].sample(3)

Unnamed: 0,index,dataset,id_a,id_q,meta.pair_idx,text_q,text_a,if_q_1,avoid_rate_1,avoid_type_1,if_q_2,avoid_rate_2,avoid_type_2,if_q_3,avoid_rate_3,avoid_type_3,avoid_rate_avg,avoid_type_avg
196,236,CDC,t1_chrijcr,t3_26j0zc,,I am forced to wear jean shorts for an event. ...,denim jacket with navy blue shoes and a denim ...,Q,0.0,,Q,0.0,,NQ,,,0.0,
279,333,CDC,t1_cia6m66,t3_28en5d,,I'm a young guy with two weeks off and about $...,The first time I went to an actually nice beac...,Q,0.0,,Q,0.0,,Q,0.0,,0.0,
377,448,PQTC,2013-02-06a.262.8,2013-02-06a.262.5,2013-02-06.3.0,I thank the Minister for that answer and welco...,The hon Gentleman is talking total nonsense . ...,Q,3.0,Fight,NQ,,,Q,4.0,Fight,3.5,Fight


In [5]:
# rows with no avoid_type_avg (NaNs) are examples of non-avoidant answers
annotated_data[dataset_name].avoid_type_avg.fillna("non-avoidant", inplace=True)

# to verify this, the avoid_rate_avg of these is likely low or below 2
non_avoidant = annotated_data[dataset_name][annotated_data[dataset_name].avoid_type_avg == "non-avoidant"]
non_avoidant.avoid_rate_avg.describe()


count    167.000000
mean       0.232535
std        0.306095
min        0.000000
25%        0.000000
50%        0.000000
75%        0.333333
max        1.333333
Name: avoid_rate_avg, dtype: float64

In [6]:
# Checking that the data is correct
annotated_data[dataset_name][["avoid_type_avg"]].value_counts()

avoid_type_avg
Flight            204
non-avoidant      167
Fight              28
Undetermined       24
Name: count, dtype: int64

In [7]:
annotated_data[dataset_name][["avoid_rate_avg"]].describe()

Unnamed: 0,avoid_rate_avg
count,423.0
mean,1.700552
std,1.424706
min,0.0
25%,0.333333
50%,1.333333
75%,3.0
max,4.0


#### Fitting the data fo an NLI task

In [8]:
def avoid_rate_to_id(avoid_rate: float):
    if avoid_rate >= 2.0:
        return 0 # entailment, high avoidance
    elif avoid_rate >= 1.0 and avoid_rate < 2.0:
        return 1 # neutral
    else:
        return 2 # contradiction

In [9]:
# make a copy of the dataset
processed_dataset = f"{dataset_name}_preprocessed"
annotated_data[processed_dataset] = annotated_data[dataset_name]

In [10]:
annotated_data[processed_dataset].rename(columns={"text_q": "question", "text_a": "answer"}, inplace=True)
#data["label"] = data["avoid_rate_avg"].apply(lambda x: id2label[avoid_rate_to_id(x)])
annotated_data[processed_dataset]["label"] = annotated_data[processed_dataset]["avoid_rate_avg"].apply(lambda x: avoid_rate_to_id(x))

In [11]:
annotated_data[processed_dataset] = annotated_data[processed_dataset][["question", "answer", "label"]]

In [12]:
annotated_data[processed_dataset].sample(3)

Unnamed: 0,question,answer,label
54,I'm going to launch something big at my player...,Technically this would be resolved with a grap...,2
326,"In the course of reviewing that assessment , w...",I thank my hon Friend for that question . I do...,0
133,The Secretary of State will recognise that tou...,In February or March 1996 .,0


#### Save the data as parquet

In [13]:
annotated_data_files = annotated_data.keys()
for i in annotated_data_files:
    parquet_filename = f"{data_path}/{i}_dataset.parquet"
    print(parquet_filename)
    annotated_data[i].to_parquet(parquet_filename, engine="pyarrow")

../data/processed/question_avoidance_dataset.parquet
../data/processed/question_avoidance_preprocessed_dataset.parquet
