# Data Extraction

Just a quick notebook to extract the body text from the responses and to sample 500 pairs of r/ChangeMyView posts from Chenhao Tan's CMV dataset

Takes the following file [heldout_pair_data](original_data/cmv/pair_task/heldout_pair_data.jsonlist) from Chenhao Tan's CMV dataset in his paper ["Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions"](https://chenhaot.com/papers/changemyview.html) and samples 500 random lines of code to test ChatGPT's capabilities.

Additionally, this also coverts the file into .csv files for better readability.

In [25]:
#Samples 500 lines from 'heldout_pair_data.jsonlist'
import pandas as pd
import os

file_path = "original_data/cmv/pair_task/heldout_pair_data.jsonlist"
output_dir = "sample_data"

if os.path.exists(file_path):
    df = pd.read_json(file_path, orient='records', lines=True)
    sample = df.sample(min(500, len(df)))
    
    filename = os.path.basename(file_path)
    output_path = os.path.join(output_dir, filename)
    
    sample.to_json(output_path, orient='records', lines=True)
else:
    raise FileNotFoundError(f"{file_path}")

In [None]:
#Converts to CSV and extracts only body text from responses
import pandas as pd

file_path = "sample_data/heldout_pair_data.jsonlist"

df = pd.read_json(file_path, orient='records', lines=True)

#Extracts body text from responses
def extract_body_text(response_data):
    return response_data['comments'][0]['body']

#Applies function to positive and negative columns and saves
df['positive_body'] = df['positive'].apply(extract_body_text)
df['negative_body'] = df['negative'].apply(extract_body_text)

final_df = df[["op_text", "positive_body", "negative_body"]]
final_df.columns = ["op_text", "positive", "negative"]

final_df.to_csv("sample_data/heldout_pair_data.csv", index=False)