# Feature Analysis

This notebook will visually display some dataframes to understand more context behind the datasets I'm using

### 1. Post GPT prediction dataset

The following files [results/explain_features.csv](results/explain_features.csv) and [results/predict_then_explain_results.csv](results/predict_then_explain_results.csv) contains GPT's prediction of which of two responses likely changed the OP's opinion. Each row consists of the following:

- response_1, response_2 : The two responses to the OP. The prediction one is randomized
- prediction_response : either 1 or 2 depending on the response
- prediction : either 1 or 2 depending on GPT's guess at the prediction_response
- explanation : GPT's reasoning as to why they chose a particular response
- temperature : (set to 0) how much variance there is in GPT's responses are

The following is what the first few lines of the dataset look like (for just explain-then-predict):

In [16]:
import pandas as pd
df = pd.read_csv("results/explain_then_predict_results.csv")
df.head(5)

Unnamed: 0,response_1,response_2,correct_response,prediction,correct,explanation,temperature
0,"""The Kurds"" are an ethnic group, with a simila...",&gt;1) The Kurds espouse a number of western v...,1,1,True,Response 1 provides a more detailed and nuance...,0
1,Actually 2040 isn't a bad estimate. [In 2012 C...,Have you looked into googles plans for self dr...,2,2,True,Response 2 provides concrete examples and evid...,0
2,&gt; This is why I think that giving the reaso...,"In principle, I agree with you that parents do...",2,2,True,Response 2 presents a strong argument by highl...,0
3,If we can make them then it's highly probable ...,You've already addressed a bunch of points her...,2,2,True,Response 2 is more persuasive because it chall...,0
4,"If someone uses the ""backdoor"" to my bathroom ...",The arguments against surveillance are nearly ...,1,1,True,Response 1 is more persuasive because it direc...,0


Now, let's pick a random one so you can see the full responses and the explanation

In [17]:
import textwrap
row = df.sample(1).iloc[0]

print(textwrap.fill("RESPONSE 1: " + row['response_1'], width=80))
print("")
print(textwrap.fill("RESPONSE 2: " + row['response_2'], width=80))
print("")
print(textwrap.fill("EXPLANATION: " + row['explanation'], width=80))
print("")
print("CORRECT RESPONSE:", row["correct_response"])

RESPONSE 1: What would you suggest Christians do? Not vote? Move out of the US?
To where? My family is christian. We pay taxes, work, support the economy, are
contributing members of society. By some definitions, thats "supporting the
actions of the US". I am curious what you think the alternative is?

RESPONSE 2: You're treating the United States as some unified identity whose
actions one could support or oppose.  You're adding up government policies and
cultural trends and treating them as some sort of package.  You're also adding
together different actions by different levels of government and in different
places.  Nobody supports everything that goes on in the United States.  It would
be a contradiction to do so, as different things happen in different places, or
at different levels of government.

EXPLANATION: Response 1 provides a more logical and structured argument by
breaking down the idea of supporting the United States into different
components. It addresses the complexity o

The following are some functions to calculate statistical signifiance for proportion function to help calculate p-values for later. You can skip to the next text box (would recommend minimizing if possible).

In [18]:
from scipy.stats import norm
from scipy import stats

def one_prop_test(p_hat, p0, n, tail='left'):

    se = (p0 * (1 - p0) / n)**(1/2)
    z = (p_hat - p0) / se

    if tail == 'right':
        p_val = 1 - norm.cdf(z)
    elif tail == 'left':
        p_val = norm.cdf(z)
    else:
        raise ValueError("tail must be left or right")

    return p_val, z

def one_sample_ttest(column, mu0=0, tail='left'):
    n = column.shape[0]
    sample_mean = column.mean()
    sample_std = column.std(ddof=1)

    t = (sample_mean - mu0) / (sample_std / (n ** 0.5))

    if tail == 'right':
            p_val = 1 - stats.t.cdf(t, df=n - 1)
    elif tail == 'left':
        p_val = stats.t.cdf(t, df=n - 1)
    else:
        raise ValueError("tail must be left or right")

    return p_val, t
        

We now calculate the p-values for each dataset to determine statistical significance.

In [19]:
df_ETP = pd.read_csv("results/explain_then_predict_results.csv")
ETP_correct = (df_ETP['prediction'] == df_ETP['correct_response']).sum()
df_PTE = pd.read_csv("results/predict_then_explain_results.csv")
PTE_correct = (df_PTE['prediction'] == df_PTE['correct_response']).sum()
print(f"Explain-Then-Predict Accuracy: {ETP_correct} / 500 = {ETP_correct/500}")
ETP_p, ETP_z = one_prop_test(0.58, 0.5, 500, 'right')
print(f"p-value for Explain-Then-Predict: {round(ETP_p,4)}. z-score for Explain-Then-Predict: {round(ETP_z,2)}\n")
print(f"Predict-Then-Explain Accuracy: {PTE_correct} / 500 = {PTE_correct/500}")
PTE_p, PTE_z = one_prop_test(0.54, 0.5, 500, 'right')
print(f"p-value for Explain-Then-Predict: {round(PTE_p,4)}. z-score for Explain-Then-Predict: {round(PTE_z,2)}")

Explain-Then-Predict Accuracy: 290 / 500 = 0.58
p-value for Explain-Then-Predict: 0.0002. z-score for Explain-Then-Predict: 3.58

Predict-Then-Explain Accuracy: 270 / 500 = 0.54
p-value for Explain-Then-Predict: 0.0368. z-score for Explain-Then-Predict: 1.79


So both are statistically significant (assuming threshold is p=0.05), meaning that GPT is guessing at a rate higher than random.

### 2. Features of user responses dataset 

These datasets [(results/predict_features.csv](results/predict_features.csv) and [results/explain_features.csv](results/explain_features.csv) contained some features we extracted from each dataset. Namely the following:

- **word_count** : The word count of each response
- **response_1/2_valence** : The average valence score, or how pleasant a word sounds, of each text, as determined by the [NRC-VAD Lexicon](https://saifmohammad.com/WebPages/nrc-vad.html)
- **response_1/2_arousal** : The average arousal score, or how emotionally intense a word is, of each text, as determined by the [NRC-VAD Lexicon](https://saifmohammad.com/WebPages/nrc-vad.html)
- **response_1/2_dominance** : The average dominance score, or how much degree of control a word has, of each text, as determined by the [NRC-VAD Lexicon](https://saifmohammad.com/WebPages/nrc-vad.html)
- **response_1/2_concreteness** : The average concreteness score, or how perceptible or intangible a word is, of each text, as determined by [https://link.springer.com/article/10.3758/s13428-013-0403-5#Sec10](https://link.springer.com/article/10.3758/s13428-013-0403-5#Sec10)

For valence, arousal, dominance and concreteness, words were normalized to a **range of 0 to 1**, and only words with a **value above 0.65** were included, as to not dilute the points. Additionally, the difference of the correct response minus the incorrect response was also tracked for each entry


In [20]:
import pandas as pd
df2 = pd.read_csv('results/explain_features.csv')
df2.head(5)

Unnamed: 0,response_1_word_count,response_2_word_count,word_count_prediction_difference,response_1_valence,response_2_valence,valence_prediction_difference,response_1_arousal,response_2_arousal,arousal_prediction_difference,response_1_dominance,response_2_dominance,dominance_prediction_difference,response_1_concreteness,response_2_concreteness,concreteness_prediction_difference,response_1_link_count,response_2_link_count,link_count_prediction_difference
0,161,364,-203,0.28325,0.538545,-0.255295,0.94075,0.898875,0.041875,0.955833,0.842042,0.113792,0.818769,0.838077,-0.019308,0,1,-1
1,148,119,-29,0.7459,0.958333,0.212433,0.922,0.9815,0.0595,0.912667,0.7534,-0.159267,0.8628,0.823231,-0.039569,3,0,-3
2,384,495,111,0.858675,0.720577,-0.138098,0.8645,0.386167,-0.478333,0.8771,0.8905,0.0134,0.833913,0.831348,-0.002565,0,0,0
3,103,80,-23,0.699778,0.9028,0.203022,0.693667,0.118,-0.575667,0.707,0.8905,0.1835,0.87,0.810667,-0.059333,0,0,0
4,110,111,-1,0.744,0.274,0.47,0.923,0.74325,0.17975,0.88425,0.8745,0.00975,0.844,0.884889,-0.040889,0,0,0


Here we can see the average difference for the correct responses vs the incorrect responses.

In [23]:
df_diff = df2[["word_count_prediction_difference", "valence_prediction_difference", "arousal_prediction_difference", "dominance_prediction_difference", "concreteness_prediction_difference", "link_count_prediction_difference"]]
print(df_diff.mean(),"\n")


mean_link_diff = df2.loc[
    df2["link_count_prediction_difference"] != 0, 
    "link_count_prediction_difference"
].mean()
print(f"link_count_prediction_difference for texts that contain a link: {mean_link_diff}")

word_count_prediction_difference     -12.072000
valence_prediction_difference         -0.000241
arousal_prediction_difference          0.017882
dominance_prediction_difference       -0.025396
concreteness_prediction_difference     0.000396
link_count_prediction_difference       0.018000
dtype: float64 

link_count_prediction_difference for texts that contain a link: 0.0743801652892562


Let's test for statistical significant using a standard t-test

In [26]:
results = {}
for col in df_diff.columns:
    #Determine tail direction
    mean_val = df_diff[col].mean()
    tail = 'left' if mean_val < 0 else 'right'

    p_val, t = one_sample_ttest(df_diff[col], mu0=0, tail=tail)
    results[col] = {"t": t, "p_val": p_val}

#for links without posts with 0 links
nonzero_links = df2.loc[
    df2["link_count_prediction_difference"] != 0,
    "link_count_prediction_difference"
]

mean_link_diff = nonzero_links.mean()
tail = 'left' if mean_link_diff < 0 else 'right'

p_val, t = one_sample_ttest(nonzero_links, mu0=0, tail=tail)

results["link_count_prediction_difference_nonzero"] = {
    "t": t,
    "p_val": p_val
}

results_df = pd.DataFrame(results).T
print(results_df)

                                                 t     p_val
word_count_prediction_difference         -1.301163  0.096902
valence_prediction_difference            -0.018984  0.492431
arousal_prediction_difference             0.921249  0.178683
dominance_prediction_difference          -1.510873  0.065727
concreteness_prediction_difference        0.252301  0.400456
link_count_prediction_difference          0.272977  0.392492
link_count_prediction_difference_nonzero  0.272182  0.392975


We do the same for [**predict_then_explain**](results/predict_features.csv)

In [27]:
df2 = pd.read_csv('results/predict_features.csv')
df_diff = df2[["word_count_prediction_difference", "valence_prediction_difference", "arousal_prediction_difference", "dominance_prediction_difference", "concreteness_prediction_difference", "link_count_prediction_difference"]]
print(df_diff.mean(),"\n")


mean_link_diff = df2.loc[
    df2["link_count_prediction_difference"] != 0, 
    "link_count_prediction_difference"
].mean()
print(f"link_count_prediction_difference for texts that contain a link: {mean_link_diff}")

word_count_prediction_difference     -3.276000
valence_prediction_difference        -0.008216
arousal_prediction_difference         0.004585
dominance_prediction_difference      -0.041168
concreteness_prediction_difference    0.002549
link_count_prediction_difference      0.022000
dtype: float64 

link_count_prediction_difference for texts that contain a link: 0.09090909090909091


In [28]:
results = {}
for col in df_diff.columns:
    #Determine tail direction
    mean_val = df_diff[col].mean()
    tail = 'left' if mean_val < 0 else 'right'

    p_val, t = one_sample_ttest(df_diff[col], mu0=0, tail=tail)
    results[col] = {"t": t, "p_val": p_val}

#for links without posts with 0 links
nonzero_links = df2.loc[
    df2["link_count_prediction_difference"] != 0,
    "link_count_prediction_difference"
]

mean_link_diff = nonzero_links.mean()
tail = 'left' if mean_link_diff < 0 else 'right'

p_val, t = one_sample_ttest(nonzero_links, mu0=0, tail=tail)

results["link_count_prediction_difference_nonzero"] = {
    "t": t,
    "p_val": p_val
}

results_df = pd.DataFrame(results).T
print(results_df)

                                                 t     p_val
word_count_prediction_difference         -0.352545  0.362289
valence_prediction_difference            -0.646406  0.259157
arousal_prediction_difference             0.236018  0.406758
dominance_prediction_difference          -2.458374  0.007148
concreteness_prediction_difference        1.629797  0.051888
link_count_prediction_difference          0.333650  0.369392
link_count_prediction_difference_nonzero  0.332718  0.369964
