In [9]:
import zipfile

import pandas as pd
import requests
import statsmodels.api as sm

## <font color="red">*Exercise 1*</font>

<font color="red">Describe 2 separate predictions relevant to your project and associated texts, which involve predicting text that has not been observed based on patterns that have. Then, in a single, short paragraph, describe a research design through which you could use textual features and the tools of classification and regression to evaluate these predictions.

## <font color="red">*Exercise 2*</font>

<font color="red">Propose a simple causal model in your data, or a different causal model in the annotated Internet Arguments Corpus (e.g., a different treatment, a different outcome), and test it using a linear or logistic regression model. If you are using social media data for your final project, we encourage you to classify or annotate a sample of that data (either computationally or with human annotators) and examine the effect of texts on replies to that text (e.g., Reddit posts on Reddit comments, Tweets on Twitter replies, YouTube video transcripts on YouTube comments or ratings). You do not need to make a graph of the causal model, but please make it clear (e.g., "X affects Y, and C affects both X and Y.").
    
<font color="red">Also consider using the [ConvoKit datasets](https://convokit.cornell.edu/documentation/datasets.html)! Anytime there is conversation, there is an opportunity to explore the effects of early parts of the conversation on later parts. We will explore this further in Week 8 on Text Generation and Conversation.
    
<font color="red">***Stretch*** (not required): Propose a more robust identification strategy using either matching, difference in difference, regression discontinuity, or an instrumental variable. Each of these methods usually gives you a more precise identification of the causal effect than a unconditional regression. Scott Cunningham's [Causal Inference: The Mixtape](https://mixtape.scunning.com/) is a free textbook on these topics, and all have good YouTube video explanations.

In [5]:
# This takes about 5 minutes to execute
# The `https` link does not work, so there's no need to change it
url = "http://nldslab.soe.ucsc.edu/iac/iac_v1.1.zip"

req = requests.get(url)

filename = url.split("/")[-1]
with open(filename, "wb") as output_file:
    output_file.write(req.content)
print("Downloaded file: " + url)

Downloaded file: http://nldslab.soe.ucsc.edu/iac/iac_v1.1.zip


In [6]:
# Zip to DataFrame
with zipfile.ZipFile("iac_v1.1.zip") as z:
    with z.open(
        "iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_averages.csv"
    ) as f:
        qr = pd.read_csv(f)

    with z.open(
        "iac_v1.1/data/fourforums/annotations/mechanical_turk/qr_meta.csv"
    ) as f:
        md = pd.read_csv(f)

In [8]:
# Creating the DataFrame that we'll use for the causal test
pairs = qr.merge(md, how="inner", on="key")
pairs = pairs[~pairs.quote_post_id.isnull() & ~pairs.response_post_id.isnull()]

triples = pairs.merge(
    pairs, left_on="response", right_on="quote", how="inner", suffixes=("_r1", "_r2")
)

# Rename and reorder columns
triples = triples.rename(
    columns={"quote_r1": "quote", "quote_r2": "response1", "response_r2": "response2"}
)
triples = triples.drop(columns=["response_r1"])
front_columns = [
    "quote",
    "response1",
    "response2",
    "attack_r1",
    "fact-feeling_r1",
    "nicenasty_r1",
    "sarcasm_r1",
    "agreement_r2",
]
triples = triples.dropna(subset=front_columns)
triples = triples[front_columns].join(triples.drop(columns=front_columns))
triples.head()

Unnamed: 0,quote,response1,response2,attack_r1,fact-feeling_r1,nicenasty_r1,sarcasm_r1,agreement_r2,key_r1,discussion_id_x_r1,...,questioning-asserting_unsure_r2,sarcasm_r2,sarcasm_unsure_r2,discussion_id_y_r2,response_post_id_r2,quote_post_id_r2,term_r2,task1 num annot_r2,task2 num annot_r2,task2 num disagree_r2
0,I remember looking at the classic evolutionary...,Why do you find it necessary to fit observatio...,"Evolution has no goals, it is merely a beautif...",0.333333,0.333333,0.666667,0.2,-2.833333,"(731, 1)",6032,...,0.0,0.0,0.166667,6032,149673,149609.0,,6,5,2
1,What is the fun in that?,"Seriously? Well, I come here hoping for someth...","nah, I was just poking fun because I can! Pers...",-0.6,-2.2,0.0,0.0,-2.166667,"(697, 2)",5205,...,0.0,0.2,0.166667,5205,123129,122800.0,,6,5,2
2,"First off, the scientific method goes:\n \n 1)...",You guys know me. Always happy to correct anyo...,"Ah, thanks for the correction, although there ...",2.4,2.8,2.2,0.0,-0.4,"(9, 0)",9449,...,,0.0,0.4,9449,247243,247240.0,No terms in first 10,5,7,0
3,You can ignore the obvious question. This is w...,Actually what they are really doing is ignorin...,"Really, then show me how I'm wrong - without d...",-3.5,-3.166667,-3.166667,0.166667,-1.833333,"(1077, 1)",3467,...,0.0,0.2,0.166667,3467,73783,73741.0,really,6,5,2
4,Its really sad what these gay predator priests...,Homosexuals are attracted to adults of the sam...,Homosexuals are attracted to people of the sam...,0.166667,2.166667,-0.166667,0.4,-2.666667,"(611, 0)",4337,...,0.0,0.0,0.0,4337,112012,112008.0,No terms in first 10,6,7,4


In [17]:
ex_y = triples["agreement_r2"]
ex_x_cols = ["fact-feeling_unsure_r1"]
ex_x = sm.add_constant(triples[ex_x_cols])

ex_lm = sm.OLS(ex_y, ex_x).fit()
print(ex_lm.summary())

                            OLS Regression Results                            
Dep. Variable:           agreement_r2   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     17.95
Date:                Tue, 20 Feb 2024   Prob (F-statistic):           2.43e-05
Time:                        22:54:45   Log-Likelihood:                -2572.2
No. Observations:                1340   AIC:                             5148.
Df Residuals:                    1338   BIC:                             5159.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -1

**Thoughts:** In the above OLS linear regression, I test whether the initial
responder's lean toward a factual or feeling-based response has a causal effect 
on the second response's agreement measurement. Based on the data above, it has a
significant effect (below a .001 level) on whether the second response agrees.
Though it does have a significant effect, it does not account for much of the 
variation in `agreement_r2`, based on the Adjusted R-squared value (0.012), with the 
value of `fact-feeling_unsure_r1` only accounting for 1.2% of the variation 
in `agreement_r2`.

## <font color="red">*Exercise 3*</font>

<font color="red">Propose a measure you could generate to fill in or improve upon the simple causal model you proposed above and how you would split the data (e.g., a % of your main data, a separate-but-informative dataset). You do not have to produce the measure.
    
<font color="red">***Stretch*** (not required): Produce the measure and integrate it into your statistical analysis. This could be a great approach for your final project!

The recommendation in the example would be useful for here as well. Given that
there a multitude of pairs that do not belong to the triples collection, I would
use that split. I would test my model improvement, listed below, on the 
pairs split, and if successful, validate it on the triples.

To improve the model, I would add a length component, `len(r1.split())`,
for the first response in an attempt to validate the assumption that feeling
statements are often shorter and curt, as opposed to factual statements that
tend to be more verbose. My intuition is that this first response length measurement
co-variable would help to improve the Adjusted R-squared so that more than 1.2%
of the variation is accounted for within the model.

## <font color="red">*Exercise 4*</font>

<font color="red">Propose a mediation model related to the simple causal model you proposed above (ideally on the dataset you're using for your final project). If you have measures for each variable in the model, run the analysis: You can just copy the "Mediation analysis" cell above and replace with your variables. If you do not have measures, do not run the analysis, but be clear as to the effect(s) you would like to estimate and the research design you would use to test them.

## <font color="red">*Exercise 5*</font>

<font color="red">Pick one other paper on causal inference with text from the ["Papers about Causal Inference and Language
" GitHub repository](https://github.com/causaltext/causal-text-papers). Write at least three sentences summarizing the paper and its logic of design in your own words.
    
<font color="red">***Stretch*** (not required): Skim a few more papers. The causal world is your textual oyster!

**Title:** A Survey of Online Hate Speech through the Causal Lens ([Link](https://arxiv.org/pdf/2109.08120.pdf))
**Summary:** Using a causal lens, the paper looks into the historical academic basis, reasoning, and 
complications associated with studying online hate speech. Their particular focus is describing studies 
grouped into three specific causal fields: digital misbehaviors versus the physical world, harmful content
versus the individual, and the effect of interventions on individuals. Lastly, the authors discuss the 
prevailing difficulties associated with the convergence of causal research methods and natural language 
processing (confounding bias, linguistic representations, and causal transportability) and provide 
suggestions for future researchers to successfully navigate the obstacles mentioned above.