# SemEval-2023 Task8 Causal Claims Identifications and PIO Frame Extraction
[SemEval-2023 task8](https://causalclaims.github.io/)
## Subtask 1: Causal claim indentification
For the provided snippet of text, the first subtask aims to identify the span of text that is either a claim, experience, experience_based_claim or a question. These four categories can be defined as follow:
* Claim: Commmunicates a causal interaction between an intervention and an outcome.
* Experience: Relates a specific outcome/symptom to an intervention or population based on someone's experience.
* Experience based claim: A claim based on someone's experince.
* Question: Poses a question.


# Preview Data

data file type is csv


In [2]:
# !pip install pandas

In [3]:
import pandas as pd
semeval_train_df = pd.read_csv("../st1_public_data/st1_train_inc_text.csv")
semeval_test_df = pd.read_csv("../st1_public_data/st1_test_inc_text.csv")

In [4]:
semeval_train_df.head()

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
0,s1jpia,t5_2s23e,"[{""crowd-entity-annotation"":{""entities"":[{""end...",De-Nial\nI wrote this a few years ago and just...
1,re7owg,t5_2s1h9,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Had a seizure for the first time in ten years\...
2,r0nm2r,t5_2syer,"[{""crowd-entity-annotation"":{""entities"":[{""end...",How long does it take to drop UA through diet/...
3,qmkuzj,t5_2rtve,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Im wondering what is the average age yall were...
4,sn5sqd,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Has anyone taken amitriptyline and found it to...


In [5]:
semeval_train_df.tail()

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
5690,sypylg,t5_2saq9,"[{""crowd-entity-annotation"":{""entities"":[{""end...",CPR Training Concerns\nI have to do CPR traini...
5691,sivvy6,t5_2r876,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Has anyone in the US started going back to wor...
5692,s6z4ik,t5_2s23e,"[{""crowd-entity-annotation"":{""entities"":[{""end...",What is your strategy when you have relapse of...
5693,rdbdw7,t5_2s1h9,"[{""crowd-entity-annotation"":{""entities"":[]}}]",Curious about other people's experience in the...
5694,ri83g1,t5_2r876,"[{""crowd-entity-annotation"":{""entities"":[{""end...",How effective are the gene modulators in treat...


In [6]:
semeval_train_df.dtypes

post_id          object
subreddit_id     object
stage1_labels    object
text             object
dtype: object

In [7]:
semeval_train_df.describe()

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
count,5695,5695,5695,5695
unique,5695,9,4911,5270
top,s1jpia,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[]}}]",[deleted by user]\n[removed]
freq,1,724,557,363


## Columns of the dataframe
* post_id, subreddit_id: the ids the get the reddit post
* stage1_labels: labels contain Causal Claims which we need to identification
    - label: the label discribe snippet ["per_exp", "claim", "claim_per_exp", "question"]
    - startOffset: start index of the label
    - endOffset: end index of the label
* text: reddit post text


In [8]:
semeval_train_df.columns

Index(['post_id', 'subreddit_id', 'stage1_labels', 'text'], dtype='object')

In [9]:
semeval_train_df["stage1_labels"][0]

'[{"crowd-entity-annotation":{"entities":[{"endOffset":858,"label":"per_exp","startOffset":661},{"endOffset":2213,"label":"per_exp","startOffset":1861},{"endOffset":2407,"label":"per_exp","startOffset":2255},{"endOffset":3254,"label":"claim_per_exp","startOffset":2697},{"endOffset":3620,"label":"claim_per_exp","startOffset":3294},{"endOffset":3751,"label":"claim_per_exp","startOffset":3621},{"endOffset":4480,"label":"per_exp","startOffset":3752},{"endOffset":4759,"label":"question","startOffset":4482}]}}]'

In [10]:
type(semeval_train_df["stage1_labels"][0])

str

# Semval task 8 Dataset details

details from semeval
```
We are currently providing a sample dataset and will be releasing the full training data soon. Our dataset is built from Reddit posts and to respect the users' privacy we are not releasing the dataset directly. Instead, we are only providing Reddit post ids, annotations, and a script. Participants can use the script to obtain the data and merge it with the provided annotations. If users choose to delete their post, the script won't be able to get it.
```


In [11]:
semeval_train_df.loc[8]

post_id                                                     saxv8t
subreddit_id                                              t5_2tyg2
stage1_labels    [{"crowd-entity-annotation":{"entities":[{"end...
text                                  [deleted by user]\n[removed]
Name: 8, dtype: object

### We can see some of the text of the reddit post were deleted by user. We were not able to get those text

### train

---
train post total numbers : 5695  
train post which deleted by user numbers : 678

---

In [12]:
semeval_train_del_df = semeval_train_df.loc[semeval_train_df["text"].str.contains("deleted")]

In [13]:
len(semeval_train_del_df), len(semeval_train_df)

(678, 5695)

In [14]:
semeval_train_del_df.head()

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
8,saxv8t,t5_2tyg2,"[{""crowd-entity-annotation"":{""entities"":[{""end...",[deleted by user]\n[removed]
21,sv4jgm,t5_2syer,"[{""crowd-entity-annotation"":{""entities"":[{""end...",[deleted by user]\n[removed]
25,sm7law,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[{""end...",[deleted by user]\n[removed]
27,sofbm3,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[{""end...",[deleted by user]\n[removed]
34,regj97,t5_2rtve,"[{""crowd-entity-annotation"":{""entities"":[{""end...",[deleted by user]\n[removed]


In [15]:
semeval_train_rm_del_df = pd.merge(semeval_train_df, semeval_train_del_df, indicator=True, how="outer").query('_merge=="left_only"').drop('_merge', axis=1)

In [16]:
semeval_train_rm_del_df

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
0,s1jpia,t5_2s23e,"[{""crowd-entity-annotation"":{""entities"":[{""end...",De-Nial\nI wrote this a few years ago and just...
1,re7owg,t5_2s1h9,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Had a seizure for the first time in ten years\...
2,r0nm2r,t5_2syer,"[{""crowd-entity-annotation"":{""entities"":[{""end...",How long does it take to drop UA through diet/...
3,qmkuzj,t5_2rtve,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Im wondering what is the average age yall were...
4,sn5sqd,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Has anyone taken amitriptyline and found it to...
...,...,...,...,...
5689,st6mwp,t5_2s3g1,"[{""crowd-entity-annotation"":{""entities"":[{""end...",who else cant go 48 hrs without diarrhea??\nev...
5690,sypylg,t5_2saq9,"[{""crowd-entity-annotation"":{""entities"":[{""end...",CPR Training Concerns\nI have to do CPR traini...
5691,sivvy6,t5_2r876,"[{""crowd-entity-annotation"":{""entities"":[{""end...",Has anyone in the US started going back to wor...
5693,rdbdw7,t5_2s1h9,"[{""crowd-entity-annotation"":{""entities"":[]}}]",Curious about other people's experience in the...


### test
---
test post total numbers : 1424   
test post which deleted by user numbers : 160  

---

In [17]:
semeval_test_del_df = semeval_test_df.loc[semeval_test_df["text"].str.contains("deleted")]

In [18]:
len(semeval_test_del_df), len(semeval_test_df)

(160, 1424)

In [19]:
semeval_test_del_df.head()

Unnamed: 0,post_id,subreddit_id,text
6,s5lfpx,t5_2r876,[deleted by user]\n[removed]
51,swux97,t5_2s3g1,"Losing weight, unable to eat, WHAT DO I DO\n[d..."
54,so7958,t5_2s3g1,Is not going for a day or two and then going a...
69,sfyi1a,t5_2tyg2,hm.\n[deleted]
106,qng86f,t5_2rtve,masks and face shields usage\n[deleted]


In [20]:
semeval_test_rm_del_df = pd.merge(semeval_test_df, semeval_test_del_df, indicator=True, how="outer").query('_merge=="left_only"').drop('_merge', axis=1)

In [21]:
semeval_test_rm_del_df

Unnamed: 0,post_id,subreddit_id,text
0,pwns5j,t5_2r876,Speeding fine\nI know this could be my anxiety...
1,rm5t18,t5_2qlaa,Food Recipe Wednesday\n**This is your** r/GERD...
2,paw773,t5_2syer,Trigger Foods & Diet\nUnderstanding medication...
3,sk2zb2,t5_2s3g1,Any quick relief remedies for the stomach cram...
4,scofpi,t5_2saq9,Canker sores\nAnyone else get them often ? And...
...,...,...,...
1419,rn4kqy,t5_2rtve,Is there someone whose job it is to help peopl...
1420,swwjjc,t5_2s3g1,Something different is happening..\nI've had I...
1421,p9c2hv,t5_2syer,First 2 weeks on allopurinol\nHi guys. Im a 26...
1422,ripncf,t5_2s1h9,Seizure After Booster\nNot saying the booster ...


## change stage1_labels type from str to list

Using ast.literal_eval() to change list format string type change into list type

In [22]:
import ast

semeval_train_rm_del_df["stage1_labels"] = semeval_train_rm_del_df["stage1_labels"].apply(lambda x: ast.literal_eval(x))

In [23]:
type(semeval_train_rm_del_df["stage1_labels"][0])


list

In [24]:
semeval_train_rm_del_df

Unnamed: 0,post_id,subreddit_id,stage1_labels,text
0,s1jpia,t5_2s23e,[{'crowd-entity-annotation': {'entities': [{'e...,De-Nial\nI wrote this a few years ago and just...
1,re7owg,t5_2s1h9,[{'crowd-entity-annotation': {'entities': [{'e...,Had a seizure for the first time in ten years\...
2,r0nm2r,t5_2syer,[{'crowd-entity-annotation': {'entities': [{'e...,How long does it take to drop UA through diet/...
3,qmkuzj,t5_2rtve,[{'crowd-entity-annotation': {'entities': [{'e...,Im wondering what is the average age yall were...
4,sn5sqd,t5_2s3g1,[{'crowd-entity-annotation': {'entities': [{'e...,Has anyone taken amitriptyline and found it to...
...,...,...,...,...
5689,st6mwp,t5_2s3g1,[{'crowd-entity-annotation': {'entities': [{'e...,who else cant go 48 hrs without diarrhea??\nev...
5690,sypylg,t5_2saq9,[{'crowd-entity-annotation': {'entities': [{'e...,CPR Training Concerns\nI have to do CPR traini...
5691,sivvy6,t5_2r876,[{'crowd-entity-annotation': {'entities': [{'e...,Has anyone in the US started going back to wor...
5693,rdbdw7,t5_2s1h9,[{'crowd-entity-annotation': {'entities': []}}],Curious about other people's experience in the...


In [25]:
semeval_test_rm_del_df

Unnamed: 0,post_id,subreddit_id,text
0,pwns5j,t5_2r876,Speeding fine\nI know this could be my anxiety...
1,rm5t18,t5_2qlaa,Food Recipe Wednesday\n**This is your** r/GERD...
2,paw773,t5_2syer,Trigger Foods & Diet\nUnderstanding medication...
3,sk2zb2,t5_2s3g1,Any quick relief remedies for the stomach cram...
4,scofpi,t5_2saq9,Canker sores\nAnyone else get them often ? And...
...,...,...,...
1419,rn4kqy,t5_2rtve,Is there someone whose job it is to help peopl...
1420,swwjjc,t5_2s3g1,Something different is happening..\nI've had I...
1421,p9c2hv,t5_2syer,First 2 weeks on allopurinol\nHi guys. Im a 26...
1422,ripncf,t5_2s1h9,Seizure After Booster\nNot saying the booster ...


## train, test removed deleted post numbers
train number: 5017
test number: 1264

# Visualize Data with labels to get more info from subtask 8.1

## using termcolor visualize different labels


In [26]:
from IPython.display import HTML as html_print
def cstr(s, color='dark gray'):
    return "<text style=color:{}>{}</text>".format(color, s)

def display(df_row):
    def idx_in(idx, labels, f = True):
        labels_color = {"claim": "#FF431B;", "per_exp": "#46980D;", "claim_per_exp": "#0D6598;", "question": "#960B98;"}
        for label in labels:
            if label["startOffset"] == idx:
                return f"<text style=color:{labels_color[label['label']]}>"
            if f:
                if label["endOffset"] == idx:
                    return "</text>" + idx_in(idx, labels, f=False)
        return ""

    labels = df_row["stage1_labels"][0]["crowd-entity-annotation"]["entities"]
    text = df_row["text"]
    html = ""
    for idx in range(len(text)):
        html += idx_in(idx, labels) + text[idx]
    return html

## labels correspond to color
* Claim: Orange
* per_exp: Green
* claim_per_exp: Blue
* question: Purple

In [27]:
data_id = 0
html_print(cstr(display(semeval_train_rm_del_df.loc[data_id])))



In [28]:
semeval_train_rm_del_df.loc[data_id]["stage1_labels"]

[{'crowd-entity-annotation': {'entities': [{'endOffset': 858,
     'label': 'per_exp',
     'startOffset': 661},
    {'endOffset': 2213, 'label': 'per_exp', 'startOffset': 1861},
    {'endOffset': 2407, 'label': 'per_exp', 'startOffset': 2255},
    {'endOffset': 3254, 'label': 'claim_per_exp', 'startOffset': 2697},
    {'endOffset': 3620, 'label': 'claim_per_exp', 'startOffset': 3294},
    {'endOffset': 3751, 'label': 'claim_per_exp', 'startOffset': 3621},
    {'endOffset': 4480, 'label': 'per_exp', 'startOffset': 3752},
    {'endOffset': 4759, 'label': 'question', 'startOffset': 4482}]}}]