# Exploratory data analysis as the feasibility study of method2 step3 (relation classification)

This is done after running sequencially:
1. defi_textmine_2025/method2/data/tag_entity_in_texts.py
````sh
python -m defi_textmine_2025.method2.data.tag_entity_in_texts
````
2. defi_textmine_2025/method2/data/reduce_texts.py
````sh
# test data (unlabeled)
python -m defi_textmine_2025.method2.data.reduce_texts 7 data/defi-text-mine-2025/interim/entity_bracket_tagging_dataset/test/ data/defi-text-mine-2025/interim/reduced_text_w_entity_bracket/test/

# train data (labeled)
python -m defi_textmine_2025.method2.data.reduce_texts 14 data/defi-text-mine-2025/interim/entity_bracket_tagging_dataset/train/ data/defi-text-mine-2025/interim/reduced_text_w_entity_bracket/train/
````
3. defi_textmine_2025/method2/evaluation/split_labeled_dataset.py
````sh
python -m defi_textmine_2025.method2.evaluation.split_labeled_dataset 5
````

How well can we separate the classification into 2 or 3 classifiers:
- a binary classifier to identify the gender of a person
- a multiclass not-multilabel classifier to identify relations that don't coocur  with others
- a multilabel classifier to identify relation that often coocur with others

**Hypothesis**: We believe that separating the relations this way might reduce the confusion rate between relations and ths improve the macro-F1 since that metrics demands to improve the score for each single label.

# Imports

In [1]:
import logging
logging.basicConfig(
     level=logging.INFO, 
     format= '[%(asctime)s|%(levelname)s|%(module)s.py:%(lineno)s] %(message)s',
     datefmt='%H:%M:%S'
 )
import json
import pandas as pd
from defi_textmine_2025.data.utils import TARGET_COL, EDA_DIR, INTERIM_DIR


FOLD_NUM = 1
logging.info(f"{FOLD_NUM=}")
TASK_NAME = "hasrelation"
logging.info(f"{TASK_NAME=}")
STEP1_TASK_TARGET_COL = f"{TASK_NAME}_label"
logging.info(f"{STEP1_TASK_TARGET_COL=}")
TASK_INPUT_COL = "input_text"

USED_COLUMNS = ["text_index", "e1_id", "e2_id", "e1_type", "e2_type", TARGET_COL, TASK_INPUT_COL, STEP1_TASK_TARGET_COL]
logging.info(f"{USED_COLUMNS=}")

def load_preprocessed_data(parquet_path: str) -> pd.DataFrame:
    return pd.read_parquet(parquet_path)

def format_relations_str_to_list(labels_as_str: str) -> list[str]:
    return json.loads(
        labels_as_str.replace("{", "[").replace("}", "]").replace("'", '"')
    )  if pd.notnull(labels_as_str) else []

[10:30:12|INFO|2387107469.py:13] FOLD_NUM=1
[10:30:12|INFO|2387107469.py:15] TASK_NAME='hasrelation'
[10:30:12|INFO|2387107469.py:17] STEP1_TASK_TARGET_COL='hasrelation_label'
[10:30:12|INFO|2387107469.py:21] USED_COLUMNS=['text_index', 'e1_id', 'e2_id', 'e1_type', 'e2_type', 'relations', 'input_text', 'hasrelation_label']


# Load data

In [None]:
train_df = load_preprocessed_data(f"{INTERIM_DIR}/train-fold{FOLD_NUM}-mth2.parquet")[USED_COLUMNS]
train_df.head(2)

In [3]:
val_df = load_preprocessed_data(f"data/defi-text-mine-2025/interim/validation-fold{FOLD_NUM}-mth2.parquet")[USED_COLUMNS]
val_df.head(2)

Unnamed: 0,text_index,e1_id,e2_id,e1_type,e2_type,relations,input_text,hasrelation_label
3,2576,0,2,GATHERING,CBRN_EVENT,['HAS_CONSEQUENCE'],"Au milieu de l’{ interview }, un incendie est ...",1
6,2576,3,0,CIVILIAN,GATHERING,,"Le matin du 10 janvier 2010, { Arthur } et Jac...",0


In [4]:
test_df = load_preprocessed_data("data/defi-text-mine-2025/interim/test-preprocessed_for_mth2.parquet").drop(["reduced_text"], axis=1)
test_df.head(2)

Unnamed: 0,text_index,e1_id,e2_id,e1_type,e2_type,text,relations,input_text
0,51344,0,1,FIRE,PLACE,Un { incendie } a eu lieu hier matin au [ Port...,,Un { incendie } a eu lieu hier matin au [ Port...
1,51344,1,1,PLACE,PLACE,Un incendie a eu lieu hier matin au < Portugal...,,Un incendie a eu lieu hier matin au < Portugal >.


In [5]:
df = pd.concat([train_df, val_df], axis=0)
df.shape

(122044, 8)

# number of examples per relation coocurrence

In [7]:
n_examples_per_relation_coocurrences_df = df[TARGET_COL].value_counts()
n_examples_per_relation_coocurrences_df.to_csv(f"{EDA_DIR}/n_examples_per_relation_coocurrences.csv")
n_examples_per_relation_coocurrences_df[n_examples_per_relation_coocurrences_df <=5 ]

relations
['IS_PART_OF', 'CREATED']                                                                   5
['HAS_FAMILY_RELATIONSHIP', 'IS_AT_ODDS_WITH', 'IS_IN_CONTACT_WITH']                        5
['HAS_FAMILY_RELATIONSHIP', 'HAS_CONTROL_OVER', 'IS_IN_CONTACT_WITH']                       4
['HAS_FAMILY_RELATIONSHIP', 'IS_IN_CONTACT_WITH', 'IS_COOPERATING_WITH']                    4
['IS_BORN_IN', 'RESIDES_IN', 'IS_LOCATED_IN']                                               3
['HAS_CONTROL_OVER', 'IS_IN_CONTACT_WITH', 'IS_COOPERATING_WITH']                           3
['HAS_CONTROL_OVER', 'IS_BORN_IN']                                                          3
['STARTED_IN', 'OPERATES_IN', 'IS_LOCATED_IN']                                              3
['HAS_CONTROL_OVER', 'IS_COOPERATING_WITH']                                                 2
['HAS_FAMILY_RELATIONSHIP', 'IS_AT_ODDS_WITH']                                              2
['IS_AT_ODDS_WITH', 'IS_IN_CONTACT_WITH', 'IS_COOP

In [8]:
n_examples_per_relation_coocurrences_df[n_examples_per_relation_coocurrences_df <=5 ].sum() /n_examples_per_relation_coocurrences_df.sum()

0.002008032128514056

## Check some examples of single coocurrences

In [46]:
df.query(f""" {TARGET_COL}=="['GENDER_MALE', 'GENDER_FEMALE']" """)

Unnamed: 0,text_index,e1_id,e2_id,e1_type,e2_type,relations,input_text,hasrelation_label
51128,157,7,7,CIVILIAN,CIVILIAN,"['GENDER_MALE', 'GENDER_FEMALE']",Le bilan fait état d'une < personne > gravemen...,1


In [48]:
df.query(f""" {TARGET_COL}=="['GENDER_MALE', 'IS_IN_CONTACT_WITH']" """).iloc[0]["input_text"]

'Un jour, une vidéo filmant la décapitation d’un < homme > cagoulé a été diffusée à la télévision. Il s’agissait du Général < Martin Kumba >, reconnue grâce à une montre de marque qu’< il > avait à la main.'

# Get the coocurrent labels for each label

In [9]:
label2coocurrentlabels = {}
for labels in [format_relations_str_to_list(s) for s in n_examples_per_relation_coocurrences_df.index.to_list()]:
    for label in labels:
        if label not in label2coocurrentlabels:
            label2coocurrentlabels[label] = []
        for coocurrentlabel in labels:
            if label != coocurrentlabel:
                label2coocurrentlabels[label].append(coocurrentlabel)
len(label2coocurrentlabels)

37

In [12]:
print(json.dumps(label2coocurrentlabels, indent=4))
# print(label2coocurrentlabels)

{
    "IS_LOCATED_IN": [
        "STARTED_IN",
        "HAS_CONTROL_OVER",
        "RESIDES_IN",
        "HAS_CONTROL_OVER",
        "RESIDES_IN",
        "IS_BORN_IN",
        "HAS_CONTROL_OVER",
        "IS_BORN_IN",
        "IS_BORN_IN",
        "RESIDES_IN",
        "STARTED_IN",
        "OPERATES_IN",
        "OPERATES_IN"
    ],
    "HAS_CONTROL_OVER": [
        "IS_LOCATED_IN",
        "IS_PART_OF",
        "OPERATES_IN",
        "IS_IN_CONTACT_WITH",
        "CREATED",
        "IS_AT_ODDS_WITH",
        "IS_IN_CONTACT_WITH",
        "IS_AT_ODDS_WITH",
        "RESIDES_IN",
        "IS_LOCATED_IN",
        "IS_PART_OF",
        "IS_IN_CONTACT_WITH",
        "IS_PART_OF",
        "CREATED",
        "IS_BORN_IN",
        "IS_LOCATED_IN",
        "RESIDES_IN",
        "HAS_FAMILY_RELATIONSHIP",
        "IS_IN_CONTACT_WITH",
        "IS_IN_CONTACT_WITH",
        "IS_COOPERATING_WITH",
        "IS_BORN_IN",
        "IS_COOPERATING_WITH",
        "IS_IN_CONTACT_WITH",
        "CREATED

# Analyze relation coocurrences

In [14]:
def analyze_confusion_potentials(labels_of_interest: list[str], identical_entity: bool=False) -> None:
    n_total_examples = df.shape[0]
    n_total_examples_of_lonely_labels = 0    
    for label in labels_of_interest:
        # label_examples_df = df.query(f""" {TARGET_COL}=="['{label}']" """)
        label_examples_df = df[df[TARGET_COL].astype(str).str.contains(f"'{label}'")]
        n_label_examples = label_examples_df.shape[0]
        print("\n---------------", label, "has", n_label_examples, "examples", "---------------")
        count_per_entity_pair_df = label_examples_df.groupby(["e1_type", "e2_type"])[STEP1_TASK_TARGET_COL].count()
        print("\n( 1 ) Seen pair of entity types \n", count_per_entity_pair_df)
        for e1_type, e2_type in count_per_entity_pair_df.index:
            # print(e1_type, e2_type)
            # print(f"\n* labels in all the dataset with {(e1_type, e2_type)=} \n", df.query(f"e1_type=='{e1_type}' & e2_type=='{e2_type}'")[TARGET_COL].value_counts())
            if identical_entity:
                n_examples_per_label_w_e1_type_and_e2_type_df = df.query(f"e1_type=='{e1_type}' & e2_type=='{e2_type}' & e1_id==e2_id")[TARGET_COL].value_counts()
            else:
                n_examples_per_label_w_e1_type_and_e2_type_df = df.query(f"e1_type=='{e1_type}' & e2_type=='{e2_type}'")[TARGET_COL].value_counts()
            n_labels_w_e1_type_and_e2_type = n_examples_per_label_w_e1_type_and_e2_type_df.shape[0]
            if n_labels_w_e1_type_and_e2_type>1:                
                print(f"\n( 2 ) labels in all the dataset with {(e1_type, e2_type)=} and {identical_entity=} \n", n_examples_per_label_w_e1_type_and_e2_type_df)
            # break
        n_total_examples_of_lonely_labels += n_label_examples
        # break
    print(f"##################################\t{n_total_examples_of_lonely_labels=} e.i. a ratio of {n_total_examples_of_lonely_labels/df[TARGET_COL].dropna().shape[0]:.3f} \t##################################")

# analyze_confusion_potentials(["CREATED"], False)

## Relations occurring always alone

In [15]:
lonely_labels = [lonely_label for lonely_label in label2coocurrentlabels if len(label2coocurrentlabels[lonely_label])==0]
print(f"{len(lonely_labels)} {lonely_labels=}")

16 lonely_labels=['HAS_CATEGORY', 'HAS_CONSEQUENCE', 'HAS_QUANTITY', 'IS_OF_NATIONALITY', 'HAS_COLOR', 'IS_DEAD_ON', 'WEIGHS', 'IS_REGISTERED_AS', 'IS_BORN_ON', 'HAS_FOR_LENGTH', 'WAS_CREATED_IN', 'WAS_DISSOLVED_IN', 'HAS_FOR_WIDTH', 'HAS_FOR_HEIGHT', 'HAS_LONGITUDE', 'HAS_LATITUDE']


In [16]:
analyze_confusion_potentials(lonely_labels, False)


--------------- HAS_CATEGORY has 894 examples ---------------

( 1 ) Seen pair of entity types 
 e1_type                e2_type 
CIVILIAN               CATEGORY    823
MILITARY               CATEGORY     22
TERRORIST_OR_CRIMINAL  CATEGORY     49
Name: hasrelation_label, dtype: int64

--------------- HAS_CONSEQUENCE has 769 examples ---------------

( 1 ) Seen pair of entity types 
 e1_type      e2_type                             
ACCIDENT     ACCIDENT                                69
             CBRN_EVENT                              29
             CRIMINAL_ARREST                          1
             FIRE                                    25
             STRIKE                                   2
                                                     ..
TRAFFICKING  DRUG_OPERATION                           2
             FIRE                                     2
             NON_MILITARY_GOVERNMENT_ORGANISATION     1
             THEFT                                    2
    

## Relations occurring at least once with others

In [17]:
not_lonely_labels = [not_lonely_label for not_lonely_label in label2coocurrentlabels if len(label2coocurrentlabels[not_lonely_label])>0]
print(f"{len(not_lonely_labels)} {not_lonely_labels=}")

21 not_lonely_labels=['IS_LOCATED_IN', 'HAS_CONTROL_OVER', 'OPERATES_IN', 'IS_IN_CONTACT_WITH', 'STARTED_IN', 'IS_AT_ODDS_WITH', 'IS_PART_OF', 'GENDER_MALE', 'START_DATE', 'END_DATE', 'INITIATED', 'IS_OF_SIZE', 'GENDER_FEMALE', 'IS_COOPERATING_WITH', 'RESIDES_IN', 'HAS_FAMILY_RELATIONSHIP', 'CREATED', 'DEATHS_NUMBER', 'INJURED_NUMBER', 'DIED_IN', 'IS_BORN_IN']


In [18]:
analyze_confusion_potentials(not_lonely_labels, False)


--------------- IS_LOCATED_IN has 9025 examples ---------------

( 1 ) Seen pair of entity types 
 e1_type                               e2_type
ACCIDENT                              PLACE       439
AGITATING_TROUBLE_MAKING              PLACE        31
BOMBING                               PLACE        69
CBRN_EVENT                            PLACE       252
CIVILIAN                              PLACE      2037
CIVIL_WAR_OUTBREAK                    PLACE        32
COUP_D_ETAT                           PLACE        29
CRIMINAL_ARREST                       PLACE       112
DEMONSTRATION                         PLACE        48
DRUG_OPERATION                        PLACE        36
ECONOMICAL_CRISIS                     PLACE        70
ELECTION                              PLACE        30
EPIDEMIC                              PLACE       117
FIRE                                  PLACE       271
GATHERING                             PLACE       297
GROUP_OF_INDIVIDUALS                  PLACE 

### Check some examples of rare coocurrences

In [22]:
df.query(f""" {TARGET_COL}=="['DEATHS_NUMBER', 'IS_OF_SIZE']" """)[TASK_INPUT_COL].iloc[0]

'De [ nombreuses ] { personnes } ont été tuées.'

### Check wether `IS_AT_ODDS_WITH` is always a symmetric relation

This property might be useful to augment data

In [27]:
# If symetric, then the number of relations in each original text (text_index) should be even
df.query(f""" {TARGET_COL}=="['IS_AT_ODDS_WITH']" """).groupby("text_index")[TARGET_COL].count().sort_values()

text_index
3821      1
4936      1
41749     1
51848     1
42004     2
         ..
41186    26
4910     26
41073    32
11897    34
3696     38
Name: relations, Length: 193, dtype: int64

In [28]:
df.query(f""" {TARGET_COL}=="['IS_AT_ODDS_WITH']" """).query("text_index==3821")[TASK_INPUT_COL].iloc[0]

"Après le chaos de la circulation, les [ forces de l'ordre ] ont dispersé les { manifestants } avec des grenades lacrymogènes."

In [23]:
# difficult to know that mercenaries are by definition at odds with police
df.query(f""" {TARGET_COL}=="['IS_AT_ODDS_WITH']" """).query("text_index==3701")[TASK_INPUT_COL].iloc[0]

"Peu après, le chef de l'État a déclaré dans un communiqué radio que cette tentative de coup d'État avait été dirigée par le mercenaire M. [ Smith Lewan ], un ancien membre de l'armée américaine. Le chef d'État a affirmé que M. [ Smith Lewan ] et M. Basuki Achmad avaient déjà travaillé ensemble et que toutes les preuves étaient réunies. Malgré les patrouilles de la { police } pour rétablir l'ordre, des cas de vols de voitures et de vandalisme ont été recensés dans les rues de Jakarta."

In [24]:
df.query(f""" {TARGET_COL}=="['IS_AT_ODDS_WITH']" """).query("text_index==2431")[TASK_INPUT_COL].iloc[0]

"L'[ équipe ] des Forces d'Interventions Rapides, les FIR, a appréhendé un groupe de trafiquants de drogue dans leur quartier général caché en pleine forêt à une longitude de 6.944408. Lors de la descente, { Gibryl Walker }, l'{ un } des trafiquants, a pu se cacher sous un pick-up, un couteau à la main, derrière une pile de cartons dans lesquels étaient dissimulés leurs sachets de drogue. Pendant les affrontements, des échanges de tirs ont eu lieu et { Gibryl Walker } est tombé sur la pile de cartons. En tombant, { il } a accidentellement répandu sur son visage la poudre de cocaïne contenue dans les cartons. Une fois l'opération terminée, les secours ont constaté que Monsieur { Gibryl Walker } gisait sous la voiture, mort d'une overdose à cause de la drogue qu'{ il } avait inhalée malgré { lui }."

# Context of entities in gender relations

We want to assume that: 
- the gender is always a binary attribute
- the gender never coocurres with another attribute / relation

In [29]:
analyze_confusion_potentials(["GENDER_MALE", "GENDER_FEMALE"], True)


--------------- GENDER_MALE has 908 examples ---------------

( 1 ) Seen pair of entity types 
 e1_type                e2_type              
CIVILIAN               CIVILIAN                 831
MILITARY               MILITARY                  21
TERRORIST_OR_CRIMINAL  TERRORIST_OR_CRIMINAL     56
Name: hasrelation_label, dtype: int64

( 2 ) labels in all the dataset with (e1_type, e2_type)=('CIVILIAN', 'CIVILIAN') and identical_entity=True 
 relations
['GENDER_MALE']                     830
['GENDER_FEMALE']                   402
['GENDER_MALE', 'GENDER_FEMALE']      1
Name: count, dtype: int64

( 2 ) labels in all the dataset with (e1_type, e2_type)=('MILITARY', 'MILITARY') and identical_entity=True 
 relations
['GENDER_MALE']                          20
['GENDER_MALE', 'IS_IN_CONTACT_WITH']     1
Name: count, dtype: int64

( 2 ) labels in all the dataset with (e1_type, e2_type)=('TERRORIST_OR_CRIMINAL', 'TERRORIST_OR_CRIMINAL') and identical_entity=True 
 relations
['GENDER_MALE']   

## Check the 2 coocurrences of genders

### ['GENDER_MALE', 'IS_IN_CONTACT_WITH']

How can sommeone be in contact with himself?

In [30]:
# the only example having both ['GENDER_MALE', 'IS_IN_CONTACT_WITH'] as gender
df.query(f""" {TARGET_COL}=="['GENDER_MALE', 'IS_IN_CONTACT_WITH']" """)[TASK_INPUT_COL].iloc[0]

'Un jour, une vidéo filmant la décapitation d’un < homme > cagoulé a été diffusée à la télévision. Il s’agissait du Général < Martin Kumba >, reconnue grâce à une montre de marque qu’< il > avait à la main.'

### ['GENDER_MALE', 'GENDER_FEMALE']

Can a person be of both genders?

In this example, it seems clear that the driver was a woman. 

What its unclear is that the owner of the car, who is a male (`M. < Ali Alissone >`), is tagged as the same entity as the driver, that is explicitly said to be a woman (`la conductrice`)

In [31]:
train_df.query(f""" {TARGET_COL}=="['GENDER_MALE', 'GENDER_FEMALE']" """)[TASK_INPUT_COL].iloc[0]

"Le bilan fait état d'une < personne > gravement blessée. L'< automobiliste > circulant en direction de Nantes a percuté l’arrière d’un poids-lourd qui < le > précédait puis s’est déporté sur la voie opposée. Le capot de la voiture appartenant à M. < Ali Alissone > a été fortement endommagé dans cette collision. La présence de l'Office National des secours a permis de sortir la < conductrice > de son véhicule."

In [34]:
# examples that might contain < personne > and thus the same kind of ambiguity / error
test_df[test_df[TASK_INPUT_COL].str.contains("< personne >")].query("e1_id==e2_id & e1_type=='CIVILIAN'")[TASK_INPUT_COL].values.tolist()

["Malheureusement pour eux, une < personne >, Mme < Sherley Campbell >, a été victime d'une électrocution et les investigations menées par la police ont permis de découvrir le vol. La < victime > est actuellement à l'hôpital, sous oxygène, avec un bras amputé.",
 'D\'après une enquête approfondie, la police a découvert que la < personne > qui les a alertés n\'est autre que Madame < Anna Laura Sauvage >, la présidente de l\'association "One Hand".',
 'Le 13 novembre 2022, Monsieur < Laurent Chazal >, le président de l\'association "Aide pour tous" au Venezuela, a assisté à la campagne d\'une nouvelle secte à Caracas. D\'après une enquête approfondie, la police a découvert que la < personne > qui les avait alertés était Monsieur < Laurent Chazal >.',
 "Sur les 20 personnes à bord, une seule < personne > a survécu. Il s'agissait d'une < jeune femme > nommée < Bona Capelani >. < Elle > avait eu la chance d'être à côté d'une fenêtre. < Elle > avait été éjectée à l'extérieur du bus lors de l