# E2MoCase

In this notebook we show how to obtain the raw data from which [E2MoCase](https://arxiv.org/pdf/2409.09001) was derived. Unfortunately, we cannot openly share the raw data due to [commercial restrictions](https://www.liri.uzh.ch/en/services/swissdox.html). Therefore, to get the complete dataset, including the raw data, please follow the steps described below.

As a first step, you’ll need to have a valid license for the [SwissDox API](https://liri.linguistik.uzh.ch/wiki/langtech/swissdox/api). Once you’ve obtained it, specify the X-API-Key and X-API-Secret headers in the [config/configuration.yaml](config/configuration.yaml) file.

Next, we need to run the queries located in the [Queries](./Queries) folder to retrieve the news data from SwissDox. The JSON files in this folder contain queries for each selected case in YAML format.

In [1]:
import os
import json
import glob
queries_path = './Queries'
yaml_queries = {}
pattern = os.path.join(queries_path, '*.json')
for filepath in glob.glob(pattern):
    with open(filepath, 'r', encoding='utf-8') as f:

        data = json.load(f)
        yaml_query = data.get('yaml_query')
        if yaml_query:
            yaml_queries[filepath] = yaml_query



print('Total number of queries:', len(yaml_queries))
print()
print('First query:', list(yaml_queries.values())[0])

Total number of queries: 52

First query: query:
  dates:
    - from: 2019-01-01
      to: 2024-03-31
  content:
    - Ahmaud Arbery
result:
  format: TSV
  maxResults: 10000000
  columns:
    - id
    - pubtime
    - medium_code
    - medium_name
    - rubric
    - regional
    - doctype
    - doctype_description
    - language
    - char_count
    - dateline
    - head
    - subhead
    - content_id
    - content
version: 1.2



 Now, we submit the query to SwissDox using the functions provided in the [**sw_lib.py**](utils/sw_lib.py) script:

In [None]:
from utils.sw_lib import submit_query, get_headers

header = get_headers(path='./config/configuration.yaml')
for name, query in yaml_queries.items():
    response, yaml_string = submit_query(query, name, name, header)

    if response['result'].lower() == 'ok':
        print('The response has been correctly submitted!')
print(response)

After submitting all the queries, you can check their status and download the query results either via the SwissDoc platform or by using the methods provided in the [utils/sw_lib.py](utils/sw_lib.py) class, as shown below.

In [None]:
from utils.sw_lib import check_status, check_status_id, get_query_id
from utils.util import download

save_dir = './artifacts'
header = get_headers()
status_all = check_status(header)

for query_name in yaml_queries.keys():
    query_id = get_query_id(status_all, query_name, case_sensitive=False)
    if query_id is None:
        print(f'Query not found for {query_name}!')
        continue

    file_name_ex = query_name + ".tsv.xz"

    json_id = check_status_id(query_id, header)[0]

    if json_id['status'] != 'finished':
        status = json_id['status']

        print(f'Status not finished for {query_name}!. Status is on {status}')
        continue
    if json_id['downloadUrl'] is None:
        status = json_id['status']
        results = json_id['actualResults']
        print(f'Status of {query_name} is {status} with {results} results')
        continue
    print(f'Downloading {query_name}')
    download(json_id['downloadUrl'], os.path.join(save_dir, file_name_ex), header) #download the file


 Once you’ve downloaded all the files, you’ll need to decompress them. This will produce a TSV file containing the following columns.
 
            - id
            - pubtime
            - medium_code
            - medium_name
            - rubric
            - regional
            - doctype
            - doctype_description
            - language
            - char_count
            - dateline
            - head
            - subhead
            - content_id
            - content


The news content is located in the **content column**, which is enriched with markup tags such as <LG\> or <ParagTitle\>. 

## Preprocessing

To illustrate the preprocessing phase, let’s assume, for instance, that the query results are stored in the file [example_data.csv](./example_data.csv), which contains the columns **head, subhead, content_id, and content**. Below, we describe the code used to clean the content by removing markup tags, translating the text into English, and splitting it into paragraph

In [2]:
import pandas as pd
df = pd.read_csv('example_data.csv', sep='\t')
#Note that the data shown in this example are not real news data extracted from SwissDox, but they were synthetically generated for illustrative purposes.
print(df.head(2)) 


                             content_id language  \
0  01a7d373-b97a-336c-f318-43518eff51c8       en   
1  02b7d373-c97b-446c-f318-43518eff51c9       en   

                                   head  \
0            Nova Kakhovka Dam Collapse   
1  Historic Vote Ousts US House Speaker   

                                   subhead  \
0  Widespread Flooding in Southern Ukraine   
1  Kevin McCarthy Removed by Narrow Margin   

                                             content  
0  In early June 2023, the Nova Kakhovka dam in s...  
1  In a dramatic turn of events on Capitol Hill, ...  


In [4]:
df['medium_code'] = 'TEXT' # assume everything is text

In [5]:
from deep_translator import GoogleTranslator
from utils import sw_handler as text_handler
from tqdm import tqdm



for index, row in tqdm(df.iterrows(), total=len(df), desc=f"Pre-processing data",):
    #We avoid content coming from audio or video transcription 
    if text_handler.is_audio_video_text(row): 
        print("No text: audio or video transcription")
        continue
    
    cleaned_content = text_handler.get_content_in_paragraphs(row) # clean data
    cleaned_content = "\n".join(cleaned_content) # reconstruct the content as a string
    row['cleaned_content'] = cleaned_content
    
    if cleaned_content=="": continue
    if len(cleaned_content)<=1: continue
    
    
    language= row["language"]
    if language != 'en':
        head = GoogleTranslator(source=language, target='en').translate(row["head"]) 
        df.loc[index,"translated_head"]=head
        subhead= "" if type(row["subhead"])!=str else GoogleTranslator(source=language, target='en').translate(row["subhead"]) 
        df.loc[index,"translated_subhead"]=subhead
        
        text = GoogleTranslator(source=language, target='en').translate(cleaned_content)
        if not isinstance(text, str): 
            text=''
        if text == '': continue
        translated_input=text

        df.loc[index,"translated_input"]= text
    else:
        df.loc[index,"translated_head"]=row['head']
        df.loc[index,"translated_subhead"]=row['subhead']
        df.loc[index,"translated_input"]= cleaned_content
    


Pre-processing data: 100%|██████████| 10/10 [00:07<00:00,  1.33it/s]


In [6]:
df.tail(5)

Unnamed: 0,content_id,language,head,subhead,content,medium_code,translated_head,translated_subhead,translated_input
5,227d4d3d-8148-4ed6-8ec0-4455ebc2437e,fr,Réforme des Retraites en France,Grèves Massives et Contestations Sociales,La France a connu une période particulièrement...,TEXT,Pension reform in France,Massive strikes and social disputes,Pension reform in France\nMassive strikes and ...
6,ecc63b07-9653-4ac5-923d-f609575719e8,fr,Séisme Dévastateur au Maroc,Aide Internationale et Reconstructions,Le Maroc a été frappé en septembre 2023 par un...,TEXT,Devastating earthquake in Morocco,International aid and reconstructions,Devastating earthquake in Morocco\nInternation...
7,2a79d0fe-29a3-4cd5-9ac2-2538f47eaf0f,de,Technische Rezession in Deutschland,BIP Zwei Quartale in Folge Geschrumpft,Deutschland rutschte Ende 2022 und Anfang 2023...,TEXT,Technical recession in Germany,GDP Two quarters in a row shrunk,Technical recession in Germany\nGDP Two quarte...
8,ad700303-5da1-4226-bf4b-5e83b85154c3,de,Ende der Atomkraft in Deutschland,Letzte Drei AKWs vom Netz Genommen,Im April 2023 wurden in Deutschland die letzte...,TEXT,End of nuclear power in Germany,Last three nuclear power plant removed from th...,End of nuclear power in Germany\nLast three nu...
9,f45e5a0c-e365-4097-b5ef-572ebc1ed304,it,Arresto Storico in Sicilia,Matteo Messina Denaro Catturato Dopo Trent’Anni,"Nel gennaio 2023, le forze dell’ordine italian...",TEXT,Historical arrest in Sicily,Matteo Messina money captured after thirty years,Historical arrest in Sicily\nMatteo Messina mo...


After the text has been cleaned and translated, we split it into paragraphs, as shown below.

In [51]:
res_list = []
for index, row in df.iterrows():
    language= row["language"]
    paragraphs= text_handler.get_content_in_paragraphs(row)
    if paragraphs=="":continue
    for parag_id, p in enumerate(paragraphs):
      res_list.append({"content_id":row["content_id"], "P": "P"+str(parag_id),"language":language, "translated_input": p})

dfs = [pd.DataFrame(row, index=[0]) for row in res_list]
res_df = pd.concat(dfs, ignore_index=True)


In [52]:
res_df

Unnamed: 0,content_id,P,language,translated_input
0,01a7d373-b97a-336c-f318-43518eff51c8,P0,en,Nova Kakhovka Dam Collapse\nWidespread Floodin...
1,01a7d373-b97a-336c-f318-43518eff51c8,P1,en,"In early June 2023, the Nova Kakhovka dam in s..."
2,01a7d373-b97a-336c-f318-43518eff51c8,P2,en,International observers immediately raised con...
3,01a7d373-b97a-336c-f318-43518eff51c8,P3,en,Both Ukrainian and Russian authorities exchang...
4,02b7d373-c97b-446c-f318-43518eff51c9,P0,en,Historic Vote Ousts US House Speaker\nKevin Mc...
5,02b7d373-c97b-446c-f318-43518eff51c9,P1,en,"In a dramatic turn of events on Capitol Hill, ..."
6,02b7d373-c97b-446c-f318-43518eff51c9,P2,en,The unprecedented vote plunged Congress into t...
7,02b7d373-c97b-446c-f318-43518eff51c9,P3,en,"Following the vote, McCarthy briefly addressed..."
8,03c7d373-d97c-556c-f318-43518eff51ca,P0,en,India’s Chandrayaan-3 Lunar Triumph\nHistoric ...
9,03c7d373-d97c-556c-f318-43518eff51ca,P1,en,"In late August 2023, India achieved a signific..."


The **(content_id, P) pair** uniquely identifies a row. Therefore, to reconstruct the original dataset, simply merge the dataframe df with the derivated data contained in the file [e2mocase.csv](e2mocase.csv) using this pair as the key. 


In [None]:
df_derived = pd.read_csv('e2mocase.csv', sep='\t')

e2mocase = pd.merge(df_derived, df, on=['content_id', 'P'], how='inner')
