<a href="https://colab.research.google.com/github/lalopey/coreference/blob/main/ECB_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<font color="darkred">ECB+ EDA</font>**

#### **Authors:**
Eduardo Peynetti

## **<font color="darkred">Setup Notebook</font>**

#### Installs

In [1]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0
Collecting mendelai-brat-parser
  Downloading mendelai_brat_parser-0.0.4-py3-none-any.whl (4.2 kB)
Installing collected packages: mendelai-brat-parser
Successfully installed mendelai-brat-parser-0.0.4


#### Imports

In [1]:
import os
import requests
import zipfile
import tarfile
import xmltodict
import textwrap
from pathlib import Path

import pandas as pd


#### Utils

In [2]:
def download_file(packet_url, base_path="", extract=False, headers=None):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True, headers=headers) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith('.zip'):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    else:
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

def unpack_ecb(base_path='datasets'):
  ecb_dir = 'ECB+_LREC2014'
  ecb_file = 'ECB+.zip'
  with zipfile.ZipFile(os.path.join(base_path,ecb_dir,ecb_file)) as zfile:
    zfile.extractall(os.path.join(base_path,ecb_dir))


# Wrap text to 80 characters.
wrapper = textwrap.TextWrapper(width=80) 


## **<font color="darkred">Download Datasets</font>**

In [3]:
ECB_PATH = 'datasets/ECB+_LREC2014/ECB+'
LITBANK_PATH = 'datasets/litbank-master'
WIKI_PATH = 'datasets/WikiCoref'


In [4]:
download_file('http://kyoto.let.vu.nl/repo/ECB+_LREC2014.zip', base_path="datasets", extract=True)
unpack_ecb(base_path="datasets")


## **<font color="darkred">ECB+</font>**

http://www.newsreader-project.eu/results/data/the-ecb-corpus/

http://www.newsreader-project.eu/files/2013/01/NWR-2014-1.pdf

In [5]:
file = os.path.join(ECB_PATH, '1', '1_1ecbplus.xml')

with open(file,'rb') as f:
    dct = xmltodict.parse(f)

print('Document keys:\n')
display(dct['Document'].keys())

print('\n Token data:\n')
display(dct['Document']['token'][0])

string = ''

for token in dct['Document']['token']:
    string += token['#text'] + ' '

print('\n', wrapper.fill(string))

print('\n Markables:\n')
display(dct['Document']['Markables'].keys())

print('\n Relations:\n')

dct['Document']['Relations']['CROSS_DOC_COREF'][0]

Document keys:



odict_keys(['@doc_name', '@doc_id', 'token', 'Markables', 'Relations'])


 Token data:



OrderedDict([('@t_id', '1'),
             ('@sentence', '0'),
             ('@number', '0'),
             ('#text', 'http')])


 http : / / www . accesshollywood . com / lindsay - lohan - leaves - betty - ford
- checks - into - malibu - rehab _ article _ 80744 Lindsay Lohan Leaves Betty
Ford , Checks Into Malibu Rehab First Published : June 13 , 2013 4 : 59 PM EDT
Lindsay Lohan has left the Betty Ford Center and is moving to a rehab facility
in Malibu , Calif . , Access Hollywood has confirmed . A spokesperson for The
Los Angeles Superior Court confirmed to Access that a judge signed an order
yesterday allowing the transfer to Cliffside , where she will continue with her
90 - day court - mandated rehab . Lohan ’ s attorney , Shawn Holley , spoke out
about the move . “ Lindsay is grateful for the treatment she received at the
Betty Ford Center . She has completed her course of treatment there and looks
forward to continuing her treatment and building on the foundation established
at Betty Ford , ” Holley said in a statement to Access . The actress checked
into the Betty Ford Center in May as part of a plea deal

odict_keys(['ACTION_ASPECTUAL', 'ACTION_CAUSATIVE', 'ACTION_OCCURRENCE', 'ACTION_REPORTING', 'ACTION_STATE', 'HUMAN_PART_ORG', 'HUMAN_PART_PER', 'LOC_FAC', 'NON_HUMAN_PART', 'TIME_DATE', 'TIME_DURATION', 'TIME_OF_THE_DAY'])


 Relations:



OrderedDict([('@r_id', '21683'),
             ('@note', 'ACT15736242788923354'),
             ('source', OrderedDict([('@m_id', '34')])),
             ('target', OrderedDict([('@m_id', '72')]))])