# Split sentences

In this notebook, we perform the sentence splitting (SS) of the Cantemist training and development corpora. To do it, we use a custom version of the [Sentence-Splitter tool](https://github.com/PlanTL-SANIDAD/SPACCC_Sentence-Splitter) developed by the [Plan-TL Sanidad](https://www.plantl.gob.es/sanidad/Paginas/sanidad.aspx). In the custom version, instead of printing the text of each split sentence, we modified SentenceSplitter.java file to print the start and end char positions of each split sentence.

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
# Java SE 1.8 required
!java -version

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)


In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-coding/"

In [3]:
out_path = "./Cantemist-SSplit-text/"

In [4]:
ss_tool_path = "./SPACCC_Sentence-Splitter_Custom/"

In [5]:
ss_jar_path = ss_tool_path + "apache-opennlp-1.8.4/lib/opennlp-tools-1.8.4.jar:" + \
              ss_tool_path + "src/ SentenceSplitter"

In [6]:
ss_model_path = ss_tool_path + "model/es-sentence-splitter-model-spaccc.bin"

## Training

We consider the train & dev1 Cantemist corpora as the training subset.

In [8]:
sub_corpus = "train-set/"
sub_corpus_path = corpus_path + sub_corpus + sub_task_path + "txt/"

In [9]:
%%time
sub_corpus_files = [sub_corpus_path + f for f in os.listdir(sub_corpus_path) if os.path.isfile(sub_corpus_path + f)]

CPU times: user 405 µs, sys: 3.88 ms, total: 4.29 ms
Wall time: 24.5 ms


In [10]:
len(sub_corpus_files)

501

In [11]:
%%time
sub_corpus = "dev-set1/"
sub_corpus_path = corpus_path + sub_corpus + sub_task_path + "txt/"
sub_corpus_files.extend([sub_corpus_path + f for f in os.listdir(sub_corpus_path) if os.path.isfile(sub_corpus_path + f)])

CPU times: user 256 µs, sys: 1.91 ms, total: 2.16 ms
Wall time: 19.5 ms


In [12]:
len(sub_corpus_files)

751

In [13]:
sub_corpus_out_path = out_path + "training/"

In [14]:
# Create dir if it does not exist
if not os.path.exists(sub_corpus_out_path):
    os.makedirs(sub_corpus_out_path)

In [15]:
%%time
for sub_file in sub_corpus_files:
    cmd = "java -classpath " + ss_jar_path + " " + sub_file + " " + ss_model_path + " > " + \
          sub_corpus_out_path + sub_file.split('/')[-1]
    os.system(cmd)

CPU times: user 63.9 ms, sys: 1.27 s, total: 1.34 s
Wall time: 3min 38s


## Development

We consider the dev2 Cantemist corpus as the development subset.

In [20]:
sub_corpus = "dev-set2/"
sub_corpus_path = corpus_path + sub_corpus + sub_task_path + "txt/"

In [21]:
%%time
sub_corpus_files = [f for f in os.listdir(sub_corpus_path) if os.path.isfile(sub_corpus_path + f)]

CPU times: user 0 ns, sys: 6.64 ms, total: 6.64 ms
Wall time: 27.6 ms


In [22]:
len(sub_corpus_files)

250

In [23]:
sub_corpus_out_path = out_path + "development/"

In [20]:
# Create dir if it does not exist
if not os.path.exists(sub_corpus_out_path):
    os.makedirs(sub_corpus_out_path)

In [21]:
%%time
for sub_file in sub_corpus_files:
    cmd = "java -classpath " + ss_jar_path + " " + sub_corpus_path + sub_file + " " + ss_model_path + " > " + \
          sub_corpus_out_path + sub_file
    os.system(cmd)

CPU times: user 2.43 ms, sys: 411 ms, total: 413 ms
Wall time: 1min 7s
