# Split sentences

In this notebook, we perform the sentence splitting (SS) of the CodiEsp training and development corpora. To do it, we use a custom version of the [Sentence-Splitter tool](https://github.com/PlanTL-SANIDAD/SPACCC_Sentence-Splitter) developed by the [Plan-TL Sanidad](https://www.plantl.gob.es/sanidad/Paginas/sanidad.aspx). In the custom version, instead of printing the text of each split sentence, we modified SentenceSplitter.java file to print the start and end char positions of each split sentence.

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
# Java SE 1.8 required
!java -version

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)


In [3]:
corpus_path = "./codiesp_v4/"

In [4]:
out_path = "./CodiEsp-SSplit-text/"

In [5]:
ss_tool_path = "./SPACCC_Sentence-Splitter_Custom/"

In [6]:
ss_jar_path = ss_tool_path + "apache-opennlp-1.8.4/lib/opennlp-tools-1.8.4.jar:" + \
              ss_tool_path + "src/ SentenceSplitter"

In [7]:
ss_model_path = ss_tool_path + "model/es-sentence-splitter-model-spaccc.bin"

## Train

In [8]:
sub_corpus = "train"

In [9]:
sub_corpus_path = corpus_path + sub_corpus + "/text_files/"

In [10]:
%%time
sub_corpus_files = [f for f in os.listdir(sub_corpus_path) if os.path.isfile(sub_corpus_path + f)]

CPU times: user 3.15 ms, sys: 0 ns, total: 3.15 ms
Wall time: 2.86 ms


In [11]:
len(sub_corpus_files)

500

In [12]:
sub_corpus_out_path = out_path + sub_corpus + "/"

In [13]:
# Create dir if it does not exist
if not os.path.exists(sub_corpus_out_path):
    os.makedirs(sub_corpus_out_path)

In [14]:
%%time
for sub_file in sub_corpus_files:
    cmd = "java -classpath " + ss_jar_path + " " + sub_corpus_path + sub_file + " " + ss_model_path + " > " + \
          sub_corpus_out_path + sub_file
    os.system(cmd)

CPU times: user 19.9 ms, sys: 874 ms, total: 893 ms
Wall time: 2min 19s


## Dev

In [15]:
sub_corpus = "dev"

In [16]:
sub_corpus_path = corpus_path + sub_corpus + "/text_files/"

In [17]:
%%time
sub_corpus_files = [f for f in os.listdir(sub_corpus_path) if os.path.isfile(sub_corpus_path + f)]

CPU times: user 3.36 ms, sys: 0 ns, total: 3.36 ms
Wall time: 1.79 ms


In [18]:
len(sub_corpus_files)

250

In [19]:
sub_corpus_out_path = out_path + sub_corpus + "/"

In [20]:
# Create dir if it does not exist
if not os.path.exists(sub_corpus_out_path):
    os.makedirs(sub_corpus_out_path)

In [21]:
%%time
for sub_file in sub_corpus_files:
    cmd = "java -classpath " + ss_jar_path + " " + sub_corpus_path + sub_file + " " + ss_model_path + " > " + \
          sub_corpus_out_path + sub_file
    os.system(cmd)

CPU times: user 19.5 ms, sys: 399 ms, total: 419 ms
Wall time: 1min 4s
