# Sentence Splitting

Tutorial for splitting setences using nltk and Stanford's CoreNLP. I also showed example of regex-ing timestamps from the document. 
First download CoreNLP from: https://stanfordnlp.github.io/CoreNLP/. 
Unzip the file and cd into that folder from the terminal and run: 

- java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 

Which starts up a CoreNLP server on your local machine. You can check by entering http://localhost:9000 on your browser. 

To use python wrapper for CoreNLP, run on your terminal: 
- pip install pycorenlp 

credit: http://titipata.github.io/2016/11/09/sentence-split.html

In [1]:
# import packages 
import pandas as pd 
import numpy as np 
from os import system 
import re 
from pycorenlp import StanfordCoreNLP

In [2]:
# python wrapper for coreNLP 
nlp = StanfordCoreNLP('http://localhost:9000')

In [12]:
path = 'input.txt'

# read text 
f = open(path,'r')
data = f.read()
data

"This is sample text for setence splitting. I don't know what else to say. It is 12:20 pm right now. "

# NLTK Method 



In [13]:
from nltk.tokenize import word_tokenize, sent_tokenize

def sent_split(documents):
    words = [word_tokenize(sent) for sent in sent_tokenize(documents)]
    return words

sent_split(data)

[['This', 'is', 'sample', 'text', 'for', 'setence', 'splitting', '.'],
 ['I', 'do', "n't", 'know', 'what', 'else', 'to', 'say', '.'],
 ['It', 'is', '12:20', 'pm', 'right', 'now', '.']]

# CoreNLP Method

In [14]:
def sentence_split(text, properties):
    # split sentences 
    annotated = nlp.annotate(text, properties)
    sentence_split = list()
    for sentence in annotated['sentences']:
        s = [t['word'] for t in sentence['tokens']]
        sentence_split.append(s)
    return sentence_split

splited = sentence_split(data, properties = {'annotators': 'ssplit','outputFormat': 'json'})
splited

[['This', 'is', 'sample', 'text', 'for', 'setence', 'splitting', '.'],
 ['I', 'do', "n't", 'know', 'what', 'else', 'to', 'say', '.'],
 ['It', 'is', '12:20', 'pm', 'right', 'now', '.']]

In [15]:
# to put the setences back 
joined = [' '.join(x) for x in splited]
joined

['This is sample text for setence splitting .',
 "I do n't know what else to say .",
 'It is 12:20 pm right now .']

# Extracting Time Stamps

In [16]:
# patterns 
week = r'weekened|week|day|weekday|year|month'
season = r'\bspring\b|\bsummer\b|\bfall\b|\bwinter\b'
date = r'[0-9]{1,2}[/\.-][0-9]{1,2}[/\.-][0-9]{4}|[0-9]{1,2}[/\.-][0-9]{1,2}[/\.-][0-9]{2}'
date2 = r'[0-9]{1,2}[/\.-][0-9]{1,2}[/\.-][0-9]{4}|[0-9]{1,2}[/\.-][0-9]{1,2}[/\.-][0-9]{4}'
date3 = r'on [0-9]{1,2}[/\.-][0-9]{1,2}'
dayname = r'monday|tuesday|wednesday|thursday|friday|saturday|sunday'
dayname_short = r'\bmon\b|\btue\b|\bwed\b|\bthurs\b|\bfri\b|\bsat\b|\bsun\b'
month = r'january|february|march|april|\bmay\b|june|july|august|september|october|november|december'
month_short= r'\bjan\b|\bfeb\b|\bmar\b|\bapr\b|\bmay\b|\bjun\b|\bjul\b|\baug\b|\bsep\b|\boct\b|\bnov\b|\bdec\b'
year = r'[0-9]{4}' # 2017 
time = r'[0-9]{1,2}:[0-9]{1,2}|[0-9]{1,2}\s?[ap].?m.?' # 14:00 
relative = r'\bnow\b|today|yesterday|tomorrow' # know should not be picked up 
relative2 = '(this|next|last|coming|past|every) ' + '('+ week +'|' +dayname+'|'+dayname_short+'|'+month+'|'+ month_short+'|'+season + ')'

# combine all patterns 
pattern = "|".join([relative, relative2, season, date, date2, date3, 
                    dayname, dayname_short, month, month_short, year,time])

In [18]:
# return sentences with timestamps 
count = 0
for sentence in joined:
    
    match = re.search(pattern , sentence.lower()) 

    if match:
        print(count)
        print(match)    
        print(sentence)
        print()
        count+= 1

0
<_sre.SRE_Match object; span=(6, 11), match='12:20'>
It is 12:20 pm right now .



In [None]:
# save the output into file 

f = open('output.txt','w')
count = 0 
for sentence in joined:
    match = re.search(pattern , sentence.lower()) 
    if match: 
        f.writelines(sentence+ '\n')

f.close()