# Preprocessing


## 0. prepraring data

Text preprocessing is a crucial step for natural language processing (NLP) tasks as it transform texts into forms that could be recognized by various computational models or machine learning algorithms.

The objective of this lab is to perform the following preprocessing tasks:
(1) Tokenization: splitting strings of texts into smaller units with distinct meanings, which are known as "tokens";
(2) Sentence Boundary Disambiguation/Detection (SBD): processing text at the sentence level which entails identifying start and end of sentences.

In this lab, we employ 2 tokenization scripts, namely Penn treebank-style tokenization script (`tokenizer.sed`) and Corpus preparation from Moses (`tokenizer.perl`). For SBD, we use the pretrained model Splitta. 

In [1]:
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/data/en_ewt-ud-dev-alt.txt
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/data/en_ewt-ud-dev.txt    

--2021-10-01 00:08:06--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/data/en_ewt-ud-dev-alt.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-10-01 00:08:06 ERROR 404: Not Found.

--2021-10-01 00:08:06--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/data/en_ewt-ud-dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126870 (124K) [text/plain]
Saving to: ‘en_ewt-ud-dev.txt’


2021-10-01 00:08:07 (1.77 MB/s) - ‘e

## 1. tokenization
### `tokenizer.sed`

First, download the Penn treebank-style tokenization script (`tokenizer.sed`) which is available at ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenizer.sed.

Check permissions of `tokenizer.sed` with `ls -l`.
`chmod` could be used to set custom permissions to files and directory.
`755` could be understood as 3 units which each number represents the permission to read (`r`), write (`w`) and execute (`x`) the file. The first unit represents the permission for the user owner (you); The second unit represents the permission for other owners in the group owning the file; the third unit represents permission for others. 
Each unit is a sum of the following numbers: 
 - `4`: read
 - `2`: write
 - `1`: execute
 
In this case, 
- "`7`" = "4+2+1" which means the user owner of the file/directory can read (4), write (2) and execute (1) permissions
- "`5`" = "4+0+1" which means the group owner of the file/directory can read (4), cannot write (0) but can execute (1) permissions
- "`5`" = same as above but this unit represents permissions for other users
Therefore, `chmod 755 tokenizer.sed` signifies changing the permissions of `tokenizer.sed` for its users.

In [1]:
!wget ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenizer.sed
!ls -l tokenizer.sed
!chmod 755 tokenizer.sed 

--2021-12-10 14:27:18--  ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenizer.sed
           => ‘tokenizer.sed.2’
Resolving ftp.cis.upenn.edu (ftp.cis.upenn.edu)... 158.130.67.137
Connecting to ftp.cis.upenn.edu (ftp.cis.upenn.edu)|158.130.67.137|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/treebank/public_html ... done.
==> SIZE tokenizer.sed ... 2204
==> PASV ... done.    ==> RETR tokenizer.sed ... done.
Length: 2204 (2.2K) (unauthoritative)


2021-12-10 14:27:19 (64.8 KB/s) - ‘tokenizer.sed.2’ saved [2204]

-rwxr-xr-x 1 kkwt kkwt 2290 Sep 21 17:42 tokenizer.sed


In [3]:
!head -n 1 tokenizer.sed

#!/bin/sed -f


We are now able to run `tokenizer.sed` to perform tokenization on the raw text.

In [4]:
!./tokenizer.sed en_ewt-ud-dev.txt | head

From the AP comes this story : 

President Bush on Tuesday nominated two individuals to replace retiring jurists 
on federal courts in the Washington area. Bush nominated Jennifer M. Anderson 
for a 15-year term as associate judge of the Superior Court of the District of 
Columbia , replacing Steffen W. Graae. *** Bush also nominated A. Noel Anketell 
Kramer for a 15-year term as associate judge of the District of Columbia Court 
of Appeals , replacing John Montague Steadman . 

The sheikh in wheel-chair has been attacked with a F-16-launched bomb. He could 
/bin/sed: couldn't write 65 items to stdout: Broken pipe


### `tokenizer.perl`

**TODO** complete commands for `tokenizer.perl`

First, download `scripts.tgz` and extract `tokenizer.perl` from Moses which is available in the at https://www.statmt.org/wmt09/scripts.tgz.

In [5]:
!wget https://www.statmt.org/wmt09/scripts.tgz

--2021-10-01 00:08:09--  https://www.statmt.org/wmt09/scripts.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4927 (4.8K) [application/x-gzip]
Saving to: ‘scripts.tgz’


2021-10-01 00:08:10 (33.2 MB/s) - ‘scripts.tgz’ saved [4927/4927]



In [6]:
! tar xvfz scripts.tgz

scripts/
scripts/detokenizer.perl
scripts/wrap-xml.perl
scripts/lowercase.perl
scripts/tokenizer.perl
scripts/reuse-weights.perl
scripts/nonbreaking_prefixes/
scripts/nonbreaking_prefixes/nonbreaking_prefix.de
scripts/nonbreaking_prefixes/nonbreaking_prefix.en
scripts/nonbreaking_prefixes/nonbreaking_prefix.el


In [7]:
!ls -l scripts/tokenizer.perl

-rwxr-xr-x 1 kkwt kkwt 4207 Sep 27  2007 scripts/tokenizer.perl


We are now able to run `tokenizer.perl` to perform tokenization on the raw text.

In [8]:
!./scripts/tokenizer.perl -l en < en_ewt-ud-dev.txt | head

Tokenizer v3
Language: en
From the AP comes this story :

President Bush on Tuesday nominated two individuals to replace retiring jurists
on federal courts in the Washington area . Bush nominated Jennifer M. Anderson
for a 15-year term as associate judge of the Superior Court of the District of
Columbia , replacing Steffen W. Graae . * * * Bush also nominated A. Noel Anketell
Kramer for a 15-year term as associate judge of the District of Columbia Court
of Appeals , replacing John Montague Steadman .

The sheikh in wheel-chair has been attacked with a F-16-launched bomb . He could
Unable to flush stdout: Broken pipe


## 2. sentence boundary detection
### `sbd.py`

First, download and extract the sentence splitter `sbd.py` in `splitta-0.1.0.tar.gz` from https://files.pythonhosted.org/packages/cf/d2/9771eb65f1dc3925dbcfc7c4b2adaefa38e1549e4e4e75409df316f8c453/splitta-0.1.0.tar.gz.

In [9]:
!wget https://files.pythonhosted.org/packages/cf/d2/9771eb65f1dc3925dbcfc7c4b2adaefa38e1549e4e4e75409df316f8c453/splitta-0.1.0.tar.gz

--2021-10-01 00:08:10--  https://files.pythonhosted.org/packages/cf/d2/9771eb65f1dc3925dbcfc7c4b2adaefa38e1549e4e4e75409df316f8c453/splitta-0.1.0.tar.gz
Resolving files.pythonhosted.org (files.pythonhosted.org)... 151.101.193.63, 151.101.1.63, 151.101.65.63, ...
Connecting to files.pythonhosted.org (files.pythonhosted.org)|151.101.193.63|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6943603 (6.6M) [application/octet-stream]
Saving to: ‘splitta-0.1.0.tar.gz’


2021-10-01 00:08:12 (6.39 MB/s) - ‘splitta-0.1.0.tar.gz’ saved [6943603/6943603]



**TODO** complete commands for `sbd.py`

In [10]:
!tar xvfz splitta-0.1.0.tar.gz

splitta-0.1.0/
splitta-0.1.0/MANIFEST.in
splitta-0.1.0/PKG-INFO
splitta-0.1.0/README.rst
splitta-0.1.0/setup.cfg
splitta-0.1.0/setup.py
splitta-0.1.0/splitta/
splitta-0.1.0/splitta/__init__.py
splitta-0.1.0/splitta/model_nb/
splitta-0.1.0/splitta/model_nb/feats
splitta-0.1.0/splitta/model_nb/lower_words
splitta-0.1.0/splitta/model_nb/non_abbrs
splitta-0.1.0/splitta/model_svm/
splitta-0.1.0/splitta/model_svm/feats
splitta-0.1.0/splitta/model_svm/lower_words
splitta-0.1.0/splitta/model_svm/non_abbrs
splitta-0.1.0/splitta/model_svm/sample.txt
splitta-0.1.0/splitta/model_svm/sbd.py
splitta-0.1.0/splitta/model_svm/svm_model
splitta-0.1.0/splitta/sbd.py
splitta-0.1.0/splitta/sbd_util.py
splitta-0.1.0/splitta/word_tokenize.py
splitta-0.1.0/splitta.egg-info/
splitta-0.1.0/splitta.egg-info/dependency_links.txt
splitta-0.1.0/splitta.egg-info/PKG-INFO
splitta-0.1.0/splitta.egg-info/SOURCES.txt
splitta-0.1.0/splitta.egg-info/top_level.txt


In [2]:
cd splitta-0.1.0/

/home/kkwt/splitta-0.1.0


Since `sbd.py` is a Python script (`.py`), it has to be implemented in a Python terminal to perform SBD. We can access the Python terminal on your system using `python3` or `python2`.

In [12]:
!python3 ./splitta/sbd.py

  File "/home/kkwt/splitta-0.1.0/splitta-0.1.0/./splitta/sbd.py", line 537
    print '[%d] [%1.4f] %s?? %s' %(frag.label, frag.pred, w1, w2)
          ^
SyntaxError: invalid syntax


In [3]:
!python2 ./splitta/sbd.py --help

Usage: sbd.py [options] <text_file>

Options:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output
  -t, --tokenize        write tokenized output
  -m MODEL_PATH, --model=MODEL_PATH
                        model path
  -o OUTPUT, --output=OUTPUT
                        write sentences to this file
  -x TRAIN, --train=TRAIN
                        train a new model using this labeled data file
  -c, --svm             use SVM instead of Naive Bayes for training


Now, We are able to run `sbd.py`, using `model_svm` with the `-m` option to perform SBD on the tokenized text. 
Looking into the output, one can see that the text has already been segmented into sentences which each of them ends on a period. 

In [3]:
!python2 ./splitta/sbd.py -m ./splitta/model_svm ../en_ewt-ud-dev.txt | head

loading model from [./splitta/model_svm/]... done!
reading [../en_ewt-ud-dev.txt]
featurizing... done!
SVM classifying... done!
From the AP comes this story : President Bush on Tuesday nominated two individuals to replace retiring jurists on federal courts in the Washington area.
Bush nominated Jennifer M.
Anderson for a 15-year term as associate judge of the Superior Court of the District of Columbia, replacing Steffen W.
Graae.
*** Bush also nominated A.
Noel Anketell Kramer for a 15-year term as associate judge of the District of Columbia Court of Appeals, replacing John Montague Steadman.

The sheikh in wheel-chair has been attacked with a F-16-launched bomb.
He could be killed years ago and the israelians have all the reasons, since he founded and he is the spiritual leader of Hamas, but they didn't.
Today's incident proves that Sharon has lost his patience and his hope in peace.
Traceback (most recent call last):
  File "./splitta/sbd.py", line 671, in <module>
    test.segment(u

Yet, it is worth noting that the period in an abbreviation was mistaken as a period that signifies sentence end. For example, the name Jennifer M. Anderson was split into two sentences. 

Accordingly, one can employ `svm_light` to solve this classification/pattern recognition problem. SVM stands for Support Vector Machine which is a simple supervised machine algorithm that could be used for classification.

Download and extract `svm_light` in `svm_light.tar.gz` from http://download.joachims.org/svm_light/current/svm_light.tar.gz.

In [4]:
cd svm_light

/home/kkwt/splitta-0.1.0/svm_light


In [16]:
!wget http://download.joachims.org/svm_light/current/svm_light.tar.gz

--2021-10-01 00:08:13--  http://download.joachims.org/svm_light/current/svm_light.tar.gz
Resolving download.joachims.org (download.joachims.org)... 81.88.34.174, 81.88.42.187
Connecting to download.joachims.org (download.joachims.org)|81.88.34.174|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://osmot.cs.cornell.edu/svm_light/current/svm_light.tar.gz [following]
--2021-10-01 00:08:14--  http://osmot.cs.cornell.edu/svm_light/current/svm_light.tar.gz
Resolving osmot.cs.cornell.edu (osmot.cs.cornell.edu)... 128.253.51.182
Connecting to osmot.cs.cornell.edu (osmot.cs.cornell.edu)|128.253.51.182|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51026 (50K) [application/x-gzip]
Saving to: ‘svm_light.tar.gz’


2021-10-01 00:08:14 (356 KB/s) - ‘svm_light.tar.gz’ saved [51026/51026]



In [17]:
!tar xvfz svm_light.tar.gz

LICENSE.txt
Makefile
svm_learn.c
kernel.h
svm_learn.h
svm_learn_main.c
svm_classify.c
svm_loqo.c
svm_common.c
svm_common.h
svm_hideo.c


In the svm_light directory, run `make all` to compile the svm-light programme along with all of the required dependencies. 

In [18]:
!make all

gcc -c  -O3                      svm_learn_main.c -o svm_learn_main.o 
gcc -c  -O3                      svm_learn.c -o svm_learn.o 
gcc -c  -O3                      svm_common.c -o svm_common.o 
[01m[Ksvm_common.c:[m[K In function ‘[01m[Kread_model[m[K’:
  600 |   [01;35m[Kfscanf(modelfl,"SVM-light Version %s\n",version_buffer)[m[K;
      |   [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  605 |   [01;35m[Kfscanf(modelfl,"%ld%*[^\n]\n", &model->kernel_parm.kernel_type)[m[K;
      |   [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  606 |   [01;35m[Kfscanf(modelfl,"%ld%*[^\n]\n", &model->kernel_parm.poly_degree)[m[K;
      |   [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  607 |   [01;35m[Kfscanf(modelfl,"%lf%*[^\n]\n", &model->kernel_parm.rbf_gamma)[m[K;
      |   [01;35m[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K
  608 |   [01;35m[Kfsca

In [19]:
!ls svm_classify

svm_classify


In [20]:
!pwd

/home/kkwt/splitta-0.1.0/splitta-0.1.0


Now, we are able to perform SBD along with SVM-light on the tokenized text.

In [5]:
cd ~/splitta-0.1.0

/home/kkwt/splitta-0.1.0


In [7]:
!python2 ./splitta/sbd.py -m ./splitta/model_svm ../en_ewt-ud-dev.txt -o ../en_ewt-ud-dev.txt.sbd_splitta

loading model from [./splitta/model_svm/]... done!
reading [../en_ewt-ud-dev.txt]
featurizing... done!
SVM classifying... done!


In [23]:
!python2 ./splitta/sbd.py -m ./splitta/model_svm ../en_ewt-ud-dev.txt > ../en_ewt-ud-dev.txt.sbd_splitta

loading model from [./splitta/model_svm/]... done!
reading [../en_ewt-ud-dev.txt]
featurizing... done!
SVM classifying... done!
