<a href="https://colab.research.google.com/github/slkh/COVIDPause/blob/master/Dataset_exploration_and_preperation_excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This excercise will give a prespective on data prepration for a simple morphological segmentation task.

There are two main goals in this excercise: 
- Prepare and explore data
- Get familiar with unix commands

In a larger project, you will likely need to write a proper script to deal with the data and prepare it. But sometimes you want to explore the data and get quick counts for different scenarios before fully comitting to using the data. This is what we are doing in this excercise.

--- 

The data we are using comes from a [morphological segmentation shared task](https://github.com/sigmorphon/2022SegmentationST).


_❓ TL;DR: A shared task is compition where you have a defined task and data that is prepared specifically for that task and ask people to submit systems that solve the given task at high accuracies._

---

## Download data
First, let's clone the repo containing the data:

In [None]:
!git clone https://github.com/sigmorphon/2022SegmentationST.git

Now let's navigate to where the data is sitting:

1. view the contents of the main directory: 

In [None]:
!ls -l 2022SegmentationST/

2. Navigate to the data folder _and_ list its contents:

In [None]:
%cd 2022SegmentationST/data/
!ls -l

---
## Explore data files
We can observe the following:
- There are files for 9 languages: `ces`, `eng`, `fra`, `hun`, `ita`, `lat`, `mon`, `rus`, and `spa`.
- For some languages we have two types of data sets: `sentences`: running natural langauge (text), and `word`: word types.
- For each langugae and each type the data is split in to three standard splits: `train`, `dev`, `test`.
- All the files are in `.tsv` format, meaning that the values in one line are tab seperated.
- `test` has two files: one with `gold` and one without.

For conveniance, we will work with English data -- labled `eng`.
Since we want to work with more naturalistic data, we will look at the contents of the `sentence` files:

In [None]:
!head eng.sentence.train.tsv

The command `head` shows the first 10 lines of a given file. If we look carfully at the first line, we can see that there are two sentences (note that the lines wrap when they are dispalyed).

Let's look more carefully:

In [None]:
# look at the very first line of the file only:
!head -1 eng.sentence.train.tsv

In [None]:
# the following command `cut` literally "cuts" the line according to a delimenter, but default a tab `\t`
!head -1 eng.sentence.train.tsv|cut -f1

In [None]:
!head -1 eng.sentence.train.tsv|cut -f2

We can see that the first coulmn (after running `cut -f1`) is the _whitespace_ tokenized sentence, while the second one (`cut -f2`) contains the segmented words. According to the documentation of the data `@@` indicates the morpheme. Any morpheme with `@@` will attach to what's to the left of it.

We now know that each `sentence` file will have the format of:

```
Whitespace tokenized sentence \t White @@space token @@ize @@ed sentence
```

Let see how many sentence are there in each `eng.sentence.*.tsv`:

In [None]:
# to view counts we run the `wc` command, to look at "line" numbers only we use the "-l" option
!wc -l eng.sentence.*.tsv

We notice that both `dev` and `test` have similar counts of sentences, while  `train` has much more sentences, this makes sense because we usually need more data to _train_ a system.

Additionally, we see that there are two files for `test`, one of them has `gold` in the name. This basically means that the file has the answers (aka the segmentations). This is because in shared tasks, the `test` file is distributed at the very end of the compition to fairly evaluate the submitted systems. The _answers_ are usually revealed after the shared task comes to an end. To see the difference, let's look at the first line from both files:

In [None]:
!head -1 eng.sentence.test.tsv

In [None]:
!head -1 eng.sentence.test.gold.tsv

For our purposes we will be using the following files:
```
eng.sentence.dev.tsv
eng.sentence.test.gold.tsv
eng.sentence.train.tsv
  ```


---

## Processing data
Let's create our own directory and copy the data files that we are going to use to the new directory.

First, let's go to the _root_ directory:

In [None]:
# to view where we are now, we run the command:
!pwd

In [None]:
# now we go two levels "up":
%cd ../../

Now we create a new **dir**ectory (aka folder) and name is `stats`, we then navigate into `stats`

In [None]:
%mkdir stats

In [None]:
%cd stats

Let's copy the files we are going to use to our directory:

In [None]:
# note the use of the wildcard "*" to mean any character
%cp ../2022SegmentationST/data/eng.sentence.*.tsv .

In [None]:
# let's see what files do we have:
! ls -l

Since we don't need `eng.sentence.test.tsv`, we will remove it.

**WARNING: don't use the following command -in general- unless you are really sure about what you are doing, there is absolutely NO GOING BACK from this command!**

In [None]:
%rm eng.sentence.test.tsv

In [None]:
# let's check on the files again:
!ls -l

Now, for each file, let's create a file with two columns where one has the "raw" word and the other has the segmented word. There are many ways to do this, but here let's do it using simple commands to get familiar with the different core commands.

First, we take the unsegmented sentence and make it a list of words instead, let's test it on the first line only and then apply it to the entire file:

In [None]:
# the command `cat` is intended for concatinating files, but it can also be used
# to just print out files (despite what unix prescriptivits think)
!cat eng.sentence.train.tsv| head -1|cut -f1

In [None]:
# to turn a sentence into a list of words we are replacing each whitespace with 
# a new line. We are using perl regular expressions (regex) to do this:
!cat eng.sentence.train.tsv| head -1|cut -f1|perl -pe 's/ /\n/g;'

In [None]:
# we now do this to the entire file by removing `head -1` from the previous 
# command. We then look at the number of "lines" which are word tokens in this case:
!cat eng.sentence.train.tsv|cut -f1|perl -pe 's/ /\n/g;'|wc -l

In [None]:
# let's store what we did into a file using " > "file_name"
!cat eng.sentence.train.tsv|cut -f1|perl -pe 's/ /\n/g;' > train_raw

You many now be tempted to do the same thing to get the list of the segmented words, which is very reasonable, but let's not forget that the segmentations in their current format are space seperated, and if we try to align both word lists we will get a mismatch. So first, let's attach those segments to their words. We will do this for the first line to show an example and then apply it to the entire file:

In [None]:
# we are going to replace the space to the left of every `@@` with "nothing"
!cat eng.sentence.train.tsv|cut -f2|head -1

In [None]:
# the regex basically says replace every " @@" with "@@", which translates to 
# remove whitespace before "@@"
!cat eng.sentence.train.tsv|cut -f2|head -1|perl -pe 's/ @@/@@/g;'

In [None]:
# What if we want to replace "@"s with "+" sign so it is more readable and easier to debug?
!cat eng.sentence.train.tsv|cut -f2|head -1|perl -pe 's/ @@/@@/g;'|perl -pe 's/@@/+/g;'

In [None]:
# now let's compress the regex
!cat eng.sentence.train.tsv|cut -f2|head -1|perl -pe 's/ @@/+/g;'

Now that we know how to deal with segments, we can use the same command to turn a sentence into a list:

In [None]:
# here we are using the group match (\S) just to make sure that the @@ is followed
# by a string an not a standing alone symbol. To retain what we matched we capture
# the group with "$1"
!cat eng.sentence.train.tsv|cut -f2|perl -pe 's/ @@(\S)/+$1/g; s/ /\n/g;' > train_seg

Now, let's make sure that both words lists have the same number of lines:

In [None]:
!wc -l train_*

### Dealing with issue found in the data
Uh-Oh! we see that there is a discrepancy in the number of lines, the list of segmentations has 2 extra tokens, or the raw has 2 less tokens. One quick way to see what's happening is for look for the differences between the two lists (other than the segmented words), chances are there is where the issue lies:

In [None]:
# we will use the "sdiff" command, and we will supress the similar entries and 
# those with segmentation
!sdiff -s train_raw train_seg|egrep -v "\+"

The main culpert seems to be two additional words in the segmentation side, `Caucasian` and `West`. Let's look for those sentences in the original files: 

In [None]:
# the option -n prints the line number
!egrep -n 'Caucasian' eng.sentence.train.tsv

We see there are two issues here, the main one is that there is an extra word in the segmentation side and a change of capitalization as well.

Let's now look for the other issue, the word "West" is likely to be more frequent than "Caucasian", so first we will see how many sentences contain it and then we will try to narrow down the sentence of interest:

In [None]:
!egrep -n 'West' eng.sentence.train.tsv|wc -l

In [None]:
# We know that the words Hammer in the "raw" side is present, so let's look for
# a line that contains both
!egrep -n 'Hammer.*West' eng.sentence.train.tsv|wc -l

In [None]:
# Great, only one sentence
!egrep -n 'Hammer.*West' eng.sentence.train.tsv

We see that we have the exact same issue.

There are multiple ways to deal with this, the sensible thing to do in this case is to _ignore_ those two sentence and not include them in our data. The last thing we want to do is to go in and try to "fix it". The common practice in shared tasks is to immediatly inform the organizers of the issue and they are usually very responsive and will fix the issues or will let you know how to deal with it. Until the data is fixed, we usually ignore the ill formed entires and continue with our work so we are not stuck.

Since we are working with a *_copy_* of the data, we will add an indicator at the beginning of the lines to helps quickly identify them.

In [None]:
# sed is very powerful, and it is worth learning if you interested in this kind of
# scripting. Here we will be adding 3 "#" in front of the problematic lines.
!sed -i -e '1212s/^/### /' -e '6502s/^/### /' eng.sentence.train.tsv

In [None]:
# Let's quickly check the file for those lines
!egrep -n '(Hammer.*West|Caucasian)' eng.sentence.train.tsv

Now let's repeate the process of creating the lists again and check the line numbers to double check. Except this time we will make sure to ignore the lines that we just marked:

In [None]:
!cat eng.sentence.train.tsv|egrep -v "^### "|cut -f1|perl -pe 's/ /\n/g;' > train_raw
!cat eng.sentence.train.tsv|egrep -v "^### "|cut -f2|perl -pe 's/ @@(\S)/+$1/g; s/ /\n/g;' > train_seg
!wc -l train_*

Awesomesauce! Now let's create a file with the two lists next to each other seperated by a tab `\t`:

In [None]:
# the command `paste` does that job
!paste train_raw train_seg > train_list

In [None]:
# let's look at the first few words
!head train_list

In [None]:
# And the last few words to make sure they properly aligned
!tail train_list

In [None]:
# triple check the number of lines for all the word lists
!wc -l train_*

Amaizing! Now let's do that for both `dev` and `test`, including the sanity checks:

In [None]:
# dev
!cat eng.sentence.dev.tsv|cut -f1|perl -pe 's/ /\n/g;' > dev_raw
!cat eng.sentence.dev.tsv|cut -f2|perl -pe 's/ @@(\S)/+$1/g; s/ /\n/g;' > dev_seg
!wc dev_*

In [None]:
!paste dev_raw dev_seg > dev_list

In [None]:
!head dev_list

In [None]:
!tail dev_list

In [None]:
!wc dev_*

In [None]:
# test
!cat eng.sentence.test.gold.tsv|cut -f1|perl -pe 's/ /\n/g;' > test_raw
!cat eng.sentence.test.gold.tsv|cut -f2|perl -pe 's/ @@(\S)/+$1/g; s/ /\n/g;' > test_seg
!wc test_*

In [None]:
!paste test_raw test_seg > test_list

In [None]:
!head test_list

In [None]:
!tail test_list

In [None]:
!wc test_*

In [None]:
# quick check of the current contents:
!ls -l

In [None]:
# let's check the number of lines in both the original sentence files and our new files
!wc -l eng.* *_list

These numbers makese sense and the differences are consistent between the lists and the files.

### Tokens vs Types
Since we are not using full sentences for "our task" and we are just using pairs of input and output, i.e., `raw` and `seg`, we really don't need the "repetion" of the pairs, one example should be enough. For example, a system should not need to see `said --> say+ed` 10 times before it "learns" how to do it.

_❓ TL;DR: `tokens` refer to the indivisual instances words in a text, while `types` are the unique tokens. For example, in this list: `bad,bad,bad,sad,sad,rad` there are 6 tokens and 3 types._

Now let's reduce our lists to types.

There are several way to get the list of "unique" lines from a file, in this excercise we are going to use a combination of `sort` and `uniq`.

In [None]:
# we will start with train_list, first check the number of lines:
!wc -l train_list

In [None]:
# now we sort the file and then unique the lines, and then check the number of lines:
!sort train_list|uniq|wc

In [None]:
# lets save the types into a new file
!sort train_list|uniq > train_list.types

There is huge difference between the number of tokens and types, which is expected since this extracted from running text. Now, what if we want to keep the information about how frequent each type is? Very simple:

In [None]:
# we use the option -c, we can additionally sort them by frequency and look at
# the top ten most frequent types
!sort train_list|uniq -c|sort -nr|head

#### Get a frequency list

This makes a lot of sense! Now lets save the frequency list for reference

In [None]:
!sort train_list|uniq -c|sort -nr > train_list.tpyes_freq

Lets create the same lists for both `dev` and `train`

In [None]:
# dev
!sort dev_list|uniq > dev_list.types
!sort dev_list|uniq -c|sort -nr > dev_list.tpyes_freq
# test
!sort test_list|uniq > test_list.types
!sort test_list|uniq -c|sort -nr > test_list.tpyes_freq

In [None]:
# quickly peek at the number of lines for all the .types files
!wc *.types

---

## Create OOV splits

At this stage, the files are technically ready for training a segmentor and even evaluating it. However, in tasks such as morphological segmentation, systems must generalize well for unseen word. 

_❓ TL;DR: unseen words or tokens are those tokens that are not present in the dataset used for training a system, in our case `train`. They are also called Out-of-Vocabulary (OOV)_

Most NLP tasks deal with _natural_ langugage and therefore it is normal to see some overlaps between {`dev`, `test`} and `train`. What is not acceptable is training on `dev` and `test`. In other words, we should not training our system on the same datasets that we are going to evaluate it on, it is not fair and will give very skewed and biased resutls, and most importantly it will not be able to generalize!
In many cases were we want to test the limits of system generalization we keep only unseen (or OOV) instances in both `dev` and `test`. This is what we will do next.

We are going to use the `comm` command to find the common lines or the lack thereof between two files. `comm` works with sorted files, and since our files were generated by sorting them, then we don't need to re-sort them again.

In [None]:
# find the OOVs in dev, and how many they are
!comm -13 train_list.types dev_list.types|wc

In [None]:
# save them to a new file
!comm -13 train_list.types dev_list.types > dev_list_oov.types

In [None]:
# do the same for test
!comm -13 train_list.types test_list.types > test_list_oov.types

Note that we do this only for `dev` and `test` because it is OOV with respect to `train`.

Now let's look at the overall counts of our types:

In [None]:
!wc -l *.types

For both `dev` and `test` the OOV types are around 30% of the whole set.

Congratulations! You have now explored and prepared data for your upcoming segmentation task!

---

# Conclusion

In this excercise we learned how to explore data and quickly create lists and splits and inspect them using core unix commands. If we were training a segmentation system our data is technically ready for training, developing, and evaluting. However, as we start working with the data more, we may discover bugs, or want to tweak how we generated some lists. Therefore, it is wise to write a proper script (e.g. in python) that will produce the same files that we produced using the command line. This is cleaner for running the pipeline for as many times as we need without worrying about accidental error propagation.

## Tips
- If you want to know more about any of the commands we used just type `man` followed by the command to get the manual on the command.
- If you want to run the commands locally on your terminal, you need to remove the `!` and `%` symbols we have here.
- If this notebook is not mounted on your drive or is not local on your machine, all your progress will be lost if the runtime is disconnected.

## Bonus Excercise

Create a python script that will generate the list files, types, type frequencies, and OOV types. Compare the output with what we got here and see if there any more issues with the data. 