### Text mining intro

This Jupyter notebook is meant to give you a feel for working with files in the context of textual analysis:

    - I have downloaded these files from WoS but what is actually there? (editor);
    - Right, actually I am just interested in a small portion of what I got (text extraction);
    - Ok, this is not what I had in mind (go back to the source) or: prepare to do some serious pre-processing.

Farfetched examples? Not at all.

The examples are taken from Kenneth Ward Church: Unix(tm) for Poets

### Jane Austen: Pride and prejudice

The Manning book "Natural Language Processing in Action" (NLPIA) comes with some example files. These example files are, more often than not, prepared files. In other words, the files are already customized for the software the authors use (for example: Every sentence is on one line (terminated by a newline+) instead of have formatted text that is not wider than 79 characters, each physical line (possibly part of a sentence) ended by a newline+.

The NLPIA file is: pride.txt

If we open this file in our editor of choice, we "see" the formatting. Each sentence in the book is a paragraph, that is there are 2 newlines between sentences (here: sequences of characters terminated by a full stop). There is ONE newline that chops up a book sentence into more or less equal parts to display on the screen.

This might not seem a big deal at first glance, but there are all sort of possible glitches right from the start. How many book sentences does P&P have? We can count the double newlines. In my editor that gives 1866 book sentences.

Suppose our software can only deal with book sentences on one line in our file, what to do? Easy peasy: Just replace single newlines by a space and double newlines by a single newline. *What could go possibly wrong? Can you spot the caveat?*

In the real world out there things are different. If we download the Jane Austen ook from Project Gutenberg, we can choose the format. In our case we choose the plain text UTF-8 file.

http://www.gutenberg.org/ebooks/1342

If we open this file in our editor, we see something different. The beginning of the first book sentence "It is a truth universally acknowledged ..." starts on line 167 in our file. Then the text becomes familiar until the very end where the sequence *** END OF ... signals text added by Project Gutenberg.

The NLPIA authors simply removed the Project Gutenberg sandwich around the Jane Austen text, which makes perfect sense if one is going to do textual analysis of that text.

#### Conclusions

1. It pays off to know the basics of a so-called programming editor:

    - Inspect what is there;
    - Make some quick necessary changes (re-formatting, deleting the Gutenberg sandwich, etc.) When mistakes are made these editors allow for undo;
    - When the file is saved, we save plain text in UTF-8.

2. If we were to start with the Project Gutenberg file, we end up with 2 files right at the beginning of our project. We can store the PG file we got from their website as a source file (for example in a directory named "src" or "raw" and the pre-processed file in another dierctory "wip".

3. If one really dives in files in different stages tend to pile up. If that is the case simple version control routines (we use Git) can be of tremendous help. If you reached a certain result, you check that particular result in -- as a snapshot. And just as undo comes in handy, these snapshots can be used to roll back to after you messed up things big time.

Now that we have some idea of the environment we need to do our textmining work, we can explore some basic tasks without diving to deep into prgramming (yet).

### The Command Line Interface (CLI)

The CLI or terminal a bit Unix/Linux and MacOS oriented; MS Windows users must use a workaround. For the remaining part of this workshop we will concentrate on working with Python programs and code via Jupyter notebooks. But having a glimpse of some of the CLI possibilities not only shows how powerful these tools are, but, more importantly, can show some important concepts of working with textual data:

    - chunking running plain text into meaningful chunks;
    - discarding parts of the text;
    - pipelines, the combination of small code snippets that, in a series of steps, produce the desired outcome.

#### Let's have a quick look at our text

First 5 sentences of P&P

In [1]:
%%bash
sed 5q < data/pride.txt

﻿It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds


#### Count words in a text

We break up this task in three discrete steps:

1. We break up the text into a sequence of words (or tokens => tokenizing) with tr
2. We then sort all the words in our sequence with sort
3. Then we count the duplicates with uniq

Let's try to break up our text in tokens (words)

In [2]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt | sed 5q


It
is
a
truth


Fine, works like a charm. And, what is more important, with tr we manipulated newlines like a breeze. Usually newlines are a pain in the proverbial. They can mean the end of a line or a command (depends on the context; but remember in my editor I had to enter the magic combo 'CTRL-Q CTRL-J' twice to select all "\n\n" sequneces. Here tr let us use the octal representation of the newline char: '\012*'. Perfect.

Let's sort our sequence:

In [3]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort | sed 5q


a
a
a
a


Next we count the number of occurences of each token:

In [4]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort |
uniq -c | sed 5q

      1 
   1905 a
     49 A
      1 abatement
      6 abhorrence


In [5]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort |
uniq -c |
sort -nr | sed 5q

   4113 to
   4058 the
   3599 of
   3430 and
   2138 her


The important thing in the above code snippets are not the somewhat cryptic and terse commands, but the ease of gluing these commands together in a so-called "pipeline" using the Unix pipe symbol "|" where the output of one command becomes the input of the next. This allows for fast experimenting and adjusting our snippets.

The output of the snippet in cell [4] shows that we count 1905 occurences for the lowercase 'a' and 49 for the uppercase 'A'. With our friend tr these are easily merged, we just cast all tokens to lowercase: 

In [6]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' |
sort |
uniq -c | sed 5q

      1 
   1954 a
      1 abatement
      6 abhorrence
      1 abhorrent


Right we use the new snippet for our counting pipeline, but this time we do not reverse the order, so we get the least used words first:

In [7]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' |
sort |
uniq -c |
sort -n | sed 10q

      1 
      1 abatement
      1 abhorrent
      1 abide
      1 abiding
      1 ablution
      1 abound
      1 abrupt
      1 absurdity
      1 abundant


To sum up: Pipelines are a honking good idea and together with the input '<' and output '>' operators we can move forwards quickly. We can use 'tr' to tokenize in different ways, according to the context. And we have different ways of sorting at our disposal:

| Example | Explanation |
| :- | :- |
| sort -d | dictionary order |
| sort -f | fold case |
| sort -n | numeric order |
| sort -nr | reverse numeric order |
| sort -u | remove duplicates |
| sort +1 | start with field 1 (start with 0) |
| sort +0.50 | start with 50th char (of first field == 0) |
| sort +1.5 | start with 5th char of field 1 |