<a href="https://colab.research.google.com/github/pvanhuisstede/workshops/blob/main/00_tm_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

Learn Python in 10 days (Sams, YouTube video) or Teach Yourself Programming in 10 Years [Peter Norvig's take on the issue](https://norvig.com/21-days.html).

Here the angle is pragmatic: Working with (a large amount of) text files what are the possibilities? What do you need to do some useful work? And, when things start to get more serious, how do I get to cooperate fruitfully with programmers?

One thing to keep in mind during this workshop is that, when we are looking at files with code, or clever constructs on the command line, etc., we are looking at *end products*, carefully crafted, optimized, made by programmers. Often the clean code we see is not what the maker started out with, but we do not see the start and the intermediate steps (and mistakes) that resulted in the end product.

So, the next half hour, we will look at some CLI magic.

### Text mining intro

This Jupyter notebook is meant to give you a feel for working with files in the context of textual analysis:

    - I have downloaded these files from Web of Science but what is actually there? (use your editor);
    - Right, actually I am just interested in a small portion of what I got from WoS (use some text extraction);
    - Ok, this is not what I had in mind (go back to the source), or: prepare to do some serious pre-processing.

Farfetched examples? Not at all. But one does not have to setup an extended programming environment in order to get this kind of information. Something called the "Unix uitlities" (working at the command line) will do nicely.

The examples below are taken from Kenneth Ward Church: Unix(tm) for Poets (https://www.cs.upc.edu/~padro/Unixforpoets.pdf)

Another good source is: Jeroen Janssens' Data Science at the Command Line (https://www.datascienceatthecommandline.com/1e/).

### Jane Austen: Pride and prejudice

The Manning book "Natural Language Processing in Action" (NLPIA) comes with some example files. These example files are, more often than not, prepared files. In other words, the files are already customized for the software the authors use (for example: Every sentence is on one line (terminated by a newline+) instead of have formatted text that is not wider than 79 characters, each physical line (possibly part of a sentence) ended by a newline+.

The NLPIA file is: pride.txt

If we open this file in our editor of choice, we "see" the formatting. Each sentence in the book is a paragraph, that is there are 2 newlines between sentences (here: sequences of characters terminated by a full stop). There is ONE newline that chops up a book sentence into more or less equal parts to display on the screen.

This might not seem a big deal at first glance, but there are all sort of possible glitches right from the start. How many book sentences does P&P have? We can count the double newlines. In my editor that gives 1866 book sentences.

Suppose our software can only deal with book sentences on one line in our file, what to do? Easy peasy: Just replace single newlines by a space and double newlines by a single newline. *What could go possibly wrong? Can you spot the caveat?*

In the real world out there things are different. If we download the Jane Austen ook from Project Gutenberg, we can choose the format. In our case we choose the plain text UTF-8 file.

http://www.gutenberg.org/ebooks/1342

If we open this file in our editor, we see something different. The beginning of the first book sentence "It is a truth universally acknowledged ..." starts on line 167 in our file. Then the text becomes familiar until the very end where the sequence *** END OF ... signals text added by Project Gutenberg.

The NLPIA authors simply removed the Project Gutenberg sandwich around the Jane Austen text, which makes perfect sense if one is going to do textual analysis of that text.

#### Conclusions

1. It pays off to know the basics of a so-called programming editor:

    - Inspect what is there;
    - Make some quick necessary changes (re-formatting, deleting the Gutenberg sandwich, etc.) When mistakes are made these editors allow for undo;
    - When the file is saved, we save plain text in UTF-8.

2. If we were to start with the Project Gutenberg file, we end up with 2 files right at the beginning of our project. We can store the PG file we got from their website as a source file (for example in a directory named "src" or "raw" and the pre-processed file in another dierctory "wip".

3. If one really dives in files in different stages tend to pile up. If that is the case simple version control routines (we use Git) can be of tremendous help. If you reached a certain result, you check that particular result in -- as a snapshot. And just as undo comes in handy, these snapshots can be used to roll back to after you messed up things big time.

Now that we have some idea of the environment we need to do our textmining work, we can explore some basic tasks without diving to deep into prgramming (yet).

### The Command Line Interface (CLI)

The CLI or terminal a bit Unix/Linux and MacOS oriented; MS Windows users must use a workaround. For the remaining part of this workshop we will concentrate on working with Python programs and code via Jupyter notebooks. But having a glimpse of some of the CLI possibilities not only shows how powerful these tools are, but, more importantly, can show some important concepts of working with textual data:

    - chunking running plain text into meaningful chunks;
    - discarding parts of the text;
    - pipelines, the combination of small code snippets that, in a series of steps, produce the desired outcome.

Before we can work with the data files we prepared, we must import our Github repo into the Colab environment.

In [20]:
!git clone https://github.com/pvanhuisstede/workshops

fatal: destination path 'workshops' already exists and is not an empty directory.


#### Let's have a quick look at our text

First 5 sentences of P&P

In [21]:
%%bash
sed 5q < ./workshops/data/pride.txt

﻿It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds


#### Count words in a text

We break up this task in three discrete steps:

1. We break up the text into a sequence of words (or tokens => tokenizing) with tr
2. We then sort all the words in our sequence with sort
3. Then we count the duplicates with uniq

Let's try to break up our text in tokens (words)

In [None]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt | sed 5q

bash: line 1: data/pride.txt: No such file or directory


Fine, works like a charm. And, what is more important, with tr we manipulated newlines like a breeze. Usually newlines are a pain in the proverbial. They can mean the end of a line or a command (depends on the context; but remember in my editor I had to enter the magic combo 'CTRL-Q CTRL-J' twice to select all "\n\n" sequneces. Here tr let us use the octal representation of the newline char: '\012*'. Perfect.

Let's sort our sequence:

In [None]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort | sed 5q

bash: line 1: data/pride.txt: No such file or directory


Next we count the number of occurences of each token:

In [None]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort |
uniq -c | sed 5q

bash: line 1: data/pride.txt: No such file or directory


In [None]:
%%bash
tr -sc '[A-Z][a-z]' '[\012*]' < data/pride.txt |
sort |
uniq -c |
sort -nr | sed 10q

bash: line 1: data/pride.txt: No such file or directory


The important thing in the above code snippets are not the somewhat cryptic and terse commands, but the ease of gluing these commands together in a so-called "pipeline" using the Unix pipe symbol "|" where the output of one command becomes the input of the next. This allows for fast experimenting and adjusting our snippets.

The output of the snippet in cell [4] shows that we count 1905 occurences for the lowercase 'a' and 49 for the uppercase 'A'. With our friend tr these are easily merged, we just cast all tokens to lowercase: 

In [None]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' |
sort |
uniq -c | sed 5q

bash: line 1: data/pride.txt: No such file or directory


Right we use the new snippet for our counting pipeline, but this time we do not reverse the order, so we get the least used words first:

In [None]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' |
sort |
uniq -c |
sort -n | sed 10q

bash: line 1: data/pride.txt: No such file or directory


To sum up: Pipelines are a honking good idea and together with the input '<' and output '>' operators we can move forwards quickly. We can use 'tr' to tokenize in different ways, according to the context. And we have different ways of sorting at our disposal:

| Example | Explanation |
| :- | :- |
| sort -d | dictionary order |
| sort -f | fold case |
| sort -n | numeric order |
| sort -nr | reverse numeric order |
| sort -u | remove duplicates |
| sort +1 | start with field 1 (start with 0) |
| sort +0.50 | start with 50th char (of first field == 0) |
| sort +1.5 | start with 5th char of field 1 |

#### Bigrams

Bigrams are pair of words that can present us with a somewhat better, although still limited, view of the semantics of a text because we pair 2 adjacent words, instead of a simple bag of words (bow).

We start with a part of the code we used above to generate our list of words, all lowercase and we write the result to a file: pride.words.

Then the new part of our code snippet: We use that file to generate a new file with all words BUT rotate them a place and output the result to a second file: pride.nextwords

We can use both files to paste the lines together and save the result into 

In [None]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' > data/pride.words
tail +2 data/pride.words > data/pride.nextwords
paste data/pride.words data/pride.nextwords |
sort | uniq -c > data/pride.bigrams

bash: line 1: data/pride.txt: No such file or directory
bash: line 2: data/pride.words: No such file or directory
bash: line 3: data/pride.nextwords: No such file or directory
bash: line 5: data/pride.bigrams: No such file or directory
paste: data/pride.words: No such file or directory


Have a look at the most used bigrams with:

In [None]:
%%bash
sort -nr < data/pride.bigrams | sed 10q

bash: line 1: data/pride.bigrams: No such file or directory


#### Grep and friends

Another way of quickly scanning texts and selecting lines containin some words is to use the grep utility (or one of its friends: egrep, ack, etc.).

Suppose we are interested in lines that contain a reference to Mr. Darcy, we can use the following command:

In [None]:
%%bash
grep 'Darcy' data/pride.txt | sed 5q

grep: data/pride.txt: No such file or directory


But we can also use our new bigrams file to grep bigrams with Darcy:

In [None]:
%%bash
grep 'darcy' data/pride.bigrams | sed 5q

grep: data/pride.bigrams: No such file or directory


In the two bigram examples above we see that one has to be very interested in the use of words like "of", "to", "in" by Jane Austen in order to appreciate the information generated. Depending on the questions we have, we often want to discard certain parts, words of a text, in oder to be able to concentrate on the parts we are interested in.

Let's use our friend grep to filter out some of the stopwords.

For that we make a file "stopwords" that contains the words, each on a line of its own, that we are NOT interested in.

In [None]:
%%bash
tr '[A-Z]' '[a-z]' < data/pride.txt |
tr -sc '[a-z]' '[\012*]' > data/pride.words
tail +2 data/pride.words > data/pride.nextwords
paste data/pride.words data/pride.nextwords |
grep -v -w -f data/stopwords |
sort | uniq -c > data/pride.bigrams
sort -nr < data/pride.bigrams | sed 10q

bash: line 1: data/pride.txt: No such file or directory
bash: line 2: data/pride.words: No such file or directory
bash: line 3: data/pride.nextwords: No such file or directory
grep: data/stopwords: No such file or directory
paste: data/pride.words: No such file or directory
bash: line 6: data/pride.bigrams: No such file or directory
bash: line 7: data/pride.bigrams: No such file or directory


### Conclusion

Depending on the files you start to work with, you should be prepared to do some cleaning up. Depending on the sources you have and the questions you have, you further tailor the sources to work with.

The small programs we used above are so-called Unix utilities. You will have to work on a Unix, Linux or MacOS machine to be able to use them. And they much more powerful then we have showed here.

Here they were just used to give you an idea of what it is to work with texts. You have to be able to see what you have got to work with, for example by using a plain text editor.

Even with the small examples above our data subdirectory now contains 6 files. We started with the text file of Pride and Prejudice and generated 5 files. Somehow we must keep track of the intermediates and results we generate: Version control software.

What we did and why we did what we did is stated in this file. A so-called Jupyter notebook in which we can use text cells together with cells that contain code and generate the results of running that code.

We could have done a lot more with the Jane Austen book. Analyse the sentences in which the male characters are mentioned against the ones in which the female characters play a role. Zoom in on the sentences that contain dialogue. Or we could have 