<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# Taming Text 1

This is lesson `1` of 3 in the educational series on Taming Text. This notebook is intended to introduce tools and history behind XPath and Regular Expressions within python.

**Audience:** `Teachers` / `Learners` / `Researchers`  ...

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 


**Difficulty:**  `Intermediate` / `Advanced`

**Completion time:** `90 minutes`

**Knowledge Required:** 


* Python basics (variables, flow control, functions, lists, dictionaries)


**Knowledge Recommended:**


**Learning Objectives:**


**Research Pipeline:**

Can be at many points.


## Text, revisited

### What is plain text?

This is something that can actually be hard to define, weirdly enough. These are file that contain contents formatted as text. Which doesn't help much, but effectively: if you can look at it as text you've got it.

As we discuss text it's important to remember that:

* word files are not text files
* file extensions are fiction created as a "convention" during command line era
	* there weren't GUIs and it was hard to tell the difference between file formats and files/folders by name alone.
* this means that "double click and see what happens" is a very small part of the story
* Sometimes "proprietary" data is actually plain text hiding behind a file extension, like `.dat` etc.
	* they usually do have a proprietary structure within the text, but that can be dealt with
	* try popping these open in a plain text editor and see what happens

Another fun fact? (okay, a set of them)

* Jupyter Notebooks are actually giant json files
* all the code and formatting is saved there
* in plain text
* including any embedded images

How are the images stored? `base64` encoding, which is "a group of [binary-to-text encoding](https://en.wikipedia.org/wiki/Binary-to-text_encoding "Binary-to-text encoding") schemes that represent [binary data](https://en.wikipedia.org/wiki/Binary_data "Binary data") ... in sequences of 24 [bits](https://en.wikipedia.org/wiki/Bit "Bit") that can be represented by four 6-bit Base64 digits." (https://en.wikipedia.org/wiki/Base64)

Basically, plain and readable text transmission of binary data.

Messing around with how images were transmitted is actually a pretty old thing: https://youtu.be/cLUD_NGE370?t=225 As we will discover, revolutionary tools are those that do powerful things when you think about what you work with in a different way.

Nowadays when we open up pictures in Python tools, we do so in 3D arrays (each array is a color "layer"). When we think about pictures as a collection of pixels, where the color can be described using a human readable color system, we could actually just save those numbers. In fact, many machine learning tools don't always know/care that they are working with images, they just work with the arrays of numbers and go from there. 

Whatever you think about text, think about it from a bigger perspective. 

### Opening plain text files

You've seen some of this with python basics etc., where you use the `open()` function in `r` mode. You may also see `rt` mode being used for "read text". There's also `rb` for "read bytes". 

There are several ways to open up plain text files. Key: don't just double click. 

Yes, you can use python or other programming languages to read text. However, many programmers rely on a variety of tools for inspecting and viewing data before reading it in.

There's no single text program that's the 'best'. Most of us either use something that hasn't irritated us enough to use something else or use something that has some specialty for what we are working with. 

Examples: Sublime Text (my personal one), Atom, NotePad++, or many IDEs (eg pycharm and jupyter lab) can display the contents of a text file. There are also plenty of command line tools.

### Rethinking text as data
#### teletype machines
Before there were monitors, etc. there were typewriters and paper. Many of the characters and text techniques came from this era. This hardware of that system is still within our text, even if we don't have those physical systems anymore.

Fun facts:

* they had literal bells on them, which were rung via the bell character. `\b` 
* before monitors they didn't have a delete key, there were techniques and characters to indicate either ignore the previous character or trash this line
* for example, if you messed up on a punched tape, how do you erase it? punch all the holes to make a row of just 1s
	* https://en.wikipedia.org/wiki/Delete_character
* related to why windows and mac use diff sets of characters for newlines. `\r\n` vs `\n` alone

A fun watch: Fun stuff: https://www.youtube.com/watch?v=WqgFK9h75eg&ab_channel=Derobukal

#### data as lines

Much of the basis of our computing came from this era, and one of the easiest delimiters to work with was the line. We physical media separated elements by line and this idea is continued in much of our modern computing. 

* https://en.wikipedia.org/wiki/Baudot_code#/media/File:Baudot_Tape.JPG

### We've inherited limitations from the past

Many of these tools and techniques came about when our choices for text were really limited. We had what could be typed and displayed. There's a lot of work to do and only so many characters. 

This impact us in a variety of ways. We handle context disambiguation in our daily life with English, we are equipped to handle this. (eg homonyms) However, it takes time to build up the context to make these determinations. 

* regular expressions have "metacharacters" and the same symbol usually serves 3+ meanings depending on the context
	* really hard to get used to, but if you keep in mind trying to read the context of placement it gets better
* HTML entities (https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
	* needed because just inventing obfuscated ways of saying stuff helps us get around the symbol clashing

### Nowadays?

* lots of text
* lots of formats
* why? lots of data out there
* https://datapraxis.github.io/sourcecaster/ 

The overall goal: be able to look at text the way the computer is seeing it. Don't just see the words, but think about the things around the words. 

## Structure and unstructured data

More data than you know is actually just structured text. We may craft data just for machines to process, caring little for the humans. We may choose to blend the machine and the human. We may have information that's just for humans. All depends on need.

There are many types of structured data with more to come. We will be looking at XML/HTML as our structured data example here, but many of the skills for thinking through structure will serve you well no matter what the structure is.

From: https://en.wikipedia.org/wiki/Serialization 

|[Data exchange](https://en.wikipedia.org/wiki/Data_exchange "Data exchange") formats|   |
|---|---|
|[Human readable](https://en.wikipedia.org/wiki/Human-readable_medium "Human-readable medium") formats|- [Atom](https://en.wikipedia.org/wiki/Atom_(standard) "Atom (standard)")<br>- [CSV](https://en.wikipedia.org/wiki/Comma-separated_values "Comma-separated values")<br>- [EDIFACT](https://en.wikipedia.org/wiki/EDIFACT "EDIFACT")<br>- [JSON](https://en.wikipedia.org/wiki/JSON "JSON") <br>    - [Web Encryption](https://en.wikipedia.org/wiki/JSON_Web_Encryption "JSON Web Encryption")<br>    - [Web Token](https://en.wikipedia.org/wiki/JSON_Web_Token "JSON Web Token")<br>    - [Web Signature](https://en.wikipedia.org/wiki/JSON_Web_Signature "JSON Web Signature")<br>- [Property list](https://en.wikipedia.org/wiki/Property_list "Property list")<br>- [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework "Resource Description Framework")<br>- [Rebol](https://en.wikipedia.org/wiki/Rebol "Rebol")<br>- [TOML](https://en.wikipedia.org/wiki/TOML "TOML")<br>- [XML](https://en.wikipedia.org/wiki/XML "XML")<br>- [YAML](https://en.wikipedia.org/wiki/YAML "YAML")|
|[Binary](https://en.wikipedia.org/wiki/Binary_file "Binary file") formats|- [AMF](https://en.wikipedia.org/wiki/Action_Message_Format "Action Message Format")<br>- [ASN.1](https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One "Abstract Syntax Notation One") <br>    - [SMI](https://en.wikipedia.org/wiki/Structure_of_Management_Information "Structure of Management Information")<br>- [Avro](https://en.wikipedia.org/wiki/Apache_Avro "Apache Avro")<br>- [Base32](https://en.wikipedia.org/wiki/Base32 "Base32")<br>- [Base64](https://en.wikipedia.org/wiki/Base64 "Base64")<br>- [BSON](https://en.wikipedia.org/wiki/BSON "BSON") <br>    - [UBJSON](https://en.wikipedia.org/wiki/UBJSON "UBJSON")<br>- [Cap'n Proto](https://en.wikipedia.org/wiki/Cap%27n_Proto "Cap'n Proto")<br>- [CBOR](https://en.wikipedia.org/wiki/CBOR "CBOR")<br>- [FlatBuffers](https://en.wikipedia.org/wiki/FlatBuffers "FlatBuffers")<br>- [MessagePack](https://en.wikipedia.org/wiki/MessagePack "MessagePack")<br>- [Property list](https://en.wikipedia.org/wiki/Property_list "Property list")<br>- [Protocol Buffers](https://en.wikipedia.org/wiki/Protocol_Buffers "Protocol Buffers")<br>- [Thrift](https://en.wikipedia.org/wiki/Apache_Thrift "Apache Thrift")<br>- [Cyphal DSDL](https://en.wikipedia.org/wiki/Cyphal "Cyphal")<br>- [XDR](https://en.wikipedia.org/wiki/External_Data_Representation "External Data Representation")<br>- [uuencode](https://en.wikipedia.org/wiki/Uuencoding "Uuencoding")<br>- [yEnc](https://en.wikipedia.org/wiki/YEnc "YEnc")|

When we think about unstructured data, we can think about free text. However, even when something wasn't made with that kind of structure, there's usually something you can latch onto. Even a book has some kind of structure. That may depend on when and where it came from, but usually there's some amount of expectations you can make with it.

## Literalness

Since we are often using the same set of characters to provide structure to data, we need to focus on thinking about these characters as having **literal expression** vs **metaexpression** (my stupid word for this). 

When you see these metaexpression characters, you should sort of presume that they are being used in their meta lives. That makes sense since they were chosen for a reason, as not being the most common sort of thing used. Like, we wouldn't use the letter 'e' as the main delimiter ... I hope.

When you do need to use one of them literally, think of it as an exception and think of it not as whatever, but a literal whatever. Even read it as "literal backslash" or "literal quote". 

Read more: https://en.wikipedia.org/wiki/String_literal 

## The trick to text: match the solution to the structure

This workshop will cover XPath and regular expressions, two very common tools. XML may come and go, but tree structures, and thinking about tree structures are here to stay. Free text will always be around, and thus also regex.

Your preference should always be to match a tool designed to "parse", or process with an understanding of the structure, that actual structure the data is stored in. 

For example, CSVs should be parsed with a CSV reader and not just "split on commas". The pandas `read_csv` function is much more than something just splitting on commas or whatever you told it to. Likewise, core python has a CSV module that will handle properly sanitized content. 

You can read about many of the techniques here: https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision

## What is xpath?

(https://en.wikipedia.org/wiki/XPath)

This language is used with XML documents. In its current form, HTML can be considered an XML schema and many associated tools used with it.

The history of markup languages is long and circular. Originally used to note how certain content should appear, the names/systems/syntax have been shuffled around a lot. Many inherited and based off previous versions.

XML currently is used as a markup language like that but also for data transmission. As it can put data into semantically tagged chunks, it was naturally extended to holding data with structure. 

"The W3C intended XHTML 1.0 to be identical to HTML 4.01 except where limitations of XML over the more complex SGML require workarounds. Because XHTML and HTML are closely related, they are sometimes documented in parallel. In such circumstances, some authors [conflate the two names](https://en.wikipedia.org/wiki/(X)HTML "(X)HTML") as (X)HTML or X(HTML)." (https://en.wikipedia.org/wiki/HTML#SGML-based_versus_XML-based_HTML)

To sum, we can pretty much use XML tools on HTML for web scraping. However, this also means that XML files can also be parsed using that same tool. This becomes fantastic when working with APIs, as they often support returning data as XML. That said, parsing results in JSON is likely easier if given the choice. 

Fun reading: the original specs from CERN for HTML http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html#4

Bonus, both XML and JSON are tree structures. Meaning that the logic of navigating them is very similar (even if the syntax is not).

Double bonus: there are a few JSON querying tools that are meant to mimic XPath (https://goessner.net/articles/JsonPath/) or be very close to it. 

## Tree structures and XML in general
To properly understand the "why" of XPath, we should spend some time exploring XML and the idea of a tree structure.

### XML lingo

This is not meant to serve as a full metadata class, but a quick reference to help xpath make more sense. 


* tag
	* the structural elements that can be used to mark things up.
	* used for talking about things like the `a` tag etc. 
* element
	* an instance of a tag being used in the document
* attribute
	* gives you a key/value pair, possible attributes are also defined in the schema
	* `attribute = "the value"` the attribute doesn't have quotes, while the value will have quotes around it. Well, should, but some HTML is the wild west.
* schema
	* a set of rules defining how this pocket world should work and how all the below can be used
	* you can use multiple within the same document!
* namespaces
	* allows you to specify the names etc. for all the schemas being used
	* used to disambiguate the tags

Let's look through some examples:

https://www.w3schools.com/xml/xml_examples.asp
