### Some of last week's regex questions.

In [None]:
with open('persuasion.txt', 'r', encoding="utf-8") as f:
    persuasion = f.read()

In [None]:
import re

##### deal with punctuation and 's

delete selected punctuation from the text

In [None]:
persuasion = re.sub(r"[.,;'’-]", "", persuasion)

In [None]:
annes_words = re.findall(r"Annes? \w+", persuasion)
annes_words

##### remove "Anne" from word found after "Anne"


In [None]:
for anne_word in annes_words:
    anne_word = re.sub(r"Annes? ", "", anne_word)
    print(anne_word)

This is a _positive lookbehind assertion_:

In [None]:
annes_words = re.findall(r"(?<=Anne )\w+", persuasion)
annes_words

##### why did we miss 3 Annes?

In [None]:
test_string = "Anne Elliot was looking on IMDB for the characters Anne Elliot or Anne, like, for example, Anne Hathaway"

##### greedy

In [None]:
how_many = re.findall(r".{0,20}Anne.{0,20}", test_string)
how_many

##### lazy

In [None]:
how_many_now = re.findall(r".{0,20}?Anne.{0,20}?", test_string)
how_many_now

## What is markup?

The term originates in printers _marking up_ manuscripts for typesetting. They would mark things in advance to be in **bold**, _italics_, ```different font``` etc. so the type could be set efficiently.

This is one type of markup: _presentational_ markup; the other is _semantic_ markup. 

But of course there are grey areas. If we refer to 'Hamlet' and '_Hamlet_' we mean the character and the play respectively.

#### And why should we care?

In DH you are highly likely to come across data that is marked up in some way. You might want to extract data from the markup or you might want to convert the markup into something else.

You may also want to mark up data yourself. Markup is on a spectrum from extremely light to very dense, depending on the project and the research questions.

Some types of markup are easier to extract data from than others!

##### Markdown

There are _many_ types of markup. In this workshop we'll only look at four (or maybe three and a half):
- Markdown
- HTML
- LaTeX
- XML

<h5>HTML</h5>

<p>There are <i>many</i> types of markup. In this workshop we'll only look at four (or maybe three and a half):</p>
<ul>
    <li>Markdown</li>
    <li>HTML</li>
    <li>LaTeX</li>
    <li>XML</li>
</ul>

\begin{equation}
\text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|}
\end{equation}

### The rules of XML

- There must be a _root element_ which wraps around everything else, eg ```<persuasion>...</persuasion>```
- Elements must nest inside each other: ```<chapter><paragraph>...</paragraph></chapter>```
- Starting and opening tags must match in case: ```<Head>...</Head>```
- Attributes must be in quotation marks ```<character name="Anne Elliot">Miss Elliot</character>```

### Markup schemes

With XML you are free to use any elements you like. This brings great flexibility.

However it creates a couple of problems:

- If texts are marked up differently they're not usable _en masse_
- Writing your own scheme is a lot of work

A solution is to use an existing encoding scheme. There are many, but the most commonly used one in Digital Humanities is the [Text Encoding Initiative](https://www.tei-c.org).
    

#### Group work

Mark up the first couple of pages of ```persuasion.xml```. 
- Decide on what you want to mark up: proper nouns, parts of speech, whatever you like
- Come up with a markup scheme: what will be an element name and what will be an attribute value?
- Periodically check that your XML is valid by opening it/refreshing it in a browser
- If you have time, try to map your markup to the elements used in TEI