# Plain text and text encodings

### *Michaelmas Term 2025*

## Aims for these sessions

- enough Python so that you can recognise some features and feel some familiarity
- cover some DH basics
- a little bit of computer science to help understand some the above

### What is Python and what can it do?

It's a free, open-source general purpose language. Nowadays the only version to use is Python 3 and the most up-to-date release is 3.14 (Python 2 is maintained on some systems for legacy reasons but new code should not be written in it). 

Python can essentially do anything a computer can do, although it might not be the best choice for some things. It's a first choice for a lot of data science work and has become the main language for machine learning/AI.

Because of its breadth, there will always be areas of Python that are unfamiliar. If you know the fundamentals and have some practice, you'll be able to understand lots of Python code, even if the applications of it are different from what you're used to.

### How do you run Python?

If you have Python installed
- Run a Python script (a text file with a ```.py``` extension) from the command line: ```python myscript.py```
- From the built-in Python shell from the command line: type ```python``` and return and you'll get a prompt: this is an exploratory environment where you can try snippets
- From an installed shell like ```iPython```: a more full featured version of the above
- From an installed program like Jupyter Notebook, which runs in a browser

Online
- Google Colab, 
- Python Anywhere
- GitHub Actions
- many other including 'playgrounds'

Before we start running some Python, save a copy of this notebook in Drive (you can ignore the warnings). You always need to do this with an imported notebook in Colab, so that any changes you make persist and you keep your own copy of the notebook.

Let's read the opening paragraph of Jane Austen's 1817 novel *Persuasion* using Python.

In [None]:
persuasion_snippet = """Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for his own amusement, never took up any book but the Baronetage; there
he found occupation for an idle hour, and consolation in a distressed
one; there his faculties were roused into admiration and respect, by
contemplating the limited remnant of the earliest patents; there any
unwelcome sensations, arising from domestic affairs changed naturally
into pity and contempt as he turned over the almost endless creations
of the last century; and there, if every other leaf were powerless, he
could read his own history with an interest which never failed. This
was the page at which the favourite volume always opened:
"""

In [None]:
persuasion_snippet

This is plain text. You can see exactly what's going on, even the line breaks. Plain text is easy to work with. For example we can see the line breaks are encoded `\n` so we can remove them if we like.

In [None]:
persuasion_snippet_no_line_breaks = persuasion_snippet.replace("\n", " ")

In [None]:
persuasion_snippet_no_line_breaks

What about reading in a Word file containing the whole of *Persuasion*? First we need to get the files we're using today to somewhere Colab can read them:

In [None]:
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/persuasion.txt
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/persuasion.docx
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/emma.txt
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/plain-text-emojis.docx
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/plain-text-emojis-to-unzip.docx
    

In [None]:
with open('persuasion.docx', 'r') as f:
    persuasion = f.read()

When learning and using Python it's important not to be worried by errors. They're normal and you should try to see them as Python being helpful!

Read error messages from the bottom up. There are about 50 error types in Python and this one is a `UnicodeDecodeError` and talks about 'utf-8'. Unicode and utf-8 are both associated with plain-text formats and Python is saying that Word is not giving us that.

A ```docx``` file is not really a single file. It's a zip file containing multiple files. The same is true of ```xlsx``` files and many other formats. You can use Python to open Word and Excel files, but it's not straightforward.

In Python it is straightforward to work with plain text. With ```persuasion.txt``` in the same folder as this notebook, we can read in the whole of Austen's _Persuasion_ like this (this is the same code as above, just changing the file name from `docx` to `txt`.

In [None]:
with open('persuasion.txt', 'r') as f:
    persuasion = f.read()

Now ```persuasion``` is a variable pointing at the full text of _Persuasion_. We can view the contents of a variable in Colab just by typing it in a code cell.

What usable information do we have here? What information can we get from plain text?

In [None]:
persuasion

Notice that this is fast. Plain text like this is very fast on a modern computer. But it's not very convenient for reading a novel.

We can see parts of the text using Python's _slice_ notation. The numbers are character positions within the big text string which is ```persuasion```. Try changing these to get a different slice.

In [None]:
persuasion[2000:3000]

How many characters are there in total? We can get this with the ```len``` function:

In [None]:
len(persuasion)

Does this seem reasonable? How many characters in the average novel? How many words?

In [None]:
persuasion = persuasion.replace("\n", " ")

The line endings are annoying. As we did above, we can get rid of them using Python's ```replace``` method, which works on sequences of characters (strings).

In [None]:
persuasion[1000:2000]

What else can we do with built-in methods? Colab will help you. Type ```.``` after the variable you want to do something with (in this case ```persuasion```). A list should pop up. This will show the string methods available.

In [None]:
persuasion.

The methods available with ```persuasion``` depend on what type of thing ```persuasion``` is. We can find out with ```type```. This is a very useful function in Python because it's easy to be under a misapprehension about what you're dealing with (which leads to a ```type error```).

In [None]:
type(persuasion)

So ```persuasion``` is a string and we can use any of Python's built-in string methods on it. Let's try ```find```. Anne Elliot is the principal character in _Persuasion_ so let's look for _Anne_

In [None]:
persuasion.find("Anne")

Wait, what does ```2133``` refer to? You can get help on a particular method by running the cell with a ```?```appended:

In [None]:
persuasion.find?

This is perhaps a bit abstruse but it means that _Anne_ first occurs in our _Persuasion_ string at character number 2133. Is that useful?

Let's try the ```count``` method. That seems like it should count the number of occurrences of something, but we can check.

In [None]:
persuasion.count?

In [None]:
persuasion.count("Anne")

We have to be careful with results like these. Does it mean that the person _Anne_ is mentioned 496 times in _Persuasion_? Maybe, but is there a character called _Annette_ or a trip to _Annecy_ in France? Is there even another character called Anne who is not Anne Elliot?

In [None]:
persuasion.count("Anne Elliot")

In [None]:
persuasion.count("Miss Elliot")

Notice that when we get the length of _Persuasion_ the syntax is different from when we count occurrences:
    
```len(persuasion)```

```persuasion.count("Anne")```

```len``` is a _function_ but ```count``` is a particular type of function, called a _method_.

Why do you get an error if you run ```len``` on an integer, eg ```len(5)```? Why isn't the answer 1?

### Group work (or homework)

Keeping to the constraint of not importing any Python modules, just using built-in methods and functions, what else can you do with _Persuasion_?

- [list of string methods with examples](https://www.tutorialstonight.com/python/python-string-methods)


Here are some ideas, but feel free to do your own thing:
- divide up the text into equal chunks and look for distributions of words of phrases through the novel
- get some context around a string you find using slice notation (yes this is very crude)
- split the text on spaces using the ```split``` method. What are the pros and cons of this? 
- don't forget that you can use ```persuasion.``` and ```persuasion.method?``` to find out about other string methods; are these useful?

If you get through all of that, well done! Now try importing another Jane Austen novel, *Emma*, which is saved in the same place as `persuasion.txt` with the name `emma.txt`. Can you produce some comparative figures for the two novels?