# Monday, November 3rd, 2025

We've seen several functions that can be used for processing text (`str`) data. For example, things like the `.split`, `.replace`, and `.join` methods can be used to manipulate strings. Today, we will look more into working with text data in Python. 

[Project Gutenberg](https://www.gutenberg.org/) is a repository containing over 75,000 free eBooks that are in the public domain. The code cell below will download the text from *Frankenstein; Or, The Modern Prometheus* by Mary Shelley and save it as a text file `frankenstein.txt`.

**Note:** Feel free to browse through the Project Gutenberg library and select any other eBook of your choice. Just make sure to download a plain-text version of the eBook (not PDF, EPUB, HTML, or any other formats).

In [None]:
import requests

data = requests.get('https://www.gutenberg.org/cache/epub/42324/pg42324.txt')
with open('frankenstein.txt','wb') as f:
    f.write(data.content)

## Working with text files in Python

To get started, we will need to be able to load the downloaded text file into Python. The `open` function can be used in Python to open a file for reading or writing. For example, `open('frankenstein.txt', 'r')` will open the `frankenstein.txt` file for reading (signified by the argument `'r'`). Once a file is opened, we can use the `.read()` method to read the contents of the file into a string.

This particular text file uses UTF-8 encoding, which is not the default encoding that the `open` function expects for text data. We can include an optional argument `encoding='utf-8-sig'` when opening the file to account for this non-default encoding.

In [None]:
f = open('frankenstein.txt', 'r', encoding='utf-8-sig')  # Opens the file for reading
text = f.read()                                         # Reads the contents into a string

print(text[2000:3000])

When opening a file in Python, the file remains open to Python until it is closed using the `.close` method. If we forget to close the file, it is possible for undesirable things to happen (for example, the file could become corrupted). 

In [None]:
f.close()                                            # Closes the file

To ensure that we do not forget to close the file, we can use the `with` construct to have Python automatically close the file when we are done reading from it. To use this `with` construct, we assign a temporary variable name to an opened file. We then carry out any desired operations on this temporary variable in an indented block. Once the code exits the indented block, the file is closed.

The code below demonstrates how this works.

In [None]:
with open('frankenstein.txt', 'r', encoding='utf-8-sig') as f:  # Opens the file for reading
    text = f.read()                                            # Reads the content into a string

# The file `f` is now closed, since we have exited the `with` block.

print(text[2000:3000])

## Analyzing text data

Now that we have some text to work with, what questions can we ask/answer? Suppose that we want to explore the frequency of the words that appear in *Frankenstein*.

First, it will be helpful to obtain a list of the words that appear in Frankenstein. As a starting point, we can use the `.split` method to separate the `text` string into a list of substrings separated by any amount of white space. This will (generally) give us the list of words in `text`.

In [None]:
words = text.split()
print(words[:10])

**Exercise:** Construct a dictionary `word_count_dict` whose keys are words and whose values count how many times that word appears in *Frankenstein*.

To construct this dictionary:
 - Look through every word in *Frankenstein*.
 - If the word is not in our dictionary, add it to the dictionary with a value of `1`.
   - We can test whether a dictionary `word_count_dict` contains a key `word` using the Boolean expression `word in word_count_dict`. This will be `True` if `word` is a defined key in `word_count_dict` and `False` otherwise.
 - If the word is in the dictionary, increment the value associated to that word by `1`.

In [None]:
word_count_dict = {}

How many times does the word `'monster'` appear in the text?

We've generated a dictionary mapping words to their counts, but which word has the highest count? We can try using the `max` function to find the most frequently appearing word.

What is happening here? Is the symbol above truly the most frequently appearing word in Frankeinstein?

By default, taking the maximum of a dictionary returns the maximum of the keys of that dictionary. In this case, the keys are strings, so it returns the last string in alphabetical order. This is illustrated in the simple example below.

In [None]:
my_dict = {'a': 5,
           'b': 3,
           'c': 1}

In [None]:
max(my_dict)

In order to find the most frequently appearing word in `word_count_dict`, we'd rather take the maximum of the values (that is, of the word counts). We can use `word_count_dict.values()` to get a "list" of the values, and then take the maximum using the `max` function.

This gives us the count for the most frequently appearing word, but it does not tell us which word has this count. This is illustrated in the simple example below.

In [None]:
max(my_dict.values())

What we'd really like to do is to find the "maximum" key, where we measure the "size" of a key by its associated value.

## Finding maximums/minimums with a custom key

When using `max` to find the maximum value of an iterable, we can optionally include an argument `key` (a function) that tells the `max` function how to items are to be compared. This `key` function must be able to take in any of the items in the iterable, and return some type of data that can be sorted (like integers, floats, strings, etc.).

For example, suppose we have a list of strings and we want to find the string that is the longest. We could use `key=len` inside the `max` function to find the longest string.

In [None]:
my_list = ['Two', 'One', 'Four', 'Three']

In [None]:
max(my_list)

In [None]:
max(my_list, key=len)

We can also supply a custom key when using the `min` function to find minimums.

In [None]:
min(my_list)

In [None]:
min(my_list, key=len)

Note: If there are multiple items that are maximums or minimums, Python will return the first of these item encountered in the list. In the example above, both `'One'` and `'Two'` have the minimum length of `3`. Python returns `'One'` since it appears before `'Two'` in the list.

Let's define our own function that can be used to compare the strings in `my_list`.

In [None]:
def my_key(s):
    if s == 'One':
        return 1
    elif s == 'Two':
        return 2
    elif s == 'Three':
        return 3
    elif s == 'Four':
        return 4

In [None]:
min(my_list, key=my_key)

In [None]:
max(my_list, key=my_key)

Returning to the *Frankenstein* text, we want to find the "maximum" word, where the size of each word is given by the number of times that word appears in the text. 

**Exercise:** Write a function `word_count_key` that takes in a string `word` and returns the number of times that `word` appears in *Frankenstein*. You should use the `word_count_dict` that was defined earlier.

**Exercise:** Use the `word_count_key` function to find the most frequently appearing word in *Frankenstein*, along with the number of times that the word appears.

## Sorting

The `sorted` function can take in an iterable structure (e.g. list, dictionary, etc.) and return a sorted list of those items.

In [None]:
my_list

In [None]:
sorted(my_list)

Just like the `max` and `min` functions, we can supply a `key` input argument to change the way that the items are sorted.

In [None]:
sorted(my_list, key=len)

By default, the `sorted` function will sort in ascending order (i.e. from smallest to largest). We can switch to descending order using the optional argument `reverse=True`.

In [None]:
sorted(my_list, key=my_key)

In [None]:
sorted(my_list, key=my_key, reverse=True)

Note: When sorting a dictionary, the `sorted` function will return a list of sorted keys (and will drop the dictionary structure and associated values).

**Exercise:** Use the `sorted` function along with the `word_count_key` function to sort the keys of `word_count_dict` from most frequently appearing to least frequently appearing. Then print out the 10 most frequently appearing words along with their corersponding word counts.

## Pre-processing text for analysis

In the code above, we looked through the text of *Frankenstein* and counted how many times each word appears in the text. On the other hand, there may be instances of a single word appearing as several different keys in the `word_count_dict` dictionary, with each key being slight variations of this word. 

For example, the word strings `for` and `For` both appear as distinct keys in `word_count_dict`.

In [None]:
word_count_dict['for']

In [None]:
word_count_dict['For']

Can we modify our code to account for this? That is, can we make it so that only the string `'for`' appears in `word_count_dict`, and each appearance of `for` or `For` will be included in the word count for the string `'for'`?

We can use the `.lower` method on a string to convert all upper case letters to lowercase, as shown in the example below. Similarly, the `.upper` method will convert all lowercase letters to uppercase.

In [None]:
s = 'ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs'
print(s)
print(s.lower())
print(s.upper())

**Exercise:** Modify the code above that was used to create `word_count_dict` so that all keys are lowercase and the associate word counts include all instances of the word (regardless of capitalization).

For another example, the word strings `'study'` and `'study,'` appear as distinct keys in `word_count_dict`. 

In [None]:
word_count_dict['study']

In [None]:
word_count_dict['study,']

We can use the `.replace` method to replace all instances of `',`' with an empty string before calculating word counts.

In [None]:
s = "This is a string, it contains some punctuation. Here's some more: !?.,"
print(s)
print(s.replace(',',''))

We can expect that this will happen with several other types of punctuation, such as `'.'`, `'!'`, `'?'`, `'('`, `')'`, `"'"`, `'"'`, `'#'`, `'@'`, `'%'`, etc.

**Exercise:** Use the `.replace` method to remove punctuation marks and other non-letter characters from the text of *Frankenstein*, then re-compute `word_count_dict`.

## [Project 5 - Code breakers](https://jllottes.github.io/Projects/code_breakers/code_breakers.html)

Our next project deals with trying to break an encryption to discover the meaning of a secret message. Let's look at how the encryption process works.

### Background:  ASCII codes

Each character on a computer keyboard is assigned an [ASCII code](http://www.theasciicode.com.ar/), which is an integer in the range `0`-`127`. The ASCII code of a character can be obtained using the `ord()` function:

In [None]:
for c in "This is MTH 337":
    print("'{}'  ->  {}".format(c, ord(c)))

Conversely, the function `chr()` converts ASCII codes into characters:

In [None]:
char_list = []
for n in [104, 101, 108, 108, 111]:
    char_list.append(chr(n))
    txt = ''.join(char_list)
print(txt)

It will be helpful to be able to convert easily between strings of characters and lists of their corresponding ASCII codes.

**Exercise:** Write functions `str_to_ascii` and `ascii_to_str` that will convert between strings and lists of ASCII codes. We can use the `ord()` and `chr()` functions to convert any particular character or ASCII code.

### Text encryption

In order to securely send a confidential message one usually needs to encrypt it in some way to conceal its content. Here we consider the following encryption scheme:

 - One selects a secret key, which is sequence of characters. This key is used to both encrypt and decrypt the message.
 - Characters of the secret key and characters of the message are converted into ASCII codes. In this way the key is transformed into a sequence of integers $(k_1, k_2, \dots, k_r)$, and the message becomes another sequence of integers $(m_1, m_2, \dots, m_s)$. If $r<s$, then the secret key sequence is extended by repeating it as many times as necessary until it matches the length of the message.
 - Let $c_i$ be the reminder from the division of $m_i + k_i$ by $128$. The sequence of numbers $(c_1, c_2, \dots, c_s)$ is the encrypted message.

For example, if the message is `'Top secret!'` and the secret key is `'buffalo'` then the encrypted message is: `[54, 100, 86, 6, 84, 81, 82, 84, 90, 90, 7]`. Let's develop some code that will allow us to perform this encryption ourselves.

In [None]:
message = 'Top secret!'
key = 'buffalo'

First, let's convert both to lists of ASCII codes using the `str_to_ascii` function.

Problem: Our message has more characters than our key, so we need to duplicate our key enough times to match the length of the message.

One idea: use a `while` loop to keep duplicating `key_ascii` until it matches or exceeds the length of `message_asii`.

Another idea: Use integer division to count how many times we need to duplicate the `key_ascii` list to match or exceed the length of `message_ascii`.

Another idea: use modular arithmetic on the index of `key_ascii`, dividing by the length of `key_ascii` to keep looping through `key_ascii` until we enough entries.

**Exercise:** Write a function `get_padded_key_ascii` that takes in arguments `key_ascii` and `length` and returns a padded version of `key_ascii` of length `length`, obtained by repeating `key_ascii` as many times as necessary.

**Exercise:** Write a function `encrypt(message_ascii, key_ascii)` that return the encrypted version of `message_ascii` using the secret key `key_ascii` (based on the code above).

In order to decrypt the message we work backwards: for each number $c_i$, we compute the reminder from the division of $c_i - k_i$ by $128$. This number is equal to $m_i$, so converting it into a character we get the $i$-th letter of the message.

**Exercise:** Write a function `decrypt(encrypted_ascii, key_ascii)` that returns the decrypted message.