<h1> Getting Started on a Cipher Diagnostic Tool</h1>

<p>One of the most challenging aspects of cryptanalysis (at least with classical ciphers) is the identification of the type of cipher. Is it monoalphabetic, polygraphic, homophonic, or something else entirely?

Once the correct type of cipher has been identified, the real work can start in earnest. This is easier said than done. 

Let's see if we can work toward a cipher classifier. This is going to take a lot of work, but with persistence and patience, I think we can produce something useful... but we need some data. In particular, we need to identify features of ciphers that can be used be the classifier in learning the structure of different ciphers.

There are some clear choices. The ciphertext itself, and Friedman's Index of Coincidence (IoC) for a ciphertext are the most obvious starting places. Other important measures might include the Shannon entropy (though there may be some dependence between this and the IoC, so we will need to be careful), and the ngram frequency distributions. Polygraphic substitutions and homophonic substitutions can have a far fewer or greater number of characters that the standard English alphabet. It may be tempting to include known plaintext as well, but most cipher diagnosis  Let's start making a list.

<ul>
    <li>The Ciphertext Itself</li>
    <li>Index of Coincidence</li>
    <li>Shannon Entropy of the Ciphertext</li>
    <li>N-Gram Frequency Distributions</li>
    <li>Factors of Numbers Close to the Ciphertext Length</li>
    <li>Other Things I Have Not Thought Of Yet!</li>

</ul>

This, admittedly, is not a large number of features to consider. We will keep our minds open.

However, we are still getting ahead of ourselves! We don't even have any ciphers to work with! We also need to identify the kinds of assumptions that will underly our model, as that will impact the ways in which we construct our dataset.
</p>

<h2>Assumptions</h2>
Given a random ciphertext, how likely is it that is was encrypted with a Affine Cipher? Vigenere? Hill? 

Well, as I know of no "Complete Database of Classical Ciphers," and since we want the classifier to base its conclusion on the characteristics of a given ciphertext, we will create this dataset in such a way that ciphers are approximately uniformly distributed. That is, a given ciphertext in the dataset will have an equal chance of having been enciphered with any of the encryption algorithms implemented.

Furthermore, the likelihood of a ciphertext being encrypted with a given cipher should be independent of the length of the ciphertext. So we should be certain that ciphertext lengths be distributed among individual ciphers in approximately the same way as well. 

<ol>
    <li>The distribution of ciphers used should be approximately uniform.</li>
    <li>The distribution of lengths of a ciphertexts should be approximately uniform.</li>
</ol>

This looks like a good starting place, and we can return to these assumptions later if we feel so inclined. Let's make a dataset!

<h2>Data</h2>

For the time being, we are going to focus on ciphertext that has been encrypted from English plaintext. As such, Project Gutenburg seems like a good resource. Given their enormous influence in other texts, why don't we start with the top 100 downloaded ebooks of the last 30 days? 

It will probably be usefule to create regex object to identify the appropriate links on the webpage. A little poking around the source on https://www.gutenberg.org/browse/scores/top should help us identify the correct format.

There appear to be a few ways the download urls appear. Here are the ones I have found so far (note I have formatted them according to
the structure of f-strings):
<ul>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt.utf-8</li>
</ul>

Okay! Now let's create some functions to do the heavy lifting for us! 

First we will write a function to connect to the appropriate website, parse the html (using Beautiful Soup of course), get things organized by title, author, and ID number.

After this, we will write a third function will construct urls for downloading each of the books with IDs retreived and a fourth function will attempt to download each of the books, trying a variety of file extensions and locations if the download fails.

In [172]:
# A function that connects to the Project Gutenberg website for top books 
# and returns a dictionary of the most downloaded book titles and their IDs
# where the keys are the book IDs and the values are the book titles and authors.

def get_top_books_list():
    """
    Connects to the Project Gutenberg website for top books and returns a dictionary of the most downloaded book titles and their IDs.

    Returns:
        list[str]: A list of tuples containing the book title, author, and ID number.
    """

    import re
    import requests
    from bs4 import BeautifulSoup

    # Connect to the Project Gutenberg website for top books.
    try:
        response = requests.get("https://www.gutenberg.org/browse/scores/top")
    except requests.exceptions.RequestException as e:
        print(e)
        return None
    
    # Parse the HTML response.
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all possible book titles and ids and store them in a list
    book_titles = []
    for link in soup.find_all("a"):
        try:
            if link.get("href").startswith("/ebooks/"):
                book_titles.append((link.text, link.get("href").split("/")[2]))
        except AttributeError:
            pass
    # Remove the non-book titles at the front.
    book_titles = book_titles[4:]

    # Create a list with book titles and IDs.
    book_list = []
    for book in book_titles:
        book_list.append((book[0], book[1]))

    # Remove everything after and including 'by' from the book titles.
    for i in range(len(book_list)):
        book_list[i] = (book_list[i][0].split("by")[0], book_list[i][1])

    # Remove non-alpha characters from the book titles, remove excess whitespace, convert names to lowerspace,
    # and replace remaining spaces with underscores.
    for i in range(len(book_list)):
        book_list[i] = (re.sub(r"[^a-zA-Z ]+", "", book_list[i][0]).strip().lower().replace(" ", "_"), book_list[i][1])

    # Remove books with duplicate ids.
    for i in range(len(book_list)):
        for j in range(len(book_list)):
            try:
                if i != j and book_list[i][1] == book_list[j][1]:
                    book_list.pop(j)
            except IndexError:
                pass

    return book_list


In [176]:
# Test the get_top_books_list() function.
print(get_top_books_list())

[('romeo_and_juliet', '1513'), ('mo', '2701'), ('a_room_with_a_view', '2641'), ('middlemarch', '145'), ('the_complete_works_of_william_shakespeare', '100'), ('little_women_or_meg_jo_beth_and_amy', '37106'), ('the_enchanted_april', '16389'), ('the_blue_castle_a_novel', '67979'), ('cranford', '394'), ('the_adventures_of_ferdinand_count_fathom__complete', '6761'), ('the_expedition_of_humphry_clinker', '2160'), ('the_adventures_of_roderick_random', '4085'), ('history_of_tom_jones_a_foundling', '6593'), ('twenty_years_after', '1259'), ('my_life__volume', '5197'), ('pride_and_prejudice', '1342'), ('frankenstein_or_the_modern_prometheus', '84'), ('alices_adventures_in_wonderland', '11'), ('dracula', '345'), ('the_great_gats', '64317'), ('the_picture_of_dorian_gray', '174'), ('the_photodrama', '70937'), ('a_tale_of_two_cities', '98'), ('the_wizards_cave', '70936'), ('noli_me_tangere', '20228'), ('a_modest_proposal', '1080'), ('ang_filibusterismo_karugtng_ng_noli_me_tangere', '47629'), ('the_ad

It took a little bit fo work, but we got there in the end. 

Next, we make a function that takes an ID number as input and retrieves a plaintext file of the book with the given ID. Let's also have it name and save the file in a subdirectory.

In [179]:
# A function to download an ebook given its ID. Since Project Gutenberg is not a fan of automation, we will be sneaky
# and wait for 1 second between requests.
def download_ebook(ebook_id: str, save_path: str = "text_files") -> None:
    """
    Downloads an ebook given its ID, and saves it to the text_files directory.
    
    Args:
        ebook_id (str): The ID of the ebook to download.
    """

    import requests
    import time

    # Create a list of possible urls for the ebook.
    urls = [f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt.utf-8",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt.utf-8",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt.utf-8",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt.utf-8"]
    
    # Try each url until we find one that works.
    for url in urls:
        try:
            response = requests.get(url)

            # If the text response is an html file, rather than the book we are looking for, we try the next url.
            if response.text.startswith("<!DOCTYPE html>"):
                response = None
                continue
            else:
                break
        # If the request fails, we try the next url.
        except requests.exceptions.RequestException as e:
            print(e)
            response = None
            continue
        finally:
            # Wait 1 seconds between requests.
            time.sleep(1)

    # Save the ebook to the text_files directory if the response is not None.
    if response is not None:
        with open(f"{save_path}/{ebook_id}.txt", "w") as f:
            f.write(response.text)

    # If it has not already been done, close the connection.
    if response is not None:
        response.close()
    

In [178]:
# Test the download_ebook() function with something that should work (Treasure Island).
download_ebook("112")

This looks promising! We will try to download them all, changing their names to something more appropriate along the way.

In [182]:
# A function to download all ebooks in the top 100 list, changing their names to their titles.
def retrieve_books(file_path: str = "text_files") -> None:
    """
    Downloads all ebooks in the top 100 list, changing their names to their titles.
    """

    # Import the os module.
    import os

    # Get the top 100 books list.
    top_100_books = get_top_books_list()


    # Download each ebook.
    for book in top_100_books:
        print(f"Downloading {book[0]}...")
        download_ebook(book[1], file_path)

        # Try to rename the ebook by the actual name. If the book already exists in the file path,
        # move on to the next book. If an error occurs while trying to rename the book because
        # the book could not be retreive in the first place, move on to the next book.
        try:
            os.rename(f"{file_path}/{book[1]}.txt", f"{file_path}/{book[0]}.txt")
        except FileExistsError:
            print(f"{book[0]} already exists in this directory.")
            continue
        except FileNotFoundError:
            print(f"Book with ID {book[1]} could not be found.")
            continue

In [183]:
# The moment of truth has arrived (again; I've come to this point a number of times). Download all the books.
retrieve_books()

Downloading romeo_and_juliet...
Downloading mo...
Downloading a_room_with_a_view...
Downloading middlemarch...
Downloading the_complete_works_of_william_shakespeare...
Downloading little_women_or_meg_jo_beth_and_amy...
Downloading the_enchanted_april...
Downloading the_blue_castle_a_novel...
Downloading cranford...
Downloading the_adventures_of_ferdinand_count_fathom__complete...
Downloading the_expedition_of_humphry_clinker...
Downloading the_adventures_of_roderick_random...
Downloading history_of_tom_jones_a_foundling...
Downloading twenty_years_after...
Downloading my_life__volume...
Downloading pride_and_prejudice...
Downloading frankenstein_or_the_modern_prometheus...
Downloading alices_adventures_in_wonderland...
Downloading dracula...
Downloading the_great_gats...
Downloading the_picture_of_dorian_gray...
Downloading the_photodrama...
Downloading a_tale_of_two_cities...
Downloading the_wizards_cave...
Downloading noli_me_tangere...
Downloading a_modest_proposal...
Downloading an

In [184]:
# Check to see how many books were downloaded.
print(len(os.listdir("text_files")))

119


<h3>Success!... And More Issues to Address</h3>
It looks as if things worked more or less according to plan (a few books are apparently unavailable in the file format I was hoping for). That said, we have some more things to deal with, like headers and footers to each text file, and there are some books which are not even in English (we can try an incorporate non-English text later on, but for now their inclusion will undermine some things).

The style of writing (vocabularly, grammar, etc.) in these books is remarkably different and they come in a variety of lengths. For any cipher that is neither polyalphabetic nor polygraphic, the usage of different symbols, and therefore the vernacular of the period in which a text was written, will likely alter predictions due to differences in n-gram frequencies, particularly with $n > 1$. As such, we need to be sure that each book has an equal chance of being enciphered with a given cipher. In addition, since there is no reason someone cannot encipher sentence fragments, paraphrases, shorthand, or otherwise truncated text, we should try to ensure some of the plaintext possesses these characteristics.

But first, let's clean things up a bit. We will create a function to remove the boilerplate text from the top and bottom of each ebook text file.

In [198]:
# A function to remove the boilerplate text from a Project Gutenberg books.
def remove_boilerplate(file_name: str) -> None:
    """
    Removes the boilerplate text from a single Project Gutenberg books.
    """

    import re

    with open(f"text_files/{file_name}", "r") as f:
        text = f.read()

    
    
    # Save the changes to the book.
    with open(f"text_files/{file_name}", "w") as f:
        f.write(text)


To this end, let's create a function to randomly select a piece of text (e.g., paragraphs, sentences, fragments) from one of these books, which we will also select randomly. Let's also try to ensure it doesn't cut words into pieces.

In [47]:
# A function to randomly select a book from the text_files directory, and then randomly select a bit of text from that book,
# ensuring that the text does not cut words into pieces.
import random
def generate_random_text_samples(file_path: str, num_samples=1) -> list:
    """
    Generates random text samples from the books in the text_files directory. The samples are randomly selected from the books. 
    The samples are guaranteed to be complete words, and the number of samples is determined by the num_samples parameter.   

    Args:
        file_path (str): The path to the directory containing the text files.
        num_samples (int): The number of samples to generate. Defaults to 1.

    Returns:
        list: A list of strings containing the text samples.

    Raises:
        ValueError: If num_samples is not an integer greater than 0.

    """

    # Check that num_samples is an integer greater than 0.
    if num_samples < 1 or not isinstance(num_samples, int):
        raise ValueError("num_samples must be an integer greater than 0.")
    
    # Get a list of all the text files in the directory.
    text_files = os.listdir(file_path)

    # Select a random text file.
    text_file = random.choice(text_files)

    # Read the text file.
    with open(file_path + "/" + text_file, "r") as f:
        text = f.read()

    # Split the text into words.
    words = text.split()

    # Remove punctuation from the words.
    words = [re.sub(r'[^\w\s]', '', word) for word in words]

    # Remove empty strings from the words.
    words = [word for word in words if word != ""]

    # Remove numbers from the words.
    words = [word for word in words if not word.isdigit()]

    # Loop until we have the desired number of samples.
    samples = []
    while len(samples) < num_samples:
        # Select a random word.
        word = random.choice(words)

        # Get the index of the word.
        index = words.index(word)

        # Get the number of words in the text.
        num_words = len(words)

        # Get the number of words in the sample.
        num_sample_words = random.randint(1, len(words)+1)

        # Get the start and end indices of the sample.
        start_index = index - num_sample_words // 2
        end_index = index + num_sample_words // 2
    
        # If the sample would start before the beginning of the text, set the start index to 0.
        if start_index < 0:
            start_index = 0

        # If the sample would end after the end of the text, set the end index to the end of the text.
        if end_index > num_words:
            end_index = num_words

        # Get the sample.
        sample = words[start_index:end_index]

        # Join the words in the sample together.
        sample = " ".join(sample)

        # Add the sample to the list of samples.
        samples.append(sample)

    # Return the samples.
    return samples

Let's test this out by trying to produce 1000 text samples from our books.

In [48]:
# Generate 10 random text samples using our new function.
samples = generate_random_text_samples("text_files", 10)

In [2]:
def generate_ciphertext_plaintext_pairs(cipher: Cipher, input_file: str, output_file: str) -> None:
    """
    Reads a text file, and then generates a file containing ciphertext and plaintext pairs of random length and randomly
    selected starting positions. The ciphertext is constructed using a given cipher object.

    Args:
        cipher (Cipher): The cipher object to use for encryption.
        input_file (str): The name of the file to read.
        output_file (str): The name of the file to write.
    """

    # Read the input file.
    with open(input_file, "r") as f:
        text = f.read()

    # Generate ciphertext and plaintext pairs.
    ciphertext_plaintext_pairs = []
    for i in range(1000):
        # Generate a random length for the ciphertext and plaintext.
        length = random.randint(10, 1000)

        # Generate a random starting position for the ciphertext and plaintext.
        start = random.randint(0, len(text) - length)
        
        # Generate the ciphertext and plaintext. 
        plaintext = text[start:start+length]
        ciphertext = cipher.encrypt(plaintext)

        # Add the ciphertext and plaintext to the list of pairs.
        ciphertext_plaintext_pairs.append((ciphertext, plaintext))

    # Write the ciphertext and plaintext pairs to the output file.
    with open(output_file, "w") as f:
        for pair in ciphertext_plaintext_pairs:
            f.write(pair[0] + "\n")
            f.write(pair[1] + "\n")