<h1> Getting Started on a Cipher Diagnostic Tool</h1>

<p>One of the most challenging aspects of cryptanalysis (at least with classical ciphers) is the identification of the type of cipher. Is it monoalphabetic, polygraphic, homophonic, or something else entirely?

Once the correct type of cipher has been identified, the real work can start in earnest. This is easier said than done. 

Let's see if we can work toward a cipher classifier. This is going to take a lot of work, but with persistence and patience, I think we can produce something useful... but we need some data. In particular, we need to identify features of ciphers that can be used be the classifier in learning the structure of different ciphers.

There are some clear choices. The ciphertext itself, and Friedman's Index of Coincidence (IoC) for a ciphertext are the most obvious starting places. Other important measures might include the Shannon entropy (though there may be some dependence between this and the IoC, so we will need to be careful), and the ngram frequency distributions. Polygraphic substitutions and homophonic substitutions can have a far fewer or greater number of characters that the standard English alphabet. It may be tempting to include known plaintext as well, but most cipher diagnosis  Let's start making a list.

<ul>
    <li>The Ciphertext Itself</li>
    <li>Index of Coincidence</li>
    <li>Shannon Entropy of the Ciphertext</li>
    <li>N-Gram Frequency Distributions</li>
    <li>Factors of Numbers Close to the Ciphertext Length</li>
    <li>Other Things I Have Not Thought Of Yet!</li>

</ul>

This, admittedly, is not a large number of features to consider. We will keep our minds open.

However, we are still getting ahead of ourselves! We don't even have any ciphers to work with! We also need to identify the kinds of assumptions that will underly our model, as that will impact the ways in which we construct our dataset.
</p>

<h2>Assumptions</h2>
Given a random ciphertext, how likely is it that is was encrypted with a Affine Cipher? Vigenere? Hill? 

Well, as I know of no "Complete Database of Classical Ciphers," and since we want the classifier to base its conclusion on the characteristics of a given ciphertext, we will create this dataset in such a way that ciphers are approximately uniformly distributed. That is, a given ciphertext in the dataset will have an equal chance of having been enciphered with any of the encryption algorithms implemented.

Furthermore, the likelihood of a ciphertext being encrypted with a given cipher should be independent of the length of the ciphertext. So we should be certain that ciphertext lengths be distributed among individual ciphers in approximately the same way as well. 

<ol>
    <li>The distribution of ciphers used should be approximately uniform.</li>
    <li>The distribution of lengths of a ciphertexts should be approximately uniform.</li>
    <li>Encrypted text should come from human-readable plaintext.</li>
    <li>Encrypted text should come from plaintext that (at least roughly) follows common grammatical conventions.</li>
</ol>

This looks like a good starting place, and we can return to these assumptions later if we feel so inclined. Let's make a dataset!

<h2>Data</h2>

<p>For the time being, we are going to focus on ciphertext that has been encrypted from English plaintext. As such, Project Gutenburg seems like a good resource. Given their enormous influence in other texts, why don't we start with the top 100 downloaded ebooks of the last 30 days? 

It will probably be usefule to create regex object to identify the appropriate links on the webpage. A little poking around the source on https://www.gutenberg.org/browse/scores/top should help us identify the correct format.

There appear to be a few ways the download urls appear. Here are the ones I have found so far (note I have formatted them according to
the structure of f-strings):
<ul>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt.utf-8"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt"</li>
    <li>"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt.utf-8</li>
</ul>
</p>

<p>Okay! Now let's create some functions to do the heavy lifting for us! 

First we will write a function to connect to the appropriate website, parse the html (using Beautiful Soup of course), get things organized by title, author, and ID number.

After this, we will write a third function will construct urls for downloading each of the books with IDs retreived and a fourth function will attempt to download each of the books, trying a variety of file extensions and locations if the download fails.</p>

In [292]:
# A function that connects to the Project Gutenberg website for top books 
# and returns a dictionary of the most downloaded book titles and their IDs
# where the keys are the book IDs and the values are the book titles and authors.

def get_top_books_list():
    """
    Connects to the Project Gutenberg website for top books and returns a dictionary of the most downloaded book titles and their IDs.

    Returns:
        list[str]: A list of tuples containing the book title, author, and ID number.
    """

    import re
    import requests
    from bs4 import BeautifulSoup

    # Connect to the Project Gutenberg website for top books.
    try:
        response = requests.get("https://www.gutenberg.org/browse/scores/top")
    except requests.exceptions.RequestException as e:
        print(e)
        return None
    
    # Parse the HTML response.
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all possible book titles and ids and store them in a list
    book_titles = []
    for link in soup.find_all("a"):
        try:
            if link.get("href").startswith("/ebooks/"):
                book_titles.append((link.text, link.get("href").split("/")[2]))
        except AttributeError:
            pass
    # Remove the non-book titles at the front.
    book_titles = book_titles[4:]

    # Create a list with book titles and IDs.
    book_list = []
    for book in book_titles:
        book_list.append((book[0], book[1]))

    # Remove everything after and including 'by' from the book titles.
    for i in range(len(book_list)):
        book_list[i] = (book_list[i][0].split("by")[0], book_list[i][1])

    # Remove non-alpha characters from the book titles, remove excess whitespace, convert names to lowerspace,
    # and replace remaining spaces with underscores.
    for i in range(len(book_list)):
        book_list[i] = (re.sub(r"[^a-zA-Z ]+", "", book_list[i][0]).strip().lower().replace(" ", "_"), book_list[i][1])

    # Remove books with duplicate ids.
    for i in range(len(book_list)):
        for j in range(len(book_list)):
            try:
                if i != j and book_list[i][1] == book_list[j][1]:
                    book_list.pop(j)
            except IndexError:
                pass

    return book_list


In [293]:
# Test the get_top_books_list() function.
print(get_top_books_list())

[('romeo_and_juliet', '1513'), ('mo', '2701'), ('a_room_with_a_view', '2641'), ('middlemarch', '145'), ('little_women_or_meg_jo_beth_and_amy', '37106'), ('the_complete_works_of_william_shakespeare', '100'), ('the_enchanted_april', '16389'), ('the_blue_castle_a_novel', '67979'), ('cranford', '394'), ('the_adventures_of_ferdinand_count_fathom__complete', '6761'), ('history_of_tom_jones_a_foundling', '6593'), ('the_expedition_of_humphry_clinker', '2160'), ('the_adventures_of_roderick_random', '4085'), ('twenty_years_after', '1259'), ('my_life__volume', '5197'), ('pride_and_prejudice', '1342'), ('frankenstein_or_the_modern_prometheus', '84'), ('the_devils_dictionary', '972'), ('alices_adventures_in_wonderland', '11'), ('constantinople_old_and_new', '70946'), ('the_yellow_wallpaper', '1952'), ('knock_threeonetwo', '70944'), ('the_great_gats', '64317'), ('a_tale_of_two_cities', '98'), ('the_picture_of_dorian_gray', '174'), ('dracula', '345'), ('adventures_of_huckleberry_finn', '76'), ('noli_

<p>It took a little bit fo work, but we got there in the end. 

Next, we make a function that takes an ID number as input and retrieves a plaintext file of the book with the given ID. Let's also have it name and save the file in a subdirectory.</p>

In [294]:
# A function to download an ebook given its ID. Since Project Gutenberg is not a fan of automated tools messing with their
# site, we will be way sneaky and wait for 1 second between requests.
def download_ebook(ebook_id: str, save_path: str = "text_files") -> None:
    """
    Downloads an ebook given its ID, and saves it to the text_files directory.
    
    Args:
        ebook_id (str): The ID of the ebook to download.
    """

    import requests
    import time

    # Create a list of possible urls for the ebook.
    urls = [f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}.txt.utf-8",
            f"https://www.gutenberg.org/files/{ebook_id}/{ebook_id}-0.txt.utf-8",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt.utf-8",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt",
            f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}-0.txt.utf-8"]
    
    # Try each url until we find one that works.
    for url in urls:
        try:
            response = requests.get(url)

            # If the text response is an html file, rather than the book we are looking for, we try the next url.
            if response.text.startswith("<!DOCTYPE html>"):
                response = None
                continue
            else:
                break
        # If the request fails, we try the next url.
        except requests.exceptions.RequestException as e:
            print(e)
            response = None
            continue
        finally:
            # Wait 1 seconds between requests.
            time.sleep(1)

    # Save the ebook to the text_files directory if the response is not None.
    if response is not None:
        with open(f"{save_path}/{ebook_id}.txt", "w", encoding='utf-8') as f:
            f.write(response.text)

    # If it has not already been done, close the connection.
    if response is not None:
        response.close()
    

In [295]:
# Test the download_ebook() function with something that should work (At The Earth's Core by Edgar Rice Burroughs).
download_ebook("123")

This looks promising! We will try to download them all, changing their names to something more appropriate along the way.

In [296]:
# A function to download all ebooks in the top 100 list, changing their names to their titles.
def retrieve_books(file_path: str = "text_files") -> None:
    """
    Downloads all ebooks in the top 100 list, changing their names to their titles.
    """

    # Import the os module.
    import os

    # Get the top 100 books list.
    top_100_books = get_top_books_list()


    # Download each ebook.
    for book in top_100_books:
        print(f"Downloading {book[0]}...")
        download_ebook(book[1], file_path)

        # Try to rename the ebook by the actual name. If the book already exists in the file path,
        # move on to the next book. If an error occurs while trying to rename the book because
        # the book could not be retreive in the first place, move on to the next book.
        try:
            os.rename(f"{file_path}/{book[1]}.txt", f"{file_path}/{book[0]}.txt")
        except FileExistsError:
            print(f"{book[0]} already exists in this directory.")
            continue
        except FileNotFoundError:
            print(f"Book with ID {book[1]} could not be found.")
            continue

In [297]:
# The moment of truth has arrived (again; I've come to this point a number of times). Download all the books.
retrieve_books()

Downloading romeo_and_juliet...
Downloading mo...
Downloading a_room_with_a_view...
Downloading middlemarch...
Downloading little_women_or_meg_jo_beth_and_amy...
Downloading the_complete_works_of_william_shakespeare...
Downloading the_enchanted_april...
Downloading the_blue_castle_a_novel...
Downloading cranford...
Downloading the_adventures_of_ferdinand_count_fathom__complete...
Downloading history_of_tom_jones_a_foundling...
Downloading the_expedition_of_humphry_clinker...
Downloading the_adventures_of_roderick_random...
Downloading twenty_years_after...
Downloading my_life__volume...
Downloading pride_and_prejudice...
Downloading frankenstein_or_the_modern_prometheus...
Downloading the_devils_dictionary...
Downloading alices_adventures_in_wonderland...
Downloading constantinople_old_and_new...
Downloading the_yellow_wallpaper...
Downloading knock_threeonetwo...
Downloading the_great_gats...
Downloading a_tale_of_two_cities...
Downloading the_picture_of_dorian_gray...
Downloading dra

In [298]:
# Check to see how many books were downloaded.
print(len(os.listdir("text_files")))

116


<h3>Success!... And More Issues to Address</h3>

<p>It looks as if things worked more or less according to plan (a few books are apparently unavailable in the file format I was hoping for). That said, we have some more things to deal with, like headers and footers to each text file, and there are some books which are not even in English (we can try to incorporate non-English text later on, but for now their inclusion will undermine some things).

The style of writing (vocabularly, grammar, etc.) in these books is remarkably different and they come in a variety of lengths. For any cipher that is neither polyalphabetic nor polygraphic, the usage of different symbols, and therefore the vernacular of the period in which a text was written, will likely alter predictions due to differences in n-gram frequencies, particularly with $n > 1$. As such, we need to be sure that each book has an equal chance of being enciphered with a given cipher. In addition, since there is no reason someone cannot encipher sentence fragments, paraphrases, shorthand, or otherwise truncated text, we should try to ensure some of the plaintext possesses these characteristics.

But first, let's clean things up a bit. We will create a function to remove the boilerplate text from the top and bottom of each ebook text file, and to clean up some of the spacing and formatting issues.<p>

In [370]:
# A function to clean up the text from a Project Gutenberg book.
def clean_text(file_name: str) -> None:
    """
    Removes the boilerplate text from a single Project Gutenberg book and cleans the text up by
    fixing punctuation and removing non-ascii characters.

    Args:
        file_name (str): The name of the file to clean.

    Raises:
        UnicodeDecodeError: If the file cannot be decoded.
    """

    import re

    # Open the file and read the text.
    try:
        with open(f"text_files/{file_name}", "r", encoding='utf-8') as f:
            text = f.read()
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: {file_name} could not be processed.")
        return None

    # Remove the boilerplate text.
    text = text.split("START OF THE PROJECT GUTENBERG EBOOK")[1]
    text = text.split("END OF THE PROJECT GUTENBERG EBOOK")[0]
    
    # Remove the first and last lines of the text to get rid of the redundant title and asterisks.
    text = " ".join(text.split("\n")[1:-1])

    # Text cleaning. The only punctuation we will retain are commas and any symbols used to terminate a sentence.
    # Remove the newlines.
    text = re.sub("\n", " ", text)

    # Remove the chapter headings.
    text = re.sub("CHAPTER [A-Z]+", "", text)

    # Remove the asterisks, quotation marks, apostrophes, dashes, semicolons, colons, parentheses, brackets, and underscores.
    text = re.sub("[\*\"\'\-\;\:\(\)\[\]\_]", "", text)

    # Remove any remaining non-ascii characters.
    text = re.sub("[^\x00-\x7F]+", "", text)

    # Remove any spaces that are more than one character long.
    text = re.sub(" {2,}", " ", text)

    # Save the text back to the file.
    try:
        with open(f"text_files/{file_name}", "w", encoding='utf-8') as f:
            f.write(text)
    except UnicodeEncodeError:
        print(f"UnicodeEncodeError: {file_name} could not be processed.")
        return None

    return None
    


In [371]:
# A function to clean all the texts in the text_files directory.
def clean_all_texts(file_path: str = "text_files") -> None:
    """
    Cleans all the texts in the text_files directory.

    Args:
        file_path (str, optional): The path to the directory containing the text files. Defaults to "text_files". Any
        of the text files that cannot be cleaned will be deleted.

    Raises:
        FileNotFoundError: If the file path does not exist.
        IndexError: If the text cannot be cleaned, it will be deleted.
    """

    import os

    # Get a list of all the books in the text_files directory.
    books = os.listdir(file_path)

    # Remove the boilerplate text from each book.
    for book in books:
        try:
            clean_text(book)
        except IndexError:
            print(f"{book} could not be processed.")
            # If the text cannot be cleaned, delete the file.
            os.remove(f"{file_path}/{book}")
            continue
        except FileNotFoundError:
            print(f"{book} could not be found.")
            continue


In [372]:
# Clean all the texts in the text_files directory.
clean_all_texts()

the_philippines_a_century_hence.txt could not be processed.
ang_filibusterismo_karugtng_ng_noli_me_tangere.txt could not be processed.
the_tragical_history_of_doctor_faustus.txt could not be processed.
UnicodeDecodeError: .DS_Store could not be processed.
helps_to_latin_translation_at_sight.txt could not be processed.
demonology_and_devillore.txt could not be processed.
the_kama_sutra_of_vatsyayana.txt could not be processed.
noli_me_tangere.txt could not be processed.
florante_at_laura.txt could not be processed.
the_expedition_of_humphry_clinker.txt could not be processed.
the_problems_of_philosophy.txt could not be processed.
how_to_sing.txt could not be processed.
the_king_in_yellow.txt could not be processed.
mo.txt could not be processed.
the_slang_dictionary.txt could not be processed.
baron_trumps_marvellous_underground_journey.txt could not be processed.
diego_collados_grammar_of_the_japanese_language.txt could not be processed.
the_tempest.txt could not be processed.
history_

In [373]:
# Check to see how many books remain. 
print(len(os.listdir("text_files")))

96


<h3>Whew! That Was a Learning Experience!</h3>

<p>Now that we have some files to work with, we need to get back to the issue of ensuring each book is equally likely to be enciphered with a given cipher. To that end, we will create a function that will randomly select a book from among the books we have downloaded, and then it will randomly select a substring of the book. The substring will be of a random length, and it will be selected from a random location in the book. We will also have it return the index of the substring within the book.</p>

In [386]:
# A function to randomly select a book from the text_files directory and select a random excerpt from that book.
def get_random_excerpt(file_path: str = "text_files") -> str:
    """
    Randomly selects a book from the text_files directory and selects a random excerpt from that book.
    """

    import os
    import random

    # Get a list of all the books in the text_files directory.
    books = os.listdir(file_path)

    # Remove the .DS_Store file if it exists.
    if ".DS_Store" in books:
        books.remove(".DS_Store")

    # Select a random book
    book = random.choice(books) 

    # Open the book and read the text.
    try:
        with open(f"{file_path}/{book}", "r", encoding='utf-8') as f:
            text = f.read()
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: {book} could not be processed.")
        return None

    # Randomly select a length for the excerpt between 1 and 100000 characters.
    excerpt_length = random.randint(1, 100000)

    # Randomly select a starting point for the excerpt.
    excerpt_start = random.randint(0, len(text) - excerpt_length) if len(text) > excerpt_length else 0

    # Select the excerpt.
    excerpt = text[excerpt_start:excerpt_start + excerpt_length] if len(text) > excerpt_length else text

    return excerpt


In [389]:
# Test the get_random_excerpt() function a few thousand times.
for i in range(10000):
    get_random_excerpt()
print("Done.")

Done.


<h3>So far so good!</h3>

<p>A little testing with the get_random_excerpt() function has revealed that at least one issue remains. Some of the books remaining are not in English. We will have to deal with this later. For now, let's just remove them from the list of books we are working with.</p>

In [None]:
# Remove non-English texts from the text_files directory.
def remove_non_english_texts(file_path: str = "text_files") -> None:
    """
    Removes non-English texts from the text_files directory. This function uses the langdetect library to detect the
    language of the text. If the language is not English, the text is deleted.

    Args:
        file_path (str, optional): The path to the directory containing the text files. Defaults to "text_files".
    
    """

    # Import the os library.
    import os

    # Get a list of all the books in the text_files directory.
    books = os.listdir(file_path)

    # Remove the .DS_Store file if it exists.
    if ".DS_Store" in books:
        books.remove(".DS_Store")

    # Remove non-English texts.
    for book in books:
        try:
            if detect(get_random_excerpt()) != "en":
                os.remove(f"{file_path}/{book}")
        except UnicodeDecodeError:
            print(f"UnicodeDecodeError: {book} could not be removed.")
            continue
        except TypeError:
            print(f"TypeError: {book} could not be removed.")
            continue

In [2]:
def generate_ciphertext_plaintext_pairs(cipher: Cipher, input_file: str, output_file: str) -> None:
    """
    Reads a text file, and then generates a file containing ciphertext and plaintext pairs of random length and randomly
    selected starting positions. The ciphertext is constructed using a given cipher object.

    Args:
        cipher (Cipher): The cipher object to use for encryption.
        input_file (str): The name of the file to read.
        output_file (str): The name of the file to write.
    """

    # Read the input file.
    with open(input_file, "r") as f:
        text = f.read()

    # Generate ciphertext and plaintext pairs.
    ciphertext_plaintext_pairs = []
    for i in range(1000):
        # Generate a random length for the ciphertext and plaintext.
        length = random.randint(10, 1000)

        # Generate a random starting position for the ciphertext and plaintext.
        start = random.randint(0, len(text) - length)
        
        # Generate the ciphertext and plaintext. 
        plaintext = text[start:start+length]
        ciphertext = cipher.encrypt(plaintext)

        # Add the ciphertext and plaintext to the list of pairs.
        ciphertext_plaintext_pairs.append((ciphertext, plaintext))

    # Write the ciphertext and plaintext pairs to the output file.
    with open(output_file, "w") as f:
        for pair in ciphertext_plaintext_pairs:
            f.write(pair[0] + "\n")
            f.write(pair[1] + "\n")