# [Substitution cipher](https://en.wikipedia.org/wiki/Substitution_cipher)
Cipher-text is generated by simple substitution of letters from original text, by letters from "key"
Decryption is made by simply reversing the proces. 

## [Caesar cipher](https://en.wikipedia.org/wiki/Caesar_cipher)

Every letter is shifted by `3` to the left, so instead of `D` you get `A`, `E->B`, etc. This is the OG cipher.

## [ROT13](https://en.wikipedia.org/wiki/Substitution_cipher)

Shift by `13`, there are 26 letters in latin alphabet, so ciphered message created by shifting every letter by `13`, can be
decoded by encrypting it once again.

# Big problem of substitution Cipher
Big problem of substitution Cipher are pairs of letters, for example if one could decode letter `q` the next one will almost surely be `u`. 

In [1]:
from typing import *

Used alphabets

In [2]:
LATIN_LETTERS = "abcdefghijklmnopqrstuvwxyz"
FULL_LATIN_LETTERS="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
FULL_LATIN_LETTERS_W_NUMBERS="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

Special case of substitution cipher is called Cesar cipher, in cesar cipher every letter is shifted by some constant.
function below creates substitution table for that specific cipher 

In [3]:
def gen_substitution_table(shift:int,alphabet:str = LATIN_LETTERS):
    substitution_table = {}
    
    shifted_alphabet = alphabet[shift:] + alphabet[0:shift]
    
    for (letter,shifted_letter) in zip(alphabet, shifted_alphabet):
        substitution_table[letter]= shifted_letter    
    return substitution_table

print(gen_substitution_table(13))

{'a': 'n', 'b': 'o', 'c': 'p', 'd': 'q', 'e': 'r', 'f': 's', 'g': 't', 'h': 'u', 'i': 'v', 'j': 'w', 'k': 'x', 'l': 'y', 'm': 'z', 'n': 'a', 'o': 'b', 'p': 'c', 'q': 'd', 'r': 'e', 's': 'f', 't': 'g', 'u': 'h', 'v': 'i', 'w': 'j', 'x': 'k', 'y': 'l', 'z': 'm'}


Cipher simply swaps letters in message to mapped letters from substitution table

In [4]:
def encrypt(message:str,substitution_table)->str:
    
    cipher_text:str = ""
    
    for letter in message:
        if letter in substitution_table:
            cipher_text += substitution_table[letter]
        else:
            cipher_text += letter
    return cipher_text

print(encrypt("lol 7 ",gen_substitution_table(13)))
    

yby 7 


Simplest way to brake substitution cipher is to compare how many times every letter appears in cipher-text to statistical average in used language

Frequency table represents usage of specyfic letter in english language, if we compare those theoretical  

In [5]:
FREQUENCY_TABLE = { 
'E' :12.7,
'T' :9.1,
'A' :8.2,
'O' :7.5,
'I' :7.0,
'N' :6.7,
'S' :6.3,
'H' :6.1,
'R' :6.0,
'D' :4.3,
'L' :4.0,
'C' :4.8,
'U' :2.8,
'M' :2.4,
'w' :2.3,
'F' :2.2,
'G' :2.0,
'Y' :2.0,
'P' :1.9,
'B' :1.5,
'V' :1.0,
'K' :0.08,
'J' :0.02,
'Q' :0.01,
'X' :0.01,
'Z' :0.01}

In [6]:
def analyze_frequency(message:str,alphabet:str = LATIN_LETTERS)->dict:
    """
    count appearances of each letter in message, and return appearance frequencies as percentage

    Args:
        message (str): target message
        alphabet (str, optional): letters to count. Defaults to LATIN_LETTERS.

    Returns:
        dict: list of frequencies
    """
    frequency_table = {}    
    for letter in alphabet:
        frequency_table[letter] = (100 * message.count(letter)) / len(message)  
    
    return {k: v for k, v in sorted(frequency_table.items(), key=lambda item: item[1],reverse=True)}

print(analyze_frequency("aaabbgh"))
    

{'a': 42.857142857142854, 'b': 28.571428571428573, 'g': 14.285714285714286, 'h': 14.285714285714286, 'c': 0.0, 'd': 0.0, 'e': 0.0, 'f': 0.0, 'i': 0.0, 'j': 0.0, 'k': 0.0, 'l': 0.0, 'm': 0.0, 'n': 0.0, 'o': 0.0, 'p': 0.0, 'q': 0.0, 'r': 0.0, 's': 0.0, 't': 0.0, 'u': 0.0, 'v': 0.0, 'w': 0.0, 'x': 0.0, 'y': 0.0, 'z': 0.0}


To perform successful frequency attack first big text is needed, to assure that all letter appearances average out to correct values 
now, every language, writing style, author, book, paragraph will deviate from national average, so we can't simply compare frequencies found in cipher-text to the average.
some additional work needs to be put in to unravel correct substitution table, example below represents ideal situation, when letters perfectly match their appearance frequencies

In [7]:
f = open("lorem_ipsum.txt", "r")
lorem_ipsum_freq = analyze_frequency(f.read())
f.close()
print(lorem_ipsum_freq)

{'e': 9.13779305847832, 'i': 8.204155428435934, 'u': 7.3564717389371355, 's': 7.178635999881442, 't': 6.900026675360858, 'a': 6.704407362399597, 'n': 4.757106019739767, 'l': 4.733394587865675, 'r': 4.718574942944367, 'm': 3.6960194433741367, 'o': 3.6397047926731676, 'c': 3.3966626159637214, 'd': 2.2585138860072913, 'p': 1.7338984557929993, 'v': 1.2507780313583687, 'g': 1.1381487299564303, 'b': 0.9632769198849995, 'q': 0.9306737010581226, 'f': 0.7558018909866919, 'h': 0.5127597142772459, 'x': 0.16301609413438453, 'j': 0.08891786952784611, 'k': 0.0, 'w': 0.0, 'y': 0.0, 'z': 0.0}


In [8]:
 
f = open("lorem_ipsum.txt", "r")
encrypted_lorem_ipsum = encrypt(f.read(),gen_substitution_table(21))
f.close()
encrypted_lorem_ipsum_freq = analyze_frequency(encrypted_lorem_ipsum)
print(encrypted_lorem_ipsum_freq)


{'z': 9.13779305847832, 'd': 8.204155428435934, 'p': 7.3564717389371355, 'n': 7.178635999881442, 'o': 6.900026675360858, 'v': 6.704407362399597, 'i': 4.757106019739767, 'g': 4.733394587865675, 'm': 4.718574942944367, 'h': 3.6960194433741367, 'j': 3.6397047926731676, 'x': 3.3966626159637214, 'y': 2.2585138860072913, 'k': 1.7338984557929993, 'q': 1.2507780313583687, 'b': 1.1381487299564303, 'w': 0.9632769198849995, 'l': 0.9306737010581226, 'a': 0.7558018909866919, 'c': 0.5127597142772459, 's': 0.16301609413438453, 'e': 0.08891786952784611, 'f': 0.0, 'r': 0.0, 't': 0.0, 'u': 0.0}
