# Vigenère cipher
Vigenere encryption, decryption and ciphertext-only attack in python. [@gjs990825](https://github.com/gjs990825)
Check [Vigenère cipher](https://en.wikipedia.org/wiki/Vigen%C3%A8re_cipher) on Wikipedia for more information.

___
## Part 1. Encryption and decryption

### The code

In [1]:
from itertools import cycle, starmap, filterfalse
import wordninja

# constants
A = ord('A')
MAX_KEY_LENGTH = 50
MAX_KEY_CANDIDATE = 10
MAX_DUPLICATED_PART = 0.7

# frequency taken from https://en.wikipedia.org/wiki/Letter_frequency
FREQ_ENGLISH = [0.08167, 0.01492, 0.02782, 0.04253, 0.12702, 0.02228,
                0.02015, 0.06094, 0.06966, 0.00153, 0.00772, 0.04025,
                0.02406, 0.06749, 0.07507, 0.01929, 0.00095, 0.05987,
                0.06327, 0.09056, 0.02758, 0.00978, 0.0236, 0.0015,
                0.01974, 0.00074]
# IC(index of coincidence) expected for english
IC_ENGLISH = sum(f * f for f in FREQ_ENGLISH) * 26


def alpha_only(text) -> str:
    """get all capitalized alpha only text"""
    return ''.join(filterfalse(lambda c: not c.isalpha(), text)).upper()


def split(no_space_text):
    """split no_space_text with wordninja, see https://github.com/keredson/wordninja for details"""
    return ' '.join(wordninja.split(no_space_text))


class Vigenere:
    def __init__(self, keyword: str):
        self.keyword = alpha_only(keyword)

    @staticmethod
    def get_cipher(p, k) -> str:
        """encrypt character p using character k as key"""
        return chr(A + ((ord(p) - A) + (ord(k) - A)) % 26)

    @staticmethod
    def get_plain(c, k) -> str:
        """decrypt character c using character k"""
        return chr(A + ((ord(c) - A) - (ord(k) - A)) % 26)

    def encrypt(self, plain_text) -> str:
        plain_text = alpha_only(plain_text)
        return ''.join(starmap(self.get_cipher, zip(plain_text, cycle(self.keyword))))

    def decrypt(self, cipher_text, split_word=False) -> str:
        cipher_text = alpha_only(cipher_text)
        plain_text = ''.join(starmap(self.get_plain, zip(cipher_text, cycle(self.keyword))))
        return split(plain_text) if split_word else plain_text

    def encrypt_file(self, in_path, out_path):
        with open(in_path, 'r') as in_file, open(out_path, 'w') as out_file:
            out_file.write(self.encrypt(in_file.read()))

    def decrypt_file(self, in_path, out_path, split_word=False):
        with open(in_path, 'r') as in_file, open(out_path, 'w') as out_file:
            plain_text = self.decrypt(in_file.read(), split_word)
            out_file.write(plain_text)


### Load test example

In [2]:
try:
    with open('original.txt', 'r') as f:
        original_text = f.read()
except FileNotFoundError as e:
    print(e)
    original_text = ''

print(original_text)

Differential Privacy is the state-of-the-art goal for the problem of privacy-preserving data release and privacy-preserving data mining. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods incur higher computing complexity and lower information to noise ratio, which renders the published data next to useless. This proposal aims to reduce computing complexity and signal to noise ratio. The starting point is to approximate the full distribution of high-dimensional dataset with a set of low-dimensional marginal distributions via optimizing score function and reducing sensitivity, in which generation of noisy conditional distributions with differential privacy is computed in a set of low-dimensional subspaces, and then, the sample tuples from the noisy approximation distribution are used to generate and release the synthetic 

### Encryption and decryption test

In [3]:
v = Vigenere('infosec')
clean_text = alpha_only(original_text)
cipher_text = v.encrypt(original_text)
decrypted_text = v.decrypt(cipher_text, split_word=True)
cipher_text, decrypted_text

('LVKTWVGVGNODTTQIFQQMUBUJGLEVMBKHZICZGLCSPHWEYVWTTWOQSESHXENJSGAXEJGWVXQALRSXCZRQSSWGIAIDJMXIPDDJIUMEAWFKFIGFAARKVTJLAWVQALHWGJVVVIWWWAVSUVMHNRWSFXKIYUFAZCKLMCOIXMEHOFRQBRKTWGVQIJZQLCVQQSLLGXHGZAGCBVTBGJJQTMRAQGVFNCFENLNYOARRIEYWUYNIEBVWRVPRNBHYVLNYOKIVKBSHSMPANQOJKGVHRPWVQNNYHJMDCGJGWBKAGNBYQGBUTRKMPKHWVAKJMEHCETWBVSUUSOXYJLAXAIAIZGAGZVSTGVOIGNCFXQVBNGWVCBVTKZMEPEJBVITAGMSHYDTVXVWHFIGFBWBVBBZGWPGAFYVAWRZBUCKENIVRGLSTMQZQWGQUCZHARIKBRDDIZQGDOFHUQTSODXQVBNGWVCBVTHZIUBNWHARIXBNBLMUBBFDHVQFVROLIVPRKIDPFQFYFAFWBVTBGJJQTMRAQGVFNCFENLNYOKIVEVYVSWGBBKZGAFQZJBKMQVNQASVIQAFZVMUBENPMXKWAXJAEQXGNAADKVTXQGVGNHSQLMQVNSRJIFCPNBYWGVFNHAZKBLNBOLKKULSFITIGNCFSHVBNGQGQVQNHASPIYIWKXTQOZHASPAJNHZHKNSJFWRVQNQDJMXIPDWKGQUCZHWHKVNXSLSHTBBRAQGVFNCFENAHGGHEEMFFBVXJMAYVWWCUCQSLYRTRXTJSOBUJBGMUGNUDJSZQZFHASPLVXHJMDCGNCFETMHXSVXQORSSJEVMNSRJINMNXSLLGALSHZIVQPIOLEUMGXCEIEZHHWSPUKVJBUIRZBGZWQUEBZZVFGQAASKXKONYSVFGTBBWUSPAGWIUXKVTFZGAMLRLFWIDILJGAEPVRYKGVMWIJFLLGPVLVVMOMAXWGRCTQFHSWGBINOWBRWAJBLMCTZJQZEPQFRWFHKNSJF

### File operations

In [4]:
v = Vigenere('infosec')
v.encrypt_file('original.txt', 'encrypted.txt')
v.decrypt_file('encrypted.txt', 'decrypted.txt', split_word=True)

---
## Part 2. Ciphertext-only attack
Based on [index of coincidence](https://en.wikipedia.org/wiki/Index_of_coincidence) technique, see [example section](https://en.wikipedia.org/wiki/Index_of_coincidence#Example) for details.

In [5]:
from collections import namedtuple


def index_of_coincidence(text) -> float:
    """calculate IC(index of coincidence) of given string sequence"""
    n = len(text)
    if n <= 1: return 26
    counts = [0 for _ in range(26)]
    for c in text:
        counts[ord(c) - A] += 1
    return sum(c * (c - 1) for c in counts) / (n * (n - 1) / 26)


def group_with_length(text, n) -> list[list[str]]:
    """i_th item falls into (i % length)_th group"""
    results = [[] for _ in range(n)]
    for i, c in enumerate(text):
        results[i % n].append(c)
    return results


KeyInfo = namedtuple('KeyInfo', 'length ic'.split())
Key = namedtuple('Key', 'key ic'.split())


def guess_key_length(text) -> list[KeyInfo]:
    """compare AVERAGE IC of every key length in [1, MAX_KEY_LENGTH), return the top MAX_KEY_CANDIDATE ones close to IC_ENGLISH"""
    key_info = []
    for length in range(1, min(MAX_KEY_LENGTH, len(text))):
        substrings = group_with_length(text, length)
        average_ic = sum(index_of_coincidence(ss) for ss in substrings) / len(substrings)
        key_info.append(KeyInfo(length, average_ic))
    return sorted(key_info, key=lambda x: abs(x.ic - IC_ENGLISH))[:10]


def correlation_of(text) -> float:
    """correlation between the text letter frequencies and the relative letter frequencies for normal English text"""
    n = len(text)
    counts = [0 for _ in range(26)]
    for c in text:
        counts[ord(c) - A] += 1
    return sum(counts[i] / n * FREQ_ENGLISH[i] for i in range(26))


def get_single_key(text) -> str:
    """test every character as key for given text, use the one that has highest correlation"""
    correlations, max_idx = [], 0
    for i in range(26):
        v = Vigenere(chr(A + i))
        correlations.append(correlation_of(v.decrypt(text)))
        if correlations[i] > correlations[max_idx]:
            max_idx = i
    return chr(A + max_idx)

In [6]:
def crack_virginia(cipher_text, save_to=None) -> list[Key]:
    cipher_text, keys = alpha_only(cipher_text), []

    # try all possible key length, find their corresponding keys
    for key_info in guess_key_length(cipher_text):
        substrings = group_with_length(cipher_text, key_info.length)
        key = Key(''.join(get_single_key(ss) for ss in substrings), key_info.ic)
        print(f'Key length {key_info.length}, IC = {key_info.ic:.3f}: {key.key}')
        keys.append(key)

    # remove similar(the ones have duplicated part bigger than MAX_DUPLICATED_PART) keys
    copy, keep = sorted(keys, key=lambda k: len(k.key)), []
    while len(copy) > 1:
        drop = False
        for other in copy[:-1]:
            original = len(copy[-1].key)
            processed = len(copy[-1].key.replace(other.key, ''))
            if (1 - processed / original) > MAX_DUPLICATED_PART:
                drop = True
                break
        if not drop:
            keep.insert(0, copy[-1])
        copy.pop()
    keep.extend(copy)
    keys = list(filterfalse(lambda k: k not in keep, keys))

    # save decoding results
    if save_to:
        with open(save_to, 'w') as save_file:
            for key in keys:
                save_file.write(f'Decrypt using {key}:\n')
                save_file.write(Vigenere(key.key).decrypt(cipher_text, split_word=True))
                save_file.write('\n\n')
            print(f'check file {save_to} for cracking results')
    # or print them
    else:
        for key in keys:
            print(f'Decrypt using {key}:')
            print(Vigenere(key.key).decrypt(cipher_text, split_word=True), end='\n\n')
    return keys

### Breaking the example cipher

In [7]:
cipher_text = Vigenere('infosec').encrypt(original_text)
crack_virginia(cipher_text)

Key length 7, IC = 1.711: INFOSEC
Key length 28, IC = 1.713: INFOSECINFOSECINFOSECINFOSEC
Key length 14, IC = 1.713: INFOSECINFOSEC
Key length 21, IC = 1.684: INFOSECINFOSECINFOSEC
Key length 35, IC = 1.676: INFOSECINFOSECINFOSECINFOSECINFOSEC
Key length 42, IC = 1.669: INFOSECINFOSECINFOSECINFOSECINFOSECINUOSEC
Key length 49, IC = 1.624: XNFOSECINFOSECINFOSECINFOSECINFOSECINFOSECINFOSEC
Key length 30, IC = 1.137: SCOOSSJSCEOANRSNNICWJOSNODPXNB
Key length 40, IC = 1.122: SCDNPSJJCRTROASOCIZQUSSESECDRXJHIIGOTOCN
Key length 29, IC = 1.119: WDOERNOSCCRUCNJASJCOCXIODIPIN
Decrypt using Key(key='INFOSEC', ic=1.711004490531436):
DIFFERENTIAL PRIVACY IS THE STATE OF THE ART GOAL FOR THE PROBLEM OF PRIVACY PRESERVING DATA RELEASE AND PRIVACY PRESERVING DATA MINING EXISTING TECHNIQUES USING DIFFERENTIAL PRIVACY HOWEVER CANNOT EFFECTIVELY HANDLE THE PUBLICATION OF HIGH DIMENSIONAL DATA IN PARTICULAR WHEN THE INPUT DATA SET CONTAINS A LARGE NUMBER OF ATTRIBUTES EXISTING METHODS INCUR HIGHER COMPUT

[Key(key='INFOSEC', ic=1.711004490531436),
 Key(key='SCOOSSJSCEOANRSNNICWJOSNODPXNB', ic=1.1365478686233401),
 Key(key='SCDNPSJJCRTROASOCIZQUSSESECDRXJHIIGOTOCN', ic=1.121788617886179),
 Key(key='WDOERNOSCCRUCNJASJCOCXIODIPIN', ic=1.1186843807533462)]

### Break from file

In [8]:
def crack_virginia_from_file(in_path, out_path):
    with open(in_path, 'r') as in_file:
        crack_virginia(in_file.read(), save_to=out_path)

crack_virginia_from_file('cipher_to_break.txt', 'breaking_results.txt')

Key length 16, IC = 1.730: MAVERICKMAVERICK
Key length 40, IC = 1.731: MAVERICKMAVERICKMAVIRICKMAVERICKMAVERICK
Key length 8, IC = 1.739: MAVERICK
Key length 24, IC = 1.740: MAVERICKMAVERICKMAVERICK
Key length 48, IC = 1.745: MAVERICKMAVERICKMAVERICKMAVERICKMAVERICKMAVERICK
Key length 32, IC = 1.760: MAVERICKMAVERICKMAVERICKBAVERICK
Key length 36, IC = 1.368: MICKRMVKRAVKREGUCICERNCKRWCEREVDRAGK
Key length 20, IC = 1.345: RAVKRICZRIVERICERIVK
Key length 28, IC = 1.343: RIWTMIVKRAVERIRERIRKRICKRAVK
Key length 12, IC = 1.341: RICKRIVERIVK
check file breaking_results.txt for cracking results
