### Caesar Cipher

&nbsp;

Caesar Cipher is named after Roman general Gaius Julius Caesar. It is the easiest cipher technique you have played around as a kid without knowing its name. You substitute each letter with a letter some fixed number of positions down the alphabet. For instance, when the shift is 2, a->c, b->d, c->e, "je sais pas"->"lg ucku rcu". The cipher can be easily cracked by letter frequency analysis as long as you know what the underlying language is.

In [1]:
import re

In [2]:
global english_lower

In [3]:
#letter mapping
english_lower='abcdefghijklmnopqrstuvwxyz'

### Functions

In [4]:
#to make the ciphertext less revealing
#we can use lower case,remove space and punctuations via regex
#then group the letters by certain number
def break_into_blocks(ciphertext,bandwidth=6):

    #only capture words
    tough_ciphertext=list(map(lambda x:x.lower(),re.findall('\w',ciphertext)))
    
    #break into blocks
    new_ciphertext=[''.join(tough_ciphertext[i:i+bandwidth]) for i in range(0,len(tough_ciphertext),bandwidth)]

    #fill up the last block with padding
    if len(new_ciphertext[-1])<bandwidth:
        new_ciphertext[-1]+='a'*(bandwidth-len(new_ciphertext[-1]))
        
    #create output
    ultimate_ciphertext=' '.join(new_ciphertext)
    
    return ultimate_ciphertext

In [5]:
#encryption
def caesar_cipher_encrypt(plaintext,shift=1):   
    
    assert shift<26 and shift>=0,"shift should be smaller than 26"
    
    #convert text to list
    plaintext_list=list(plaintext)

    #first,map alphabets to numbers
    #next,shift number
    #finally,map numbers to alphabets
    for i in range(len(plaintext_list)):
        if plaintext_list[i].lower() in english_lower:        
            code=english_lower.index(plaintext_list[i].lower())+shift
            if code>25:
                code-=25
            plaintext_list[i]=english_lower[code]

    return ''.join(plaintext_list)

In [6]:
#decryption
def caesar_cipher_decrypt(ciphertext,shift=1):
    
    assert shift<26 and shift>=0,"shift should be smaller than 26"
    
    #convert text to list
    ciphertext_list=list(ciphertext)
   
    #first,map alphabets to numbers
    #next,shift number
    #finally,map numbers to alphabets
    for i in range(len(ciphertext_list)):
        if ciphertext_list[i].lower() in english_lower:        
            code=english_lower.index(ciphertext_list[i].lower())-shift
            if code<0:
                code+=25
            ciphertext_list[i]=english_lower[code]

    return ''.join(ciphertext_list)

### Run

In [7]:
plaintext="""Once arrived at the office building, she needs to take the elevator/lift to her floor. On average, the capacity for a ThyssenKrupp is about 8 to 12 people. But hey, people working in Canary Wharf can’t get squeezed like canned sardines. Call them snobbish if you want. From my experience, more than five people is called crowded. In and out of office plus twice for lunchtime, Jane Doe has contacted 20 people in one confined space alone. When she is in the office, an open-plan office (everybody hates it) is a nightmare to keep social distancing. A normal day at work should leave at least 30 people at exposure assuming she doesn’t take up a client facing role. We are talking about meaningless conferences, awful hotdesking and tiny cubicles. Working from home doesn’t seem so bad now. At lunch time, she goes to Tesco in One Canada Square for 3£ meal deal (this is not the continent, no 3-course meal plus Tuscany wine for your average lunch). It’s easy to contact 5 people in the fridge area since everyone is looking for the perfect chicken BLT. Lining up for the queue gets at least two people exposed, the one in front of you and the one after you. Usually there is an assistant to direct you to the right counter. Say Jane Does pays at the machines, the people to her left and right are also within the radius of two meters. Alternatively she can go to Vietnamese food truck for pho and spring rolls. Either way, the queue at lunch hour can contribute at least 10 contacts. Besides, she goes to Waitrose after work to purchase ingredients for the dinner. She doesn’t eat out often because of the crazy expense of the rent and monthly zone 2 pass. Using the same logic above, she has contacted 20 people at different supermarkets every day. In addition, she goes to crossfit or yoga class after dinner because of the peer pressure from colleagues. Obesity is a sign of laziness in Canary Wharf. Eight or nine PM is a peak hour at gym so another 10 contacts can be made. Before a day passes, Jane Doe has 3+15+5+30+5+10+5+5+10+15+10 contacts. We take a rounding to 120 because there can be some miscellaneous contacts. For instance, her team may maintain a tradition of going to pubs for happy hour every day or she has a date night with her husband at Covent Garden every Thursday."""
shift_num=24
print(plaintext)

Once arrived at the office building, she needs to take the elevator/lift to her floor. On average, the capacity for a ThyssenKrupp is about 8 to 12 people. But hey, people working in Canary Wharf can’t get squeezed like canned sardines. Call them snobbish if you want. From my experience, more than five people is called crowded. In and out of office plus twice for lunchtime, Jane Doe has contacted 20 people in one confined space alone. When she is in the office, an open-plan office (everybody hates it) is a nightmare to keep social distancing. A normal day at work should leave at least 30 people at exposure assuming she doesn’t take up a client facing role. We are talking about meaningless conferences, awful hotdesking and tiny cubicles. Working from home doesn’t seem so bad now. At lunch time, she goes to Tesco in One Canada Square for 3£ meal deal (this is not the continent, no 3-course meal plus Tuscany wine for your average lunch). It’s easy to contact 5 people in the fridge area si

In [8]:
#encrypt
ciphertext=caesar_cipher_encrypt(plaintext,shift=shift_num)

In [9]:
#only capture words
#break into blocks
ultimate_ciphertext=break_into_blocks(ciphertext,bandwidth=6)
print(ultimate_ciphertext)

nmbdyq qhudcy ssgdne ehbdzt hkchmf rgdmdd crsnsy jdsgdd kduysn qkhess ngdqek nnqnmy udqyfd sgdbyo ybhsxe nqysgx rrdmjq toohry znts8s n12odn okdzts gdxodn okdvnq jhmfhm bymyqx vgyqeb ymsfds rptddy dckhjd bymmdc ryqchm drbykk sgdlrm nzzhrg hexntv ymseqn llxdwo dqhdmb dlnqds gymehu dodnok dhrbyk kdcbqn vcdchm ymcnts neneeh bdoktr svhbde nqktmb gshldi ymdcnd gyrbnm sybsdc 20odno kdhmnm dbnmeh mdcroy bdyknm dvgdmr gdhrhm sgdnee hbdymn odmoky mneehb ddudqx zncxgy sdrhsh rymhfg slyqds njddor nbhykc hrsymb hmfymn qlykcy xysvnq jrgntk ckdyud yskdyr s30odn okdysd wonrtq dyrrtl hmfrgd cndrms syjdto ybkhdm seybhm fqnkdv dyqdsy kjhmfy zntsld ymhmfk drrbnm edqdmb dryvet kgnscd rjhmfy mcshmx btzhbk drvnqj hmfeqn lgnldc ndrmsr ddlrnz ycmnvy sktmbg shldrg dfndrs nsdrbn hmnmdb ymycyr ptyqde nq3ldy kcdyks ghrhrm nssgdb nmshmd msmn3b ntqrdl dykokt rstrby mxvhmd enqxnt qyudqy fdktmb ghsrdy rxsnbn msybs5 odnokd hmsgde qhcfdy qdyrhm bddudq xnmdhr knnjhm fenqsg dodqed bsbghb jdmzks khmhmf toenqs gdptdt dfdsry

In [10]:
#decrypt
caesar_cipher_decrypt(ultimate_ciphertext,shift=shift_num)

'oncear riveda ttheof ficebu ilding shenee dstota kethee levato rliftt oherfl oorona verage thecap acityf orathy ssenkr uppisa bout8t o12peo plebut heypeo plewor kingin canary wharfc antget squeea edlike canned sardin escall themsn obbish ifyouw antfro mmyexp erienc emoret hanfiv epeopl eiscal ledcro wdedin andout ofoffi ceplus twicef orlunc htimej anedoe hascon tacted 20peop leinon econfi nedspa cealon ewhens heisin theoff iceano penpla noffic eevery bodyha tesiti sanigh tmaret okeeps ociald istanc ingano rmalda yatwor kshoul dleave atleas t30peo pleate xposur eassum ingshe doesnt takeup aclien tfacin grolew eareta lkinga boutme aningl esscon ferenc esawfu lhotde skinga ndtiny cubicl eswork ingfro mhomed oesnts eemsob adnowa tlunch timesh egoest otesco inonec anadas quaref or3mea ldealt hisisn otthec ontine ntno3c oursem ealplu stusca nywine foryou ravera gelunc hitsea sytoco ntact5 people inthef ridgea reasin ceever yoneis lookin gforth eperfe ctchic kenblt lining upfort hequeu egets

### Letter Frequency Analysis

&nbsp;

The key to Caesar cipher is the shift number. How can we determine the shift? As in natural languages, words show various statistical regularities, that's why we can use deep learning to predict the next possible word. The decryption can be tackled by letter frequency analysis. In our case, we only consider 26 English letters. The frequency analysis can be done on single letter or multiple letters (N-gram model). The result is always arbitrary. The most frequent letter in this ciphertext may not be the one across all English literatures depending on the sample size. If you are not short of computing power and manpower, you can always use brute force to try all possibilities for Caesar cipher.

In [11]:
#english letter frequency from the book Cryptanalysis by Helen Fouché Gaines
meaker=['e','t','a','o','n','i','s','r','h','l','d','c','u','p','f','m','w','y','b','g','v','k','q','x','j','z']

#english letter frequency from the book Making,Breaking Codes by Paul Garrett
garrett=['e','t','o','i','a','n','s','r','h','l','d','u','c','m','p','y','f','g','w','b','v','k','x','j','q','z']

#french and spanish letter frequency from the book Cryptogram Solving by M. E. Ohaver
french=['e','a','i','s','t','n','r','u','l','o','d','m','p','c','v','q','g','b','f','j','h','z','x','y','k','w']
spanish=['e','a','o','s','n','i','r','l','d','u','c','t','m','p','b','h','q','g','v','y','j','f','z','x','k','w']

#english digrams from the book Cryptanalysis by Helen Fouché Gaines
digrams=['th','he','an','in','er','re']

#english trigrams from the book Cryptanalysis by Helen Fouché Gaines
trigrams=['the','ing','tha','and','ion']


In [12]:
#compute n gram frequency in ciphertext
#compare with empirical plaintext result
def compute_letter_freq(ciphertext,benchmark,N):

    #break text into letters
    total=[i[j:j+N] for i in ciphertext.split() for j in range(len(i)-N+1) if len(i)>N]

    #count
    D={}
    for i in set(total):
        D[i]=total.count(i)
    D=dict(sorted(D.items(),key=lambda x:x[1],reverse=True))
    
    #output
    print(f'{N}-gram Frequency Analysis')
    print('Most frequent in cyphertext is',list(D.keys())[0])
    print('Most frequent in plaintext is',benchmark[0])

In [13]:
#based on single letter freq analysis
#mapping e to d is shift 24
compute_letter_freq(ciphertext,meaker,1)
potential_shift=english_lower.index('d')-english_lower.index('e')
if potential_shift<0:
    potential_shift+=25
print('The shift should be',potential_shift)

1-gram Frequency Analysis
Most frequent in cyphertext is d
Most frequent in plaintext is e
The shift should be 24


In [14]:
compute_letter_freq(ciphertext,digrams,2)
print('After decryption, it should be "',
      caesar_cipher_decrypt('gd',potential_shift),'"')
print('Whereas the most frequent digrams are',digrams)

2-gram Frequency Analysis
Most frequent in cyphertext is gd
Most frequent in plaintext is th
After decryption, it should be " he "
Whereas the most frequent digrams are ['th', 'he', 'an', 'in', 'er', 're']


In [15]:
compute_letter_freq(ciphertext,trigrams,3)
print('After decryption, it should be "',
      caesar_cipher_decrypt('hmf',potential_shift),'"')
print('Whereas the most frequent trigrams are',trigrams)

3-gram Frequency Analysis
Most frequent in cyphertext is hmf
Most frequent in plaintext is the
After decryption, it should be " ing "
Whereas the most frequent trigrams are ['the', 'ing', 'tha', 'and', 'ion']
