Ruby scripts that solve simple substitution ciphers (example cryptoquotes) using shotgun hill-climbing and looking at tetragraphs
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README
dictionary_builder.rb
dictionary_inspector.rb
english.dic
substitution_solver.rb

README

For command line usage see the bottom of this readme.

This archive contains a couple of ruby scripts.  The main script 
will decode most mono-alphabetic simple substitution ciphers 
by using a shotgun-hill climbing algorithm.  It works by
selecting a random key and scoring the result, then making small
adjustments to the key that improve the score (which is the
hill climbing part).  If it can't find any obvious adjustments 
that will improve the score, it will make a small 
random change to the key and then start hill climbing again.
The algorithm will do this 10 times after the score plateaus, at
which point if it can't come up with a better score, the 
program gives up (assuming that it has either reached 
a dead end or that it has the answer).  Then the program
will generate a new, completely random key and start the process
over from the beginning, (this is the shotgun part of the
algorithm).  It's a little like trying to find the highest peak
of a long polynomial expression where you can't plot the line
ahead of time.  The tactic here would be, pick a random point
on the line, start climbing the hill until you can't climb any
higher, then pick another random point on the line and start
climbing again.  If you have a curve with a small number of 
relatively uniformly distributed peaks, than this
is a moderately efficient, albeit uncertain way of
ascertaining the correct answer.

This program has no way of knowing whether or not it has hit
upon the correct answer, so it will keep on looking for better
scores until the user hits CTRL-C to end the program.  It's
up to the user to decide whether the answer the program has
come up with makes any sense or not.  The scoring algorithm
is really the heart of this program.  Basically the program
has a dictionary of tetra graph frequencies, that is, how
often do these four letters appear next to one another in the
English language.  Using this number as a guide, it produces
the score by taking the log of each of the tetra graph 
frequencies in the cipher text and adds them all together to 
produce a score.  adding up the logs of the frequencies to 
produce the score rather than adding up the scores themselves 
appears to be important in making the program able to score 
plain text higher than gibberish.  I didn't realize this the
first time I wrote this program and didn't get very good
results.  I should note that I took somebody else's code as
a guide in producing these ruby scripts.  I've modified the
strategy slightly from the original author's, and obviously
I translated it from C to Ruby.  I'm very grateful to the
original author for sharing his C code with the world.  I
would have had a hard time getting this program working
without using his program as a guide, In particular taking
the logs of the scores would not have occurred to me.

This script runs pretty slowly, owing mostly to the fact that
I've written it as a ruby script.  I've seen other implementations
of this idea written in C which run orders of magnatude faster
than this implementation does.  I plan to retranslate this code
into several languages to see how well they can run it.

At any rate, included in this archive is the following
1. the substitution solver script which will over time, 
(hopefully) recover the plain text.
2. a sample tetra graph dictionary which was generated using 
"Harry Potter and the Goblet of Fire" as it's source txt, 
(yes I own a legal copy of the book)
3. a script which will generate a tetra graph dictionary from 
an ascii text file.
4. this readme.
5. an inspector script which will show the contents of the
english.dic file in a readable format.

This program is distributed under the GNU General Public
License.  If you would like to use this source code in
your own programs, or rewrite it or whatever, just be
sure to read the GNU GPL so that you know your rights
and responsibilities.

examples of usage

to decrypt a cipher text in an ascii text file
	ruby substitution_solver.rb < cipher.txt

to decrypt a message on the command line
	echo "KDUUB SRWWHU"| ruby substitution_solver.rb

to decrypt a message by typing it in, run the script with no
arguments as in
	ruby substitution_solver.rb
then type the ciphertext, hit enter and then the key combo 
for the end of file character which is Ctrl-D on Linux & I 
think Ctrl-Z on windows.

to generate an english.dic file using another source use
	ruby dictionary_builder.rb < novel.txt

finally I'm including a script which will print out the english.dic 
file in a readable form.  It's interesting to note that the 
frequencies increase exponentially as you go down the list.  
I suspect that this is why using the logarithm to compute the score 
is a necessity as it essentially makes the result linear or at 
least closer to being linear.

example
	ruby dictionary_inspector.rb > output.txt
or
	ruby dictionary_inspector.rb | less
or just plain old
	ruby dictionary_inspector.rb