Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Ruby scripts that solve simple substitution ciphers (example cryptoquotes) using shotgun hill-climbing and looking at tetragraphs
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
For command line usage see the bottom of this readme. This archive contains a couple of ruby scripts. The main script will decode most mono-alphabetic simple substitution ciphers by using a shotgun-hill climbing algorithm. It works by selecting a random key and scoring the result, then making small adjustments to the key that improve the score (which is the hill climbing part). If it can't find any obvious adjustments that will improve the score, it will make a small random change to the key and then start hill climbing again. The algorithm will do this 10 times after the score plateaus, at which point if it can't come up with a better score, the program gives up (assuming that it has either reached a dead end or that it has the answer). Then the program will generate a new, completely random key and start the process over from the beginning, (this is the shotgun part of the algorithm). It's a little like trying to find the highest peak of a long polynomial expression where you can't plot the line ahead of time. The tactic here would be, pick a random point on the line, start climbing the hill until you can't climb any higher, then pick another random point on the line and start climbing again. If you have a curve with a small number of relatively uniformly distributed peaks, than this is a moderately efficient, albeit uncertain way of ascertaining the correct answer. This program has no way of knowing whether or not it has hit upon the correct answer, so it will keep on looking for better scores until the user hits CTRL-C to end the program. It's up to the user to decide whether the answer the program has come up with makes any sense or not. The scoring algorithm is really the heart of this program. Basically the program has a dictionary of tetra graph frequencies, that is, how often do these four letters appear next to one another in the English language. Using this number as a guide, it produces the score by taking the log of each of the tetra graph frequencies in the cipher text and adds them all together to produce a score. adding up the logs of the frequencies to produce the score rather than adding up the scores themselves appears to be important in making the program able to score plain text higher than gibberish. I didn't realize this the first time I wrote this program and didn't get very good results. I should note that I took somebody else's code as a guide in producing these ruby scripts. I've modified the strategy slightly from the original author's, and obviously I translated it from C to Ruby. I'm very grateful to the original author for sharing his C code with the world. I would have had a hard time getting this program working without using his program as a guide, In particular taking the logs of the scores would not have occurred to me. This script runs pretty slowly, owing mostly to the fact that I've written it as a ruby script. I've seen other implementations of this idea written in C which run orders of magnatude faster than this implementation does. I plan to retranslate this code into several languages to see how well they can run it. At any rate, included in this archive is the following 1. the substitution solver script which will over time, (hopefully) recover the plain text. 2. a sample tetra graph dictionary which was generated using "Harry Potter and the Goblet of Fire" as it's source txt, (yes I own a legal copy of the book) 3. a script which will generate a tetra graph dictionary from an ascii text file. 4. this readme. 5. an inspector script which will show the contents of the english.dic file in a readable format. This program is distributed under the GNU General Public License. If you would like to use this source code in your own programs, or rewrite it or whatever, just be sure to read the GNU GPL so that you know your rights and responsibilities. examples of usage to decrypt a cipher text in an ascii text file ruby substitution_solver.rb < cipher.txt to decrypt a message on the command line echo "KDUUB SRWWHU"| ruby substitution_solver.rb to decrypt a message by typing it in, run the script with no arguments as in ruby substitution_solver.rb then type the ciphertext, hit enter and then the key combo for the end of file character which is Ctrl-D on Linux & I think Ctrl-Z on windows. to generate an english.dic file using another source use ruby dictionary_builder.rb < novel.txt finally I'm including a script which will print out the english.dic file in a readable form. It's interesting to note that the frequencies increase exponentially as you go down the list. I suspect that this is why using the logarithm to compute the score is a necessity as it essentially makes the result linear or at least closer to being linear. example ruby dictionary_inspector.rb > output.txt or ruby dictionary_inspector.rb | less or just plain old ruby dictionary_inspector.rb