Permalink
Browse files

Moved to http://github.com/jhannah/mutationgrid

  • Loading branch information...
1 parent 511af06 commit 10073feef75d99a300032972940eda35209ffb5f @jhannah committed Jul 30, 2009
Showing with 0 additions and 159 deletions.
  1. +0 −94 erlang/README.md
  2. +0 −65 erlang/mutate.erl
View
@@ -1,94 +0,0 @@
-Overview
-========
-
-I'm working on a project which contains a couple million nucleic acids sequences,
-each 18 bases long. Each base is 'A', 'C', 'G', or 'T'. 18 bases in a row
-makes for 69 billion possible sequences. So my actual data (2M) is very
-sparse compared to the potential list (69B).
-
-I want to eliminate duplicates out of my 2M sequences. But by "duplicate" I
-don't mean exact match only. I mean any sequence that is one mutation (a
-single base changes from one letter to another letter) or two mutations distant.
-
-Mutation is quite a problem. When you have 18 bases and mutate 2 of them
-that's 2,754 possible mutations (18 \* 3 \* 17 \* 3 ... right?) for each sequence.
-So to de-dupe my 2M sequences I actually have to do 5.5B searches through 2M
-sequences (11 quadrillion comparisons). That's a lot, even with good indexes.
-
-And this problem is exponential as 18 bases grows to 19 or 20 or 30...
-
-I want a parallel processing solution so the work doesn't take so long.
-Luckily there's a 2000 node super-computer I can use. That should help.
-
-The idea is to break my 2M sequences into an arbitrary number of shards.
-To talk through a small example, let's pretend my sequences are 4 bases
-long instead of 18.
-
- AAAA..TTTT is 256 sequences
-
-We split the search space into 4 nodes:
-
- AAAA..ATTT 64 sequences
- CAAA..CTTT 64 sequences
- GAAA..GTTT 64 sequences
- TAAA..TTTT 64 sequences
-
-Now each node is responsible for doing lookups and remembering statistics
-regarding a maximum of 64 sequences each. And if our data is very sparse
-then the actual number is far less.
-
-
-
-Node methods
-============
-
-hmm... how do we bootstrap each node? text file ssh'd to the node?
-
-**mutate(Sequence, NumberOfMutations)**
-
- Starting with Sequence (e.g. 'AAAA'), iterate every possible Mutant caused by introducting
- NumberOfMutations (e.g. '1' or '2'). For each local Mutant, if Mutant is a known Sequence
- then increment that Sequence's counter. For each remote Mutant, increment_sequence()
- to the node responsible.
-
-**increment_sequence(Sequence)**
-
- Another node has discovered that one of our Sequences is a Mutant of one of their
- sequences. Increment our counter.
-
-**sequence_is_local(Sequence)**
-
- Returns true if Sequence is in the local shard. False otherwise.
-
-**node_for_sequence(Sequence)**
-
- Returns the node responsible for Sequence.
-
-**report()**
-
- Generate a report of all statistics that have been gathered for all of our Sequences.
-
-
-Status
-======
-
-Participants:
- Freenode #omaha.dev stesla jhannah dhoss
-
-jhannah's mutate function and a bunch of test stuff:
-
- http://github.com/jhannah/sandbox/tree/master/erlang
-
-stesla's plumbing:
-
- http://github.com/stesla/mg/tree/master
-
-
-
-Jay's misc notes
-================
-
-* http://erlang.org/download/getting_started-5.4.pdf
-* You need matching ~.erlang.cookie files.
-* http://mad.printf.net/MSCC_matching_instructions/matching.html
-
View
@@ -1,65 +0,0 @@
-%% mutate.erl - mutate a sequence of bases in a single place
-%% Author: ndim on Freenode (#erlang)
-%% Written with and for deafferret on freenode/#erlang.
-%% http://github.com/ndim/erlang_stuff/blob/master/src/deafferret.erl
-%% http://github.com/jhannah/sandbox/blob/master/erlang/mutate.erl
-
--module(mutate).
-
--export([start/0]).
--export([mutate/1, mutate/2]).
-
--define(DEFAULT_ALPHABET, "ACGT").
-
-
-%% BUG: If the OrigSequence contains elements not in Alphabet -> wrong result.
-
-%% FIXME: We are using ++ in two places, which mostly is not a good idea.
-
-%% NOTE: We return the results in a very strange kind of "ordering".
-
-%% NOTE: A simple test case of "AAAA" lets you miss permutated characters.
-
-
-%% Author's comments:
-%%
-%% I like the result list construction in mutate() now: Create a list
-%% of lists of strings and then flattening that to a list of strings
-%% at the end.
-%%
-%% I do not like the Prefix++[Base] expression yet, at all. I have no
-%% idea what do do about it (yet), though.
-%%
-%% I have some doubts about the Prefix++[X|Suffix] expression as
-%% well. I have no idea what do do about it (yet), though.
-
-
-%% Verbose version. Comment out the io:format stuff for a mute version.
-mutate([], _Alphabet, _Prefix, Acc) ->
- lists:append(Acc);
-mutate([Base|Suffix]=_OrigSequence, Alphabet, Prefix, Acc) ->
- io:format("mutate(~p,~p,~p,~p)~n", [_OrigSequence,Alphabet,Prefix,Acc]),
- NewMutations = [Prefix++[X|Suffix] || X<-Alphabet, X=/=Base],
- io:format(" new: ~p~n", [NewMutations]),
- mutate(Suffix, Alphabet, Prefix++[Base], [NewMutations|Acc]).
-
-mutate(OrigSequence, Alphabet) ->
- mutate(OrigSequence, Alphabet, [], []).
-
-mutate(OrigSequence) ->
- mutate(OrigSequence, ?DEFAULT_ALPHABET).
-
-
-%% Run example.
-example(OrigSequence) ->
- Mutations = mutate(OrigSequence),
- io:format("Mutations of orig sequence ~p for default alphabet ~p:~n"
- " ~p~n",
- [OrigSequence, ?DEFAULT_ALPHABET,
- Mutations]).
-
-%% Run a few examples.
-start() ->
- [ example(X) || X <- ["CATTAG", "AAAA"] ].
-
-

0 comments on commit 10073fe

Please sign in to comment.