Skip to content
Browse files

Partial commit for 6.878 PS1 to make sure we don't lose it.

  • Loading branch information...
1 parent a09ea13 commit 123b070a2834b3de3a44d6aa179b2bb9e7ccc9ed @pwnall committed Sep 22, 2009
View
6 .project
@@ -6,12 +6,18 @@
</projects>
<buildSpec>
<buildCommand>
+ <name>org.rubypeople.rdt.core.rubybuilder</name>
+ <arguments>
+ </arguments>
+ </buildCommand>
+ <buildCommand>
<name>net.sourceforge.texlipse.builder.TexlipseBuilder</name>
<arguments>
</arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>net.sourceforge.texlipse.builder.TexlipseNature</nature>
+ <nature>org.rubypeople.rdt.core.rubynature</nature>
</natures>
</projectDescription>
View
2 README.textile
@@ -18,4 +18,4 @@ h2. Academic Honesty
If you're a student in one of my classes, I expect that you won't look at my
solutions before the deadlines for the problem sets. If you use my work for
-other problem sets, I expect that you will acknowledge my contribution.
+other problem sets, I expect that you will acknowledge this source.
View
2 src/6.878/metadata.tex
@@ -0,0 +1,2 @@
+\newcommand{\PsetClassNumber}{6.878}
+\newcommand{\PsetClassTerm}{Fall 2009}
View
179 src/6.878/ps1/all.tex
@@ -0,0 +1,179 @@
+\section{Problem 1}
+
+\subsection{Part A}
+Listing \ref{problem1:code} shows my Ruby implementation of the Needleman-Wunsch
+algorithm.
+
+\lstinputlisting[language=Ruby, caption=Ruby implementation of
+Needleman-Wunsch sequence alignment,
+label=problem1:code]{6.878/ps1/code/ps1-seqalign.rb}
+
+The implementation loosely follows the provided Python implementation, with the
+major difference that it reads the scoring matrix from a file. Listing
+\ref{problem1:a_scores} shows the scoring file translated from the provided
+Python implementation.
+
+\lstinputlisting[float=hbp, language=C, caption=The scoring matrix translated
+from the Python code, label=problem1:a_scores]{6.878/ps1/code/a_scores.txt}
+
+Figure \ref{problem1:solution} shows an optimal alignment of AGGTGAT with
+AGTAA. Figures \ref{problem1:scores} and \ref{problem1:parents} show the
+alignment scores and the information needed to reconstruct the optimal
+alignment.
+
+\begin{figure}[hbp]
+\center \input{6.878/ps1/problem1_alignment.tex}
+\caption{An optimal alignment of AGGTGAT with AGTAA}
+\label{problem1:solution}
+\end{figure}
+
+\begin{figure}[hbp]
+\center \input{6.878/ps1/problem1_scores.tex}
+\caption{Alignment scores for AGGTGAT vs AGTAA}
+\label{problem1:scores}
+\end{figure}
+
+\begin{figure}[hbp]
+\center \input{6.878/ps1/problem1_parents.tex}
+\caption{Solution reconstruction data for AGGTGAT vs AGTAA}
+\label{problem1:parents}
+\end{figure}
+
+\subsection{Part B}
+The Ruby implementation does not need to be modified. It is sufficient to
+change the scoring matrix to reflect the Hamilton distance between genes.
+Listing \ref{problem1:b_scores} shows the changed score file. The resulting
+score's absolute magnitude reflects the number of base mutations (insertions,
+deletions, substitutions) between the two genes.
+
+\lstinputlisting[float=hbp, language=C, caption=The scoring matrix used to
+compute the distance between two genes,
+label=problem1:b_scores]{6.878/ps1/code/b_scores.txt}
+
+\subsection{Part C}
+Running the program as described above yields a score of -111.
+
+\subsection{Part D}
+Aligning the human versions of HoxA13 and HoxD13 yields a score of -467.
+Aligning the mouse version of the same genes yields a score of -471. I use the
+mean score.
+
+Aligning the human version of HoxD13 with the mouse version of HoxD13 yields a
+score of -154. The big difference between this and the score obtained in part B
+is slightly disheartening, and I'll definitely use the mean score. The
+difference probably suggests that the scoring matrix is not very good.
+
+Assuming that that mammalian genomes undergo a constant rate of mutations over
+time, we find that the HoxA13 and HoxD13 genes have probably diverged
+
+$$
+\frac{467 + 471}{2} \cdot \frac{2}{111 + 154} \cdot {70 \textrm{ million years}}
+\approx 248 \textrm{ million years}
+$$
+
+Note that, for simplification, the scoring matrix in listing
+\ref{problem1:b_scores} assumes that all base mutations (insertions, deletions,
+substitutions) are equally probable. This is known not to be the case.
+
+
+\section{Problem 2}
+
+The recurrence is as follows:
+
+$$
+F(i, j, k) = \max \left\{
+\begin{array}{l}
+F(i - 1, j, k) + s(i, \cdot, \cdot) \\
+F(i, j - 1, k) + s(\cdot, j, \cdot) \\
+F(i, j, k - 1) + s(\cdot, \cdot, k) \\
+F(i - 1, j - 1, k) + s(i, j, \cdot) \\
+F(i, j - 1, k - 1) + s(\cdot, j, k) \\
+F(i - 1, j, k - 1) + s(i, \cdot, k) \\
+F(i - 1, j - 1, k - 1) + s(i, j, k)
+\end{array}
+\right.
+$$
+
+The initial conditions are $F(0, 0, 0) = 0$, $F(i, j, k) = -\infty$ if $i <
+0$ or $j < 0$ or $k < 0$, and $s(i, j, k) = -\infty$ if $i \cdot j \cdot k = 0$.
+
+\section{Problem 3}
+
+\subsection{Part A}
+
+The program's output is reproduced in figure \ref{problem3:exact30}.
+
+\begin{figure}[htb]
+ \includegraphics[width=6.8in]{6.878/ps1/figs/p3_exact30.png}
+ \caption{Dotplot for all exact matching 30-mers.}
+ \label{problem3:exact30}.
+\end{figure}
+
+\subsection{Part B}
+
+The key observation for my solution is that all the required modifications are
+still exact matching problems, but the subsequence of bases that have to match
+isn't continuous. So, it is sufficient to modify the code to hash the
+subsequences that have to match. Listing \ref{problem3:code} shows the
+code changes needed to produce plots modifications i, ii, iii, and iv.
+
+\begin{lstlisting}[language=Python, caption=The scoring matrix used to
+compute the distance between two genes, label=problem3:code]
+ # length of hash key
+ kmerlen = 30
+ # stride of hash key # ADDED
+ kmerstep = 1 # ADDED
+
+ # hash table for finding hits
+ lookup = {}
+
+ # store sequence hashes in hash table
+ print "hashing seq1..."
+ for i in xrange(len(seq1) - kmerlen + 1):
+ key = seq1[i:i+kmerlen:kmerstep] # CHANGED
+ lookup.setdefault(key, []).append(i) # CHANGED
+
+
+
+ # look up hashes in hash table
+ print "hashing seq2..."
+ hits = []
+ for i in xrange(len(seq2) - kmerlen + 1):
+ key = seq2[i:i+kmerlen:kmerstep] # CHANGED
+
+ # store hits to hits list
+ for hit in lookup.get(key, []):
+ hits.append((i, hit))
+\end{lstlisting}
+
+The script above outputs the original plot. The following values for kmerlen
+and kmerstep produce the plots for modifications i-iv:
+\begin{enumerate}
+ \item [i.] kmerlen = 100, kmerstep = 1
+ \item [ii.] kmerlen = 60, kmerstep = 2
+ \item [iii.] kmerlen = 90, kmerstep = 3
+ \item [iv.] kmerlen = 120, kmerstep = 4
+\end{enumerate}
+
+\subsection {Part C}
+
+\subsection {Part D}
+
+\begin{tabular}{|l|r|r|}
+\hline
+Original & 62829 & 24.70\% \\
+\hline
+Modification i & & \\
+\hline
+Modification ii & & \\
+\hline
+Modification iii & & \\
+\hline
+Modification iv & & \\
+\hline
+Modification v & & \\
+\hline
+\end{tabular}
+\end{tabular}
+
+\subsection {Part E}
View
5 src/6.878/ps1/code/a_scores.txt
@@ -0,0 +1,5 @@
+-4 A G C T
+ A 3 -1 -2 -2
+ G -1 3 -2 -2
+ C -2 -2 3 -1
+ T -2 -2 -1 3
View
5 src/6.878/ps1/code/b_scores.txt
@@ -0,0 +1,5 @@
+-1 A G C T
+ A 0 -1 -1 -1
+ G -1 0 -1 -1
+ C -1 -1 0 -1
+ T -1 -1 -1 0
View
115 src/6.878/ps1/code/ps1-seqalign.rb
@@ -0,0 +1,115 @@
+#!/usr/bin/env ruby19
+#
+# Needleman-Wunsch sequence alignment.
+#
+# Author:: Victor Costan
+# Copyright:: none
+# License:: Public Domain
+
+# This program needs the bio gem to run. Install with the following command:
+# gem install bio
+require 'bio'
+require 'pp'
+
+# Aligns two sequences accoding to the Needleman-Wunsch algorithm.
+def align_sequences_nm(sequence1, sequence2, scores)
+ # Dynamic programming.
+ dp = Array.new(sequence1.length + 1) { Array.new(sequence2.length + 1) }
+ gap_score = scores['-']['-']
+ dp[0][0] = [0, 0, 0, '|', '|']
+ sequence1.chars.each_with_index do |c, i|
+ dp[i + 1][0] = [(i + 1) * gap_score, 1, 0, c, '-']
+ end
+ sequence2.chars.each_with_index do |c, j|
+ dp[0][j + 1] = [(j + 1) * gap_score, 0, 1, '-', c]
+ end
+ sequence1.chars.each_with_index do |base1, i|
+ sequence2.chars.each_with_index do |base2, j|
+ dp[i + 1][j + 1] = [[0, 1, '-', base2], [1, 0, '-', base1],
+ [1, 1, base1, base2]].map { |i1, j1, match1, match2|
+ [dp[i - i1 + 1][j - j1 + 1].first + scores[match1][match2],
+ i1, j1, match1, match2]
+ }.max
+ end
+ end
+
+ # Solution reconstruction.
+ i, j = *[sequence1, sequence2].map(&:length)
+ match_score = dp[i][j].first
+ align1, align2 = '', ''
+ until i == 0 && j == 0
+ score, i1, j1, base1, base2 = *dp[i][j]
+ align1 << base1; i -= i1
+ align2 << base2; j -= j1
+ end
+
+ # Return values
+ scores = dp.map { |line| line.map(&:first) }
+ words = { [1, 0] => '$\\uparrow$', [0, 1] => '$\\leftarrow$',
+ [1, 1] => '$\\nwarrow$', [0, 0] => '$\\cdot$'}
+ parents = dp.map { |line| line.map { |item| words[item[1, 2]] } }
+ { :scores => scores, :parents => parents, :match_score => match_score,
+ :aligns => [align1, align2].map(&:reverse) }
+end
+
+# Monkey-patch Bio::Sequence to read sequence data directly from files.
+class Bio::Sequence
+ def self.from_file(file_name)
+ Bio::Sequence.input File.read(file_name)
+ end
+end
+
+# Reads the scoring matrix from a file.
+def scores_from_file(file_name)
+ lines = File.read(file_name).split("\n").map { |line| line.split }
+ gap_score = lines[0][0].to_f
+ Hash[*lines[1..-1].map { |line|
+ [line.first, Hash[*lines[0][1..-1].zip(line[1..-1].map { |token|
+ token.to_i
+ }).flatten].merge('-' => gap_score)]
+ }.flatten].merge('-' => Hash.new { |key| gap_score })
+end
+
+# Produces a latex rendering of a matrix.
+def latex_matrix(matrix)
+ output = "\\begin{tabular}{|#{'r|' * matrix.first.length}}\n"
+ output << "\\hline\n"
+ output << matrix.map { |line| line.join(' & ') }.join("\\\\\n\\hline\n")
+ output << "\\\\\n"
+ output << "\\hline\n"
+ output << "\\end{tabular}\n"
+ output
+end
+
+# Produces a latex rendering of a table of data.
+def latex_table(data, top_headings, left_headings)
+ matrix = [[''] + top_headings] +
+ left_headings.zip(data).map { |heading, line| [heading] + line }
+ latex_matrix matrix
+end
+
+# main
+if $0 == __FILE__
+ unless ARGV.length == 3
+ puts "Usage: #{$0} score_file sequence_1 sequence_2"
+ exit
+ end
+ sequences = ARGV[1..-1].map { |file_name| Bio::Sequence.from_file file_name }
+ scores = scores_from_file ARGV[0]
+ dp_result = align_sequences_nm(*(sequences + [scores]))
+
+ puts "Alignment:"
+ puts "\\begin{verbatim}\n#{dp_result[:aligns].join("\n")}\n\\end{verbatim}"
+ puts "Score: #{dp_result[:match_score]}"
+
+ # Only print detailed stats for small matrices.
+ if sequences.map(&:length).max <= 10
+ puts "Score matrix: "
+ puts latex_table(dp_result[:scores], ['|'] + sequences[1].chars.to_a,
+ ['|'] + sequences[0].chars.to_a)
+
+ puts "Parents matrix: "
+ puts latex_table(dp_result[:parents], ['|'] + sequences[1].chars.to_a,
+ ['|'] + sequences[0].chars.to_a)
+ end
+end
View
BIN src/6.878/ps1/figs/p3_exact30.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
4 src/6.878/ps1/problem1_alignment.tex
@@ -0,0 +1,4 @@
+\begin{verbatim}
+A-GTGA-
+AGGTAAT
+\end{verbatim}
View
21 src/6.878/ps1/problem1_parents.tex
@@ -0,0 +1,21 @@
+\begin{tabular}{|r|r|r|r|r|r|r|}
+\hline
+ & | & A & G & T & A & A\\
+\hline
+| & $\cdot$ & $\leftarrow$ & $\leftarrow$ & $\leftarrow$ & $\leftarrow$ & $\leftarrow$\\
+\hline
+A & $\uparrow$ & $\nwarrow$ & $\leftarrow$ & $\leftarrow$ & $\nwarrow$ & $\nwarrow$\\
+\hline
+G & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\leftarrow$ & $\leftarrow$ & $\leftarrow$\\
+\hline
+G & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\nwarrow$ & $\nwarrow$ & $\nwarrow$\\
+\hline
+T & $\uparrow$ & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\nwarrow$ & $\nwarrow$\\
+\hline
+G & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\uparrow$ & $\nwarrow$ & $\nwarrow$\\
+\hline
+A & $\uparrow$ & $\nwarrow$ & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\nwarrow$\\
+\hline
+T & $\uparrow$ & $\uparrow$ & $\uparrow$ & $\nwarrow$ & $\uparrow$ & $\uparrow$\\
+\hline
+\end{tabular}
View
21 src/6.878/ps1/problem1_scores.tex
@@ -0,0 +1,21 @@
+\begin{tabular}{|r|r|r|r|r|r|r|}
+\hline
+ & | & A & G & T & A & A\\
+\hline
+| & 0 & -4.0 & -8.0 & -12.0 & -16.0 & -20.0\\
+\hline
+A & -4.0 & 3 & -1.0 & -5.0 & -9.0 & -13.0\\
+\hline
+G & -8.0 & -1.0 & 6 & 2.0 & -2.0 & -6.0\\
+\hline
+G & -12.0 & -5.0 & 2.0 & 4 & 1.0 & -3.0\\
+\hline
+T & -16.0 & -9.0 & -2.0 & 5.0 & 2 & -1.0\\
+\hline
+G & -20.0 & -13.0 & -6.0 & 1.0 & 4.0 & 1\\
+\hline
+A & -24.0 & -17.0 & -10.0 & -3.0 & 4.0 & 7.0\\
+\hline
+T & -28.0 & -21.0 & -14.0 & -7.0 & 0.0 & 3.0\\
+\hline
+\end{tabular}
View
29 src/master.tex
@@ -1,5 +1,23 @@
\documentclass{article}
+%% Pset author.
+\newcommand{\PsetAuthorName}{Victor Costan}
+\newcommand{\PsetAuthorEmail}{costan@mit.edu}
+
+%% Pset class and instance information.
+\newcommand{\PsetTitle}{Problem Set 1}
+\newcommand{\PsetDueDate}{September 14, 2009}
+\newcommand{\PsetMainFile}{6.878/ps1/all.tex}
+\newcommand{\PsetClassMetadata}{6.878/metadata.tex}
+
+%%
+%% Configuration ends here.
+%%
+
+% Stop wasting paper.
+\oddsidemargin -0in
+\textwidth 6.7in
+
%% Packages
\usepackage{epsfig}
\usepackage{endnotes}
@@ -16,17 +34,6 @@
\usepackage{fancyhdr}
\pagestyle{fancy}
-%% Pset author.
-\newcommand{\PsetAuthorName}{Victor Costan}
-\newcommand{\PsetAuthorEmail}{costan@mit.edu}
-
-%% Pset class and instance information.
-\newcommand{\PsetTitle}{Problem Set 1}
-\newcommand{\PsetDueDate}{September 14, 2009}
-\newcommand{\PsetMainFile}{6.823/ps1/all.tex}
-\newcommand{\PsetClassMetadata}{6.823/metadata.tex}
-
-
\input{\PsetClassMetadata}
\renewcommand{\leftmark}{\PsetAuthorName\space$<$\PsetAuthorEmail$>$}
\renewcommand{\rightmark}{\PsetClassNumber\space\PsetClassTerm\space\PsetTitle}

0 comments on commit 123b070

Please sign in to comment.
Something went wrong with that request. Please try again.