-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to produce PHYLIP distance matrix format #9
Comments
If we did this I think I a post-processing script would make more sense than a switch. |
Alternatively, a switch would be more feasible if it simply enforced the 10 character limit; conversion could then be left to a preprocessing script. |
Many tools today (eg. RaxML) accept the Relaxed PHYLIP format: This nominally allows up to 250 character IDs and use a space as the separator rather than a hard 11 char cut. Would you consider that? |
+1 for PHYLIP distance matrix. That is the usual output format for other alignment-free distance estimators. See andi and spaced words. |
I am changing my opinion.
Thus by requiring users to create their own distance matrix, this prevents the likely error that ppl would use Mash to build phylogenies (and be overconfident in their accuracy). Thus not producing a PHYLIP distance matrix may prevent trouble in the future. |
I just looked at this ticket out of curiosity but It wasn't too hard to code. I think there is a bioperl way to do it too but I was too lazy to try it out. I implemented this in There are some dependencies like
|
@kloetzl Why would you want to stop people building trees with it? Trees are good relationship diagrams. Sure it's not 'phylo'-genetic but it's still genetic. In fact the first in-person demo I saw of mash in May 2015 by @aphillippy did exactly that! PHYLIP is just a convenient 'understood' format for exchanging distance data. The current matrix format is just being imported directly into R anyway and being used to draw trees. |
Add new 'matrix' command which compares all input sequences and outputs a distance matrix. The current implementation is rough and ready but potentially faster than iterative 'mash dist'. fixes issue marbl#9.
Hi, I am using Mash for my interest proteins to draw a dendrogram for an alignment free method. I got the following output using file1.fa 0.223052 I have to use these outputs in PHYLIP package to get a dendrogram. Is that right? or Do I need to convert this file to different format. Please point me direction. Thank you for your time. |
The output is just one row or column in the matrix. You have to do all the other comparisons, too and then you can create a Phylip-style distance matrix and finally the dendrogram. For more details, how to get a distance matrix out of mash see https://github.com/lskatz/mashtree and #66 . |
I guess I will close this as it doesn't seem you will provide a standards-compliant distance matrix output format. The 10 character limit is legacy, most parsers support arbitrary lengths. |
You could use my fork until the PR gets merged. If you find it useful one could even patch mash in common package managers to make the new command available to everyone. |
There is a 'triangle' command in latest, which outputs (relaxed) Phylip. Not really tested with tree tools yet; have a look! |
If anyone needs it, here is a script to convert the triangle into a square. mash triangle "$@" |
awk 'NR == 1 {n=$1}
function basename(file, a, n) {
n = split(file, a, "/")
return a[n]
}
NR > 1 {i=NR-1; names[i] = basename($1);
for (j=2; j <= NF; j++){
mat[i,j-1] = mat[j-1,i] = $j;
}
mat[i,i]=0.0;
}
END{i=1;
print n;
for (a in names){
printf names[a];
for(j=1; j<=n; j++)
printf " %f", mat[i,j];
printf "\n";
i++
}
}' |
I think this can be closed for now. @kloetzl the square format also seems like a reasonable command line option if you'd like to submit a PR. |
For anyone trying to load this matrix format into python/pandas: EDIT: Do this instead import pandas as pd
from scipy.spatial.distance import squareform
def mash_triangle_to_square(triangle_path):
with open(triangle_path) as contents_fh:
next(contents_fh) # Skip record count
values = []
names = []
for line in contents_fh:
records = line.strip().split('\t')
names.append(records[0])
values.extend(records[1:])
df = pd.DataFrame(squareform(values), index=names, columns=names)
df.replace({'': 0.}, inplace=True)
return df.astype(float) |
Hi @bede I could not find an elegant way of doing it and ended up with import numpy as np
import pandas as pd
def lower_triangle_to_full_natrix(filename):
num_lines_in_file = sum(1 for line in open(filename))
distances = []
sample_names = []
with open(filename) as f:
next(f) # skip sample count line
for line in f:
elements = line.strip().split('\t')
sample_names.append(elements[0])
row = [float(e) for e in elements[1:]]
row.extend([0.0] * (num_lines_in_file-1-len(row))
distances.append(row)
np_array = np.asarray(distances)
index_upper = np.triu_indices(num_lines_in_file-1)
np_array[index_upper] = np_array.T[index_upper]
return pd.DataFrame(np_array, columns=sample_names, index=sample_names).to_csv('output.tsv') |
Thanks for sharing this Andy! I remember shelving the results of that function because I had doubts about them at the time. Thanks for the fix. |
The
mash dist -t
option produces a TSV distance matrix.It is not too much effort to produce a valid PHYLIP distance matrix format, preferably in ltriangle form: http://www.mothur.org/wiki/Phylip-formatted_distance_matrix
The main issue is that PHYLIP labels may be limited to 10 chars which is bit of a disaster for
mash
applications. Maybe integers 0000000 - 999999 could be used and a .map file created for using with nw_rename later?Here is an example of more specs: http://www.uwyo.edu/dbmcd/molmark/practica/phylipinfo.doc
The text was updated successfully, but these errors were encountered: