Option to produce PHYLIP distance matrix format #9

tseemann · 2015-10-03T07:00:06Z

The mash dist -t option produces a TSV distance matrix.

It is not too much effort to produce a valid PHYLIP distance matrix format, preferably in ltriangle form: http://www.mothur.org/wiki/Phylip-formatted_distance_matrix

The main issue is that PHYLIP labels may be limited to 10 chars which is bit of a disaster for mash applications. Maybe integers 0000000 - 999999 could be used and a .map file created for using with nw_rename later?

Here is an example of more specs: http://www.uwyo.edu/dbmcd/molmark/practica/phylipinfo.doc

Guidelines for PHYLIP input files for programs Neighborand Fitch (tree-building from distance matrices)

1) File must be text-only (ASCII)!!  It must be in same directory with the program and must be called infile.  
2) 1st line has number of OTUs (taxa, pops.)
3) Next line has first OTU name (padded to at least 10 characters with spaces, if necessary)
4) Easiest is upper diagonal matrix (note that it doesn't have to be aligned).  Separators are spaces.

Example

4
LS1        0.083  0.25 0.458
LS2        0.167  0.392
LS3        0.392
LS4        

Notes: last line still pads out 10 spaces, but has no "distance" (implied zero from LS4 to itself).

The text was updated successfully, but these errors were encountered:

ondovb · 2015-10-05T15:46:43Z

If we did this I think I a post-processing script would make more sense than a switch.

ondovb · 2015-10-22T20:10:29Z

Alternatively, a switch would be more feasible if it simply enforced the 10 character limit; conversion could then be left to a preprocessing script.

tseemann · 2016-04-02T23:13:32Z

Many tools today (eg. RaxML) accept the Relaxed PHYLIP format:
http://www.phylo.org/index.php/help/relaxed_phylip

This nominally allows up to 250 character IDs and use a space as the separator rather than a hard 11 char cut.

Would you consider that?

kloetzl · 2016-04-19T07:16:45Z

+1 for PHYLIP distance matrix. That is the usual output format for other alignment-free distance estimators. See andi and spaced words.

kloetzl · 2016-07-04T09:26:35Z

I am changing my opinion.

However, due to limitations of both the k-mer approach and simple distance model, we emphasize that Mash is not explicitly designed for phylogeny reconstruction, especially for genomes with high divergence or large size differences.

Thus by requiring users to create their own distance matrix, this prevents the likely error that ppl would use Mash to build phylogenies (and be overconfident in their accuracy). Thus not producing a PHYLIP distance matrix may prevent trouble in the future.

lskatz · 2016-07-25T20:31:52Z

I just looked at this ticket out of curiosity but It wasn't too hard to code. I think there is a bioperl way to do it too but I was too lazy to try it out. I implemented this in mashtree which is not yet fully validated and which produces trees (not phylogenies)! https://github.com/lskatz/mashtree. I would also chime in to encourage people not to consider this as a phylogenetic method.

There are some dependencies like File::Basename qw/basename/ and a logmsg subroutine (ie a subroutine that prints to stderr)

my @fastqExt=qw(.fastq.gz .fastq .fq .fq.gz);
my @fastaExt=qw(.fasta .fna .faa .mfa .fas .fa);

# 1. Read the mash distances
# 2. Create a phylip file
sub distancesToPhylip{
  my($distances,$outdir,$settings)=@_;

  my $phylip = "$outdir/distances.phylip";
  return $phylip if(-e $phylip);

  logmsg "Reading the distances file at $distances";
  open(MASHDIST,"<",$distances) or die "ERROR: could not open $distances for reading: $!";

  my $id="UNKNOWN"; # Default ID in case anything goes wrong
  my %m; #matrix for distances
  while(<MASHDIST>){
    chomp;
    if(/^#query\s+(.+)/){
      $id=_truncateFilename($1,$settings);
    } else {
      my @F=split(/\t/,$_);
      $F[0]=_truncateFilename($F[0],$settings);
      $m{$id}{$F[0]}=sprintf("%0.6f",$F[1]);
    }
  }
  close MASHDIST;

  # Create the phylip file.
  # Make the text first so that we can edit it a bit.
  # TODO I should probably make the matrix the bioperl way.
  logmsg "Creating the distance matrix file for fneighbor.";
  my %seenTruncName;
  my $phylipText="";
  my @genome=sort{$a cmp $b} keys(%m);
  for(my $i=0;$i<@genome;$i++){
    my $name=_truncateFilename($genome[$i],$settings);
    $phylipText.="$name  ";
    if($seenTruncName{$name}++){

    }
    for(my $j=0;$j<@genome;$j++){
      $phylipText.=$m{$genome[$i]}{$genome[$j]}."  ";
    }
    $phylipText.= "\n";
  }
  $phylipText=~s/  $//gm;

  # Make the phylip file.
  open(PHYLIP,">",$phylip) or die "ERROR: could not open $phylip for writing: $!";
  print PHYLIP "    ".scalar(@genome)."\n";
  print PHYLIP $phylipText;
  close PHYLIP;

  return $phylip;
}

# Removes fastq extension, removes directory name,
# truncates to a length, and adds right-padding.
sub _truncateFilename{
  my($file,$settings)=@_;
  my $name=basename($file,@fastqExt);
  $name=substr($name,0,$$settings{truncLength});
  $name.=" " x ($$settings{truncLength}-length($name));
  return $name;
}

tseemann · 2016-08-29T07:02:13Z

@kloetzl Why would you want to stop people building trees with it? Trees are good relationship diagrams. Sure it's not 'phylo'-genetic but it's still genetic. In fact the first in-person demo I saw of mash in May 2015 by @aphillippy did exactly that! PHYLIP is just a convenient 'understood' format for exchanging distance data. The current matrix format is just being imported directly into R anyway and being used to draw trees. </RANT>

Add new 'matrix' command which compares all input sequences and outputs a distance matrix. The current implementation is rough and ready but potentially faster than iterative 'mash dist'. fixes issue marbl#9.

Amrithasuresh · 2018-02-06T21:59:02Z

Hi,

I am using Mash for my interest proteins to draw a dendrogram for an alignment free method. I got the following output using mash dist -t

file1.fa 0.223052
file2.fa 0.255107
file3.fa 0.223052
file4.fa 0.255107
file5.fa 0.243822
file6.fa 0.212171

I have to use these outputs in PHYLIP package to get a dendrogram. Is that right? or Do I need to convert this file to different format. Please point me direction.

Thank you for your time.

kloetzl · 2018-02-07T08:11:58Z

The output is just one row or column in the matrix. You have to do all the other comparisons, too and then you can create a Phylip-style distance matrix and finally the dendrogram. For more details, how to get a distance matrix out of mash see https://github.com/lskatz/mashtree and #66 .

tseemann · 2018-03-20T06:39:59Z

I guess I will close this as it doesn't seem you will provide a standards-compliant distance matrix output format.

The 10 character limit is legacy, most parsers support arbitrary lengths.

kloetzl · 2018-03-27T10:07:40Z

I guess I will close this as it doesn't seem you will provide a standards-compliant distance matrix output format.

You could use my fork until the PR gets merged. If you find it useful one could even patch mash in common package managers to make the new command available to everyone.

ondovb · 2018-09-22T21:03:43Z

There is a 'triangle' command in latest, which outputs (relaxed) Phylip. Not really tested with tree tools yet; have a look!

kloetzl · 2019-02-13T13:17:57Z

If anyone needs it, here is a script to convert the triangle into a square.

mash triangle "$@" |
	awk 'NR == 1 {n=$1}
		function basename(file, a, n) {
		  n = split(file, a, "/")
		  return a[n]
		}
		NR > 1 {i=NR-1; names[i] = basename($1);
		  for (j=2; j <= NF; j++){
		    mat[i,j-1] = mat[j-1,i] = $j;
		  }
		  mat[i,i]=0.0;
		}
		END{i=1;
		  print n;
		  for (a in names){
		    printf names[a];
		    for(j=1; j<=n; j++)
		      printf "  %f", mat[i,j];
		    printf "\n";
		    i++
		  }
		}'

ondovb · 2019-03-18T19:47:57Z

I think this can be closed for now. @kloetzl the square format also seems like a reasonable command line option if you'd like to submit a PR.

bede · 2019-05-21T14:43:02Z

For anyone trying to load this matrix format into python/pandas:

EDIT: Do this instead

import pandas as pd
from scipy.spatial.distance import squareform

def mash_triangle_to_square(triangle_path):
    with open(triangle_path) as contents_fh:
        next(contents_fh)  # Skip record count
        values = []
        names = []
        for line in contents_fh:
            records = line.strip().split('\t')
            names.append(records[0])
            values.extend(records[1:])
    df = pd.DataFrame(squareform(values), index=names, columns=names)
    df.replace({'': 0.}, inplace=True)
    return df.astype(float)

antunderwood · 2019-07-09T22:43:27Z

Hi @bede
#9 (comment)
I had actually just wrote virtually identical code but found that although this makes a square matrix it created the wrong matrix since it expects an upper triangle

I could not find an elegant way of doing it and ended up with

import numpy as np
import pandas as pd

def lower_triangle_to_full_natrix(filename):
    num_lines_in_file = sum(1 for line in open(filename))
    distances = []
    sample_names = []

    with open(filename) as f:
        next(f) # skip sample count line
        for line in f:
            elements = line.strip().split('\t')
            sample_names.append(elements[0])
            row = [float(e) for e in elements[1:]]
            row.extend([0.0] * (num_lines_in_file-1-len(row))
            distances.append(row)
        np_array = np.asarray(distances)
        index_upper = np.triu_indices(num_lines_in_file-1)
        np_array[index_upper] = np_array.T[index_upper]
        return pd.DataFrame(np_array, columns=sample_names, index=sample_names).to_csv('output.tsv')

bede · 2019-07-09T23:08:14Z

Thanks for sharing this Andy! I remember shelving the results of that function because I had doubts about them at the time. Thanks for the fix.

ondovb added the enhancement label Oct 5, 2015

tseemann mentioned this issue Jul 8, 2017

Option to output standard PHYLIP distance matrix sanger-pathogens/panito#5

Open

kloetzl mentioned this issue Oct 26, 2017

mash matrix #66

Closed

ondovb mentioned this issue Dec 15, 2017

Is there a way to sketch 1,000 files in one line #71

Closed

tseemann closed this as completed Mar 20, 2018

tseemann mentioned this issue Apr 7, 2018

Generate distance matrix rather than pairwise ParBLiSS/FastANI#5

Closed

ondovb reopened this Sep 22, 2018

ondovb mentioned this issue Sep 26, 2018

Tree #96

Open

kloetzl mentioned this issue Jan 9, 2019

Option to produce PHYLIP distance matrix format dnbaker/dashing#8

Closed

ondovb closed this as completed Mar 18, 2019

kloetzl mentioned this issue Feb 5, 2020

Add option to report output directly in "molten" format? tseemann/snp-dists#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to produce PHYLIP distance matrix format #9

Option to produce PHYLIP distance matrix format #9

tseemann commented Oct 3, 2015

ondovb commented Oct 5, 2015

ondovb commented Oct 22, 2015

tseemann commented Apr 2, 2016

kloetzl commented Apr 19, 2016

kloetzl commented Jul 4, 2016

lskatz commented Jul 25, 2016

tseemann commented Aug 29, 2016

Amrithasuresh commented Feb 6, 2018 •

edited

Loading

kloetzl commented Feb 7, 2018

tseemann commented Mar 20, 2018

kloetzl commented Mar 27, 2018

ondovb commented Sep 22, 2018

kloetzl commented Feb 13, 2019

ondovb commented Mar 18, 2019

bede commented May 21, 2019 •

edited

Loading

antunderwood commented Jul 9, 2019 •

edited

Loading

bede commented Jul 9, 2019

Option to produce PHYLIP distance matrix format #9

Option to produce PHYLIP distance matrix format #9

Comments

tseemann commented Oct 3, 2015

ondovb commented Oct 5, 2015

ondovb commented Oct 22, 2015

tseemann commented Apr 2, 2016

kloetzl commented Apr 19, 2016

kloetzl commented Jul 4, 2016

lskatz commented Jul 25, 2016

tseemann commented Aug 29, 2016

Amrithasuresh commented Feb 6, 2018 • edited Loading

kloetzl commented Feb 7, 2018

tseemann commented Mar 20, 2018

kloetzl commented Mar 27, 2018

ondovb commented Sep 22, 2018

kloetzl commented Feb 13, 2019

ondovb commented Mar 18, 2019

bede commented May 21, 2019 • edited Loading

antunderwood commented Jul 9, 2019 • edited Loading

bede commented Jul 9, 2019

Amrithasuresh commented Feb 6, 2018 •

edited

Loading

bede commented May 21, 2019 •

edited

Loading

antunderwood commented Jul 9, 2019 •

edited

Loading