Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to produce PHYLIP distance matrix format #9

Closed
tseemann opened this issue Oct 3, 2015 · 17 comments
Closed

Option to produce PHYLIP distance matrix format #9

tseemann opened this issue Oct 3, 2015 · 17 comments

Comments

@tseemann
Copy link
Contributor

tseemann commented Oct 3, 2015

The mash dist -t option produces a TSV distance matrix.

It is not too much effort to produce a valid PHYLIP distance matrix format, preferably in ltriangle form: http://www.mothur.org/wiki/Phylip-formatted_distance_matrix

The main issue is that PHYLIP labels may be limited to 10 chars which is bit of a disaster for mash applications. Maybe integers 0000000 - 999999 could be used and a .map file created for using with nw_rename later?

Here is an example of more specs: http://www.uwyo.edu/dbmcd/molmark/practica/phylipinfo.doc

Guidelines for PHYLIP input files for programs Neighborand Fitch (tree-building from distance matrices)

1) File must be text-only (ASCII)!!  It must be in same directory with the program and must be called infile.  
2) 1st line has number of OTUs (taxa, pops.)
3) Next line has first OTU name (padded to at least 10 characters with spaces, if necessary)
4) Easiest is upper diagonal matrix (note that it doesn't have to be aligned).  Separators are spaces.

Example

4
LS1        0.083  0.25 0.458
LS2        0.167  0.392
LS3        0.392
LS4        

Notes: last line still pads out 10 spaces, but has no "distance" (implied zero from LS4 to itself).  
@ondovb
Copy link
Member

ondovb commented Oct 5, 2015

If we did this I think I a post-processing script would make more sense than a switch.

@ondovb
Copy link
Member

ondovb commented Oct 22, 2015

Alternatively, a switch would be more feasible if it simply enforced the 10 character limit; conversion could then be left to a preprocessing script.

@tseemann
Copy link
Contributor Author

tseemann commented Apr 2, 2016

Many tools today (eg. RaxML) accept the Relaxed PHYLIP format:
http://www.phylo.org/index.php/help/relaxed_phylip

This nominally allows up to 250 character IDs and use a space as the separator rather than a hard 11 char cut.

Would you consider that?

@kloetzl
Copy link
Contributor

kloetzl commented Apr 19, 2016

+1 for PHYLIP distance matrix. That is the usual output format for other alignment-free distance estimators. See andi and spaced words.

@kloetzl
Copy link
Contributor

kloetzl commented Jul 4, 2016

I am changing my opinion.

However, due to limitations of both the k-mer approach and simple distance model, we emphasize that Mash is not explicitly designed for phylogeny reconstruction, especially for genomes with high divergence or large size differences.

Thus by requiring users to create their own distance matrix, this prevents the likely error that ppl would use Mash to build phylogenies (and be overconfident in their accuracy). Thus not producing a PHYLIP distance matrix may prevent trouble in the future.

@lskatz
Copy link

lskatz commented Jul 25, 2016

I just looked at this ticket out of curiosity but It wasn't too hard to code. I think there is a bioperl way to do it too but I was too lazy to try it out. I implemented this in mashtree which is not yet fully validated and which produces trees (not phylogenies)! https://github.com/lskatz/mashtree. I would also chime in to encourage people not to consider this as a phylogenetic method.

There are some dependencies like File::Basename qw/basename/ and a logmsg subroutine (ie a subroutine that prints to stderr)

my @fastqExt=qw(.fastq.gz .fastq .fq .fq.gz);
my @fastaExt=qw(.fasta .fna .faa .mfa .fas .fa);

# 1. Read the mash distances
# 2. Create a phylip file
sub distancesToPhylip{
  my($distances,$outdir,$settings)=@_;

  my $phylip = "$outdir/distances.phylip";
  return $phylip if(-e $phylip);

  logmsg "Reading the distances file at $distances";
  open(MASHDIST,"<",$distances) or die "ERROR: could not open $distances for reading: $!";

  my $id="UNKNOWN"; # Default ID in case anything goes wrong
  my %m; #matrix for distances
  while(<MASHDIST>){
    chomp;
    if(/^#query\s+(.+)/){
      $id=_truncateFilename($1,$settings);
    } else {
      my @F=split(/\t/,$_);
      $F[0]=_truncateFilename($F[0],$settings);
      $m{$id}{$F[0]}=sprintf("%0.6f",$F[1]);
    }
  }
  close MASHDIST;

  # Create the phylip file.
  # Make the text first so that we can edit it a bit.
  # TODO I should probably make the matrix the bioperl way.
  logmsg "Creating the distance matrix file for fneighbor.";
  my %seenTruncName;
  my $phylipText="";
  my @genome=sort{$a cmp $b} keys(%m);
  for(my $i=0;$i<@genome;$i++){
    my $name=_truncateFilename($genome[$i],$settings);
    $phylipText.="$name  ";
    if($seenTruncName{$name}++){

    }
    for(my $j=0;$j<@genome;$j++){
      $phylipText.=$m{$genome[$i]}{$genome[$j]}."  ";
    }
    $phylipText.= "\n";
  }
  $phylipText=~s/  $//gm;

  # Make the phylip file.
  open(PHYLIP,">",$phylip) or die "ERROR: could not open $phylip for writing: $!";
  print PHYLIP "    ".scalar(@genome)."\n";
  print PHYLIP $phylipText;
  close PHYLIP;

  return $phylip;
}

# Removes fastq extension, removes directory name,
# truncates to a length, and adds right-padding.
sub _truncateFilename{
  my($file,$settings)=@_;
  my $name=basename($file,@fastqExt);
  $name=substr($name,0,$$settings{truncLength});
  $name.=" " x ($$settings{truncLength}-length($name));
  return $name;
}


@tseemann
Copy link
Contributor Author

@kloetzl Why would you want to stop people building trees with it? Trees are good relationship diagrams. Sure it's not 'phylo'-genetic but it's still genetic. In fact the first in-person demo I saw of mash in May 2015 by @aphillippy did exactly that! PHYLIP is just a convenient 'understood' format for exchanging distance data. The current matrix format is just being imported directly into R anyway and being used to draw trees. </RANT>

kloetzl added a commit to kloetzl/Mash that referenced this issue Oct 16, 2017
Add new 'matrix' command which compares all input sequences and outputs
a distance matrix. The current implementation is rough and ready but
potentially faster than iterative 'mash dist'.

fixes issue marbl#9.
@kloetzl kloetzl mentioned this issue Oct 26, 2017
@Amrithasuresh
Copy link

Amrithasuresh commented Feb 6, 2018

Hi,

I am using Mash for my interest proteins to draw a dendrogram for an alignment free method. I got the following output using mash dist -t

file1.fa 0.223052
file2.fa 0.255107
file3.fa 0.223052
file4.fa 0.255107
file5.fa 0.243822
file6.fa 0.212171

I have to use these outputs in PHYLIP package to get a dendrogram. Is that right? or Do I need to convert this file to different format. Please point me direction.

Thank you for your time.

@kloetzl
Copy link
Contributor

kloetzl commented Feb 7, 2018

The output is just one row or column in the matrix. You have to do all the other comparisons, too and then you can create a Phylip-style distance matrix and finally the dendrogram. For more details, how to get a distance matrix out of mash see https://github.com/lskatz/mashtree and #66 .

@tseemann
Copy link
Contributor Author

I guess I will close this as it doesn't seem you will provide a standards-compliant distance matrix output format.

The 10 character limit is legacy, most parsers support arbitrary lengths.

@kloetzl
Copy link
Contributor

kloetzl commented Mar 27, 2018

I guess I will close this as it doesn't seem you will provide a standards-compliant distance matrix output format.

You could use my fork until the PR gets merged. If you find it useful one could even patch mash in common package managers to make the new command available to everyone.

@ondovb
Copy link
Member

ondovb commented Sep 22, 2018

There is a 'triangle' command in latest, which outputs (relaxed) Phylip. Not really tested with tree tools yet; have a look!

@kloetzl
Copy link
Contributor

kloetzl commented Feb 13, 2019

If anyone needs it, here is a script to convert the triangle into a square.

mash triangle "$@" |
	awk 'NR == 1 {n=$1}
		function basename(file, a, n) {
		  n = split(file, a, "/")
		  return a[n]
		}
		NR > 1 {i=NR-1; names[i] = basename($1);
		  for (j=2; j <= NF; j++){
		    mat[i,j-1] = mat[j-1,i] = $j;
		  }
		  mat[i,i]=0.0;
		}
		END{i=1;
		  print n;
		  for (a in names){
		    printf names[a];
		    for(j=1; j<=n; j++)
		      printf "  %f", mat[i,j];
		    printf "\n";
		    i++
		  }
		}'

@ondovb
Copy link
Member

ondovb commented Mar 18, 2019

I think this can be closed for now. @kloetzl the square format also seems like a reasonable command line option if you'd like to submit a PR.

@ondovb ondovb closed this as completed Mar 18, 2019
@bede
Copy link

bede commented May 21, 2019

For anyone trying to load this matrix format into python/pandas:

EDIT: Do this instead

import pandas as pd
from scipy.spatial.distance import squareform

def mash_triangle_to_square(triangle_path):
    with open(triangle_path) as contents_fh:
        next(contents_fh)  # Skip record count
        values = []
        names = []
        for line in contents_fh:
            records = line.strip().split('\t')
            names.append(records[0])
            values.extend(records[1:])
    df = pd.DataFrame(squareform(values), index=names, columns=names)
    df.replace({'': 0.}, inplace=True)
    return df.astype(float)

@antunderwood
Copy link

antunderwood commented Jul 9, 2019

Hi @bede
#9 (comment)
I had actually just wrote virtually identical code but found that although this makes a square matrix it created the wrong matrix since it expects an upper triangle

I could not find an elegant way of doing it and ended up with

import numpy as np
import pandas as pd

def lower_triangle_to_full_natrix(filename):
    num_lines_in_file = sum(1 for line in open(filename))
    distances = []
    sample_names = []

    with open(filename) as f:
        next(f) # skip sample count line
        for line in f:
            elements = line.strip().split('\t')
            sample_names.append(elements[0])
            row = [float(e) for e in elements[1:]]
            row.extend([0.0] * (num_lines_in_file-1-len(row))
            distances.append(row)
        np_array = np.asarray(distances)
        index_upper = np.triu_indices(num_lines_in_file-1)
        np_array[index_upper] = np_array.T[index_upper]
        return pd.DataFrame(np_array, columns=sample_names, index=sample_names).to_csv('output.tsv')

@bede
Copy link

bede commented Jul 9, 2019

Thanks for sharing this Andy! I remember shelving the results of that function because I had doubts about them at the time. Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants