Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design of Scores and similarity classes #135

Closed
florian-huber opened this issue Aug 26, 2020 · 13 comments
Closed

Re-design of Scores and similarity classes #135

florian-huber opened this issue Aug 26, 2020 · 13 comments
Labels
package quality Make package easy to install, use, adapt. performance

Comments

@florian-huber
Copy link
Collaborator

florian-huber commented Aug 26, 2020

There was already a lengthy brainstorming happening in #59, but that got quiet long. So this issue here is meant to pick this up.

Although it started with the simple idea of adding an is_symmetric option, it then boiled down to the fact that the current implementation has some redundancies, both in how scores can be calculated and in how they are implemented. I believe that this can make things unnecessarily complex for a user.
After all, a user should be mostly concerned with

  • picking the desired similarity measure (e.g. CosineGreedy)
  • specifying the necessary settings (e.g. tolerance=0.05)
  • maybe: is it an all-vs-all comparison (references == queries)?

And then run the computation with a simple to understand command, such as one of the current ways matchms offers:

similarity_measure = CosineGreedy(tolerance=0.05)
scores = Scores(references, queries, similarity_measure).calculate()
@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 26, 2020

I worked on a new implementation in PR #133 but kept feeling that we might be doing things a bit more complex than necessary.
When calculating similarity scores, after picking and specifying the type of score, there are essentially only two options:

  • You want to compare 2 given spectrums --> single similarity score. Here the input will be (SpectrumType, SpectrumType)
  • You want to compare > 2 spectrums --> array of similarity scores. That could be one-against-many or many-against-many, but in all cases it would be fine to ask the user to enter two lists/arrays (even if one of them only contains a single element). Hence the input will be (List[SpectrumType], List[SpectrumType])

Currently we offer several ways to compute this. Scores has a sequential and a parallel option, the first will do a double for loop and run on the sequential implementation of a given similarity_measure, the second will use a (hopefully more optimized) parallel implementation of similarity_measure. But both methods will accept input in form (List[SpectrumType], List[SpectrumType]) and return the desired array of scores.

I think it would be better to split consistently between single score and array of scores. And I do not see any advantage for a user if his is offered an additional naive implementation, if a more optimized parallel implementation is in place.
I think that the OOP suggestion(s) made by @sverhoeven and @jspaaks in #59 would allow to easily move the naive implementation to the similarity score by inheriting.

class BaseSimilarityFunction:
    """Similarity function base class.
    When building a custom similarity measure, inherit from this class and implement
    the desired methods.
    """
    # Set key characteristics as class attributes
    is_commutative = True

    def compute_score(self, reference: SpectrumType, query: SpectrumType) -> float:
        """Required: Method to calculate the similarity for one input pair.

        Parameters
        ----------
        reference
            Single reference spectrum.
        query
            Single query spectrum.
        """
        raise NotImplementedError


    def compute_score_matrix(self, references: List[SpectrumType], queries: List[SpectrumType],
                             is_symmetric: bool = False) -> numpy.ndarray:
        """Optional: Provide optimized method to calculate an numpy.array of similarity scores
        for given reference and query spectrums. If no method is added here, the following naive
        implementation (i.e. a double for-loop) is used.

        Parameters
        ----------
        references
            List of reference objects
        queries
            List of query objects
        is_symmetric
            Set to True when *references* and *queries* are identical (as for instance for an all-vs-all
            comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about
            2x faster.
        """
        n_rows = len(references)
        n_cols = len(queries)
        scores = numpy.empty([n_rows, n_cols], dtype="object")
        for i_ref, reference in enumerate(references[:n_rows]):
            if is_symmetric and self.is_commutative:
                for i_query, query in enumerate(queries[i_ref:n_cols], start=i_ref):
                    scores[i_ref][i_query] = self.compute_score(reference, query)
                    scores[i_query][i_ref] = scores[i_ref][i_query]
            else:
                for i_query, query in enumerate(queries[:n_cols]):
                    scores[i_ref][i_query] = self.compute_score(reference, query)
        return scores

@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 26, 2020

Scores would then become a bit simpler and only have one calculate method, such as:

class Scores:
...
    def calculate(self) -> Scores:
        """
        Calculate the similarity between all reference objects v all query objects using
        the most suitable available implementation of the given similarity_function.
        """
        if self.n_rows == self.n_cols == 1:
            self._scores = self.similarity_function.compute_score(self.references, self.queries)
        else:
            self._scores = self.similarity_function.compute_score_matrix(self.references,
                                                                         self.queries,
                                                                         is_symmetric=self.is_symmetric)
        return self
   ...

We wouldn't need to do any checks if the method has a parallel implementation. It will always have one, either naive or more optimized.
And should we want to add further options in the future then we can still easily also add a compute_scores_distributed function or alike.

@florian-huber
Copy link
Collaborator Author

Started prototyping/implementing this in PR #139

@sverhoeven
Copy link
Member

sverhoeven commented Aug 27, 2020

We could drop of Scores(references, queries).calculate(), and make use of similarity_function.compute_score_matrix() to create the Scores object. Because the implementation of Scores.calculate and BaseSimilarityFunction.compute_score_matrix are the same (signature and implementation) and a similarity function that needs to overwrite compute_score_matrix() can do so. To do so Score class would need a different constructor that accepts scores numpy array. Scores will than only make the numpy array easier to iterate/query. API could look like

class BaseSimilarityFunction:
    """Similarity function base class.
    When building a custom similarity measure, inherit from this class and implement
    the desired methods.
    """
    # Set key characteristics as class attributes
    is_commutative = True

    def pair(self, reference: SpectrumType, query: SpectrumType) -> float:
        """Required: Method to calculate the similarity for one input pair.

        Parameters
        ----------
        reference
            Single reference spectrum.
        query
            Single query spectrum.
        """
        raise NotImplementedError


    def matrix(self, references: List[SpectrumType], queries: List[SpectrumType],
                             is_symmetric: bool = False) -> numpy.ndarray:
        """Method to calculate an Scores object of similarity scores
        for given reference and query spectrums. If method is not overridden in subclass, a naive
        implementation (i.e. a double for-loop) is used.

        Parameters
        ----------
        references
            List of reference objects
        queries
            List of query objects
        is_symmetric
            Set to True when *references* and *queries* are identical (as for instance for an all-vs-all
            comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about
            2x faster.
        """
        n_rows = len(references)
        n_cols = len(queries)
        scores = numpy.empty([n_rows, n_cols], dtype="object")
        for i_ref, reference in enumerate(references[:n_rows]):
            if is_symmetric and self.is_commutative:
                for i_query, query in enumerate(queries[i_ref:n_cols], start=i_ref):
                    scores[i_ref][i_query] = self.pair(reference, query)
                    scores[i_query][i_ref] = scores[i_ref][i_query]
            else:
                for i_query, query in enumerate(queries[:n_cols]):
                    scores[i_ref][i_query] = self.pair(reference, query)
        return Scores(references, queries, scores)

class Scores:
   def __init__(self, references: ReferencesType, queries: QueriesType, scores: numpy.ndarray):
        self.n_rows = len(references)
        self.n_cols = len(queries)
        self.references = numpy.asarray(references).reshape(self.n_rows, 1)
        self.queries = numpy.asarray(queries).reshape(1, self.n_cols)
        self._scores = scores
        self._index = 0

  def __iter__():
    ...

  def __next__(self):
     ...

scores = CosineGreedy(tolerance=0.05).matrix(references, queries)

If you want to keep calculate_scores() use

def calculate_scores(references: ReferencesType, queries: QueriesType, sim: BaseSimilarityFunction):
    return sim.matrix(references, queries)

scores = calculate_scores(references, queries, CosineGreedy(tolerance=0.05))

@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 27, 2020

I am a bit undecided here.
On the one side, I like having one option less (Scores.calculate()).
On the other side, if the similarity function returns a Scores instance instead of a numpy array it would become difficult to access to the underlying method without using Scores, say for developing some other functions/implementations etc.

@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 27, 2020

Second thought on it: I think I would prefer not adding Scores to the similarity functions.
That would make them less generic. For spec2vec for instance, I would then have to introduce Scores. Technically that's no problem of course, but if feels a bit unnecessary to force using it for anybody that wants to design an own similarity function.

Maybe it would be good enough to make it Scores._calculate() to indicate that this is not the supposed way to run this?
(Plus also adding to the docstring that this is called from calculate_similarities())

@sverhoeven
Copy link
Member

sverhoeven commented Aug 27, 2020

Ok, agree that having numpy array as return is beneficial, but in the base class we could have two calculate methods one that returns numpy and one that returns Scores. With this new setup you would need the spec2vec class be a sub-class the base class. So you get the calculates Scores method for free by sub-classing, you just override the pair() and matrix() methods in the spec2vec class. Or don't you want to have the spec2vec class to be a sub-class of BaseSimilarityFunction?

Yep, I like making Scores._calculate() private that will reduce the public API and reduce confusion which calculate to use.

@florian-huber
Copy link
Collaborator Author

I agree here, I think we should make Spec2Vec a sub-class BaseSimilarityFunction.
(which also makes one of my arguments against including Scores obsolete...)

It seems to come down to two options.

  1. We remove Score.calculate() and let the similarity scores have methods that either return a Scores object or a numpy array. But how would that look? We would than need 4 methods, right? pair(), pair_numpy(), matrix() and matrix_numpy() (with better naming I guess).
  2. We change Score.calculate() to Score._calculate() and keep the similarity function as outlined above (always returning numpy array).

What for me still speaks for (2) is that we can keep all the logic for choosing the right function to call in Scores.

  • similarity measure provides: .pair() -> float and .matrix() -> np.ndarray (either naive or parallel/optimized), in the future their could be additional methods such as .distributed() -> np.ndarray
  • Scores.calculate() contains the logic to pick the right implementation. Above is only the simplest version which will only check if the input is simply a pair or contains more elements. But I imagine that could become more complicated in the future. For instance, we could at some point decide to add checks to compare id() or alike of all references and queries to decide which scores can be skipped. We could also decide to add options for only storing the most import scores or only scores above a threshold etc. (related to Scores.scores as a sparse matrix #60, Predefined top=n argument on Scores.calculate() could save memory #62)
  • calculate_scores() would remain a simple wrapper function.

@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 31, 2020

Mhhh, pylint does not like the _calculate() way...
pylint: protected-access / Access to a protected member _calculate of a client class
Of course that makes sense, because we call it from calculate_scores (and also from some tests, but that I would say is no real issue).

@florian-huber
Copy link
Collaborator Author

florian-huber commented Aug 31, 2020

@sverhoeven I now implemented one version to see how it goes in PR #139 .

  • In general it worked OK I found, so I could get rid of all xyzParallel methods, calculate_parallel and Scores.calculate_parallel().
  • With pylint complaining I kept Scores.calculate(). We could of course still decide to remove it and add the Scores to the similarity function as you suggested above. In that case the "logic" of which method to call could be moved to calculate_scores.
  • A bit awkward was that because the similarity functions now all inherit from BaseSimilarityFunction, they all need to share the same parameters. While that is of course fine for references, queries, I guess that people developing their own score might get caught by surprise that they also have to include a is_symmetric parameter to the .matrix(...) method, even though they won't use it. But I could live with that...
  • Another minor issue is that the .matrix() method does only appear in the documentation when it is implemented, but not if it is inherited from BaseSimilarityFunction.

@sverhoeven
Copy link
Member

2nd point: Mabye we can mark Scores.calculate() as deprecated (using something like https://pypi.org/project/deprecation/) so in the next major release we can drop it
3rd point: Dont think is a big problem either. When they want to implement the optional matrix() it will be more work, but the pair() is the function they need to implement to make the class usable and that one is simple.
4th point: We can maybe add 'inherited-members': True to readthedocs/conf.py:autodoc_default_options to show inherited methods.

@florian-huber
Copy link
Collaborator Author

florian-huber commented Sep 7, 2020

  • Added 'inherited-members': True to sphinx config.
  • Added deprecated decorator to Scores.calculate()

@florian-huber florian-huber added the package quality Make package easy to install, use, adapt. label Sep 10, 2020
@florian-huber
Copy link
Collaborator Author

This has been addressed (as is now being worked on further in #153).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package quality Make package easy to install, use, adapt. performance
Projects
None yet
Development

No branches or pull requests

2 participants