Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter a VertexClustering object based on cluster sizes #507

Open
ntamas opened this issue Feb 7, 2022 · 8 comments
Open

Filter a VertexClustering object based on cluster sizes #507

ntamas opened this issue Feb 7, 2022 · 8 comments
Labels
good first issue Candidates for first-time contributors who are already familiar with C wishlist Feature request that has not been chosen for implementation yet; vote or comment to prioritize it!

Comments

@ntamas
Copy link
Member

ntamas commented Feb 7, 2022

What is the feature or improvement you would like to see?

It is a common requirement when clustering a large graph to plot only the larger communities but not the smaller ones, as illustrated by this SO question.

Maybe it would be even more flexible if we allowed the user to pass an upper and a lower size limit, or an aribtrary filtering predicate based on the sizes.

I've posted a quick sketch of a possible implementation in the SO question.

Use cases for the feature

See above; the SO question provides a possible use-case.

References

N/A

@ntamas ntamas added the good first issue Candidates for first-time contributors who are already familiar with C label Feb 7, 2022
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@szhorvat szhorvat added the wishlist Feature request that has not been chosen for implementation yet; vote or comment to prioritize it! label Apr 16, 2022
@stale stale bot removed the stale label Apr 16, 2022
@5anperez
Copy link

5anperez commented Dec 6, 2022

Hello! I would like to contribute to igraph and noticed that this is labeled as a good first issue, which I am hunting for, since it will be my very first contribution. Is this still open and if so, is everything I need to get starrted in the wiki?

@ntamas
Copy link
Member Author

ntamas commented Dec 6, 2022

Yes, the issue is still open and feel free to get started on it. There's no wiki page but the README contains some instructions for getting started with development (after checking out the repo and all its submodules); see here.

I'd suggest that you open a PR with your draft as soon as possible (and mark it as draft) and commit often so we can give you early feedback if needed.

As for this specific task, you can use my SO response as a starting point, but it needs to be made a proper method of the VertexClustering class. I'd probably add a new filter_by_size() method to the VertexClustering class.

@szhorvat
Copy link
Member

szhorvat commented Dec 6, 2022

Wouldn't it make sense to try to design some sort of generalized filtering mechanism instead of a one-trick method that only handles sizes?

What if:

  • I want to filter by total vertex weight or total edge weight inside of a partition.
  • I want to select the partition containing a specific vertex.
  • I want to select partitions that contain vertices with a specific property, stored in some attribute. For example, select partitions that have a vertex whose name starts with V, representing areas in the visual cortex within a brain network.
  • I want to filter an overlapping clustering represented using VertexCover, not VertexClustering

Generally speaking, the area where most open source projects don't do so well is having a coherent, predictable, well thought out design. If you work with a system like Mathematica often enough, this becomes painfully obvious. I would like igraph to stand out from the crowd by being better in this area.

@ntamas
Copy link
Member Author

ntamas commented Dec 6, 2022

Wouldn't it make sense to try to design some sort of generalized filtering mechanism instead of a one-trick method that only handles sizes?

I was considering this, but I don't want to overstretch the scope of this issue. It's always possible to add such a filtering method later and then rewrite filter_by_size() based on this. One advantage of adding a separate method for size (which is the most common filtering that people want to use) is that you don't actually need to construct the subgraph.

That being said, I can imagine adding a filter() method to VertexClustering that takes a Boolean predicate. The predicate would be a Python callable that takes the list of vertices in the cluster and returns True/False, but in that case it would be the job of the predicate to construct the subgraph on-the-fly if needed.

@szhorvat
Copy link
Member

szhorvat commented Dec 6, 2022

Alright, that makes sense. But it's still a useful exercise to think about how a generalization could be formulated and then implemented. Creating the subgraph seems like a heavyweight operation. Would it make sense to add a helper function to the C core which can retrieve the IDs of edges contained within a subgraph induced by some vertices? Then it will become relatively easy to compute things like:

  • Total number of edges in community
  • Total edge weight in community
  • Any statistic (e.g. variance) of edge weights within community

More generally, it might make sense for a filtering method to have access to:

  1. Vertices and their attributes
  2. Edges and their attributes

The first is trivial, as the clusters are generally understood in terms of their vertices and API provides access to these through .as_cover(). There does not seem to be a simple way for the second. Am I missing something? If not, do you agree that it makes sense to add the support function to the C core, or at least discuss these ideas further somewhere else (forum?)

@ntamas
Copy link
Member Author

ntamas commented Dec 20, 2022

Would it make sense to add a helper function to the C core which can retrieve the IDs of edges contained within a subgraph induced by some vertices?

Sounds like a good idea.

More generally, it might make sense for a filtering method to have access to:

I think that all that the filtering function needs access to is the graph object itself and the members of the current cluster being considered. The graph might not even be needed as you can always add it using functools.partial, e.g.:

def some_predicate(graph, vertices):
    # do something and return True/False here

clusters.filter(partial(some_predicate, graph))

This would allow .filter() to be implemented in the Clustering base class where we don't already have a graph.

But again, I still maintain that this is outside the scope of the current issue, which only asks for filtering based on cluster sizes, so in order not to derail the current discussion, I'd recommend opening a separate issue for a more generic filtering method.

@palash018
Copy link

Is anyone still working on this issue currently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Candidates for first-time contributors who are already familiar with C wishlist Feature request that has not been chosen for implementation yet; vote or comment to prioritize it!
Projects
None yet
Development

No branches or pull requests

4 participants