Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data type checking - And a few other things #51

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

HeftyCoder
Copy link

@HeftyCoder HeftyCoder commented Feb 3, 2024

This PR includes type checking with the Data class when connecting nodes as well as when passing data between ports. It also includes some QoL changes. Specifically...

Data and Type Checking

Due to various changes, a new folder / module named data has been introduced.

Type checking is done via a simple function defined in Data.py:

def check_valid_data(out_data_type: Type[Data], inp_data_type: Type[Data]) -> bool:
    """
    Returns true if input data can accept the output data, otherwise false
    
    None type is treated as the default Data type
    """
    
    if inp_data_type is None:
        inp_data_type = Data
    if out_data_type is None:
        out_data_type = Data
    
    return issubclass(out_data_type, inp_data_type)

Another function is defined in NodePort.py for connection validity. It uses a new enum type defined in RC.py:

Code
def check_valid_conn(out: NodeOutput, inp: NodeInput) -> Tuple[ConnValidType, str]:
    """
    Checks if a connection is valid between two node ports.

    Returns:
        A tuple with the result of the check and a detailed reason, if it exists.
    """
    
    if out.node == inp.node:
        return (ConnValidType.SAME_NODE, "Ports from the same node cannot be connected!")
    
    if out.io_pos == inp.io_pos:
        return (ConnValidType.SAME_IO, "Connections cannot be made between ports of the same pos (inp-inp) or (out-out)")
    
    if out.io_pos != PortObjPos.OUTPUT:
        return (ConnValidType.IO_MISSMATCH, f"Output io_pos should be {PortObjPos.OUTPUT} but instead is {out.io_pos}")
    
    if out.type_ != inp.type_:
        return (ConnValidType.DIFF_ALG_TYPE, "Input and output must both be either exec ports or data ports")
    
    if not check_valid_data(out.allowed_data, inp.allowed_data):
        return (ConnValidType.DATA_MISSMATCH, 
                f"When input type is defined, output type must be a (sub)class of input type\n [out={out.allowed_data}, inp={inp.allowed_data}]")
    
    return (ConnValidType.VALID, "Connection is valid!")

As shown above, ports can be given an allowed_data in the constructor, which is serialized along with the port. Finally, setting an output value asserts for the correct type like this:

out = self.outputs[index]
data_type = (out.allowed_data 
                     if out.allowed_data and issubclass(out.allowed_data, Data)
                     else Data) 
        
assert isinstance(data, data_type), f"Output value must be of type {data_type.__module__}.{data_type.__name__}"

Along with these, the PR includes a _BuiltInData type in Data.py for defining various built-in types associated with payload types of the python standard library. Types from a specific parent class can be searched using get_built_in_data_types defined in core.py. The built-in types are automatically registered when a Session is created, with an identifier of data.built_in.{class_name}. The built-int types are:

  • Core types defined in core.py, PayloadData, StringData, BytesData
  • Abstract collection types in the data.built_in.collection.abc sub-module, one-to-one related with the standard libraries abc collections
  • Implemented collection types in data.built_in.collection.data_types sub-module, namely ListData, TupleData, DictData, OrderedDictData, SetData, FrozenSetData, QueueData
  • Number types defined in data.built_in.numbers sub-module, one-to-one related with the numbers module in the standard library. These are NumberData, ComplexData, RealData, RationalData, IntegerData.

The types are made to assert that a payload's type must be a sub-type of a base class. All the classes except the numbers assert that the payload type is correct. The number classes attempt to cast the payload to a correct default type if its type is incorrect.

While making this functionality, I noticed that some QoL could be made to improve setting inputs and getting outputs. payload_to_data dict, data_to_payload dict, register_payload_to_data-register_payload_to_data_multi functions are provided in Data.py for the developer to associate payload types to a certain data type. With them, a virtual class-method named register_payload_type(cls) is defined in the Data class and called when registering the class in a Session for inheritance-based payload to data association. All the above built-in types register themselves, either when imported or using the inheritance method.

i.e

register_payload_to_data_multi(
    {
        list: ListData,
        tuple: TupleData,
        dict: DictData,
        OrderedDict: OrderedDictData,
        set: SetData,
        frozenset: FrozenSetData,
        deque: QueueData,
    }
)

Using the above, these functions have been implemented in the Node class.

Code
def set_output_payload(self, index: int, payload):
        """
        Wrapper of :code:`set_output_val()` that creates the data type from the payload type.
        
        Data and payload types must be associated with register_payload_to_data found in Data module.
        
        This function assumes the derived Data type's constructor works at least with a payload argument.
        Defaults to the base Data type if an association isn't found 
        """
        
        data_type = payload_to_data.get(type(payload))
        data = data_type(payload) if data_type else Data(payload)
        
        self.set_output_val(index, data)

def input_payload(self, index: int):
        """
        Returns the payload residing at the data input of given index.

        Do not call on exec inputs
        """
        
        data = self.input(index)
        
        return data.payload if data else None

The above functions 'skip' over the need of manually instantiating a data class, provided that dataclass works with a constructor that only needs a payload.

(At this point, I'd like to say that the input() function might be needing a rename, since it can be confused with the inputs list, despite the fact that the first returns a Data instance and the second stores the ports)

Much of the functionality has been tested in a third DataTypesBuiltIn test defined in data_objects.py

Quality of Life

Some other improvements include additional events in Node:

  • updated: invoked when the update() function exits
  • output_updated: invoked when an output is updated
  • progress_updated: invoked when setting the progress in a node

Regarding the third point, a ProgressState class has been added in RC.py for defining the progress inside a node, for example when using threads or multi-processes. It's a simple class and the associated functions are defined in the PROGRESS section of Node.

Lastly, the NodePortType classes were made into dataclasses.

These are (hopefully :D) all the important changes and additions I've made in this PR.

Allows multiple outputs to be connected to the same input
- Implemented Type Checking for Data types, not payload
- Implemented a PayloadData for checking if the payload is the correct type for this Data
- Implemented various built-in Data types; number module data types, collections module data types, string, bytes
- Made node blue prints into dataclasses
- Added an updated and an output_changed event in the Node class
- Improved Connection validity test. Returns an enum and a message
- Added a simple test in data_objects.py to check for type checking and connection validity
- Ensured that the built_in data types have data.built_in.{node_class_name} identifiers
Added a new class indicating some kind of progress. Utilized the class in Node to provide an API for setting a node's update progress (in case of threading etc inside the node)
- Implemented a way to associate payload to data types
- Added more tests
-
- Added some additional ConnValidTypes and two new functionsin Flow for connection checking
Fixed bug where out_data_type or inp_data_type were None
Flow.can_nodes_connect() now returns a ConnValidType.INPUT_TAKEN if input is connected to another output. This is needed because graph_adj_rev only handles a single inp -> output
Fixed the order in which the various ConnValidTypes are checkd in Flow.can_nodes_connect()
@leon-thomm
Copy link
Owner

leon-thomm commented Feb 10, 2024

Hey, good work! Thanks for writing this up so clearly. Some initial notes:

  • I think it should be inp_data_type for both here
        if inp_data_type is None or out_data_type is Data:
            return True
  • I don't think we need to return strings in check_connection_validity(), can_nodes_connect() and check_valid_conn(), they don't seem to contain information that's not already indicated by the enum. in a way, that's what enums are for :)
  • I think we should not introduce events updated and output_updated. Firstly, the way this is currently implemented doesn't respect the in-order of activations, which is confusing (the other events are very carefully emitted because these inconsistencies happen quickly). Secondly, this will induce a significant performance hit, that is not justified if nothing really depends on it. If someone needs it, they can easily build it into their nodes manually.
  • Regarding Node.input/Node.inputs confusion: I think one should not directly access inputs or outputs other than for testing. How about renaming inputs to _inputs, and outputs to _outputs?
  • Is there any reason the node progress cannot be implemented in nodes that need this? If not, I don't see why this should be a general thing in ryvencore. We do not provide anything for multi-threading, and it should be either full support, or no support. The rest is confusion.
  • I think we might remove the "data." from the built-in Data identifiers, in the interest of simplicity. There is no convention defined on this.
  • I'm not yet sure about the payload_to_data() thing. This again sounds like trying to generally fix data compatibility between arbitrary nodes. As I noted before, I don't think we should attempt to do this, and rather force the developer to be explicit on Data stuff. I'll think.

@HeftyCoder
Copy link
Author

HeftyCoder commented Feb 10, 2024

Hello!

I think it should be inp_data_type for both here

I completely forgot to update the post. I noticed that the if check isn't good, and changed it to this (updated in original post as well):

def check_valid_data(out_data_type: Type[Data], inp_data_type: Type[Data]) -> bool:
    """
    Returns true if input data can accept the output data, otherwise false
    
    None type is treated as the default Data type
    """
    
    if inp_data_type is None:
        inp_data_type = Data
    if out_data_type is None:
        out_data_type = Data
    
    return issubclass(out_data_type, inp_data_type)

I don't think we need to return strings in check_connection_validity(), can_nodes_connect() and check_valid_conn()

The messages were supposed to be like "compile" error messages to be honest. I'll make it return only the enum. Do you want me to include the messages as a function somewhere or remove them completely? Included a get_error_message for getting a detailed error message on validity checks.

I think we should not introduce events updated and output_updated

The updated event isn't really needed, but the output_updated event might be needed, at least in the context of Ryven. This is the only way to activate the dots animation we introduced in Ryven. ~~I can remove the updated~~Removed the updated, but please reconsider the output_updated if I'm not missing something.

Regarding Node.input/Node.inputs confusion

Sure, I can change that in this PR if you want to. Went ahead and changed this already.

Is there any reason the node progress cannot be implemented in nodes that need this?

Similar to output_updated, providing a basic progression implementation in ryvencore can help Ryven or any other solution that uses ryvencore. Furthermore, ryvencore may not provide any multi-threading utilities, but node implementations can. As an example, I created a test in my custom Ryven version where I've implemented UI for the progress utility.

progress_state.mp4

This particular example doesn't apply to the original Ryven due to many changes in the application (the node logic runs on a separate thread). However, I can definitely see it being useful in Ryven where a node has to load big data, i.e. from a CSV file.

I think we might remove the "data."

Will do.

This again sounds like trying to generally fix data compatibility between arbitrary nodes.

I actually wasn't trying to fix this. I implemented the input_payload() function to avoid None if checks when using the original input(), i.e

payload = self.input_payload(1)

# as opposed to

data = self.input(1)
if data is not None:
    payload = data.payload

The set_output_payload and register_payload_to_data were simply implemented to complement the above function. I can see some confusion coming from this, especially when someone tries to register the same payload type in his own node package, essentially overriding the original association. That could be a good thing though, depending on how you view it.

I think the input_payload should stay unless I'm missing something. About the set_output_payload, I'm 50 / 50 on this. Do think about it.

- Removed 'data.' from built-in types identifiers.
- All connection validation returns a ConnValidType instead of (ConnValidType, str). The messages were moved to a `get_error_message` function in `NodePort.py`
- Removed the added `updated` event in the nodes
- Renamed `Node.outputs` and Node.inputs` to `Node._outputs` and `Node._inputs`
- Utilized typing.TYPE_CHECKING to add type hinting to (almost?) all the files that couldn't import the type due to circular imports.
@HeftyCoder
Copy link
Author

The last pushes have the following changes:

  • Removed 'data.' from built-in types identifiers.
  • All connection validation / check functions return a ConnValidType instead of (ConnValidType, str). The messages were moved to a get_error_message function in NodePort.py
  • Removed the added Node.updated event in the nodes
  • Renamed Node.outputs and Node.inputs to Node._outputs and Node._inputs
  • Utilized typing.TYPE_CHECKING to add type hinting to (almost?) all the files that couldn't import the type due to circular imports (Node in NodePort, Session in Flow, [Flow, Session] in Node, AddOn in Session)

I'll be waiting to further discuss the output_updated, the ProgressState and the payload stuff! [I do think the first two are important and leave little room for discussion, but I could be wrong]

@leon-thomm
Copy link
Owner

Included a get_error_message for getting a detailed error message on validity checks.

might make it a classmethod of ConnValidType?

This is the only way to activate the dots animation we introduced in Ryven.

fair, but that's no reason to ruin ryvencore performance, but rather to remove the feature from Ryven. Ryven should mostly adapt to ryvencore, not the other way around.1

ryvencore may not provide any multi-threading utilities, but node implementations can

...just as they can implement progress bars? ryvencore aims to only provide the basics and a modular architecture to build everything else on top (hence the name); I feel this is a little too specific and the previous argument still stands.

As an example, I created a test in my custom Ryven version where I've implemented UI for the progress utility.
progress_state.mp4

IMO this is a perfect example of something that would add a lot of value in some sort of community-driven node feature ecosystem for Ryven, no? (which doesn't exist yet)

The set_output_payload and register_payload_to_data were simply implemented to complement the above function. I can see some confusion coming from this, especially when someone tries to register the same payload type in his own node package, essentially overriding the original association.

yes, I can spin this thought to ridiculous scenarios where the behavior of a data object essentially depends on things that neither the developer, nor the user actively control due to conflicts.

The input_payload() method makes sense to me regardless, but I think we might avoid implicit conversions as done through register_payload_to_data() and set_output_payload().

Footnotes

  1. Note this is essentially held back by ryvencore events adding a far-from-zero runtime cost. This could be fixed if there was a much more efficient (non-python?) implementation of events in ryvencore, which I would like to add eventually.

@HeftyCoder
Copy link
Author

HeftyCoder commented Feb 20, 2024

might make it a classmethod of ConnValidType?

Sure!

The input_payload() method makes sense to me regardless, but I think we might avoid implicit conversions as done through register_payload_to_data() and set_output_payload().

I'll remove the implicit conversions, I too think it was a bit too much, especially for a base version of a node system that implements its own data system.

As per the output_updated and the ProgressState, the primary concern seems to be that the Event implementation is slow. I just saw that slots are created at the event's instantiation and looped over when emitting, despite there being a chance of having no callbacks at all. Therefore, I do not think the problem is python-oriented, but rather implementation-oriented.

An easy solution can be to simply check the _slot_priorities dictionary. The way I see it implemented, it's empty when there are no callbacks. We can have something like this:

def emit(self, *args):

    if len(self._slot_priorities) == 0:
        return
        
    # notice that dicts are insertion ordered since python 3.6
    for nice, cb_set in self._slots.items():
        for cb in cb_set:
            cb(*args)

This however doesn't solve the issue of having instantiated unused sets. A better solution would simply be to instantiate the set when it's needed and reorder the dictionary:

class Event:

    def __init__(self, *args):
        self.args = args
        self._slots: Dict[int, Set] = {}
        self._slot_priorities = {}

    def sub(self, callback, nice=0):
        
        # this can be commented for indefinite number of -nice- callbacks
        assert -5 <= nice <= 10
        assert self._slot_priorities.get(callback) is None

        if self._slots.get(nice) is None:
            self._slots[nice] = set()
            self._slots = dict(sorted(self._slots.items()))
            
        self._slots[nice].add(callback)
        self._slot_priorities[callback] = nice

    def unsub(self, callback):

        nice = self._slot_priorities[callback]
        cb_set = self._slots[nice]
        cb_set.remove(callback)
        del self._slot_priorities[callback]
        
        # no need to sort when deleting
        if len(cb_set) == 0:
            del self._slots[nice]

    def emit(self, *args):

        # notice that dicts are insertion ordered since python 3.6
        for nice, cb_set in self._slots.items():
            for cb in cb_set:
                cb(*args)

We can even say that the assert -5 <= nice <= 10 isn't really needed anymore. It could be anything, but I understand if you want to limit it.

Per the Event class' API, there is no need for early instantiation of 16 sets that will cause the whole system to slowdown when emitting an event. This should provide a good speed-up in ryvencore. If you find that sorting the dictionary using standard python might impact performance, we could use sorted containers.

I'll argue again that an output_updated and a progress API should be part of a core node system; perhaps even more so for the progress and less for the output_updated. Since nodes are primarily used in conjunction with a GUI, allowing a developer to define progress states out of the box is a good solution. Otherwise, they'd have to create their own implementation or resort to printing any states in a console or a text area.

...just as they can implement progress bars?

Even if ryvencore doesn't provide a progress api out of the box, Ryven could've a progress bar implementation. That might justify leaving the progress API out of ryvencore, but still being able to access a GUI implementation in Ryven. Still, I think at least the ProgressState class could remain in ryvencore, even if the root node doesn't use it.

Could you perhaps see a base node, i.e ProgressNode that implements the progress implementation, if you think the root class must definitely not have it?

Removed `register_payload_to_data` and `set_output_payload`
@HeftyCoder
Copy link
Author

As an update to the above, here's a refined version of Event that doesn't create a new dictionary by sorting the items nor uses an external library. It utilizes insort from bisect and an extra list to keep track of the added sets and the order.

class Event:

    def __init__(self, *args):
        self.args = args
        self._slots: Dict[int, Set] = {}
        self._ordered_slot_pos = []
        self._slot_priorities = {}

    def sub(self, callback, nice=0):
        assert -5 <= nice <= 10
        assert self._slot_priorities.get(callback) is None

        cb_set = self._slots.get(nice)
        if cb_set is None:
            cb_set = self._slots[nice] = set()
            insort(self._ordered_slot_pos, nice)
            
        cb_set.add(callback)
        self._slot_priorities[callback] = nice

    def unsub(self, callback):
        nice = self._slot_priorities[callback]
        cb_set = self._slots[nice]
        cb_set.remove(callback)
        del self._slot_priorities[callback]

        if len(cb_set) == 0:
            del self._slots[nice]
            self._ordered_slot_pos.remove(nice)

    def emit(self, *args):

        # notice we're using the sorted list to run through the events
        for nice in self._ordered_slot_pos:
            for cb in self._slots[nice]:
                cb(*args)

Fixed a left-over Nodes.outputs to Nodes._outputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants