Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for IP Address and MAC Address data #18767

Closed
TomAugspurger opened this issue Dec 13, 2017 · 13 comments
Closed

Add support for IP Address and MAC Address data #18767

TomAugspurger opened this issue Dec 13, 2017 · 13 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 13, 2017

Hi all, this is a proposal to add a new block and type for representing IP Addresses.
There are still some details that need ironing out, but I wanted to gauge reactions to
including this in pandas before spending too much more time on it.

Here's a notebook demonstrating the basics: http://nbviewer.jupyter.org/gist/TomAugspurger/3ba2bc273edfec809b61b5030fd278b9

Abstract

Proposal to add support for storing and operating on IP Address data.
Adds a new block type for ip address data and an ip accessor to
Series and Index.

Rationale

For some communities, IP and MAC addresses are a common data format. The data
format was deemed important enough to add the ipaddress module to the standard
library (see PEP 3144_). At Anaconda, we hear from customers who would use a
first-class IP address array container if it existed in pandas.

I turned to StackOverflow to gauge interest in this topic. A search for "IP" on
the pandas stackoverflow
tag
turns up 300 results.
Under the NumPy tag there are another 80. For comparison, I ran a few other
searches to see what interest there is in other "specialized" data types (this
is a very rough, probably incorrect, way of estimating interest):

term results
financial 251
geo 120
ip 300
logs 590

Categorical, which is already in pandas, turned up 1,089 items.

Overall, I think there's enough interest relative to the implementation /
maintenance burden to warrant adding the support for IP Addresses. I don't
anticipate this causing any issues for the arrow transition, once ARROW-1587 is
in place. We can be careful which parts of the storage layer are implementation
details.

Specification

The proposal is to add

  1. A type and container for IPAddress and MACAddress (similar to
    CategoricalDtype and Categorical).
  2. A block for IPAddress and MACAddress (similar to CategoricalBlock).
  3. A new accessor for Series and Indexes, .ip, for operating on IP
    addresses and MAC addresses (similar to .cat).

The type and block should be generic IP address blocks, with no
distinction between IPv4 and IPv6 addresses. In our experience, it's
common to work with data from multiple sources, some of which may be
IPv4, and some of which may be IPv6. This also matches the semantics
of the default ipaddress.ip_address factory function, which returns
an IPv4Address or IPv6Address as needed. Being able to deal with
ip addresses in an IPv4 vs. IPv6 agnostic fashion is useful.

Data Layout

Since IPv6 addresses are 128 bits, they do not fit into a standard NumPy uint64
space. This complicates the implementation (but, gives weight to accepting the
proposal, since doing this on your own can be tricky).

Each record will be composed of two uint64s. The first element
contains the first 64 bits, and the second array contains the second 64
bits. As a NumPy structured dtype, that's

base = np.dtype([('lo', '>u8'), ('hi', '>u8')])

This is a common format for handling IPv4 and IPv6 data:

Hybrid dual-stack IPv6/IPv4 implementations recognize a special class of
addresses, the IPv4-mapped IPv6 addresses. These addresses consist of an
80-bit prefix of zeros, the next 16 bits are one, and the remaining,
least-significant 32 bits contain the IPv4 address.

From here

Missing Data

Use the lowest possible IP address as a marker. According to RFC2373,

The address 0:0:0:0:0:0:0:0 is called the unspecified address. It must
never be assigned to any node. It indicates the absence of an address.

See here.

Methods

The new user-facing IPAddress (analogous to a Categorical) will have
a few methods for easily constructing arrays of IP addresses.

IPAddress.from_pyints(cls, values: Sequence[int]) -> 'IPAddress':
    """Construct an IPAddress array from a sequence of python integers.

    >>> IPAddress.from_pyints([10, 18446744073709551616])
    <IPAddress(['0.0.0.10', '::1'])>
    """

IPAddress.from_str(cls, values: Sequence[str]) -> 'IPAddress':
    """Construct an IPAddress from a sequence of strings."""

The methods in the new .ip namespace should follow the standard
library's design.

Properties

  • is_multicast
  • is_private
  • is_global
  • is_unspecificed
  • is_reserved
  • is_loopback
  • is_link_local

Reference Implementation

An implementation of the types and block is available at
pandas-ip (at the moment
it's a proof of concept).

Alternatives

Adding a new block type to pandas is a major change. Downstream libraries may
have special-cased handling for pandas' extension types, so this shouldn't be
adopted without careful consideration.

Some alternatives to this that exist outside of pandas:

  1. Store ipaddress.IPv4Address or ipaddress.IPv6Address objects in
    an object dtype array. The .ip namespace could still be included
    with an extension decorator. The drawback here is the poor
    performance, as every operation would be done element-wise.
  2. A separate library that provides a container and methods. The
    downside here is that the library would need to subclass Series,
    DataFrame, and Index so that the custom blocks and types are
    interpreted correctly. Users would need to use the custom
    IPSeries, IPDataFrame, etc., which increases friction when working
    with other libraries that may expect / coerce to pandas objects.

To expand a bit on the (current) downside of alternative 2, when the pandas constructors
see an "unknown" object, they falls back to object dtype and stuffs the actual Python object
into whatever container is being created:

In [1]: import pandas as pd

In [2]: import pandas_ip as ip

In [3]: arr = ip.IPAddress.from_pyints([1, 2])

In [4]: arr
Out[4]: <IPAddress(['0.0.0.1', '0.0.0.2'])>

In [5]: pd.Series(arr)
Out[5]:
0    <IPAddress(['0.0.0.1', '0.0.0.2'])>
dtype: object

I'd rather not have to make a subclass of Series, just to stick an array-like thing into a Series.

If pandas could provide an interface such that objects satisfying that interface
are treated as array-like, and not a simple python object, then I'll gladly close
this issue and develop the IP-address specific functionality in another package.
That might be the best possible outcome to all this.

References

@TomAugspurger TomAugspurger added API Design Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Dec 13, 2017
@jorisvandenbossche
Copy link
Member

Wow, detailed proposal!

First question that comes to my mind: why is it needed to be included in pandas (from technical point of view). Or to put it differently: what is currently in pandas_ip not working with storing the externally defined block in a pandas Series / DataFrame?

E.g. in geopandas the GeometryBlock can be stored in a Series as well, the main reason we have the subclasses GeoSeries and GeoDataFrame is to add a bunch of additional methods (but which could be solved with an accessor).
(I have to be honest: there are still some other methods we need to override to get everything working (like isna), but having yet another library with an external defined block might be an extra driver to fix those in pandas, like I fixed already a few things to get concat working with external blocks)

@jorisvandenbossche
Copy link
Member

For example, I see you list concat and indexing in the notebook as things that don't work. However, if you define the correct method on your block, concatting Series objects should work, and basic indexing should work as well.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 13, 2017

what is currently in pandas_ip not working with storing the externally defined block in a pandas Series / DataFrame?

Unless I'm missing something, there isn't a good way stuff an arbitrary "thing" into the regular Series / DataFrame constructors and have it work:

In [1]: import pandas as pd
pi
In [2]: import pandas_ip as ip

In [3]: arr = ip.IPAddress.from_pyints([1, 2])

In [4]: arr
Out[4]: <IPAddress(['0.0.0.1', '0.0.0.2'])>

In [5]: pd.Series(arr)
Out[5]:
0    <IPAddress(['0.0.0.1', '0.0.0.2'])>
dtype: object

AFAICT, the only way to do this from outside pandas is to construct blocks directly and use fastpath

In [8]: pd.Series(ip.IPBlock(arr, slice(0, 1)), pd.RangeIndex(2), fastpath=True)
Out[8]:
0    0.0.0.1
1    0.0.0.2
dtype: ip

So an alternative to my proposal would be to make something like In[8] possible, with a bit less of pandas internals coming through.

(edited a bug in my example).

@TomAugspurger
Copy link
Contributor Author

I could imagine coming up with an interface where if an object passed to the interface satisfies it, we dispatch some of the DataFrame / Series constructor behavior to the object. In #18767 (comment), In[5] does the "wrong" thing since it casts it to NumPy. But, if the object passed in satisfies some interface so that pandas can determine

  • the length (2)
  • the type (ip, which is a NumPy or PyArrow dtype)
  • various other things I'm sure

Then pandas can (maybe) figure out the right thing to do. To be clear, I'd be more than satisfied if we can make this solution work.

@jschendel
Copy link
Member

jschendel commented Dec 13, 2017

I was actually thinking about this yesterday, but in the context of Interval and IntervalIndex: there are situations where people care more about ranges of IP Addresses than a singular IP Address, either as the traditional IP Network groupings, or custom ranges of addresses. Seems like it would be nice to have good compatibility between IP Addresses and Interval/IntervalIndex for this, if not custom extensions of these, depending on how much customization there is.

Obviously a bit of work would need to be done on IP Addresses and Interval/IntervalIndex individually before trying to combine them, but it might be a good idea to consider compatibility between the two during initial design/implementation. Haven't scoped out the details of this much at all, so maybe Interval/IntervalIndex aren't appropriate for what I'm describing, but I think there should be some way of working with logical groups of IP Addresses. I'd classify this as more of a "nice to have" than something that'd need to be present in the initial implementation though.

Additionally, the PostgreSQL docs might be useful as an additional reference/another perspective in general:

@TomAugspurger
Copy link
Contributor Author

Updated the original with some information on why doing this outside pandas is (currently) difficult, but I'd be happy to work on making that smoother.

@jschendel, yes I was just reading through https://docs.python.org/3/howto/ipaddress.html#defining-networks on this. I'm not especially familiar with the network side of things, so I'm not sure what that would look like.

And good call on using Postgres for design inspiration.

@chris-b1
Copy link
Contributor

I'm not opposed to having an IP type in pandas, but does seem like it could be an interesting case to try develop an "extension block API" around, i.e., you do something like subclass Block and ExtensionDtype and through metaclass registration or whatever, everything works!

That said, I really don't know our own internal interfaces well enough to know if this is feasible without massive refactoring or even a good idea.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 14, 2017 via email

@jorisvandenbossche
Copy link
Member

Unless I'm missing something, there isn't a good way stuff an arbitrary "thing" into the regular Series / DataFrame constructors and have it work:

Yes, that is correct. That is also something with which I have struggled in geopandas.

For the short term, you could provide functional constructors like ip.series(..) returning a Series with ip block.

BTW, the fact that it doesn't see your ip array-like as an array-like and unwraps it in a series (so getting series of length 2) feels like a bug in pandas (in is_list_like) (but not that it would do anything better otherwise in this case of course).

But, if the object passed in satisfies some interface so that pandas can determine

An alternative interface could be pandas checking for a _data attribute (or other name) that is a Block subclass instance (although you typically want to store your ip-array-like in a block, and not the block as an attribute on the array-like .., so maybe not a good idea)

I plan to experiment with defining an interface through ABCs next week.

Can you explain this a bit in more detail?

@TomAugspurger
Copy link
Contributor Author

feels like a bug in pandas (in is_list_like)

I haven't (yet) implemented the methods to make that IP array an iterable.

I plan to experiment with defining an interface through ABCs next week.

Can you explain this a bit in more detail?

A class (ABC or otherwise) that contains enough information for the pandas constructors to do the right thing (the dtype, shape, a block type, etc). Right now in the Series constructor we try a whole bunch of things like checking if the array is an extension type before falling back to sticking it into an object-type numpy array (this is in `_sanitize_array).

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 8, 2018
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 9, 2018
Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing pandas-dev#18767
outside of pandas.

Closes pandas-dev#14781
TomAugspurger added a commit that referenced this issue Jan 16, 2018
* ENH: Added public accessor registrar

Adds new methods for registing custom accessors to pandas objects.

This will be helpful for implementing #18767
outside of pandas.

Closes #14781

* PEP8

* Moved to extensions

* More docs

* Fix see also

* DOC: Added whatsnew

* Move to api

* Update post review

* flake8

* Raise the underlying error instead of a RuntimeError

* str validate

* DOC: Moved to developer

* REF: Use public registrars for accessors

* cleanup

* Implemented optional caching

* Document cache

* Tests passing

* Use for plot

* Fix autodoc

* Fix the class instantiation

* Refactor again.

1. Removed optional caching
2. Refactored `Properties` to create the indexes it uses on demand
3. Moved accessor definitions to classes for clarity

* Fix API files

* Remove stale comment

* Tests pass

* DOC: some cleanup

* No need to assign doc

* Rename, shared docs

* Doc __new__

* Use UserWarning

* Update test
@TomAugspurger
Copy link
Contributor Author

Closing this. It's implemented in https://cyberpandas.readthedocs.io/.

@mpenning
Copy link

@TomAugspurger the title of this issue mentions mac-addresses; I see that cyberpandas groks IPs now, but is there a solution for mac addresses? If so, can you elaborate?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

5 participants