DISCUSS: disambiguation of NA and "NA" in reprs #30415

anisotropi4 · 2019-12-22T22:55:35Z

Change to `str` `dtype` behaviour for missing elements

Following comments the discussion about how to handle missing NA scalar values in #28778 I was asked to raise my question as this seperate issue.

My rather prosaic question is how if missing str elements are given the value NA, how would I distinguish between a missing str value and the two-character string 'NA'?

I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't

That is, if 'NA' were generated as the default missing str dtype value, especially if introduced as change rather than as a opt-in, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value.

For what it's worth, current idiomatic behaviour is that in a missing values would be replaced by None dtype:

   >>> array = [['No-one', 'Nadie'], ['Expects']]
   >>> df = pd.DataFrame(array, columns=['En', 'Es'])
           En     Es
   0   No-one  Nadie
   1  Expects   *None*

The dtypes here are:

   >>> [type(i) for i in df['Es']]
   [<class 'str'>, <class 'NoneType'>]

Given this, my thought is that NA is not a suitable default replacement for missing str dtype elements rather None of NoneType dtype

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-12-23T20:59:48Z

Thanks for opening the issue!

It is indeed true that currently, there is no distinction in the repr:

In [1]: pd.Series(["NA", pd.NA], dtype="string") 
Out[1]: 
0    NA
1    NA
dtype: string

Up to now, we had the same problem with strings like "None" or "NaN", but I agree that "NA" might be more common to have as a string.

For me, it is not so much a question of None vs pd.NA as missing value indicator (None also has several disadvantages, which is one of the reasons to go with a pd.NA), but rather it is a question of how to display NA in the repr to avoid such ambiguity ?

There are 2 main reprs used in pandas: the plain text repr (eg in the console or when printing) and the html repr (eg in the notebook).
In the notebook, there are probably some more options to play with styling (I am not sure we can use colors in the plain IPython terminal).

As comparison, R also uses NA, and they solve it by using like this in the plain repr:

> data.frame(a=c("NA", NA))
     a
1   NA
2 <NA>

The tibble uses a more rich display with coloring:

So there might be options that we can solve this ambiguity between NA and "NA".

anisotropi4 · 2019-12-23T22:56:18Z

In your edit I fear you have misrepresented my position with regards missing str dtype information inasmuch as I see two points that need to be agreed here:

How do you represent a missing scalar str dtype?
How do you display a missing str dtype?

In your note above you do not address the first point other than to say: "For me, it is not so much a question of None vs pd.NA as missing value indicator (None also has several disadvantages, which is one of the reasons to go with a pd.NA)" but without explanation as to what these disadvantages are.

Please would you explain the advantages of pd.NA dtype over NoneType dtype for missing str data are as I am yet to be convinced that this change is necessary or helpful.

With regards representation, point 2, as I typically work with command line tools and see plain-text representation as key I would suggest for missing scalar representation to be either the existing None with alternatives such as <NA>, pandas.NA, pd.NA or ask even <None>.

TomAugspurger · 2019-12-27T16:22:05Z

You can use isna to distinguish pd.NA from the string NA.

We’d prefer to us NA rather than None for consistent behavior across dtypes. We have more control over NA.

jorisvandenbossche · 2019-12-28T09:13:24Z

@anisotropi4 I'm sorry if I changed the intent of your issue without explicitly saying or asking that. You're correct there can be two questions, but I understood your comment on twitter to be your second item ("how to display).
There are good reasons IMO to not use None as the missing value indicator: 1) as Tom said, we want to use something consistently across data types, 2) None does have a different scalar behaviour as we want for missing values (eg None == None, something we certainly don't want for the missing value in a Series), 3) None is already used for other things in Python, such as for a default or optional value in function arguments, which is a different meaning as "missing value".

Moreover, I don't think it is a problem that the string "NA" can also mean different things compared to the object pd.NA, which is not a string. The objects itself are perfectly distinguishable. So for me, the problem only occurs in the representation.

You can use isna to distinguish pd.NA from the string NA.

That's what you can do in code, yes. But I think we should still think about the display, where you can't use isna to see directly what the data in the displayed Series are.

TomAugspurger · 2019-12-28T13:12:16Z

Right you covered the display stuff :) I think we should explore the color stuff at least within IPython and jupyter.

TomAugspurger · 2019-12-28T19:46:23Z

Just for fun, wrote this up

diff --git a/pandas/core/arrays/string_.py b/pandas/core/arrays/string_.py
index de254f662b..7b929adb9d 100644
--- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -219,6 +219,16 @@ class StringArray(PandasArray):
         arr[mask] = -1
         return arr, -1
 
+    def _formatter(self, boxed=False):
+        def fmt(x):
+            if x is libmissing.NA:
+                return "\033[91m" + "NA" + '\033[0m'
+            elif boxed:
+                return str(x)
+            else:
+                return repr(x)
+        return fmt
+
     def __setitem__(self, key, value):
         value = extract_array(value, extract_numpy=True)
         if isinstance(value, type(self)):

This worked for the array repr, but not for Series or DataFrame

jreback · 2019-12-28T20:08:31Z

right we likely need to ask the column for its repr of nulls (rather than hard code based on dtype)

TomAugspurger · 2019-12-30T11:51:16Z

Just FYI, this will take a bit more work to get working inside Series / DataFrame. Things like truncating / aligning columns gets broken because the length of the "value" `"\033[91m + NA + \033[0m" doesn't match its display length of 2. I don't think this should necessarily be a blocker for 1.0.

jreback · 2019-12-30T12:19:35Z

Just FYI, this will take a bit more work to get working inside Series / DataFrame. Things like truncating / aligning columns gets broken because the length of the "value" `"\033[91m + NA + \033[0m" doesn't match its display length of 2. I don't think this should necessarily be a blocker for 1.0.

for sure not a blocker

anisotropi4 · 2020-01-04T12:30:16Z

@jorisvandenbossche thank you for your clarification. I get now that I was rather woolly in the way that I framed the question and should've been clearer about which of the two issues this looking at.

Then, for those of us that are still working in black-and-white, I would ask that another representation of NA rather than simply the text NA is used...

TomAugspurger · 2020-01-04T14:06:56Z

I have a branch started, but it’s quite a bit of work. Probably not happening for 1.0.

Dr-Irv · 2020-01-04T15:06:05Z

Don't use color, because that will have an effect on the color blind (not a problem for me, but we've had other comments from the color blind in the past) But here's a suggestion that I think will look nice and easily handles the issue of length of the representation.

In [1]: chr(171)+"NA"+chr(187)
Out[1]: '«NA»'

TomAugspurger · 2020-01-06T12:32:30Z

Don't use color, because that will have an effect on the color blind

This will need to be configurable. And we'll want an option that works without colors too.

Dark:

Light:

Is this something we want to pursue for 1.0? It's already a surprisingly large diff, and I haven't written thorough tests, and I haven't implemented the option handling yet.

Dr-Irv · 2020-01-06T13:21:03Z

@TomAugspurger You also have to worry about the documentation impacts, because will the color show up in the docs? And even if it does, that's not good for the color blind. That's why I suggested the '«NA»' option. Visible to everybody (and probably a smaller diff?)

TomAugspurger · 2020-01-06T13:27:56Z

Yes, the color should show up in the docs.

…

On Mon, Jan 6, 2020 at 7:21 AM Irv Lustig ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> You also have to worry about the documentation impacts, because will the color show up in the docs? And even if it does, that's not good for the color blind. That's why I suggested the '«NA»' option. Visible to everybody (and probably a smaller diff?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30415?email_source=notifications&email_token=AAKAOITSZOTBPKUNOCFQ64TQ4MV4BA5CNFSM4J6N3VR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIFNNFI#issuecomment-571135637>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQYQJCWZDFXHGIZO6LQ4MV4BANCNFSM4J6N3VRQ> .

jorisvandenbossche · 2020-01-07T12:26:34Z

@Dr-Irv is there a reason to use the special utf to have '«NA»' instead of a simpler ''? Or just because you think it looks better?
(but actually, implementation wise both are probably similar regarding complexity, as in python 3 the « is just a normal string character. Being able to type it easily is maybe not that important)

--

If we want to go with a different text repr (like '«NA»'), I think it would be nice to include this in 1.0, as it impacts quite a bit the "look" of the new feature (but if we want that, maybe not a blocker for 1.0).
@TomAugspurger can you push a branch or WIP PR to have an idea of the complexity you are talking about?

Dr-Irv · 2020-01-07T13:15:15Z

@Dr-Irv is there a reason to use the special utf to have '«NA»' instead of a simpler ''? Or just because you think it looks better?
(but actually, implementation wise both are probably similar regarding complexity, as in python 3 the « is just a normal string character. Being able to type it easily is maybe not that important)

@jorisvandenbossche I chose it because it looked better. But we could also use <<NA>>, which would work as well. I'm strongly against using color, because it's being insensitive to the color blind (and this has hit me in the past in my 30+ year career). I also think having a character sequence that is unlikely to appear in data helps. So you might see "NA" in data, but either «NA» or <<NA>> are unlikely.

Closes pandas-dev#30415

* Update NA repr Closes #30415

anisotropi4 mentioned this issue Dec 22, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

13 tasks

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Dec 23, 2019

jorisvandenbossche changed the title ~~DISCUSS: Consistent str dtype behaviour with missing value~~ DISCUSS: disambiguation of NA missing value in repr of string dtype Series/DataFrame Dec 23, 2019

jorisvandenbossche added this to the 1.0 milestone Dec 23, 2019

anisotropi4 changed the title ~~DISCUSS: disambiguation of NA missing value in repr of string dtype Series/DataFrame~~ DISCUSS: disambiguation of NA missing value with reason for change and its repr of string dtype Series/DataFrame Dec 23, 2019

TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019

TomAugspurger changed the title ~~DISCUSS: disambiguation of NA missing value with reason for change and its repr of string dtype Series/DataFrame~~ DISCUSS: disambiguation of NA and "NA" in reprs Dec 30, 2019

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 8, 2020

Update NA repr

12aa3d3

Closes pandas-dev#30415

TomAugspurger mentioned this issue Jan 8, 2020

Update NA repr #30821

Merged

TomAugspurger closed this as completed in #30821 Jan 9, 2020

TomAugspurger added a commit that referenced this issue Jan 9, 2020

Update NA repr (#30821)

493363e

* Update NA repr Closes #30415

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCUSS: disambiguation of NA and "NA" in reprs #30415

DISCUSS: disambiguation of NA and "NA" in reprs #30415

anisotropi4 commented Dec 22, 2019

jorisvandenbossche commented Dec 23, 2019

anisotropi4 commented Dec 23, 2019

TomAugspurger commented Dec 27, 2019

jorisvandenbossche commented Dec 28, 2019

TomAugspurger commented Dec 28, 2019

TomAugspurger commented Dec 28, 2019

jreback commented Dec 28, 2019

TomAugspurger commented Dec 30, 2019

jreback commented Dec 30, 2019

anisotropi4 commented Jan 4, 2020

TomAugspurger commented Jan 4, 2020

Dr-Irv commented Jan 4, 2020

TomAugspurger commented Jan 6, 2020 •

edited

Loading

Dr-Irv commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020 via email

jorisvandenbossche commented Jan 7, 2020

Dr-Irv commented Jan 7, 2020

DISCUSS: disambiguation of NA and "NA" in reprs #30415

DISCUSS: disambiguation of NA and "NA" in reprs #30415

Comments

anisotropi4 commented Dec 22, 2019

Change to str dtype behaviour for missing elements

jorisvandenbossche commented Dec 23, 2019

anisotropi4 commented Dec 23, 2019

TomAugspurger commented Dec 27, 2019

jorisvandenbossche commented Dec 28, 2019

TomAugspurger commented Dec 28, 2019

TomAugspurger commented Dec 28, 2019

jreback commented Dec 28, 2019

TomAugspurger commented Dec 30, 2019

jreback commented Dec 30, 2019

anisotropi4 commented Jan 4, 2020

TomAugspurger commented Jan 4, 2020

Dr-Irv commented Jan 4, 2020

TomAugspurger commented Jan 6, 2020 • edited Loading

Dr-Irv commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020 via email

jorisvandenbossche commented Jan 7, 2020

Dr-Irv commented Jan 7, 2020

Change to `str` `dtype` behaviour for missing elements

TomAugspurger commented Jan 6, 2020 •

edited

Loading