Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSS: disambiguation of NA and "NA" in reprs #30415

Closed
anisotropi4 opened this issue Dec 22, 2019 · 17 comments · Fixed by #30821
Closed

DISCUSS: disambiguation of NA and "NA" in reprs #30415

anisotropi4 opened this issue Dec 22, 2019 · 17 comments · Fixed by #30821
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string

Comments

@anisotropi4
Copy link

Change to str dtype behaviour for missing elements

Following comments the discussion about how to handle missing NA scalar values in #28778 I was asked to raise my question as this seperate issue.

My rather prosaic question is how if missing str elements are given the value NA, how would I distinguish between a missing str value and the two-character string 'NA'?

I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't

That is, if 'NA' were generated as the default missing str dtype value, especially if introduced as change rather than as a opt-in, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value.

For what it's worth, current idiomatic behaviour is that in a missing values would be replaced by None dtype:

   >>> array = [['No-one', 'Nadie'], ['Expects']]
   >>> df = pd.DataFrame(array, columns=['En', 'Es'])
           En     Es
   0   No-one  Nadie
   1  Expects   *None*

The dtypes here are:

   >>> [type(i) for i in df['Es']]
   [<class 'str'>, <class 'NoneType'>]

Given this, my thought is that NA is not a suitable default replacement for missing str dtype elements rather None of NoneType dtype

@jorisvandenbossche
Copy link
Member

Thanks for opening the issue!

It is indeed true that currently, there is no distinction in the repr:

In [1]: pd.Series(["NA", pd.NA], dtype="string") 
Out[1]: 
0    NA
1    NA
dtype: string

Up to now, we had the same problem with strings like "None" or "NaN", but I agree that "NA" might be more common to have as a string.

For me, it is not so much a question of None vs pd.NA as missing value indicator (None also has several disadvantages, which is one of the reasons to go with a pd.NA), but rather it is a question of how to display NA in the repr to avoid such ambiguity ?

There are 2 main reprs used in pandas: the plain text repr (eg in the console or when printing) and the html repr (eg in the notebook).
In the notebook, there are probably some more options to play with styling (I am not sure we can use colors in the plain IPython terminal).

As comparison, R also uses NA, and they solve it by using like this in the plain repr:

> data.frame(a=c("NA", NA))
     a
1   NA
2 <NA>

The tibble uses a more rich display with coloring:

Screenshot from 2019-12-23 21-55-53

So there might be options that we can solve this ambiguity between NA and "NA".

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Dec 23, 2019
@jorisvandenbossche jorisvandenbossche changed the title DISCUSS: Consistent str dtype behaviour with missing value DISCUSS: disambiguation of NA missing value in repr of string dtype Series/DataFrame Dec 23, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 23, 2019
@anisotropi4
Copy link
Author

In your edit I fear you have misrepresented my position with regards missing str dtype information inasmuch as I see two points that need to be agreed here:

  1. How do you represent a missing scalar str dtype?
  2. How do you display a missing str dtype?

In your note above you do not address the first point other than to say: "For me, it is not so much a question of None vs pd.NA as missing value indicator (None also has several disadvantages, which is one of the reasons to go with a pd.NA)" but without explanation as to what these disadvantages are.

Please would you explain the advantages of pd.NA dtype over NoneType dtype for missing str data are as I am yet to be convinced that this change is necessary or helpful.

With regards representation, point 2, as I typically work with command line tools and see plain-text representation as key I would suggest for missing scalar representation to be either the existing None with alternatives such as <NA>, pandas.NA, pd.NA or ask even <None>.

@anisotropi4 anisotropi4 changed the title DISCUSS: disambiguation of NA missing value in repr of string dtype Series/DataFrame DISCUSS: disambiguation of NA missing value with reason for change and its repr of string dtype Series/DataFrame Dec 23, 2019
@TomAugspurger
Copy link
Contributor

You can use isna to distinguish pd.NA from the string NA.

We’d prefer to us NA rather than None for consistent behavior across dtypes. We have more control over NA.

@jorisvandenbossche
Copy link
Member

@anisotropi4 I'm sorry if I changed the intent of your issue without explicitly saying or asking that. You're correct there can be two questions, but I understood your comment on twitter to be your second item ("how to display).
There are good reasons IMO to not use None as the missing value indicator: 1) as Tom said, we want to use something consistently across data types, 2) None does have a different scalar behaviour as we want for missing values (eg None == None, something we certainly don't want for the missing value in a Series), 3) None is already used for other things in Python, such as for a default or optional value in function arguments, which is a different meaning as "missing value".

Moreover, I don't think it is a problem that the string "NA" can also mean different things compared to the object pd.NA, which is not a string. The objects itself are perfectly distinguishable. So for me, the problem only occurs in the representation.

You can use isna to distinguish pd.NA from the string NA.

That's what you can do in code, yes. But I think we should still think about the display, where you can't use isna to see directly what the data in the displayed Series are.

@TomAugspurger
Copy link
Contributor

Right you covered the display stuff :) I think we should explore the color stuff at least within IPython and jupyter.

@TomAugspurger
Copy link
Contributor

Just for fun, wrote this up

diff --git a/pandas/core/arrays/string_.py b/pandas/core/arrays/string_.py
index de254f662b..7b929adb9d 100644
--- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -219,6 +219,16 @@ class StringArray(PandasArray):
         arr[mask] = -1
         return arr, -1
 
+    def _formatter(self, boxed=False):
+        def fmt(x):
+            if x is libmissing.NA:
+                return "\033[91m" + "NA" + '\033[0m'
+            elif boxed:
+                return str(x)
+            else:
+                return repr(x)
+        return fmt
+
     def __setitem__(self, key, value):
         value = extract_array(value, extract_numpy=True)
         if isinstance(value, type(self)):

This worked for the array repr, but not for Series or DataFrame

Screen Shot 2019-12-28 at 1 45 05 PM

@jreback
Copy link
Contributor

jreback commented Dec 28, 2019

right we likely need to ask the column for its repr of nulls (rather than hard code based on dtype)

@TomAugspurger
Copy link
Contributor

Just FYI, this will take a bit more work to get working inside Series / DataFrame. Things like truncating / aligning columns gets broken because the length of the "value" `"\033[91m + NA + \033[0m" doesn't match its display length of 2. I don't think this should necessarily be a blocker for 1.0.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2019

Just FYI, this will take a bit more work to get working inside Series / DataFrame. Things like truncating / aligning columns gets broken because the length of the "value" `"\033[91m + NA + \033[0m" doesn't match its display length of 2. I don't think this should necessarily be a blocker for 1.0.

for sure not a blocker

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@TomAugspurger TomAugspurger changed the title DISCUSS: disambiguation of NA missing value with reason for change and its repr of string dtype Series/DataFrame DISCUSS: disambiguation of NA and "NA" in reprs Dec 30, 2019
@anisotropi4
Copy link
Author

@jorisvandenbossche thank you for your clarification. I get now that I was rather woolly in the way that I framed the question and should've been clearer about which of the two issues this looking at.

Then, for those of us that are still working in black-and-white, I would ask that another representation of NA rather than simply the text NA is used...

@TomAugspurger
Copy link
Contributor

I have a branch started, but it’s quite a bit of work. Probably not happening for 1.0.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 4, 2020

Don't use color, because that will have an effect on the color blind (not a problem for me, but we've had other comments from the color blind in the past) But here's a suggestion that I think will look nice and easily handles the issue of length of the representation.

In [1]: chr(171)+"NA"+chr(187)
Out[1]: '«NA»'

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 6, 2020

Don't use color, because that will have an effect on the color blind

This will need to be configurable. And we'll want an option that works without colors too.

Dark: Screen Shot 2020-01-06 at 6 29 45 AM

Light:
Screen Shot 2020-01-06 at 6 29 29 AM

Is this something we want to pursue for 1.0? It's already a surprisingly large diff, and I haven't written thorough tests, and I haven't implemented the option handling yet.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 6, 2020

@TomAugspurger You also have to worry about the documentation impacts, because will the color show up in the docs? And even if it does, that's not good for the color blind. That's why I suggested the '«NA»' option. Visible to everybody (and probably a smaller diff?)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 6, 2020 via email

@jorisvandenbossche
Copy link
Member

@Dr-Irv is there a reason to use the special utf to have '«NA»' instead of a simpler ''? Or just because you think it looks better?
(but actually, implementation wise both are probably similar regarding complexity, as in python 3 the « is just a normal string character. Being able to type it easily is maybe not that important)

--

If we want to go with a different text repr (like '«NA»'), I think it would be nice to include this in 1.0, as it impacts quite a bit the "look" of the new feature (but if we want that, maybe not a blocker for 1.0).
@TomAugspurger can you push a branch or WIP PR to have an idea of the complexity you are talking about?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 7, 2020

@Dr-Irv is there a reason to use the special utf to have '«NA»' instead of a simpler ''? Or just because you think it looks better?
(but actually, implementation wise both are probably similar regarding complexity, as in python 3 the « is just a normal string character. Being able to type it easily is maybe not that important)

@jorisvandenbossche I chose it because it looked better. But we could also use <<NA>>, which would work as well. I'm strongly against using color, because it's being insensitive to the color blind (and this has hit me in the past in my 30+ year career). I also think having a character sequence that is unlikely to appear in data helps. So you might see "NA" in data, but either «NA» or <<NA>> are unlikely.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 8, 2020
TomAugspurger added a commit that referenced this issue Jan 9, 2020
* Update NA repr

Closes #30415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants