Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning with loc using an index with string values #22500

Open
chrisroat opened this issue Aug 24, 2018 · 3 comments
Open

Assigning with loc using an index with string values #22500

chrisroat opened this issue Aug 24, 2018 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@chrisroat
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pytest # Not run via pytest -- just used for exception testing

def create_df(total, index=None):
  dma = [501, 501, 501, 501, 501, 501, 502, 502, 502, 502, 502, 502]
  size = [1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2]
  age = ['20-25', '30-35', '40-45', '20-25', '30-35', '40-45',
         '20-25', '30-35', '40-45', '20-25', '30-35', '40-45']
  df = pd.DataFrame()
  df['dma'] = dma
  df['size'] = size
  df['age'] = age
  df['total'] = total

  df10 = df.copy()
  df10.total = 10 * df.total

  df.set_index(index, inplace=True)
  return df

def run_test(index, value, use_df_index, expected_exception=None, expected_df=None):
  total = np.array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],)
  df = create_df(total, index)
  df_10 = create_df(10 * total, index)

  def run():
    if use_df_index:
      df.loc[df.index==value, 'total'] = df_10.loc[df_10.index==value, 'total']
    else:
      df.loc[value, 'total'] = df_10.loc[value, 'total']

  if expected_exception:
    with pytest.raises(expected_exception):
      run()
  else:
    run()
    pd.testing.assert_frame_equal(df, expected_df)


total = np.array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],)

expected_df = create_df(
  np.array([10., 10., 10., 1., 1., 1., 10., 10., 10., 1., 1., 1.],),
  'size')
run_test('size', 1, False, expected_exception=ValueError)  # A1
run_test('size', 1, True, expected_df=expected_df)         # A2 *
run_test('size', (1,), False, expected_df=expected_df)     # B1 *
run_test('size', (1,), True, expected_df=expected_df)      # B2 *

expected_df = create_df(
  np.array([10., 1., 1., 10., 1., 1., 10., 1., 1., 10., 1., 1.],),
  'age')
WRONG_DF = create_df(total, 'age')
run_test('age', '20-25', False, expected_exception=ValueError)     # A1
run_test('age', '20-25', True, expected_df=expected_df)            # A2 *
run_test('age', ('20-25',), False, expected_exception=ValueError)  # B1
run_test('age', ('20-25',), True, expected_df=WRONG_DF)            # B2

expected_df = create_df(
  np.array([10., 1., 1., 1., 1., 1., 10., 1., 1., 1., 1., 1.],),
  ['size', 'age'])
run_test(['size', 'age'], (1, '20-25'), False, expected_df=expected_df)    # B1 *
run_test(['size', 'age'], (1, '20-25'), True, expected_exception=KeyError) # B2

Problem description

When assigning via the loc parameter, I'm running into issues with using a string index. The example shows various attempts at using loc to assign with different indices: a single int column, a single string column, and a two-column index.

The variations of attempts are commented as:

  • for single column indexes, use a flat value (A) or a tuple (B). Multi-column indexes only use tuple (B).
  • use df.loc[df.index==value, column] (1) vs df.loc[value, column] (2)

Expected Output

I'd like to use a single variation for all index types (but it seems no single method works). Ideally, it would be 'B2', but that does not work for a string-based index.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.2
pip: 18.0
setuptools: 40.2.0
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Aug 25, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2018

cc @toobaz

@shoyer
Copy link
Member

shoyer commented Aug 29, 2018

I think you've identified two inconsistencies/likely bugs here.

Part of the issue is that tuples are valid members of a standard index, so df.loc['20-25', 'total'] and df.loc[('20-25',), 'total'] could potentially have different semantics. But it is indeed weird that df.loc[1, 'total'] and df.loc[(1,), 'total'] give the same result -- we shouldn't have different behavior for string vs numeric indexes.

To get the right result with a MultiIndex, you need to index either like df.loc['20-25', 'total'] or df.loc[('20-25', slice(None)), 'total'] (i.e., filling out all the trailing indexer levels). Unfortunately, you can't treat a non-MultiIndex like a single level MultiIndex -- as noted above, you'll need to unpack the tuple. This part of the larger issue of consistency between the MultiIndex and Index APIs (#3268).

One option would be to avoid using a MultiIndex for indexing at all, and stick with using a boolean indexer for the rows, e.g., df.loc[(df.size == 1) & (df.age == '20-25'), 'total'].

@jreback
Copy link
Contributor

jreback commented Aug 29, 2018

pls look and see if this is a duplicate issue

@mroeschke mroeschke added Bug and removed API Design labels Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

5 participants