Assigning with loc using an index with string values #22500

chrisroat · 2018-08-24T15:27:32Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pytest # Not run via pytest -- just used for exception testing

def create_df(total, index=None):
  dma = [501, 501, 501, 501, 501, 501, 502, 502, 502, 502, 502, 502]
  size = [1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2]
  age = ['20-25', '30-35', '40-45', '20-25', '30-35', '40-45',
         '20-25', '30-35', '40-45', '20-25', '30-35', '40-45']
  df = pd.DataFrame()
  df['dma'] = dma
  df['size'] = size
  df['age'] = age
  df['total'] = total

  df10 = df.copy()
  df10.total = 10 * df.total

  df.set_index(index, inplace=True)
  return df

def run_test(index, value, use_df_index, expected_exception=None, expected_df=None):
  total = np.array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],)
  df = create_df(total, index)
  df_10 = create_df(10 * total, index)

  def run():
    if use_df_index:
      df.loc[df.index==value, 'total'] = df_10.loc[df_10.index==value, 'total']
    else:
      df.loc[value, 'total'] = df_10.loc[value, 'total']

  if expected_exception:
    with pytest.raises(expected_exception):
      run()
  else:
    run()
    pd.testing.assert_frame_equal(df, expected_df)


total = np.array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],)

expected_df = create_df(
  np.array([10., 10., 10., 1., 1., 1., 10., 10., 10., 1., 1., 1.],),
  'size')
run_test('size', 1, False, expected_exception=ValueError)  # A1
run_test('size', 1, True, expected_df=expected_df)         # A2 *
run_test('size', (1,), False, expected_df=expected_df)     # B1 *
run_test('size', (1,), True, expected_df=expected_df)      # B2 *

expected_df = create_df(
  np.array([10., 1., 1., 10., 1., 1., 10., 1., 1., 10., 1., 1.],),
  'age')
WRONG_DF = create_df(total, 'age')
run_test('age', '20-25', False, expected_exception=ValueError)     # A1
run_test('age', '20-25', True, expected_df=expected_df)            # A2 *
run_test('age', ('20-25',), False, expected_exception=ValueError)  # B1
run_test('age', ('20-25',), True, expected_df=WRONG_DF)            # B2

expected_df = create_df(
  np.array([10., 1., 1., 1., 1., 1., 10., 1., 1., 1., 1., 1.],),
  ['size', 'age'])
run_test(['size', 'age'], (1, '20-25'), False, expected_df=expected_df)    # B1 *
run_test(['size', 'age'], (1, '20-25'), True, expected_exception=KeyError) # B2

Problem description

When assigning via the loc parameter, I'm running into issues with using a string index. The example shows various attempts at using loc to assign with different indices: a single int column, a single string column, and a two-column index.

The variations of attempts are commented as:

for single column indexes, use a flat value (A) or a tuple (B). Multi-column indexes only use tuple (B).
use df.loc[df.index==value, column] (1) vs df.loc[value, column] (2)

Expected Output

I'd like to use a single variation for all index types (but it seems no single method works). Ideally, it would be 'B2', but that does not work for a string-based index.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.2
pip: 18.0
setuptools: 40.2.0
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-25T08:20:06Z

cc @toobaz

shoyer · 2018-08-29T16:44:05Z

I think you've identified two inconsistencies/likely bugs here.

Part of the issue is that tuples are valid members of a standard index, so df.loc['20-25', 'total'] and df.loc[('20-25',), 'total'] could potentially have different semantics. But it is indeed weird that df.loc[1, 'total'] and df.loc[(1,), 'total'] give the same result -- we shouldn't have different behavior for string vs numeric indexes.

To get the right result with a MultiIndex, you need to index either like df.loc['20-25', 'total'] or df.loc[('20-25', slice(None)), 'total'] (i.e., filling out all the trailing indexer levels). Unfortunately, you can't treat a non-MultiIndex like a single level MultiIndex -- as noted above, you'll need to unpack the tuple. This part of the larger issue of consistency between the MultiIndex and Index APIs (#3268).

One option would be to avoid using a MultiIndex for indexing at all, and stick with using a boolean indexer for the rows, e.g., df.loc[(df.size == 1) & (df.age == '20-25'), 'total'].

jreback · 2018-08-29T16:49:34Z

pls look and see if this is a duplicate issue

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Aug 25, 2018

mroeschke added Bug and removed API Design labels Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assigning with loc using an index with string values #22500

Assigning with loc using an index with string values #22500

chrisroat commented Aug 24, 2018

INSTALLED VERSIONS

gfyoung commented Aug 25, 2018

shoyer commented Aug 29, 2018

jreback commented Aug 29, 2018

Assigning with loc using an index with string values #22500

Assigning with loc using an index with string values #22500

Comments

chrisroat commented Aug 24, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Aug 25, 2018

shoyer commented Aug 29, 2018

jreback commented Aug 29, 2018

Output of `pd.show_versions()`