# Referencing and indexing in Pandas

Links:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label
* https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced

In [1]:
import pandas as pd

In [2]:
d = pd.DataFrame({'x' : [1,1,3], 'y' : [4,3,1], 'a': ['dog','cat','monk']})
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk


## Chain assignments

In [3]:
print(d.x[1]) # Good way to read
d.x[1] = 2    # Bad way to write! Produces a warning, but for some reason still works.
d.x[1]==2

1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


True

In [4]:
print(d['x'][1])
d['x'][1] = 3 # Produces a warning, but for some reason still works (even though it shouldn't?)
d['x'][1] ==3

2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


True

In [5]:
print(d.x.iloc[1])
d.x.iloc[1] = 4 # Sometimes produces a warning, but for some reason still works (even though it shouldn't?)
d.x.iloc[1] == 4

3


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


True

In [6]:
print(d.iloc[1].x) # Another chain slicing: good way to read, but bad way to write!
d.iloc[1].x = 5    # DOES NOT WORK.
d.iloc[1].x == 5

4


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


False

In [7]:
d.iloc[1,0] = 6 # Good, legal way, based on row / column numbers.
d.iloc[1,0] == 6

True

In [8]:
d.loc[1,'x'] = 7 # Good, legal way, based on row / column labels.
d.loc[1,'x'] ==7

True

In [9]:
print(d)
d.x.iloc[1:] = [0,0] # Seems to be working
print(d)

   x  y     a
0  1  4   dog
1  7  3   cat
2  3  1  monk
   x  y     a
0  1  4   dog
1  0  3   cat
2  0  1  monk


In [10]:
# With lists, one cannot slice by column, so the problem doesn't quite exist
a = [[1,2,3],[4,5,6],[7,8,9]]
a[1][0] = 5
print('Assigning one value: ',a[1][0] == 5)
a[1:][0][1] = 7
print('Assigning one value after slicing:' ,a[1:][0][1] == 7)
a[1:][0][1:] = [0,0]
print('Assigning 2 values at once, after slicing: ',a[1:][0][1:] == [0,0])
a

Assigning one value:  True
Assigning one value after slicing: True
Assigning 2 values at once, after slicing:  True


[[1, 2, 3], [5, 0, 0], [7, 8, 9]]

In [11]:
# Compare to numpy
import numpy as np

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(a)

print(a[:][0]) # 0th row of all rows
print(a[0][:]) # All elements of 0th row (same thing)

print(a[1:][0]) # 0th row of last 2 rows [4 5 6]
print(a[0][1:]) # Last 2 elements of 0th row [2 3]

a[1:][0] = [0,0,0] # Works
print(a)

a[0][1:] = [-1,-1]
print(a) # Also works

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[1 2 3]
[1 2 3]
[4 5 6]
[2 3]
[[1 2 3]
 [0 0 0]
 [7 8 9]]
[[ 1 -1 -1]
 [ 0  0  0]
 [ 7  8  9]]


## Bounds

In [12]:
# Pandas again: referencing with slices outside of the block
d.iloc[1:4] # On reading slices, out-of-bond indices are just ignored

Unnamed: 0,x,y,a
1,0,3,cat
2,0,1,monk


In [19]:
# If you try to read with slices from completely outside, you get an empty dataframe
d.iloc[:,5:6]

0
1
2


In [18]:
# On writing, they cause an error (that during assignment, length don't match)
try:    d.x.iloc[1:4] = [0,0,0]
except: print('Oops')

Oops


In [17]:
# Same for single indices:
try: d.x.iloc[4]
except: print('Oops!')
# IndexError: single positional indexer is out-of-bounds

Oops!


## at

In [89]:
d = pd.DataFrame({'x' : [1,1,3], 'y' : [4,3,1], 'a': ['dog','cat','monk']})
print(d)
print(d.iat[0,1])  # Only works with simple references: no slices, no boolean
print(d.at[0,'y']) # But is apparently faster (no overhead), so may be preferred sometimes

   x  y     a
0  1  4   dog
1  1  3   cat
2  3  1  monk
4
4


# Conditional indexing

In [32]:
d = pd.DataFrame({'x' : [1,1,3], 'y' : [4,3,1], 'a': ['dog','cat','monk']})
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk


In [42]:
# With direct indencing
d.loc[d.x<3]

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat


In [59]:
# With a lambda function (the benefit here is that it can be run on a transient df
# that can't be refered directly; kind of like chaining object-producing methods.
# Useful for grouping, summarizing etc.
d.loc[lambda df: df.x<3 ]

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat


In [94]:
# Combining simple conditions
d.loc[(d.x**2<9) & (d.y>3)]

Unnamed: 0,x,y,a
0,1,4,dog


In [146]:
# Supports functions, if they support vectorized operations
d.loc[np.sqrt(d.y)<2]

Unnamed: 0,x,y,a
1,0,3,cat
2,3,1,monk


In [150]:
# So if you make your custom function obey, you can use it
f = lambda x: np.array([np.sqrt(a) for a in x])
d.loc[f(d.y)<2]

Unnamed: 0,x,y,a
1,0,3,cat
2,3,1,monk


In [152]:
# For example, by retuning a series
f = lambda x: pd.Series([a[0] for a in x])
f('dog')
d.loc[f(d.a)=='d']

Unnamed: 0,x,y,a
0,0,4,dog


In [139]:
# Or one can use list comprehensions
d.loc[(d.x<3) & [x[0]=='d' for x in d.a]]

Unnamed: 0,x,y,a
0,0,4,dog


In [92]:
# Maps also work, but probably comprehensions are better
d.loc[(d.x<3) & list(map(lambda x:x[0]=='d', d.a))]

Unnamed: 0,x,y,a
0,1,4,dog


In [156]:
# Or sometimes there's a fancy method to use:
d.loc[d.a.str.startswith('d')]

Unnamed: 0,x,y,a
0,0,4,dog


In [120]:
# For simple reports, one can also use query:
# One benefit here is human readability
d.query('x<3 and y>3')

# Another is that it's apparently faster

# Also supports in, not. Unlike boolean indexing, doesn't require parentheses around clauses.

Unnamed: 0,x,y,a
0,1,4,dog


In [122]:
# Doesn't work in 'for each' mode though:
try: d.query("a[0]=='d'")
except: print('Oops')

Oops


In [95]:
# Selecting rows where value is in a list
d.loc[d.a.isin(['dog','cat'])]

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat


In [126]:
# Logical indexing is typically writable
d.loc[d.a.isin(['dog','cat']),'x'] = [0,0]
d

Unnamed: 0,x,y,a
0,0,4,dog
1,0,3,cat
2,3,1,monk


In [133]:
# Query is not writable
d.query('x==0').iloc[:,0] = [7,7]
d

Unnamed: 0,x,y,a
0,0,4,dog
1,0,3,cat
2,3,1,monk


# Masking

Not sure what's so useful about it, but it exists.

In [96]:
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk


In [108]:
# Positive masking
d.where(d==1)

Unnamed: 0,x,y,a
0,1.0,,
1,1.0,,
2,,1.0,


In [109]:
# Negative masking
d.mask(d==1)

Unnamed: 0,x,y,a
0,,4.0,dog
1,,3.0,cat
2,3.0,,monk


# Queries

In [4]:
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk


In [20]:
d2 = d.query('x<3')
d2.y = d2.y.map(lambda x: x+2)
d2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,x,y,a
0,1,6,dog
1,1,5,cat


In [16]:
d2 = d.query('x<3').copy()
d2.y = d2.y.map(lambda x: x+2)
d2

Unnamed: 0,x,y,a
0,1,6,dog
1,1,5,cat


In [17]:
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk


In [18]:
d2.loc[0,'x'] = 10
d2

Unnamed: 0,x,y,a
0,10,6,dog
1,1,5,cat


In [19]:
d

Unnamed: 0,x,y,a
0,1,4,dog
1,1,3,cat
2,3,1,monk
