# Data Indexing and Selection

Series:
- as a dictionary
 - ``.index``
 - ``.values``
 - modify by index
- as a nparray
 - slicing
 - masking 
 - fancy index
- indexer
 - ``.loc``
 - ``.iloc``
 
DataFrame  
- as a dictionary
 - ``.index``; 
 - ``.columns``;
 - ``.values`` 
- as a nparray
 - ``.T``
- Indexer
 - ``.loc``
 - ``.iloc``


## Data Selection in Series
Series = Numpy + dictionary

In [17]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

### ``Series`` as dictionary

In [3]:
# access data by its key
data['a']

0.25

In [4]:
'a' in data

True

In [26]:
# the pandas DataFrame, which not like the dictionary, it only has the keys() and items() method but not values(), 
data.keys(), list(data.items())

(Index(['a', 'b', 'c', 'd'], dtype='object'),
 [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)])

In [25]:
# DataFrame itself has the index and values attributes
data.index, data.values

(Index(['a', 'b', 'c', 'd'], dtype='object'), array([0.25, 0.5 , 0.75, 1.  ]))

### Modified ``Series`` like a dictionary

In [30]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

### ``Series`` as one-dimensional array

In [32]:
# slicing by explicit index
data['a':'c']  # it will include the final index

a    0.25
b    0.50
c    0.75
dtype: float64

In [33]:
# slicing by implicit integer index
data[0:2]  # it will exclude the final index

a    0.25
b    0.50
dtype: float64

In [34]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [35]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

### Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

note that in the current Padans version, ``.ix`` is depracated

In [40]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [36]:
# explicit index when indexing
data[1]

0.5

In [37]:
# implicit index when slicing
data[1:3]

b    0.50
c    0.75
dtype: float64

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes. 
One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

### ``loc`` 
allows indexing and slicing that always references the explicit index

In [41]:
data.loc[1]

'a'

In [42]:
data.loc[1:3]

1    a
3    b
dtype: object

### ``iloc`` 
allows indexing and slicing that always references the implicit Python-style index:

In [43]:
data.iloc[1]

'b'

In [44]:
data.iloc[1:3]

3    b
5    c
dtype: object

A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.
The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment.



## Data Selection in DataFrame

In [69]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


### ``DataFrame`` as a dictionary  -- among columns

In [50]:
data['area'] # better choice

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [47]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [48]:
data.area is data['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [49]:
data.pop is data['pop']

False

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

In [70]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### ``DataFrame`` as two-dimensional array

In [51]:
data.values

array([[  423967, 38332521],
       [  695662, 26448193],
       [  141297, 19651127],
       [  170312, 19552860],
       [  149995, 12882135]], dtype=int64)

In [52]:
#transformation
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135


In [53]:
data.values[0] # this refers to row

array([  423967, 38332521], dtype=int64)

In [54]:
data['area'] # this refers to column

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### ``.iloc``
using implicitly index

In [28]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


### ``.loc`` 
using the explicit index and column names:

In [55]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [60]:
# combining masking and fancy indexing in loc
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [63]:
# combining masking and fancy indexing is not allowed in iloc
data.iloc[data.density > 100, :2] # wrong syntax

### Modified DataFrame

In [65]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [71]:
data.loc['Texas', 'density'] = 100
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,100.0
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

### index refers to columns

In [74]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [None]:
#data['area':'density'] # this doesn't work, wrong syntax

In [79]:
data[['area','pop','density']] # this works

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,100.0
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### slicing refers to rows

In [72]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [None]:
#data['Florida'] # this doesn't work, wrong syntax

In [77]:
data['Florida':'Florida'] # this works

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121


Such slices can also refer to rows by number rather than by index:

In [85]:
#data[1] # wrong index, it has to be slicing

In [86]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,100.0
New York,141297,19651127,139.076746


### masking
direct masking operations are also interpreted row-wise rather than column-wise:

In [87]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

<!--NAVIGATION-->
< [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) | [Contents](Index.ipynb) | [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) >