# Learning Objective

We have talked about methods and tools to access, set, and modify values in NumPy arrays. These includes indexing, slicing, masking, fancy indexing, and combinations thereof. Here we look at similar means of accessing and modifying values in Pandas Series and DataFrame.

# Data Selection in Series

**Question:** What are the two analogy we drew that help us understand Series? 

1. Dictionary 
2. Numpy array

## Series as dictionary

In [23]:
import numpy as np
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['b']

0.5

In [12]:
data['e'] = 1.25

In [13]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [14]:
data['c'] = 0.8

In [15]:
data

a    0.25
b    0.50
c    0.80
d    1.00
e    1.25
dtype: float64

In [4]:
# query if a member is part of the keys (indexes)
'a' in data

True

In [11]:
data.items()

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [None]:
data.keys()

In [None]:
list(data.items())

`Series` objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a `Series` by assigning to a new index value:

In [None]:
data['e'] = 1.25
data

## Series as one-dimensional array

In [17]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [19]:
# slicing by explicit index
data['a':'c'] # notice the end is inclusive here

a    0.25
b    0.50
c    0.75
dtype: float64

In [20]:
# slicing by implicit integer index, notice how it is exclusive at the end 
data[0:2] 

a    0.25
b    0.50
dtype: float64

In [25]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [27]:
# fancy indexing
data[['a', 'c']]

a    0.25
c    0.75
dtype: float64

## Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion. For example, if your `Series` has an explicit integer index, an indexing operation such as `data[1]` will use the explicit indices, while a slicing operation like `data[1:3]` will use the implicit Python-style index

In [33]:
data = pd.Series(['a', 'b', 'c'], index=[1,3,5])

In [29]:
data = pd.Series([1,3,5], index=['a', 'b', 'c'])

In [34]:
data

1    a
3    b
5    c
dtype: object

In [35]:
# explicit index when indexing
data[1] 

'a'

In [38]:
# implicit index when slicing
data[1:3]

1    a
3    b
dtype: object

In [40]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [41]:
data.loc[1:3]

1    a
3    b
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the `Series`

First, the `loc` attribute allow indexing and slicing that always references the explicit index

In [42]:
data.loc[1]

'a'

In [43]:
data.loc[1:3]

1    a
3    b
dtype: object

In [44]:
data.loc[5]

'c'

The `iloc` attribute allows indexing and slicing that always references the implicit Python-style index:

In [46]:
data

1    a
3    b
5    c
dtype: object

In [45]:
data.iloc[1]

'b'

In [47]:
data.iloc[1:3]

3    b
5    c
dtype: object

One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of `loc` and `iloc` make them very userful in maintaining clean and readable code; especially in the case of integer indexes. 

# Data Selection in DataFrame

Recall that a `DataFrame` acts in many ways like a 2d array, and in other ways like a dictionary of `Series` structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

## DataFrame as a Dictionary

In [2]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual `Series` that make up the columns of the `DataFrame` can be accessed via dictionary-style indexing of the column name:

In [None]:
data['area']

Equivalently, we can use attribute-style access with column names that are strings:

Like with the `Series` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

**Question:** Consider the data as a dictionary, how would you create a new column called density that is the quotient of pop and area?

## DataFrame as 2d array

As mentioned previously, we can also view the `DataFrame` as an enhanced 2d array. We can exmine the raw underlying data array using the `values` attributes:

In [None]:
data.values

We can do many array-like observations:

In [None]:
# transpose, flipping the row and column
data.T

In [None]:
data.values

When it comes to indexing of `DataFrame` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

In [None]:
data.values[0]

and passing a single "index" to a `DataFrame` accesses a column:

In [None]:
data['area']

Thus for array-style indexing, we need another convention. Here Pandas again uses the `loc`, `iloc` and `ix` indexer mentinoed earlier. Using the `iloc` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the `DataFrame` index and column labels are maintained in the result:

In [None]:
data.iloc[:3, :2]

In [None]:
# Notice how this include the end
data.loc[:'Florida', :'area']

The `ix` indexer allows a hybrid of these two approaches:

In [None]:
data['density'] = data['pop'] / data['area']

In [None]:
# loc indexer enable us to use masking and fancy indexing as 
data.loc[data['density'] > 100, ['density','area']]

In [None]:
data.iloc[0,2] = 90

In [None]:
data

## Additional Indexing Conventions

Indexing refers to columns, slicing refers to rows

In [28]:
data['Florida':'Illinois']
# How to write this using indexer?

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [29]:
data[['area', 'density']] 
# How to write this using indexer?

Unnamed: 0,area,density
California,423967,90.413926
Texas,695662,38.01874
New York,141297,139.076746
Florida,170312,114.806121
Illinois,149995,85.883763


Such slices can also refer to rows by number rather than by index:

In [33]:
data[1:3]
# How to write this using indexer?

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Similarly, directly masking operations are also interpreted row-wise rather than column-wise:

In [59]:
data[data.density > 100]
# How to write this using indexer?

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
