# Indexing and Selecting Data
* Contact: Lachlan Deer, [econgit] @ldeer, [github/Twitter] @lachlandeer

In yesterday's class on NumPy we thought about indexing, slicing, masking and modifying values in Numpy arrays. There are analogues to these operations in the pandas library that we will look into now.

Our focus will be on the pandas DataFrame, rather than the pandas Series Object - because we will be dealing with DataFrames more often than not.


In [None]:
import pandas as pd

First, let's load our example DataFrame that we assembled in the previous lesson:

In [None]:
df = pd.read_csv('out_data/state_labour_statistics.csv')
df.head()

Notice that although when we were working with the data in our previous notebook the index was set as state-year-period, once we saved to csv we lost this structure. This is not a big deal, in a future notebook we will go back to setting the index as we so desire.

We can look at some of the features of our DataFrame before continuing (as a review)

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.values

## Selecting Columns

We can select individual columns using a square bracket notation:

In [None]:
df['unemployment_rate']

or using a 'dot' notation, which is an attribute-style way of accessing

In [None]:
df.unemployment_rate

we can verify that it returns the same thing:

In [None]:
df.unemployment_rate is df['unemployment_rate']

The attribute style indexing will not work if the column names conflict with a method of a DataFrame.

We can select multiple columns too by passing a list of columns to select:

In [None]:
df[['unemployment_rate' , 'qty_unemployed']]

If we feel the need, we can also transpose a DataFrame:

In [None]:
df.T

## Selecting Rows of data

Since the 'square-bracket' indexing is reserved for selecting columns of data, we need another way to access individual rows of data. Pandas offers us three alternatives here:
* `loc`
* `iloc`
* `ix`

Let's see how each of these work:

`iloc` works just like the NumPy indexing:

In [None]:
df.iloc[0:3,:]

In [None]:
df.iloc[0:10:2, 0:4]

The `loc` syntax allows us to index the data using explicit index and column names:

In [None]:
df.loc[0:10, 'state':'unemployment_rate']

The `ix` indexer is useful if the index is set as text but we want to select by row number, whilst refer to columns by name. Note for our example, nothing changes:

In [None]:
df.ix[:3, 'state':'unemployment_rate']

but if instead we have

In [None]:
unemployed = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'unemployed':unemployed, 'population':population})
data



In [None]:
data.ix[1:3, 'population']

gives delivers something different than `.loc`

Any of our indexing conventions allow us to modify values if we so wish:

In [None]:
data.loc['Florida', 'population'] = 20000000
data

## Boolean Masking and Fancy Indexing

we can use NumPy-style access patterns within our indexers to subsets of data:

In [None]:
df.loc[df.state =='Alabama']

In [None]:
df.loc[df.state =='Alabama', ['unemployment_rate']]

In [None]:
df.loc[df.unemployment_rate > 10.0, ['state', 'period', 'year', 'unemployment_rate']]

In [None]:
df.loc[(df.unemployment_rate > 10.0) & (df.state == 'California'), 
           ['state', 'period', 'year', 'unemployment_rate']]

### Challenge

1. Select all data employment data for Nebraska when unemployment is greater than 5 percent
2. Select all unemployment data in December for the Carolinas
3. Select all unemployment data for the years 2007-2010 in California

#### Solutions

In [None]:
df.loc[(df.unemployment_rate > 5) & (df.state == 'Nebraska'), 
           ['state', 'period', 'year', 'qty_employed']]

In [None]:
df.loc[(df.state == 'South Carolina') |(df.state == 'North Carolina'),
          ['state', 'period', 'year', 'unemployment_rate', 'qty_unemployed']]

In [None]:
# alt
df.loc[df.state.str.contains("Carolina"),
          ['state', 'period', 'year', 'unemployment_rate', 'qty_unemployed']]

In [None]:
df.loc[(df.state == 'California') & (df.year >= 2007) & (df.year <= 2010),
          ['state', 'period', 'year', 'unemployment_rate', 'qty_unemployed']]

In [None]:
#alternatively
df.loc[(df.state == 'California') & df.year.between(2007, 2010, inclusive=True),
          ['state', 'period', 'year', 'unemployment_rate', 'qty_unemployed']]

## The Query Syntax

Instead of indexing with `loc` and `iloc` pandas also has a `query` method. This is particularly handy when we have multi-index data, since the methods above don't carry over.

An example useage is:

In [None]:
state = 'California'
df.query('2010 >= year >= 2007 & state==@state')