# Data Indexing and Selection
---

In the previous units we learnt about the concepts of __Series__ and __DataFrames__ and how to construct them. Let's explore the ways we can access subsets of data from them. 

---
## 1. Data Indexing and Selection - Definition:
__Data Indexing__  (also __Subset Selection__) in Pandas simply means selecting a certain sub-part of data from a Pandas Object.

In the context of a __Series__, data indexing could mean selecting one or multiple elements of a Series. In the context of __DataFrames__, data indexing could refer to selecting a subset of rows and columns, individual values, etc.

In [1]:
import pandas as pd
import datetime as dt

---
## 2. Series - Data Indexing and Selection:
Recall that __Series__ are 1-dimensional objects of indexed data.
We can index a Series object in one of the following ways:
- __Explicit vs Implicit Indexing__
- __Slicing__
- __Boolean Indexing__

---
### 2.1 Explicit vs Implicit Indexing:
The first way to select an element from a Series is by using its __index__ as a key!


- __Explicit Indexing__ - accessing a single element via the actual label (name) of its corresponding index
- __Implicit Indexing__ - accessing a single element via the integer position of its corresponding index

Syntax:
-  __Explicit Indexing__ - `series_name.loc[index_label]`
- __Implicit Indexing__ - `series_name.iloc[index_position]`


__Point to Note__: Indexing is done by using one of two possible accessors - `.loc` and `.iloc`. An easy way to distinguish them from one another is by remembering that `.iloc` stands for __integer location__. In fact, associating the __i__ in `.iloc` with __integer, index,__ or even __implicit__ will help you remember which indexing method does what in the future! 

In [2]:
s = pd.Series(["a", "b", "c", "d"], index=[1, 2, 3, 4])
s

1    a
2    b
3    c
4    d
dtype: object

In [3]:
# Explicit indexing (using actual index labels)
s.loc[1]

'a'

In [4]:
# Implicit indexing (using the postion)
# Recall that Series are 0-indexed -- element at position 1 is in fact the 2nd element in the Series
s.iloc[1]

'b'

Using __unspecified indexing__ is __NOT recommended!__

Without passing an accessor, Pandas will try to find an element with the specified index label, and if it does not exist, it will search for an element with the specified index position. In any ways, not using an accessor when indexing Series introduces __ambiguity__.

In [7]:
# Unspecified indexing
s[1]

'a'

---
### 2.2 Slicing:
We learnt how to access a single element of a Series via Explicit or Implicit Indexing.
To access a subset of elements of a Series, we use a method, called __Slicing__!


__Slicing__ leverages the same accessors as Indexing - `.loc` and `.iloc`. This time however, instead of passing on a single argument - the index label, or the index position of the desired element, we will be passing two arguments, delimited by a colon (__:__)


Syntax:
- __Explicit Slicing__ - `series_name.loc[start_index_label: stop_index_label]` - value corresponding to stop_index_label is __included__ in the output
- __Implicit Slicing__ - `series_name.iloc[start_index_position: stop_index_position]` - value corresponding to stop_index_label is __excluded__ from the output

In [8]:
s = pd.Series(["a", "b", "c", "d"], index=[1, 2, 3, 4])
s

1    a
2    b
3    c
4    d
dtype: object

In [10]:
# Explicit slicing - both values at index 1 and 3 are included -- output has length 3
s.loc[1:3]

1    a
2    b
3    c
dtype: object

In [11]:
# Implicit slicing - value at index position position 3 has been excluded -- output has length 2
s.iloc[1:3]

2    b
3    c
dtype: object

---
### 2.3 Boolean Indexing (Masking):

__Boolean indexing__ is an alternative method to index a Series/DataFrame using a __boolean mask__.
A __mask__ is a vector of boolean objects - e.g. `[True, False, True]`.
When a Pandas Object (Series or DataFrame)  is indexed by a boolean mask, the resultant output will be the subset of rows which match to True values.

In [12]:
s = pd.Series([1,2,3,4,5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [13]:
# Create a mask explicitly (manually entering True/False bools)
mask = [True, True, True, False, False]
s[mask] # note - once the mask is created, we pass it into the series using '[]'

0    1
1    2
2    3
dtype: int64

What makes this useful, is that we can create this mask __implicitly__ by using the __values__ of the Series.
When values of a Series are passed to a boolean expression, a mask is created from how the values evaluate in the expression.

In [14]:
# Create a mask implicitly, using a boolean expression
mask = (s <= 3)
display(mask)

0     True
1     True
2     True
3    False
4    False
dtype: bool

In [15]:
# We can use this mask as before to index the series
s[mask]

0    1
1    2
2    3
dtype: int64

So while it appears complicated at the outset, boolean indexing is means to filter your dataset based on whether the values meet a condition.
(Much like a SQL 'WHERE' statement!)

---
### 2.4 Assigning Values to a Series:
Data Indexing or Slicing is often used not just for 'selection and display purposes', but also for assigning new values to a Series. Below are some examples of how to do this:

In [7]:
s = pd.Series([1, 2, 3, 4])
display(s)

# Assign a new value to single element
s.iloc[0] = 9000
display(s)

# Assign a single value to a slice
s.loc[1:2] = 9005
display(s)

# Assign multiple values to a slice
s.iloc[2:4] = [9010, 9015]
display(s)

0    1
1    2
2    3
3    4
dtype: int64

0    9000
1       2
2       3
3       4
dtype: int64

0    9000
1    9005
2    9005
3       4
dtype: int64

0    9000
1    9005
2    9010
3    9015
dtype: int64

---
## 3. DataFrames - Data Indexing and Selection:

Recall that __DataFrames__ are 2-dimensional objects of data, indexed by its rows and columns. 

We can index a DataFrame object in one of the following ways:
- __Explicit vs Implicit Indexing__
- __Slicing__
- __Boolean Indexing__
- __Column Selection__

In [8]:
# Useful function to make dummy data
def make_df(cols, rows):
    data = {c:[str(c)+str(r) for r in rows] for c in cols}
    return pd.DataFrame(data)

In [10]:
df = make_df("abc", [1,2,3,4])
df.index = [5,6,7,8]
display(df)

Unnamed: 0,a,b,c
5,a1,b1,c1
6,a2,b2,c2
7,a3,b3,c3
8,a4,b4,c4


---
### 3.1 Explicit vs Implicit Indexing:
Explicit and Implicit Indexing for DataFrames work in exactly the same way as for Series, except a value in a __DataFrame__ is now uniquely identified by a pair of two keys - its column and its row.

Syntax:
-  __Explicit Indexing__ - `dataframe_name.loc[row_label, column_label]`
- __Implicit Indexing__ - `dataframe_name.iloc[row_number, column_number]`

In [11]:
# Getting the first row and first column using explicit indexing
df.loc[5, "a"]

'a1'

In [12]:
# Getting the first row and first column using implicit indexing
df.iloc[0, 0]

'a1'

---
### 3.2 Slicing:
Again, __Slicing__ a DataFrame is very similar to Slicing a Series - we shall again use the `.iloc` and `.loc` accessors. However, as __DataFrames__ are 2-dimensional objects, we now can obtain subsets of a DataFrame of all sorts and shapes!


Syntax:
- __Explicit Slicing__ - `dataframe_name.loc[start_row_label: stop_row_label, start_column_label: stop_column_label]` 
- __Implicit Slicing__ - `dataframe_name.iloc[start_row_lnum: stop_row_num, start_column_num: stop_column_num]`

Depending on what output we are after, we can use a number of variations to the above syntax! Let's look into some examples!

In [15]:
df

Unnamed: 0,a,b,c
5,a1,b1,c1
6,a2,b2,c2
7,a3,b3,c3
8,a4,b4,c4


In [13]:
# Explicit Slicing -- getting a sub-dataframe
df.loc[6:7, "b":"c"] # Remember the stop index IS included

Unnamed: 0,b,c
6,b2,c2
7,b3,c3


In [16]:
# Implict Slicing -- get the second in full
# note - df.iloc[1] yields the same result!
df.iloc[1,:]

a    a2
b    b2
c    c2
Name: 6, dtype: object

In [17]:
# Note that as the data we're selecting is 1-dimensional, this slice returns a Series
type(df.iloc[1])

pandas.core.series.Series

In [18]:
# Explicit Slicing -- get the full 2nd and 3rd rows
# As the slice is not 2-dimension, this returns a DataFrame
# note - df.loc[6:7, :] yields the same result
df.loc[6:7]

Unnamed: 0,a,b,c
6,a2,b2,c2
7,a3,b3,c3


---
### 3.3 Boolean Indexing  (Masking):

__Boolean Masking__ works on DataFrames in a similar way it does on Series. A 1-dimensional mask is passed to the 2-dimensional DataFrame, and it filters down the __rows__. The result is a new DataFrame with the same number of columns but fewer rows.

In [19]:
# Explicit Masking - it returns only the rows that correspond to True
mask = [False, True, False, True]
df[mask]

Unnamed: 0,a,b,c
6,a2,b2,c2
8,a4,b4,c4


In [21]:
# Implicit Masking -- checking for which row column 'a' has a value of 'a3;
mask = (df["a"] == "a3")
display(df[mask])

# Note: the '()' are not needed, but they help visually distinguish the mask
mask = df["a"] == "a3"
display(df[mask])

Unnamed: 0,a,b,c
7,a3,b3,c3


Unnamed: 0,a,b,c
7,a3,b3,c3


---
### 3.4 Column Selection:

With DataFrames, we also have the option of __column selection__. 
We can specify which columns we would like to select in the following way:

Synthax: 
- __Select Multiple Columns__ - `dataframe_name[['col1, col2, ...]]'`
- __Select a Single Column & Return a DataFrame__ - `dataframe_name[['col1']]`
- __Select a Single Column & Return a Series__ - `dataframe_name[col1]`

In [25]:
display(df["b"]) # Return a 1-column Series
display(df[["b"]]) # Return a 1-column DataFrame
display(df[["a", "b"]]) # Return a multi-column DataFrame

5    b1
6    b2
7    b3
8    b4
Name: b, dtype: object

Unnamed: 0,b
5,b1
6,b2
7,b3
8,b4


Unnamed: 0,a,b
5,a1,b1
6,a2,b2
7,a3,b3
8,a4,b4


## 4. Summary:
- __Data Indexing__ refers to selecting a sub-part of data from a Pandas Objects. The main ways to do this is via:
    - __Explicit vs Implicit Indexing__ - applicable to both DataFrames and Series
    - __Slicing__ - applicable to both DataFrames and Series
    - __Boolean Masking__ -applicable to both DataFrames and Series
    - __Column Selection__ - applicable to DataFrames only

---
## 5. Concept Check:
1. What are the different ways we can index a series?
2. Suppose you have a series pd.Series([1,2,3,4], index=['a','b','c','d'])
-  Using implicit and explicit indexing, get the second element
-  Using explicit indexing, slice the series to get the first three elements
-  Using a boolean mask, select the even numbers s[s % 2 == 0]

In [26]:
# 1. loc vs iloc, boolean masking
# 2.
s = pd.Series([1,2,3,4], index=["a","b","c","d"])
display(s.iloc[1])
display(s.loc["b"])
display(s.loc[:"c"])
mask = (s%2==0)
display(s[mask])

2

2

a    1
b    2
c    3
dtype: int64

b    2
d    4
dtype: int64