# Indexing and Slicing: Getting Items from Ordered Data Collections

## Indexing

Getting a single item from a collection is called **Indexing** a collection.  To do this, you pass the item's "index" (it's position) to the get-item operator in Python: the square brackets.  Note: Python is a 0-indexed language, so counting starts from zero.

```python
data = [10, 20, 30]
data[0]  # "Get the first item"
```

Counting from the end of a sequence can be done using a negative index:
```python
data = [10, 20, 30]
data[-1]  # "Get the last item"
```

**Exercises**

**Example**: What is the first letter in the english alphabet?

In [1]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [2]:
letters[0]

'a'

What is the fifth letter in the english alphabet?

In [3]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [4]:
letters[4]

'e'

What is the third letter in the english alphabet?

In [5]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [6]:
letters[2]

'c'

What is the last letter in the english alphabet?

In [8]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [7]:
letters[-1]

'z'

What is the third-to-last letter in the english alphabet?

In [8]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [9]:
letters[-3]

'x'

What is the second day of the week?

In [10]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [11]:
days_of_week[1]

'Tuesday'

What is the fifth day of the week?

In [12]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [13]:
days_of_week[4]

'Friday'

What is the second-to-last day of of the week?

In [14]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [15]:
days_of_week[-2]

'Saturday'

What is the fourth-to-last day of the week?

In [16]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [17]:
days_of_week[-4]

'Thursday'

## Slicing

Slicing is about getting multiple items from a collection.  In sequences, this can be done by specifying the start and stop indices, seperated by a colon:

```python
data = [10, 20, 30, 40, 50, 60, 70, 80]
data[0:2]  # Gets [10, 20]
data[:2]   # Also gets [10, 20]
data[2:4]  # Gets [30, 40]
data [-2:] # Gets [70, 80]
```



**Example**: What are the first five letters in the English alphabet?

In [18]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [19]:
letters[:5]

'abcde'

What are the second-to-sixth letters in the english alphabet?

In [20]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [21]:
letters[1:6]

'bcdef'

What are the third-to-fifth letters in the english alphabet?

In [22]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [23]:
letters[2:5]

'cde'

What are the last three letters of the english alphabet?

In [24]:
letters = 'abcdefghijklmnopqrstuvwxyz'

In [25]:
letters[-3:]

'xyz'

What are the second-to fifth days of the week?

In [26]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [27]:
days_of_week[1:5]

['Tuesday', 'Wednesday', 'Thursday', 'Friday']

Get all the days of the week except Monday.

In [28]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [29]:
days_of_week[1:]

['Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

Get all the days of the week except the weekend days

In [30]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [31]:
days_of_week[:-2]

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

# Filtering Data With Logical Indexing

Sometimes you want to remove certain values from your dataset.  In Numpy, this can be done with **Logical Indexing**, and in normal Python this is done with an **If Statement**

### Step 1: Create a Logical Numpy Array

We can convert all of the values in an array at once with a single logical expression.  This is broadcasting, the same as is done with the math operations we saw earlier:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> data < 3
[True, True, False, False, False]
```

**Exercises**: Make arrays of True/False values that answer the following questions about the dataset below for each element.

In [None]:
import numpy as np

list_of_values = [3, 7, 10, 2, 1, 7, np.nan, 20, -5]
data = np.array(list_of_values)

*Example*: Where are the values that are greater than zero?

In [None]:
data > 0

In [None]:
mask = np.invert(np.isnan(data))
data[mask]

In [None]:
np.nanmean(data)

Where are the values that are less than four?

In [None]:
data < 4

Where are the values that are equal to 7?

In [None]:
data == 7

Where are the values that are greater or equal to 7?

In [None]:
data >= 7

Where are the values that are not equal to 7?

In [None]:
data != 7

## Step 2: Filter with Logical Indexing

If an array of True/False values is used to *index* another array, and both arrays are the same size, it will return all of the values that correspond to the True values of the indexing array:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> mask = data > 3
>>> mask
[False, False, False, True, True]
>>> data[mask]
[4, 5]
```

Both steps can also be done in a single expression.  Sometimes this can make things clearer!


```python
>>> data[data > 3]
[4, 5]
```


**Exercises**:  Using the data below, extract only the values that corresspond to each question

In [None]:
data = np.array([3, 1, -6, 8, 20, 2, np.nan, 7, 1, np.nan, 9, 7, 7, -7])
data

*Example*: The values that are less than 0

In [None]:
data[data < 0]

The values that are greater than 3

In [None]:
data[data > 3]

The values not equal to 7

In [None]:
data[data != 7]

The values equal to 20

In [None]:
data[data == 20]

### Statistics on Filtered Data


**Exercises**: Using the following dataset, have Python to calculate the answers to the questions below:

In [None]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

*Example*: How many values are greater than 4?  

In [None]:
len(data[data > 4])

How many values are equal to 7?

In [None]:
len(data[data == 7])

3

What is the mean value of the positive numbers?

In [None]:
np.mean(data[data > 0])

np.float64(6.5)

What is the mean value of the negative numbers?

In [None]:
np.mean(data[data < 0])

np.float64(-6.5)

What is the median value of the values that are greater than 5?

In [None]:
np.median(data[data > 5])

np.float64(7.5)

What proportion of the values are positive?  (hint: sum and len, or mean)

In [None]:
sum(data > 0)/len(data)

np.float64(0.8333333333333334)

In [None]:
len(data[data > 0])/len(data)

0.8333333333333334

What proportion of the values are less than or equal to 8?

What proportion of the values are less than or equal to 8?

What proportion of the values are less than or equal to 8?

In [None]:
len(data[data <= 8])/len(data)

0.8333333333333334

## Modifying Data Using Logical Indexing

Just like in normal indexing operations with arrays, logical indexing can be used to *set* new values ,in addition to *getting* new values from an array:

| Example | Description |
| :-- | :-- |
| **`data[data > 5] = 10`** | Set all values greater than 5 to 10  |
| **`data[data > 5] = data[data > 5] * 10`** | Set the values greater than 5 to themselves times 10  |
| **`data[data > 5] *= 10`** | Multiply all the values greater than 5 by 10, setting them in-place.  |
| **`data2 = data.copy()`** | Copy an array to a new variable |

Example: Set all positive values in `data` to 100

In [None]:
data = np.arange(-4, 5)
data

array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [None]:
data[data > 0] = 100
data

array([ -4,  -3,  -2,  -1,   0, 100, 100, 100, 100])

Set all negative values in `data` to 0

In [None]:
data = np.array([ 1, -2,  2, -1,  1,  4, -3,  1,  2, -1])
data

array([ 1, -2,  2, -1,  1,  4, -3,  1,  2, -1])

In [None]:
data[data < 0] = 0
data

array([1, 0, 2, 0, 1, 4, 0, 1, 2, 0])

Add 100 to all values in `data` less than 100.

In [None]:
data = np.array([0, 101, 2, 3, 104, 105, 6, 107, 8])
data


array([  0, 101,   2,   3, 104, 105,   6, 107,   8])

In [None]:
data[data < 100] = data[data < 100] + 100
data

array([100, 101, 102, 103, 104, 105, 106, 107, 108])

In [None]:
data[data < 100] += 100
data

array([100, 101, 102, 103, 104, 105, 106, 107, 108])

In [None]:
data += 100
data

array([100, 201, 102, 103, 204, 205, 106, 207, 108])

Make all the negative values in `data` positive.

In [None]:
data = np.array([1, -2, -3, 4, -5, 6, 7, -8, 9, 10, -11])
data

array([  1,  -2,  -3,   4,  -5,   6,   7,  -8,   9,  10, -11])

In [None]:
data[data < 0] = data[data < 0] * (-1)
data

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [None]:
data[data < 0] *= (-1)
data

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [None]:
data[data < 0] = np.abs(data[data < 0])
data

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Challenge: Set all the values greater than 10 in `data` to random values between 0 and 10 (use the `np.random.randint()` function)

In [None]:
data = np.array([3, 15, 8, 12, 7, 19, 11, 6, 2, 25])
data

array([ 3, 15,  8, 12,  7, 19, 11,  6,  2, 25])

In [None]:
data[data>10] = np.random.randint(0, 10, size= len(data[data>10]))
data

array([3, 6, 8, 4, 7, 2, 5, 6, 2, 4])

## Using Logical Indexing to Link Two Different Variables in a Dataset

So long as this process produces a mask of Trues and Falses equal in shape to the array it's being applied to, it works!  An implication of this is that we can use one dataset to index another dataset:

| Syntax | Description |
| :--  | :-- |
| **`data2[data1 > 0]`** | Get the values in `data2` with indices that correspond to the positive values in `data1` |

Get all the "Treatment" group's temperatures

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

In [None]:
temp[group == 'Treatment']

array([38, 32, 39, 38])

Calculate the mean temperature of the "Control" group

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

In [None]:
np.mean(temp[group == 'Control'])

np.float64(36.4)

Get the group names for all the temperatures greater than 35, calculate the proportion that are in the Control group

In [None]:
temp = np.array([35, 38, 32, 35, 39, 37, 36, 38, 39])
group = np.array(['Control', 'Treatment', 'Treatment', 'Control', 'Treatment', 'Control', 'Control', 'Treatment', 'Control'])

In [None]:
group_gt_35 = group[temp > 35]
n_control = len(group_gt_35[group_gt_35 == 'Control'])
n_total = len(group_gt_35)
n_control / n_total

In [None]:
sum(group[temp > 35] == 'Control')/sum(temp > 35)

0.5

In [None]:
len(data[data <= 8])/len(data)

0.8333333333333334