# Selections and Filtering

In this notebook we'll learn how to do selections and filtering with Pandas.

- [Selections](#s)
    - [Selecting the top/bottom](#tb)
    - [Selecting columns](#c)
    - [Selecting rows](#r)
    - [Selecting rows and columns together](#rc)
- [Filtering](#f)
    - <mark>[Exercise: Selections and Filtering](#e0)</mark>
- [Multiple Conditions](#m)
    - <mark>[Exercise: Multiple Conditions](#e1)</mark>
- [Sorting](#sort)
    - <mark>[Exercise: Sorting](#e2)</mark>

Let's load in the `chickweight` dataset again.

In [None]:
import pandas as pd

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv') 
      .rename(str.lower, axis='columns')
)
chickweight

<a id='s'></a>
## Selections

Before we attempt to figure out the best diet for the chicks, let's investigate how to select different rows/columns. There are a few different approaches!

<a id='tb'></a>
### Selecting the top/bottom

In [None]:
(
    chickweight
    .head(5)
)

In [None]:
(
    chickweight
    .tail(2)
)

In [None]:
(
    chickweight
    .head(5)
    .tail(2)
)

<a id='c'></a>
### Selecting columns (Dataframes vs. Series)

In [None]:
chickweight['weight'].head() 

Note that the output of this next command is a little bit different.

In [None]:
chickweight[['weight']].head()

There is a subtle difference at work here. 

- `chickweight[['weight']]` returns a table with one column (`pd.DataFrame`)
- `chickweight['weight']` selects the column from the table (`pd.Series`)

As we'll see, a lot of techniques that work on dataframes will also work on series objects, but not all of them!

Using two "square brackets" means we can select more columns

In [None]:
chickweight[['weight', 'time']].head()

<a id='rc'></a>
### Selecting rows & columns together

To select rows and columns, we can use the `.loc[]` method.

For selection, it typically uses the following format.

`.loc[[rows], [columns]]` 

In our case, we should specify the column names we want, e.g. `['time','chick']`

As we don't have row names, we can use indexing to select the rows, e.g.

 - if you use ` : `, it will select all rows
 
 - if you use `:50`, it will select the first 50 rows
 
 - if you use `::2`, it will select every other row
 
Note that the `.loc`/`i.loc` methods is a tricky and inconsistent thing to remember (apologies for that). The main thing to remember is that `.loc[]` requires **square** brackets.

In [None]:
chickweight.loc[ : , ['time', 'chick']]

In [None]:
chickweight.loc[10:15, ['time', 'chick']]

In [None]:
(
    chickweight
    .loc[10:15, ['time', 'chick']]
)

#### Lesson 1

You'll notice that whenever we use `<dataframe>.<method>` that the output of this operation is yet again a dataframe.

This means that we can chain commands together to form the `-then->` style of programming. 

<a id='f'></a>
## Filtering

We saw previously how to check which parts of the dataframe/series met conditions

In [None]:
(
    chickweight < 3
    
).head()

In [None]:
(
    chickweight['chick'] == 1
)

In [None]:
(
    chickweight
    [ chickweight['chick'] == 1 ]
)

We can use these results to filter our dataframe

In [None]:
(
    chickweight 
    [chickweight['chick'] == 1] 
    .head()
)

In [None]:
(
    chickweight  
    [chickweight['chick'] == 1]
    [chickweight['time'] < 10] 
)

This is the correct output but we're getting a warning, and we should never ignore warnings!

This comes from the fact that the below creates a boolean series:

In [None]:
chickweight['time'] < 10

In [None]:
len(chickweight['time'] < 10)

...which is the **same length as the original dataframe:** 

In [None]:
len(chickweight)

...yet because it is the **second filter** being used, it is actually being **applied to the new dataframe**, filtered in step one: 

In [None]:
len(
    chickweight
    [chickweight['chick']==1]
)

***Explicit is better than implicit.***

The safest approach for filtering is to use the `.loc()` method and lambda functions

In [None]:
(
    chickweight 
    .loc[lambda df: df['chick'] == 1]
    .head()
)

In [None]:
(
    chickweight
    .loc[lambda df: df['chick'] == 1]
    .loc[lambda df: df['time'] < 10]
)

#### Lesson 2

We are using a `lambda` function here to describe how we are using the `.loc` command. 

The `.loc` tells us **what** we are doing (filtering) and the function tells us **how**.

In [None]:
(
    chickweight
    .loc[:, ['weight', 'time']]
    .loc[lambda df: df['weight'] < 50] 
)

<a id='e0'></a>
## <mark> Exercise: Selections and Filtering </mark>
1. Get the weight column as a

    a) Series

    b) Dataframe

    c) List



2. Select only 

    a) The first 100 rows of data

    b) Rows 50 to 100 of the data
    
    c) The chick and weight columns

    c) Rows 50 to 100 of the data of the chick and weight columns

3. Filter the data 

    a) For when weight is less than 60

    b) For Chick number 15

    c) For when weight is less than 60 and time is equal to 2

    d) For when weight is less than 60 and time is equal to 2, but only the weight and time columns!



**BONUS**:

1. Calculate the average Chicken weight

2. Calculate the average Chicken weight at time 10

*Any other data science questions you can think of?*

<a id='m'></a>
## Multiple conditions


We're now going to look at how we can use multiple conditions within the same line. Aside from being more efficient to run, it is also useful as we won't need to worry about previous filters.

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) & (df['time'] < 10)]
)

***Note the need for parentheses and the use of `&` rather than `and`?***

Firstly, we need **parentheses** due to the order of operations, `&` always performs before comparisons like `==`, `<`, `>=` etc.

**Now for `&` vs `and`...** before we saw the use of `and` which tests whether both expressions we have written are logically True. Eg:

In [None]:
# checks if 7 is equal to 7 and if 6 is equal to 6 first
# both are True so the output is True
print(7 == 7 and 6 == 6) 

# checks if 7 is NOT equal to 7 and if 6 is equal to 6 first
# since the first is False and the second True, the output is False
print(7 != 7 and 6 == 6)

# checks if 7 is equal to 7 and if 6 is NOT equal to 6 first
# since the first is True and the second False, the output is False
print(7 == 7 and 6 != 6)

When we create filters in dataframes, we're actually creating a series of boolean values for each row in the dataframe. If we apply the `and` operator, it attempts to return a single Boolean logic value for the whole series.

In [None]:
chickweight['chick'] == 1

In [None]:
#error
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) and (df['weight'] < 50)]
    .head(10)
)

Therefore we now need to use `&`, the bitwise AND operation, so that we can compare each `True`/`False` in every row with multiple filters.

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) & (df['weight'] < 50)]
    .head(10)
)

**The `&` (ampersand) is used for to find when two rows equate to true:**

![](images/filt-and.png)

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) | (df['weight'] < 50)]
    .tail(10)
)

**The `|` (pipe) is used for to find when AT LEAST ONE row equates to true:**

![](images/filt-or.png)

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) ^ (df['weight'] < 50)]
    .tail(10)
)

**The `^` (hat) is used for to find when only one row equates to true BUT NOT BOTH:**

![](images/filt-hat.png)

<a id='e1'></a>
## <mark> Exercise: Multiple conditions </mark>

Select only the part of chickweight where:

1. **weight** is above 50 but below 100
2. **diet** is either 1 or 2 
3. **diet** is either 1 or 3, but only show the `weight` and `diet` colmuns

<a id='sort' ></a>
## Sorting in Pandas

Sort is super useful, but keep in mind that the order in which you run the commands matter!

In [None]:
(
    chickweight
    .sort_values('weight')
    .head(20)
)

In [None]:
(
    chickweight
    .sort_values(by='weight', ascending=False)
    .head(20)
)

In [None]:
(
    chickweight
    .sort_values(by=['weight', 'chick'])
    .head(3)
)

<a id='e2'></a>
## <mark>Exercise: Sorting</mark>

Sort the data by weight (ascending) and by chick (descending). Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) to see what type of inputs the `ascending` parameter takes.

Afterwards, reset the index of the resulting dataframe. Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) to check which parameter you'll have to set to do this.