## Learning outcomes for this notebook

* Select a subset of rows from a DataFrame using query() method.

## Selecting subsets of data

### Introduction

It is often useful, particularly in large datasets, to select just a subset of the data to analyse. There are many reasons why you may want to do this. For example, you may have a dataset of people's blood groups (A, B, AB and O) in different european countries. If you are interested only in people from Scotland, for example, then you want to select Scottish people from the dataset and ignore everyone else. 

Selecting a subset of data on some criteria is a common and easy thing to do in pandas. We will use an example of Kenyan finches to demonstrate this.

### Kenyan finches

The data in the file [`finches.csv`](finches.csv) contains measurements on body mass (in grams) and beak length (in mm) of three species of 15 finches captured by mist nets in Kenya.

![](./images/03.prob-06_fmt.jpeg)

<div class="alert alert-danger">
Read in this dataset, call it `'finches'` and print it out.
</div>

### The query() method for selecting subsets of data

As you can see there are three variables: `'species'` is categorical and `'mass'` and `'beaklength'` are numerical. There are three species: the Cutthroat finch, the White-browed sparrow weaver and the Crimson-rumped waxbill. Five birds of each species were captured and measured.

Say we wanted to make a scatter plot of mass against beak length for just the five waxbills, ignoring, for the moment, the cutthroat finches and the sparrow weavers. **How do we select only the waxbills from the dataset to plot?**

This is achieved using the query() method like so

```python
finches.query('species == "waxbill"')
```

which returns a DataFrame with only waxbills selected.

<div class="alert alert-danger">
Print this command in the code cell below.
</div>

What you should see is a DataFrame with only the waxbills selected.

Then to plot mass against beak length for waxbills we simply do

```python
plt.scatter(x='mass', y='beaklength', data=finches.query('species == "waxbill"'))
```

<div class="alert alert-danger">
Create this scatter plot and label it in the code cell below.
</div>

This is a very powerful and flexible method for selecting rows on any condition. For example to select all finches whose mass is greater than 10g, say, we would write:

```python
finches.query('mass > 10')
```

or to select Cutthroat finches with beak lengths greater than or equal to 8mm:


```python
finches.query('species == "cutthroat finch" and beaklength >= 8')
```


<div class="alert alert-danger">
Try these in the code cell below.
</div>

One problem with the above queries is that the criteria to select on (i.e., 10grams or 8mm) are hard-coded into the query string. This is fine in some cases, but instead you might want to make these criteria variable. 

For example, we might have a variable called `min_beaklength` and we want to select all finches with a beak length greater than or equal to `min_beaklength`. To do this we put a `@` sign in front of the variable in the query like so:

```python
min_beaklength = 8
finches.query('beaklength >= @min_beaklength')
```

or say, for example, you wanted all finches with beak lengths between `min_beaklength` and `max_beaklength` then you could use the query

```python
min_beaklength = 8
max_beaklength = 10
finches.query('@min_beaklength < beaklength < @max_beaklength')
```


<div class="alert alert-danger">
Try these in the code cell below.
</div>


### Plotting all the finch species together

We've plotted mass against beak length for just the waxbills, say we also want to plot these for the Cutthroat finches and the sparrow weavers all in the same graph so that we can compare species. To do that we need to distinguish between them by, for example, using a different colour for each species.

<div class="alert alert-danger">
Create a clearly labelled scatter plot of mass against beak length for the three species of finch in the code cell below.
</div>

To aid your audience in interpreting your graph, species should be represented by symbols of different shapes and different colours and by a legend. Matplotlib automatically assigns a new colour to each species but it always uses a circle symbol. We can tell matplotlib which symbol to use with the `marker` argument of `plt.scatter()`. The full list of available markers is given [here](https://matplotlib.org/api/markers_api.html). To populate the legend we use the `label` argument. For example, the command

```python
plt.scatter(x='mass', y='beaklength', data=finches.query('species == "cutthroat finch"'), marker='s', label='Cutthroat finch')
```

will use a square symbol and add the entry `Cutthroat finch` to the legend. In addition, we also need to give the command 

```python
plt.legend()
```

to tell matplotlib to print the legend in the figure.

<div class="alert alert-danger">
Add a legend and make each species' symbol unique in your scatter plot above.
</div>

<div class="alert alert-danger">
As an additional excercise you might want to think about how to make this code more efficient by looping over a list of the species and only having one `plt.scatter()` command. **Hint** use `finches['species'].unique()`.
</div>

## Junior and Senior Honours marks

![](./images/exam.jpeg)

The file [honours_marks.csv](honours_marks.csv) contains the anonymised 3rd and 4th year marks of almost all Biological Science students awarded an Honours degree since academic year 2009/10. 

<div class="alert alert-danger">

Calculate the percentage of students who achieved a higher mark in their 4th year than in their 3rd year. (To check you've got the code correct, your answer should be 65.2%)

<br>
<br>
Hints: First use query() to return a DataFrame of students whose 4th year marks are greater than their 3rd year marks. Then find the number of rows in this DataFrame (using len(), see Notebook 18) and divide it by the number of rows in the full DataFrame.
</div>

## College student heights

Recall from the last notebook the living histogram of US female college students with the data in the file ['college_students.csv'](college_students.csv).

![](images/college_students.png)

Remember that if data are normally distributed then 68% of the data lie within one standard deviation of the mean and 95% of the data lie within two standard deviations of the mean.

We asked, are heights of college students normally distributed? To find out we did the following:
1. Calculate the mean and standard deviation of the heights.
2. From the data, find the proportion of students within one and two standard deviations of the mean.
3. If those proportions are roughly 68% and 95% respectively, then the data are likely to be normally distributed. 

We used the pandas methods `.mean()` and `.std()` to find the mean and standard deviation of the heights and then physically counted the number of students in the picture. This is clearly inefficient. But now you have learnt how to do the counting programmatically using the `.query()` method.

<div class="alert alert-danger">

* Using mean(), std(), query() and len(), (and without hard-coding numbers into query()) calculate the proportion of college students within one and two standard deviations of the mean height.
* Hence determine if student heights are normally distributed.
</div>
<br>

### References

Schluter D. (1988). The evolution of finch communities on islands and continents: Kenya vs. Galapagos. *Ecological Monographs* **58**:229-249.