# Intermediate Scientific Computing

## Workshop 3: Data Science (1)

## Previous workshop

- Looked at how to access and manipulate databases
- Learnt about database operations and how to use them

## Before the workshop

- Asked you to look into the pandas [`df.apply()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)
- Consider some questions about how this function is used and post to Padlet

## Today's session

**Aim**: Build on data science concepts to manipulate and plot data

- What does the apply function do? Discuss your thoughts from the asynchronous activity
  - *Padlet: https://uob.padlet.org/racheltunnicliffe1/hdatbmk7jph26x32*
- Using split-apply-combine to manipulate data
  - [*Workshop3_01_SplitApplyCombine*](Workshop3_01_SplitApplyCombine.ipynb)
- Plotting two-dimensional data
  - [*Workshop3_02_2DPlotting*](Workshop3_02_2DPlotting.ipynb)


## Apply function

In the asynchronous activity you were asked to look into a pandas method called `.apply()`:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

This is available within your Workshop 2 notes:

[Workshop2-3_asynchronous.ipynb](../Workshop02/Workshop2-3_asynchronous.ipynb)

## What does this function do?

As part of the asynchronous activity, we asked you to post your answers to these questions on Padlet:

1. What does the apply method do?

2. What type of input would you pass to the apply function for the first argument? 

3. Can you give an example of what could be passed to this function? If possible, try to find/think of a different example than given on the documentation page itself.

Access Padlet: https://uob.padlet.org/racheltunnicliffe1/hdatbmk7jph26x32

## Functions as names

When we call `df.apply()`, the first argument we  pass is the name of a function. This function is then *applied* to the DataFrame.

For example if we wanted to apply a square root to a whole DataFrame we could use the numpy sqrt function and the apply function:

```python
df.apply(np.sqrt)
```

We're calling the `apply()` method on the DataFrame, `df`, and passing `np.sqrt` as the first argument. In this case the square root function will be applied to every element in the pandas DataFrame.

Notice here we don't include round brackets after `np.sqrt`.

The apply method shows us that, in Python, function names (without the round brackets) can be passed around just like variable names.

This starts to tie into the idea of *functional programming*, where functions themselves can be treated like objects. These functions often have to be designed to accept arguments in a particular order and sometimes of a particular type.

The benefits of this start to become apparent when we begin designing more complex workflows, for instance where we want to run processes in parallel. Look out for this later on in the course.

## Methods: Split-apply-combine

In data analysis, one concept for managing and manipulating data is **split-apply-combine**.

This is the idea of:
- *splitting* up your problem into smaller chunks
- *applying* some operation to each of those chunks
- *combining* that data back together.

In [1]:
import pandas as pd
import numpy as np

### pandas groupby and resample

You *may* have already engaged with this concept if you used the `groupby` or `resample` options offered by the `pandas` package but this is a general concept which also comes up in database manipulation (e.g. the SQL syntax has a "GROUP BY" option).

For groupby the idea is to group data based on a category and then apply an operation on each of these groups.

If we define a DataFrame as follows, we can demonstrate this concept using the groupby method:

In [2]:
df = pd.DataFrame(
        [("bird", 389.0), ("bird", 24.0), ("mammal", 80.2), ("mammal", np.nan), ("mammal", 58)],
        index=["falcon", "parrot", "lion", "monkey", "leopard"], columns=("class", "max_speed"))
df

Unnamed: 0,class,max_speed
falcon,bird,389.0
parrot,bird,24.0
lion,mammal,80.2
monkey,mammal,
leopard,mammal,58.0


For this DataFrame information on different animals, we could group by the "class" column and find the mean for the other columns:

In [3]:
df.groupby("class").mean()

Unnamed: 0_level_0,max_speed
class,Unnamed: 1_level_1
bird,206.5
mammal,69.1


*Splitting* is done by the `groupby` step:
<pre>
       Split
         ↓         
<b>df.groupby("class")</b>.mean()
</pre>

*Applying* is done by using a function, in this case the `mean` function
<pre>
                     Apply
                       ↓         
df.groupby("class").<b>mean()</b>
</pre>

*Combining* is done based on the syntax as a whole, which understands how to stitch this back together to create a new pandas DataFrame
<pre>
          Combine
             ↓         
<b>df.groupby("class").mean()</b>
</pre>

We can even split these steps up if we wanted to:

In [4]:
df_grouped = df.groupby("class") # Group - GroupBy object
df_averaged = df_grouped.mean()  # Apply and combine - DataFrame object

### Apply

It would be possible for us to use a very close analogue to the `apply` method we discussed in the asynchronous activity to achieve the same result as above:

In [5]:
df.groupby("class").apply(np.mean)

Unnamed: 0_level_0,max_speed
class,Unnamed: 1_level_1
bird,206.5
mammal,69.1


In our case, pandas has provided us with a shorthand to *apply* the `mean` function (or some other functions) directly. 

However, the principle of applying any function to an object in this way is generally applicable and is, again, linked to the idea of *functional programming*.

## Grouping and resampling our data

Work through this notebook discussing the groupby and resample methods available in pandas. These methods are the most common methods of applying split-apply-combine and are very useful when dealing with different data sources.

[Workshop3_01_SplitApplyCombine.ipynb](Workshop3_01_SplitApplyCombine.ipynb)

## 2D plotting

Work through this small example using Magnetic Resonance Images of the brain to demonstrate how we could create plots for two-dimensional data.

[Workshop3_02_2DPlotting.ipynb](Workshop3_02_2DPlotting.ipynb)

## Assessment

Programming test to be completed and submitted next week.

**Deadline: Wednesday at 12pm (noon) (Week 4)**

- Details are available on the [Intermediate Scientific Computing](https://www.ole.bris.ac.uk//webapps/login/?action=default_login&new_loc=%2Fwebapps%2Fblackboard%2Fcontent%2FlistContentEditable.jsp%3Fcontent_id%3D_6558649_1%26course_id%3D_249925_1) Blackboard course - ["Assessment, submission and feedback"](https://www.ole.bris.ac.uk//webapps/login/?action=default_login&new_loc=%2Fwebapps%2Fblackboard%2Fcontent%2FlistContentEditable.jsp%3Fcontent_id%3D_6558658_1%26course_id%3D_249925_1%26mode%3Dreset) course content area.
   - "Test 1: Data Science 1"
- Assessment is provided as a zip file containing data and a Jupyter notebook template to be updated and submitted.
- Submit using the same Blackboard submission point on the "Assessment, submission and feedback" tab.

## Seminar: Data Cleaning

There are also seminar sessions this week which will include a mini hackathon focussed on data cleaning. Check your timetable for when your seminar is scheduled and see the Blackboard course page for [Week 3](https://www.ole.bris.ac.uk/auth-saml/saml/login?apId=_183_1&redirectUrl=https%3A%2F%2Fwww.ole.bris.ac.uk%2Fwebapps%2Fblackboard%2Fcontent%2FlistContentEditable.jsp%3Fcontent_id%3D_7327443_1%26course_id%3D_249925_1) for more details of this.

There is no preparation needed for this seminar but do make sure to ***bring along your personal device*** to use for coding.