# Function Application and Mapping

---
What we  will learn in this chapter might some of the most important concepts and skills that we will cover in this entire course. We will be tying much of what we have learned in previous chapters together, and the relevancy will become clear as you read and work through the exercises.

Function application and mapping simply refers to proccessing the entries of a `DataFrame` to better suite your needs.

For instance, suppose a data set your are analyzing contains a column describing the delay for a commercial flight and we are interested in seeing the distribution of delay times across weekdays. We could group all the flights that occurred on the same weekday over a time period and compare the aggregated delay times for each day to total delay time over the time period. If the delays were evenly distributed then the result might look like:

| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|:----------|-----------|-----------|-----------|-----------|-----------|-----------|
| 0.1428| 0.1428 |0.1428|0.1428|0.1428|0.1428|0.1428|

But if one day in particular consistently had delays then it might look something like this:

| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|:----------|-----------|-----------|-----------|-----------|-----------|-----------|
| 0.1| 0.4 |0.1|0.1|0.1|0.1|0.1|

We will learn how to do this type of analysis and more in the chapter.


# Preparing Our Environment

---

`pandas` has implemented useful and intuitive functionality for function application and mapping and will be the only resource we need in this chapter. Please run the code cell below to be able to access.

```python
import pandas as pd
```

In [1]:
import pandas as pd

# About the Data TODO

---

We will be using the same subset from the publicly available dataset from the Center for Medicare & Medicaid Services website ([`CMS` website](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html)), and as a reminder our subset of data contains the following columns:

| Column |Description|
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to CMS |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine  |
| `specialty` | The specialty of the doctor prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to  |
| `spending` | The total cost of the medicine prescribed <br/>for the CMS |

We will specifically be looking at datasets that are intentionally constructed with some common issues so we can practice the data cleaning and preperation skills we will be covering in this chapter. The file that we will be using through this chapter is named 'spending_clean_ex.csv' and, relative to our current working directory, this file is in the folder 'Data'. We will be reading this .`csv` file and saving its contents into the `DataFrame` named `spending_df`. We also know ahead of time that this file contains the column `unique_id` which we will want to use to index our `DataFrame`. To do this we type:

```python
spending_df = pd.read_csv('Data/spending_clean_ex.csv', index_col='unique_id')
```

In [2]:
spending_df = pd.read_csv('Data/spending_clean_ex.csv', index_col='unique_id')

# Exercise 6.0: Importing the Honolulu Flights Data Set

For some of the exercises in this chapter we will again be working with a data set containing information about all the arriving and departing flights in and out of the Honolulu aiport, HNL, on the Island of Oahu in December 2015. This data set was introduced in chapter 3: `DataFrame` Attributes and Arithmetic

Please run the following code cell which will parse the 'honolulu_flights.csv' file, and build the `HNL_flights_df DataFrame` before trying the exercises in this chapter related to the Honolulu flights data set.

Pleases recall that this data set contains the following columns:

| Column |Description|
|:----------|-----------|
| `YEAR` | The year of the flight  |
| `MONTH` |  The month of the flight |
| `DAY` |  The day of the flight |
| `DAY_OF_WEEK` |  The day of the week of the flight |
| `FLIGHT_NUMBER` |  The flight number of the flight |
| `ORIGIN_AIRPORT` |  The origin airport of the flight  |
| `DESTINATION_AIRPORT` |  The destination airport of the flight |
| `DEPARTURE_DELAY` |  The departure delay of the flight  |
| `DISTANCE` |  The distance of the flight in miles |
| `AIR_TIME` |  The flight time without taxiing in minutes |
| `ARRIVAL_DELAY` |  The arrival delay of the flight  |

In [3]:
HNL_flights_df = pd.read_csv('Data/honolulu_flights.csv')

# Global Processing Vs. Group Specific Processing

Function application falls into one of two categories:

1. **Global Processing**
2. **Group Specific Processing**

Global processing is applying the same function to every entry in a `Series` or `DataFrame`, while group specific processing is applying functions to entries that belong to a certain group. We will begin with covering global processing in the following cells.

## Global Processing

To apply a function to every element in a `DataFrame` we can use the `applymap()` `DataFrame` method.  The `applymap()` method is a function which takes one positional argument as input and that is a callable function which takes a single value and returns a single value. The `applymap()` method will apply the function passed to every single entry in the calling `DataFrame` and return a new `DataFrame` with the processed entries.

Let us see a simple example. We will construct a `DataFrame` `df` that is $3 \times 3$, i.e. there are three rows and three columns. The entries will be consecutive multiples of 3. To each entry we will apply the anonymous function: `lambda x: x / 3` which will divide a given input by 3. The result will be a new $3 \times 3$ `DataFrame` with the same index and columns as the caller with entries that are the results of the passed function.

```python
>>> df = pd.DataFrame([[ 0,  3,  6], [ 9, 12, 15], [18, 21, 24]], columns=['a', 'b', 'c'])
>>> df.applymap(lambda x: x / 3)
     a    b    c
0  0.0  1.0  2.0
1  3.0  4.0  5.0
2  6.0  7.0  8.0
```

This type of processing is relatively rare since it should be the case that it makes sense to apply the same function to every entry regardless of its location. For example, in the `spending_df` `DataFrame` there are few functions that would be reasonable to apply globally. But when you need this functionality, the `applymap()` function is quite useful. 

## Group Specific Processing

A common scenario is applying a function to a specific group of data. By group of data I mean a subset of the data that is the same based on a criterion. 

The `groupby()` `DataFrame` method is used to group rows of data by one or more of the column entries . The `groupby()` method accepts the parameter `by` which specifies how you want to group the rows of the calling `DataFrame`. `by` can be a single column label, a list of column lables, or a callable function. The method will return a `pandas` `GroupBy` object, an object we have not seen before. This object has certain attributes and methods that will be useful to us. In this chapter we will only cover the case of setting the `by` parameter of the `groupby()` method to a single column entry, if you are interested you can read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). 

If `by` is a sinlge label then the calling `DataFrame` will be grouped by the values in the column with the passed label, i.e. every entry with the same value in the specified column will be in the same group. For example, let us group the `spending_df` `DataFrame` by the values in the `specialty` column and save the returned `GroupBy` object to the variable we will call `spending_by_specialty`. To do this we use the following code.

```python
>>> spending_by_specialty = spending_df.groupby('specialty')
```

`GroupBy` objects have a handy method called `get_group()`, which returns all the entries of a specified group as a `DataFrame`. The `get_group()` method will take a positional argument that is the name of the group to access. Then the method returns a `DataFrame`, which is a subset of the initial `DataFrame` used to instantiate the `GroupBy` object. The entries of the returned `DataFrame` are all those entries in the column specified by the `by` parameter in the original `groupby()` call that match the name used in the `get_group()` call.

Continuing with the example of the `spending_by_specialty` `GroupBy` object, let us see how we would retrieve the group of rows from the `spending_df` `DataFrame` whos entries in the `specialty` column were all the same value of 'CARDIOLOGY'. This group will conviently have the name 'CARDIOLOGY', thus when we use the `get_group` method we will simply pass the value 'CARDIOOGY'.

```python
>>> spending_by_specialty.get_group("CARDIOLOGY")
            doctor_id   specialty            medication  nb_beneficiaries  spending
unique_id                                                                          
CG916968   1952344418  CARDIOLOGY           SIMVASTATIN              85.0    767.83
CG865025   1497955603  CARDIOLOGY  ATORVASTATIN CALCIUM              82.0   2726.72
MK361461   1245206184  CARDIOLOGY   PANTOPRAZOLE SODIUM             102.0   1608.42
YS123432   1841589744  CARDIOLOGY       VENLAFAXINE HCL              25.0    265.49
```

The example above returns all the entries of the  "CARDIOLOGY" group, i.e. all the entries from the original  `DataFrame` whos entries in the `specialty` column is "CARDIOLOGY", organized into a new `DataFrame`.

In [0]:
spending_by_specialty = spending_df.groupby('specialty')
spending_by_specialty.get_group("CARDIOLOGY")

# Exercise 6.1: Group Specific Processing

Which of the following options will correctly group data in `HNL_flights_df` by the entries in the `ORIGIN_AIRPORT` column and save the results in a `GroupBy` object, `HNL_flights_by_origin`?

A:
```python
HNL_flights_by_origin = HNL_flights_df.groupby('ORIGIN_AIRPORT')
```

B:
```python
HNL_flights_df.groupby('ORIGIN_AIRPORT', inplace=True)
```

C:
```python
HNL_flights_by_origin.groupby('ORIGIN_AIRPORT')
```

D:
```python
HNL_flights_by_origin = pd.groupby(HNL_flights_df['ORIGIN_AIRPORT'])
```

*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*


---


Building on the first part of this exercise, which of the following lines of code will extract the subset, or group, of data in `HNL_flights_df` that all have a common `ORIGIN_AIRPORT` value of `LAX` and save the result into the `DataFrame: LAX_to_HNL_df`?


A:
```python
LAX_to_HNL_df = HNL_flights_by_origin.get_group(ORIGIN_AIRPORT = 'LAX')
```

B:
```python
LAX_to_HNL_df = HNL_flights_by_origin.loc[:,'LAX']
```

C:
```python
LAX_to_HNL_df = HNL_flights_by_origin['LAX']
```

D:
```python
LAX_to_HNL_df = HNL_flights_by_origin.get_group('LAX')
```

*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*

In [0]:
# Exercise 6.q Scratch code cell

### Split Apply Combine

Getting groups can be easily implemented using subsetting. For instance, we could have obtained the "CARDIOLOGY" group of the `spending_df` `DataFrame` by subsetting `spending_df` with the boolean `Series` returned from the operation: `spending_df.loc[:, "specialty"] == "CARDIOLOGY"`. 

```python
spending_df[spending_df.loc[:, "specialty"] == "CARDIOLOGY"]
            doctor_id   specialty            medication  nb_beneficiaries  spending
unique_id                                                                          
CG916968   1952344418  CARDIOLOGY           SIMVASTATIN              85.0    767.83
CG865025   1497955603  CARDIOLOGY  ATORVASTATIN CALCIUM              82.0   2726.72
MK361461   1245206184  CARDIOLOGY   PANTOPRAZOLE SODIUM             102.0   1608.42
YS123432   1841589744  CARDIOLOGY       VENLAFAXINE HCL              25.0    265.49
```

We see in the above example that the returned `DataFrame` is exactly the same as the result we saw in the previous cell introducing `groupby()` and `get_group()`. So why use `GroupBy` objects anyway?

An ideal usage of `groupby()`, and the resulting `GroupBy object`, will apply operations to **each** group independently. Furthermore, `GroupBy` objects are intended to be applied in the context of the data processing paradigm called "split-apply-combine"

* **Split** the data into chunks defined using one or more columns
* **Apply** some operation on the chunks generated. 
* **Combine** the results of the applied operation into a new `DataFrame`

For instance, suppose we wanted to compute the total spending by `specialty`and save the result to a news `DataFrame`, the steps we would need to take are:

1. Split the data by `specialty`, i.e. `groupby('specialty')`
2. Apply the `sum()` method to the `spending` column for each group
3. Combine the results from each group into a new `DataFrame`

![](images/split_apply_combine_example.png)

So rather than manually subsetting each group and then applying the desired operation we could automate this workflow using the helpful `GroupBy` methods implemented by `pandas` to save ourselves some time and effort.

#### The 3 Classes of Opearations on Groups

There are 4 classes of split-apply-combine operations that can be applied to group data.


1. __Aggregations__ generate a single value for each group
  
2.  __Transformations__ convert the data and generate a group of the same size as the original group.

3.  __Filters__ retain or discard a group based on group-specific boolean computations.

![](./images/aggregate.png)

![](./images/transform.png)

![](images/filter.png)

#### Aggregate

__Aggregations__ aggregate the data in each group, i.e., they reduce the data to a single value. This includes, for instance, computing group sums, means, maximums, minimums, _etc_. Some of the interesting/important summary aggregation methods of `GroupBy`  objects are:

|Methods| Decription|
|:----------|:----------------|
| `mean`, `median` | Computes the mean and the median in each group| 
| `min` , `max` | computes the min and max in each group| 
| `size` | computes the number of values in each group| 

When one of these methods are called by the `GroupBy` object, they are applied to each group individually and then the group is combined into a new `DataFrame`.

For example suppose we wanted to group by `specialty`, apply the `sum()` method to calculate the total `spending` and total `nb_beneficiaries`, and then combine the results into a new `DataFrame` which holds the total `spending` and `nb_beneficiaries` by `specialty`. We could achieve this by first splitting the data using the `groupby()` `DataFrame` method to obtain a new `GroupBy` object, we will call it `spending_by_specialty`. Then we could apply and combine using the `GroupBy` object's `sum()` method.

```python
>>> spending_by_specialty = spending_df.groupby('specialty')
>>> spending_by_specialty.sum().head(n=3)
                 nb_beneficiaries      spending
specialty                                      
CARDIOLOGY              73.500000   1342.115000
ENDOCRINOLOGY          105.000000  76346.720000
FAMILY PRACTICE         71.461538   3318.526154
```

We see from the above example that the `GroupBy` `sum()` method returns a `DataFrame` with an index labeling the group that the row entry corresponds to and entries telling us the total `spending` and total `nb_beneficiaries`.

In [0]:
spending_df.groupby('specialty').sum().head()


##### Aggregate Continued

As discussed in the previous cell, `pandas` has implemented for us the most common aggregate methods for us, like `sum()` and `mean()`, but sometimes our data requires unique processing. The `GroupBy` method `agg()` can be used where complex or custom aggregation logic is required. The method `agg()` will take a function and use it to aggregate the group in the same way that we saw `sum()` do in the previous cell. The function passed must take a `DataFrame` as an argument, and that passed `DataFrame` will be each group of the calling `GroupBy` object.

For example, suppose we wanted to find the total spending by specialty in Canadian dollars. We can define a function called `sum_spending_CAD()` to return the sum of the spending of a group in Canadian Dollars. Then we can create a new `GroupBy` object, call it `spending_by_specialty`, using a subset of the `spending_df` `DataFrame` only containing the `specialty` and `spending` columns. Lastly, we can can call `agg()` with the `spending_by_specialty` `GroupBy` object and pass it the `sum_spending_CAD` function.

```python
>>> def sum_spending_CAD(x):
>>>    return x.sum() * 1.33
>>> spending_by_specialty = spending_df.loc[:, ['specialty', 'spending']].groupby('specialty')
>>> spending_by_specialty.agg(sum_spending_CAD).head(n=3)
                   spending
specialty                  
CARDIOLOGY        6764.2596
ENDOCRINOLOGY    96196.8672
FAMILY PRACTICE  54357.4584
```

We see in the above example that the result is a new `DataFrame` with the unique `speciality` values as the index and values corresponding the sum total of the spending by specialty in Candian dollars.

To customize group specific processing even further `agg()` can also take a dictionary of functions to aggregate on. The dictionary should be the name of a column of the group and the value a callable function that will take a `Series`. 

For example, suppose we wanted to create a new `DataFrame` that tells us the total `nb_beneficiaries` and the max `spending` by specialty from `spending_df`. To do this we would first `groupby()` `specialty` and then call `agg()` with the new `GroupBy` object, passing it the dictionary: `{'nb_beneficiaries' :sum,'spending' : max}`, which specifies that we want to sum the `nb_beneficiaries` column and find the max of the `spending` column.

```python 
>>> spending_by_specialty = spending_df.groupby('specialty')
>>> spending_by_specialty.agg({'nb_beneficiaries' :sum,
                                 'spending' : max}).head(n=3)
                 nb_beneficiaries  spending
specialty                                  
CARDIOLOGY                  294.0   2726.72
ENDOCRINOLOGY               105.0  76346.72
FAMILY PRACTICE             929.0  15640.59
```

In [0]:
def sum_spending_CAD(x):
   return x.sum() * 1.29
spending_by_specialty = spending_df.loc[:, ['specialty', 'spending']].groupby('specialty')
spending_by_specialty.agg(sum_spending_CAD).head(n=3)

# Exercise 6.2: Aggregate

Using the code cell below, create a new `DataFrame` named `delay_by_origin` that is indexed by the unique origin airports in `HNL_flights_df` and contains the median departure and arrival delays for groups of flights with common origin airports. 

In [4]:
# Type your solution to Exercise 6.2 here

#### Transform

 __Transformations__ change the data in a way that is group-specific. As opposed to aggregations, which reduce the data into a single value, transformations modify the data but don't change the shape of the groups

The example below computes the percent contribution of each entry to each specialty by applying a transformation that normalizes the entry's spending over the total spending in that specialty. 

![](images/transform_spending.png)


##### Transform Continued 1

Applying a transformation is done using the `transform()` `GroupBy` method. The `transform()` method takes as input a function name, which it calls on each group of the `GroupBy` object. The function passed to `transform()` must take a `DataFrame`, which will be a group of the calling `GroupBy` object. 

For example, suppose we wanted to transform the `spending` column of the `spending_df` `DataFrame` to hold the percentage of the total spending by specialty that rows makes up.  First, we would define a function which will take a `DataFrame` and calculate the the percentage of the total each entry takes up. Then we will create a new `GroupBy` object groupded by `specialty` from a subset of `spending_df` that only has the columns `spending` and `specialty`. Then we will call transform passing it the name of our defined function. 

```python
>>> def my_function(x):
>>>    return (x   / x.sum() ) * 100
  
>>> spending_by_specialty = spending_df.groupby('specialty')
 
>>> spending_by_specialty.transform(my_function)[spending_df['specialty'] == "CARDIOLOGY"]

            spending
unique_id           
CG916968   14.302612
CG865025   50.791475
MK361461   29.960547
YS123432    4.945366
```

We see that the result is a new `DataFrame` with an index matching that of the original `DataFrame` used to initialize the `GroupBy` object. This is different than the aggregation example because aggregation reduces the group to a single value, while transformation maintains the shape of the calling `DataFrame`.

Let us save these results into a new column in `spending_df` called `spending_pct`.

```python
>>> spending_df['spending_pct'] = spending_by_specialty.transform(my_function)
```

In [0]:
def my_function(x):
  return (x   / x.sum() ) * 100

spending_by_specialty = spending_df.loc[:, ['specialty', 'spending']].groupby('specialty')

spending_df['spending_pct'] = spending_by_specialty.transform(my_function)

##### Transform Continued 2

Suppose we wanted to see the percent spending by `drug` and `specialty`. One solution to achieve this would be to group on both the `specialty` and the `medication` columns and then sum the `spending_pct` that was computed previously.

```python
>>> medication_spending_pct =  spending_df.loc[:,['specialty', 'medication', 'spending_pct']].groupby(["specialty", "medication"]).sum()
>>> medication_spending_pct.head(n=3)
                                   spending_pct
specialty  medication                        
CARDIOLOGY ATORVASTATIN CALCIUM     50.791475
           PANTOPRAZOLE SODIUM      29.960547
           SIMVASTATIN              14.302612
```

Since we are grouping on two columns, the resulting index of `medication_spending_pct` also contains two columns. We need to reset (or drop) the index using the method `reset_index()` before we can sort on specialty and spending_pct.  `reset_index()` will set the index of the calling `DataFrame` to the default index, a range of integers.

```python
>>> medication_spending_pct = spending_df.loc[:, ['specialty', 'medication', 'spending_pct']].groupby(["specialty", "medication"]).sum().reset_index()
>>> medication_spending_pct.sort_values(["specialty", "spending_pct"], ascending=[True, False]).head(n=4)
       specialty                      medication  spending_pct
0     CARDIOLOGY            ATORVASTATIN CALCIUM     50.791475
1     CARDIOLOGY             PANTOPRAZOLE SODIUM     29.960547
2     CARDIOLOGY                     SIMVASTATIN     14.302612
3     CARDIOLOGY                 VENLAFAXINE HCL      4.945366
```

The resulting `DataFrame` is sorted by `specialty` first and then `spending_pct`. Each row tells us the percent of spending for each unique medication and specialty combination.   

In [0]:
medication_spending_pct = spending_df.loc[:, ['specialty', 'medication', 'spending_pct']].groupby(["specialty", "medication"]).sum().reset_index()
medication_spending_pct.sort_values(["specialty", "spending_pct"], ascending=[True, False]).head(n=4)

# Exercise 6.3: Transform

Using the code cell below, 
* create a new `DataFrame` named `distance_and_day_df` that is a subset of `HNL_flights_df` containing only the `DAY` and `DISTANCE` columns. 
* Group `distance_and_day_df` by the `DAY` column and save the resulting `GroupBy` object in the variable `distance_by_day`. 
* Transfrom the `DISTANCE` column for each flight by calculating the percentage of the total distance by day the flight took. Save the result in a new column of  `HNL_flights_df` named `DISTANCE_PCT`. Use the function pre-defined in the cell to perform the transformation.

In [5]:
def percent_of_total(x):
  return (x   / x.sum() ) * 100

# Type your solution to Exercise 6.3 here

#### Filter

 __Filtering__  a group consists of dropping or retaining groups in a way that depends on a group-specific computation that returns `True` or `False`. Groups that are retained will be left unmodified. For instance, we can filter specialties from `spending_df` that don't have enough entries or for which the mean `spending` is below a certain threshold.

Filtering a group is done using the `GroupBy` method `filter()`. The method `filter()` takes as input a function name, which it calls on each group of the `GroupBy` object. The function must return either `True` or `False` and groups for which the function returns `False` are dropped. The resulting `DataFrame` will have entries in the same order as the original `DataFrame`.

Suppose we want to filter out the specialties that are low spending, i.e. we want to filter out the specialties for which the total spending is less than some defined threshold, let say $\$50000$. To do this we can define a function named `filter_on_spending()`. The defined function will take a `DataFrame`, determine whether the sum total of the `spending` column in that `DataFrame` is greater than 50000, and then return `True` if it is or `False` if not. 

Then, to apply the filter on `spending_df`, we first subset the `DataFrame` so that only the columns `specialty` and `spending` are remaining and then group by `specialty`. Then the `GroupBy filter()` method can be called with the `filter_on_spending()` function passed as an argument. We can save the results into a new `DataFrame` named `high_spending_df`. Then to see which specialties surpassed the $\$50000$ total spending threshold we can print the unique values in the `specialty` column of `high_spending_df`.

```python
>>> def filter_on_spending(x):
>>>     return x['spending'].sum() > 50000

>>> high_spending_df = spending_df[["specialty", 'spending']].groupby('specialty').filter(filter_on_spending)

>>> high_spending_df['specialty']
array(['INTERNAL MEDICINE', 'ENDOCRINOLOGY'], dtype=object)
```
We see that only two specialties passed the threshold of total spending greater than $\$50000$.


In [0]:
def filter_on_spending(x):
     return x['spending'].sum() > 50000

high_spending_df = spending_df[["specialty", 'spending']].groupby('specialty').filter(filter_on_spending)

high_spending_df['specialty'].unique() 

# Exercise 6.4: Filter

Using the code cell below, 
* Group `HNL_flights_df` by the `DAY` column and save the resulting `GroupBy` object in the variable `hnl_flights_by_day`. 
* Filter the flights by determining if the `ARRIVAL_DELAY` of the day was net positive, i.e. if there was a positive total delay for a day keep the flights, otherwise filter them out. Save the resulting `DataFrame` into the variable `HNL_flights_delayed_days_df`. Use the function pre-defined in the cell to perform the transformation.

In [6]:
def net_postive_arrival_delay(x):
  return  x.ARRIVAL_DELAY.sum() > 0

# Type your solution to Exercise 6.4 here

#### Thinning Data and The Flexible `apply()` Method

`pandas` provides a few built-in `GroupBy` methods for thinning the data including `nlargest()`, `nsmallest()`, and more. An example usage of `nlargest()`, a thinning method, would be grouping a subset of `spending_df` which contains only the `spending` and `specialty` columns by `specialty` and then obtaining the 2 largest of each specialty. The result will be a new `DataFrame` with only the top 2 spenders from each unique specialty.

```python
>>> spending_by_specialty = spending_df.loc[:,['specialty', 'spending']].groupby('specialty')
>>> spending_by_specialty['spending'].nlargest(2).head(n=4)
specialty        unique_id
CARDIOLOGY       CG865025      2726.72
                 MK361461      1608.42
ENDOCRINOLOGY    FV632964     76346.72
FAMILY PRACTICE  FE564384     15640.59
Name: spending, dtype: float64
```
    
Some specialties only had a single representative which is why there are not two entries for 'ENDOCRINOLOGY'.

Though `pandas` has the more common and basic aggregation, transformation, and thinning methods implmented for us, they could not possibly cover all cases. Therefore cases that do not fit into any one of these categories may be carried out by using the more flexible `apply() GroupBy` method. `apply()` takes as input a function name, which it calls on each group of the calling `GroupBy` object.

For example suppose we wanted to thin our dataset so that there are only 50% of each specialty represented. To do this we can define a new function, we will call it, `sample_50p`, and this function will utilize the `sample()` `DataFrame` method. The `sample()` `DataFrame` will take a parameter `frac` that specifies the fraction of the original `DataFrame` that is to be returned. We can then use the `apply()` method and pass it the name of our newly defined function to obtain a new `DataFrame` that is filtered at the group specific level. 

```python
>>> def sample_50p(x):
>>>    return x.sample(frac=0.5)
    
>>> spending_by_specialty = spending_df.loc[:,['specialty', 'spending', 'medication']].groupby('specialty')
>>> spending_by_specialty.apply(sample_50p).head(n=3)
                                 specialty  spending            medication
specialty       unique_id                                                 
CARDIOLOGY      YS123432        CARDIOLOGY    265.49       VENLAFAXINE HCL
                CG865025        CARDIOLOGY   2726.72  ATORVASTATIN CALCIUM
FAMILY PRACTICE JX313970   FAMILY PRACTICE   8045.03           RISPERIDONE
```

In [0]:
def sample_50p(x):
  return x.sample(frac=0.5)
spending_by_specialty = spending_df.loc[:,['specialty', 'spending', 'medication']].groupby('specialty')
spending_by_specialty.apply(sample_50p).head(n=3)

# Summary

---

**Function Application and Mapping**

* **Global Processing**

  * To apply a function to every element in a `DataFrame` or `Series` we can use the `applymap()` method

* **Group Specific Processing**

  * The `groupby()` method is used to group the data using values on one or more columns

  * `groupby()` is often applied in the context of the data processing paradigm called "split-apply-combine"
    * **Split**: you need to split the data into chunks defined using one or more columns
    * **Apply**: apply some operation on the chunks generated. 
    * **Combine**: combine the results of the applied operation into a new `DataFrame`

  * There are 3 common classes of split-apply-combine operations that can be applied to group data.

    1. __Aggregations__ generate a single value for each group
  
    2.  __Transformations__ convert the data and generate a group of the same size as the original group.

    3.  __Filters__ retain or discard a group based on group-specific boolean computations.