<a href="https://colab.research.google.com/github/gt-cse-6040/skills_oh_week_08/blob/main/week08_session01_apply_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Apply examples##

In [1]:
import pandas as pd
import numpy as np

Here we will create a random synthetic data set for exploration.  

The synthetic data is player performance in a game. Each record will indicate the number of attempts and successes the player had in a single instance of a game. We want to extract some information and the `apply` methods are going to help us!

In [80]:
# initialize RNG (use the one with the `seed` param if you want the same DF every time)
# rng = np.random.default_rng(seed=6040)
rng = np.random.default_rng()

# set parameters for our simulation
available_players = ['Al', 'Larry', 'Jen', 'Darby', 'Chris', 'Rich', 'Katie']
num_games = 500
min_accuracy = 0.25     # Note that `min_accuracy + accuracy_range` should be <= 1
accuracy_range = 0.5
min_attempts = 10
attempts_range = 20
max_attempts = min_attempts + attempts_range

# build the simulated data
game_players = rng.choice(available_players, num_games, replace=True)
game_attempts = rng.integers(min_attempts, max_attempts, size=num_games)
acc = min_accuracy + rng.random(size=num_games) * accuracy_range
game_success = np.floor(game_attempts*acc).astype('int64')
game_df = pd.DataFrame({'player': game_players, 
                        'attempts': game_attempts, 
                        'successes': game_success})

# take a peek at the data
game_df.head()

Unnamed: 0,player,attempts,successes
0,Katie,25,8
1,Darby,21,8
2,Katie,25,10
3,Larry,19,7
4,Jen,20,10


First, we want to know the total number of attempts and successes. How can we do this with `apply`?

Keep in mind that we want the `sum` of only the 'attempts' and 'successes' columns.

Remember that using `axis=0` (the default) will apply the given function to each *column* and `axis=1` will apply the given function to each *row*.



In [23]:
game_df[['attempts', 'successes']].apply(sum)

attempts     9721
successes    4620
dtype: int64

That's good, but what if we're going to get other similarly structured data sets and want to be able to do the same thing in the "laziest" way possible? (Remember we're writing code so lazy is good!)

In [24]:
def sum_att_succ(df):
  return df[['attempts', 'successes']].apply(sum)

# let's see if we get the same thing as above
totals_df = sum_att_succ(game_df)
totals_df

attempts     9721
successes    4620
dtype: int64

Ok. We get the same result. Now we have an easy way to get the total of the 'attempts' and 'successes' columns for any DataFrame which contains those columns! That's a great start, but we can find some more interesting results...

Next, let's see if we can get these totals *for each individual player*. The `Groupby.apply()` method can help us here!

While `Series.apply` works on individual values and `DataFrame.apply` works on `Series` objects (rows or columns are instances of `Series`), `Groupby.apply` works on `DataFrame` objects. The cell below is applying the `print` function to each `DataFrame` or "group" in the `Groupby`!

In [31]:
game_df.groupby('player').apply(print)

    player  attempts  successes
11      Al        17          8
23      Al         5          3
27      Al        33         22
28      Al        35         24
34      Al         3          1
..     ...       ...        ...
437     Al        32         16
441     Al        25         18
442     Al         4          2
443     Al        11          7
486     Al        16          5

[69 rows x 3 columns]
    player  attempts  successes
4    Chris        13          7
7    Chris        14          4
15   Chris         8          3
22   Chris        33         14
47   Chris         5          1
..     ...       ...        ...
474  Chris         8          3
477  Chris         9          6
482  Chris        37         11
491  Chris        22          9
497  Chris        12          6

[82 rows x 3 columns]
    player  attempts  successes
5    Darby         0          0
10   Darby         0          0
29   Darby        32         12
33   Darby         8          2
37   Darby         0      

Since we grouped by the 'player' column, each group is a `DataFrame` made up of all rows in `game_df` having the same player. If we apply our `sum_att_succ` function instead of `print` let's see what we get!

In [32]:
game_df.groupby('player').apply(sum_att_succ)

Unnamed: 0_level_0,attempts,successes
player,Unnamed: 1_level_1,Unnamed: 2_level_1
Al,1297,669
Chris,1709,794
Darby,1234,592
Jen,1349,647
Katie,1479,691
Larry,1258,583
Rich,1395,644


Great! Suppose we want the player to remain a data column rather than the index? Here's two different ways to go about it. Either using `reset_index` or the `as_index` parameter.

In [33]:
game_df.groupby('player').apply(sum_att_succ).reset_index()

Unnamed: 0,player,attempts,successes
0,Al,1297,669
1,Chris,1709,794
2,Darby,1234,592
3,Jen,1349,647
4,Katie,1479,691
5,Larry,1258,583
6,Rich,1395,644


In [81]:
game_df.groupby('player', as_index=False).apply(sum_att_succ)

Unnamed: 0,player,attempts,successes
0,Al,1237,570
1,Chris,1190,551
2,Darby,1468,727
3,Jen,1507,689
4,Katie,1523,702
5,Larry,1255,581
6,Rich,1456,696


## Aside on return types and `apply`##

The output of `apply` depends on what the output of the `func` argument given.  
- If `func` returns a **single value** then `.apply(func)` will return a `Series`. (First example)
- If `func` returns a **Series** or **DataFrame** then `.apply(func)` will return a `DataFrame`. (Second example)

## A more complex example ##

Now we have some more requirements (roughly in line with a "tough" exam exercise). For each player we want to know the following information:
- `player` - Name of player
- `attempts` - Total attempts
- `successes` - Total successes
- `accuracy` - Overall accuracy 
- `stddev_att` - Standard deviation of attempts
- `stddev_acc` - Standard deviation of accuracy
- `best_att` - Number of attempts in the game with the most successes
- `best_succ` - Number of successes in the game with the most successes

If a player has multiple games with the most successes then choose the game with the most successes and **fewest attempts** for `best_att` and `best_succ`.



### How to tackle these ###

In general, the pattern for solving this type of exercise is to make a helper function that will do all of these calculations and then apply it to a groupby. It may be helpful to make helper functions for your main helper as well. 

We already have one such function `sum_att_succ` which we defined above. We will define another to help with the "best game" requirements.

To populate the last two columns we have to identify a player's best game and extract the attempts and successes for that game. This helper function will do just that!

Keep in mind that within the context of a `Groupby.apply` our `df` will only have records for one individual player.

In [42]:
def best_game(df):
  df = df.sort_values(['successes', 'attempts'], ascending=[False, True])
  return df.iloc[0, :]

Now we can start working on our "main" helper function...

In [56]:
def main_helper(group):
  # add a column for accuracy
  # get sum of attempts, successes
  # get mean accuracy
  # get stddev of attempts, accuracy
  # get success, attempts from best game
  # return required info as Series
  group['accuracy'] = group['successes'] / group['attempts']
  stat_sums = sum_att_succ(group)
  mean_accuracy = group['accuracy'].mean()
  stat_stddev = group[['attempts', 'accuracy']].apply(np.std)
  stat_best = best_game(group)
  return pd.Series({
      'player': stat_best['player'],
      'attempts': stat_sums['attempts'],
      'successes': stat_sums['successes'],
      'accuracy': round(mean_accuracy,3), 
      'stddev_att': round(stat_stddev['attempts'], 3),
      'stddev_acc': round(stat_stddev['accuracy'], 3),
      'best_att': stat_best['attempts'],
      'best_succ': stat_best['successes']
  })

game_df.groupby('player', as_index=False).apply(main_helper)

Unnamed: 0,player,attempts,successes,accuracy,stddev_att,stddev_acc,best_att,best_succ
0,Al,1297,669,0.48,10.997,0.192,35,25
1,Chris,1709,794,0.445,11.132,0.16,33,23
2,Darby,1234,592,0.406,13.422,0.199,33,22
3,Jen,1349,647,0.449,12.451,0.162,36,25
4,Katie,1479,691,0.448,11.669,0.17,37,27
5,Larry,1258,583,0.436,10.911,0.18,39,23
6,Rich,1395,644,0.453,11.003,0.157,37,23
