<a href="https://colab.research.google.com/github/josem361/PythonBootcamp/blob/main/Copy_of_m1s2nb2_groupby_apply_nlargest_boolean_mask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Functions for Data Analysis

1. **apply(), and groupby() with apply()**
2. **nlargest(), nsmallest(), using sum, and mean functions**
3. **boolean mask**
4. **complex analysis using all of the above**   

## The purpose of this notebook is to work through some pandas functions and concepts that are commonly used in data analysis, in a problem-solving format.

## The types of analyses that we cover here are ones that you could possibly be asked to recreate in some fashion, before the semester's end.

In [1]:
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/nba_stats.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/worst_players.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/best_players.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/top_rebs.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/top_mins.csv

--2024-03-03 18:42:20--  https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/nba_stats.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 543264 (531K) [text/plain]
Saving to: ‘nba_stats.csv’


2024-03-03 18:42:20 (59.1 MB/s) - ‘nba_stats.csv’ saved [543264/543264]

--2024-03-03 18:42:20--  https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/worst_players.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 219 [text/plain]
Saving to: ‘worst_players.csv’


2024-03-03 18

In [2]:
# some modules we will need
import pandas as pd
import numpy as np

We will be using some data from the National Basketball Association's (NBA) statistics API for this exercise. The data is from the 2017-2020 seasons and includes the major statistics for players.

We will import the data into a dataframe called nba_stats and take a quick look at the data.

In [3]:
# load the data file
# bring in the sample output file
nba_stats = pd.read_csv('nba_stats.csv')
# create df with only the columns we want to work with
nba_stats= nba_stats[['SEASON_ID','PLAYER_ID','PLAYER_NAME','GP','MIN','PTS','REB','PLUS_MINUS']]
nba_stats = nba_stats.rename(columns={"GP": "GAMES_PLAYED", "MIN": "MINUTES","PTS": "POINTS", "REB": "REBOUNDS"})

### Before we get started on the functions, let's take a quick look at some of the key data fields that we will be working with, and some fields whose meaning may not be easily discernble from the name.

- `PLAYER_ID` - The unique ID number for each player.
- `SEASON_ID` - The ID number for each season. The combination of PLAYER_ID and SEASON_ID gives us the primary key for the dataframe.
- `PLAYER_NAME` - The name of each player.

#### Note that there are 2,139 rows in the dataframe. That means we have 2,139 unique player-season combinations.

- `GAMES_PLAYED` through `PLUS_MINUS` columns- The individual statistics for the player for that season. Whenever we are working with one of the columns, we will define what that column means in the exercise.

#### The info() and describe() functions are good to use when first looking at a dataframe.

info() gives us column information, and describe() gives us some statistical measurements of the dataframe.

In [4]:
nba_stats.info()
nba_stats.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SEASON_ID     2139 non-null   int64  
 1   PLAYER_ID     2139 non-null   int64  
 2   PLAYER_NAME   2139 non-null   object 
 3   GAMES_PLAYED  2139 non-null   int64  
 4   MINUTES       2139 non-null   float64
 5   POINTS        2139 non-null   int64  
 6   REBOUNDS      2139 non-null   int64  
 7   PLUS_MINUS    2139 non-null   int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 133.8+ KB


Unnamed: 0,SEASON_ID,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
count,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0
mean,22018.499766,974788.4,45.654511,1038.742403,474.080411,191.116877,0.0
std,1.122678,720006.6,24.546739,785.248157,451.590027,180.995174,154.498564
min,22017.0,1713.0,1.0,0.516667,0.0,0.0,-672.0
25%,22017.0,203076.5,24.0,282.795,98.5,48.0,-69.0
50%,22018.0,1626179.0,51.0,977.45,360.0,150.0,-8.0
75%,22020.0,1628470.0,66.0,1684.266667,722.0,276.0,45.5
max,22020.0,1630466.0,82.0,3027.651667,2818.0,1247.0,728.0


## The apply() function

#### `apply()` is used to apply a function to a data frame or to a series (column of the data frame).

The basic way to use the function is:

out = `dataframe`.apply(`func`)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

### Use the apply() function on a single column of the dataframe

Pass a built-in function to apply().

What is the average number of games that a player played in during any season?

In [5]:
# note the syntax of using the DOUBLE BRACKETS around the column name.
mean_value = nba_stats[['GAMES_PLAYED']].apply(np.mean)
print(mean_value)

nba_stats[['GAMES_PLAYED']].apply(np.mean)

GAMES_PLAYED    45.654511
dtype: float64


GAMES_PLAYED    45.654511
dtype: float64

#### We can also use the apply function on multiple columns or the entire dataframe, but to do so, all of the dataframe columns must be able to be operated on by the function we are applying.

#### With this data, we can apply to multiple columns that are INT and FLOAT, but not to the entire dataframe, because we also have OBJECT data types.

What is the average number of games, points scored, and rebounds for the typical player in a season?

In [6]:
nba_stats[['GAMES_PLAYED','POINTS','REBOUNDS']].apply(np.mean)

# returns value error of "could not convert string to float"
# nba_stats.apply(np.mean)

GAMES_PLAYED     45.654511
POINTS          474.080411
REBOUNDS        191.116877
dtype: float64

As you can see, the function returns a value for each column.

That is to say, the default way of apply( ) dealing with a dataframe is to take a whole column each time and operate on that column with the function passed.

We can change this default setting by specifying the `axis` parameter, in which axis=0 (the default) applies by column and axis=1 applies by row. We will not demonstrate row-based apply with this dataset.

### Remember the groupby() function from the last notebook.

A `groupby()` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The basic way to use the function is:

out = `dataframe`.groupby(by=columnname).`function`()

For example:

df.groupby(by=["b"]).sum()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

On this dataset, an example might be:

In [7]:
nba_stats.groupby(by=["SEASON_ID"]).mean()

  nba_stats.groupby(by=["SEASON_ID"]).mean()


Unnamed: 0_level_0,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
22017,762753.0,48.346296,1099.722222,484.407407,198.238889,0.0
22018,913443.1,49.24717,1121.603774,516.175472,209.635849,0.0
22019,1070723.0,42.330813,967.996219,447.614367,179.502836,0.0
22020,1153053.0,42.692593,965.740741,448.364815,177.196296,0.0


#### So what `groupby()` on its own does is a dataframe-wide grouping of every `APPLICABLE` column of the passed-in function, using the "by" parameter that we set.

#### Note that the `groupby()` above does not include the OBJECT column of PLAYER_NAME.

#### Also note that the column `PLAYER_ID` is included, because it is data type `int64`.

However, we very seldom want to group an entire df in our analyses. Instead we want to generally:

1. Return statistical analyses for individual or multiple columns
2. Grouped by multiple dimensions (columns).

### So how do we do that? By using `groupby()` and `apply()` together.

**The syntax for a single column looks like:**

`dataframe.groupby('columnname').apply(function)`

**The syntax for a multiple columns looks like:**

`dataframe.groupby(['columnname1','columnname2']).apply(function)`

Remember that using `axis=0` (the default) will apply the given function to each *column* and `axis=1` will apply the given function to each *row*.

While `Series.apply` works on individual values and `DataFrame.apply` works on `Series` objects (rows or columns are instances of `Series`), `groupby.apply` works on `DataFrame` objects. The cell below is applying the `print` function to each `DataFrame` or "group" in the `groupby`!

In [8]:
nba_stats.groupby('SEASON_ID').apply(np.sum, axis=0)

Unnamed: 0_level_0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
22017,11889180,411886621,Aaron BrooksAaron GordonAaron HarrisonAaron Ja...,26107,593850.000002,261580,107049,0
22018,11669540,484124825,Aaron GordonAaron HolidayAbdel NaderAl Horford...,26101,594449.999993,273573,111107,0
22019,11648051,566412615,Aaron GordonAaron HolidayAbdel NaderAdam Mokok...,22393,512070.000004,236788,94957,0
22020,11890800,622648405,Aaron GordonAaron HolidayAaron NesmithAbdel Na...,23054,521499.999998,242117,95686,0


In [9]:
nba_stats.groupby('SEASON_ID').apply(np.mean, axis=0)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


Unnamed: 0_level_0,SEASON_ID,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
22017,22017.0,762753.0,48.346296,1099.722222,484.407407,198.238889,0.0
22018,22018.0,913443.1,49.24717,1121.603774,516.175472,209.635849,0.0
22019,22019.0,1070723.0,42.330813,967.996219,447.614367,179.502836,0.0
22020,22020.0,1153053.0,42.692593,965.740741,448.364815,177.196296,0.0


#### Note the difference in behavior between apply() alone, groupby() alone, and groupby.apply() together. This is important for students to understand!!

1. `apply()` by itself gives us the function result for the columns/rows **IT IS ABLE TO OPERATE ON**.

    As we saw above, if we try to perform a function on an incompatible column/row, it will return an error.


2. `groupby()` by itself ALSO gives us the function result for the columns/rows **IT IS ABLE TO OPERATE ON**.

    However, it will **SIMPLY NOT INCLUDE** the columns that the function cannot operate on in, the result set.


3. `groupby()` `apply()` together will return the function result for the columns/rows **IT IS ABLE TO OPERATE ON**, similar to `groupby()` alone.
    
    However, the difference is with the PLAYER_NAME column, in that the sum() function "added" the player names by concatenating (using "+") the strings together.
    
    Using np.mean() gave us the result with the "nuisance columns" error message.

## So what is the problem with the above approach? And how can we fix it?

### The problem is that, depending on the function we are using (sum vs. mean, for example), some columns may be included in (or excluded from) the returned data frame. So we may get results that we are not expecting, and the test cases will fail as a result.

### So how do we fix that and prevent it from happening?

`1. Create a new dataframe by keeping only the columns necessary for that particular analysis.`

`2. (Optional) Set your columns to groupby as indices on the new dataframe.(This ensures that you are not grouping extraneous columns, for functions such as sum())`

`3. Perform the required groupby/apply/function on the new dataframe.`

`4. (Optional) Set the index columns to be regular columns.`


#### Below are additional steps that the (exam/homework) exercise may require.


`5. Merge the returned dataframe with the other dataframe(s) required by the analysis.`

`6. Drop the extraneous columns in the new/merged dataframe.`

`7. Rename the remaining columns, per the exercise requirements.`

### So how might we want to use this is in a real (or testing) scenario?

Return a dataframe that summarizes the total minutes, games played, points, and rebounds for each player, over the 4 seasons.

Use your dataframe, nba_stats, as the starting point.

#### What is our strategy for solving this problem?

1. Create a dataframe with only the columns required.

2. Set the grouping columns to be indexes (optional).

3. Perform groupby and apply for the sum.

4. Set the indexes back to be regular columns.

5. Perform additional steps as required by the exercise requirements.

In [10]:
# create new dataframe with only the required columns
nba_stats_test = nba_stats[['PLAYER_NAME','GAMES_PLAYED','POINTS','REBOUNDS']]
print(' New dataframe')
print(nba_stats_test.head(5))

# (optional) set the grouping columns to be indexes
nba_stats_test = nba_stats_test.set_index(['PLAYER_NAME'])
print('\n Column as index')
print(nba_stats_test.head(5))

# perform the groupby.apply
nba_stats_test2 = nba_stats_test.groupby('PLAYER_NAME').apply(np.sum, axis=0)
print('\n Grouping')
print(nba_stats_test2.head(5))

# set the index columns back to be regular columns
nba_stats_test2.reset_index(inplace=True)
print('\n Set index back to column')
print(nba_stats_test2)

# perform whatever other steps the analysis requires

 New dataframe
     PLAYER_NAME  GAMES_PLAYED  POINTS  REBOUNDS
0   Aaron Gordon            50     618       284
1  Aaron Holiday            66     475        89
2  Aaron Nesmith            46     218       127
3    Abdel Nader            24     160        62
4    Adam Mokoka            14      15         5

 Column as index
               GAMES_PLAYED  POINTS  REBOUNDS
PLAYER_NAME                                  
Aaron Gordon             50     618       284
Aaron Holiday            66     475        89
Aaron Nesmith            46     218       127
Abdel Nader              24     160        62
Adam Mokoka              14      15         5

 Grouping
                GAMES_PLAYED  POINTS  REBOUNDS
PLAYER_NAME                                   
Aaron Brooks              32      75        17
Aaron Gordon             248    3780      1790
Aaron Harrison             9      60        24
Aaron Holiday            182    1396       312
Aaron Jackson              1       8         3

 Set index

#### So what if we did not set the column to be an index? What would happen then? Compare the results below with those above.

In [11]:
# create new dataframe with only the required columns
nba_stats_test = nba_stats[['PLAYER_NAME','GAMES_PLAYED','POINTS','REBOUNDS']]
print(' New dataframe')
print(nba_stats_test.head(5))

# (optional) set the grouping columns to be indexes
# commented out here
# nba_stats_test = nba_stats_test.set_index(['PLAYER_NAME'])
# print(nba_stats_test.head(5))

# perform the groupby.apply
nba_stats_test2 = nba_stats_test.groupby('PLAYER_NAME').apply(np.sum, axis=0)
print('\n Grouping')
print(nba_stats_test2.head(5))

# set the index columns back to be regular columns
# commented out here
# nba_stats_test2.reset_index(inplace=True)
# nba_stats_test2

# perform whatever other steps the analysis requires

 New dataframe
     PLAYER_NAME  GAMES_PLAYED  POINTS  REBOUNDS
0   Aaron Gordon            50     618       284
1  Aaron Holiday            66     475        89
2  Aaron Nesmith            46     218       127
3    Abdel Nader            24     160        62
4    Adam Mokoka            14      15         5

 Grouping
                                                     PLAYER_NAME  \
PLAYER_NAME                                                        
Aaron Brooks                                        Aaron Brooks   
Aaron Gordon    Aaron GordonAaron GordonAaron GordonAaron Gordon   
Aaron Harrison                                    Aaron Harrison   
Aaron Holiday            Aaron HolidayAaron HolidayAaron Holiday   
Aaron Jackson                                      Aaron Jackson   

                GAMES_PLAYED  POINTS  REBOUNDS  
PLAYER_NAME                                     
Aaron Brooks              32      75        17  
Aaron Gordon             248    3780      1790  
Aaron H

### Some good references

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.SeriesGroupBy.apply.html

https://datagy.io/pandas-groupby/

https://www.geeksforgeeks.org/grouping-and-aggregating-with-pandas/

https://datagy.io/pandas-exploratory-data-analysis/

https://stackabuse.com/efficient-data-manipulation-with-apply-function-in-pandas/

### Any questions up to this point?

### Now let's use some functions to do a more complex analysis, by player and season. What we are about to go over would be typical for a multi-point exercise on an exam.

**We will create a new dataframe, `nba_stats_3`, for this, using the previous dataframe, `nba_stats`. This dataframe will have `PLAYER_NAME` and `SEASON_ID` as the grouping columns, and we will take the optional step to set them as the indices.**

In [12]:
nba_stats_3 = nba_stats.set_index(['SEASON_ID','PLAYER_NAME'])
print('\n Column as index')
print(nba_stats_3.head(5))


 Column as index
                         PLAYER_ID  GAMES_PLAYED      MINUTES  POINTS  \
SEASON_ID PLAYER_NAME                                                   
22020     Aaron Gordon      203932            50  1383.780000     618   
          Aaron Holiday    1628988            66  1176.086667     475   
          Aaron Nesmith    1630174            46   668.731667     218   
          Abdel Nader      1627846            24   355.250000     160   
          Adam Mokoka      1629690            14    56.178333      15   

                         REBOUNDS  PLUS_MINUS  
SEASON_ID PLAYER_NAME                          
22020     Aaron Gordon        284          60  
          Aaron Holiday        89           3  
          Aaron Nesmith       127          -7  
          Abdel Nader          62          28  
          Adam Mokoka           5          -8  


**Requirement**:  

Return a dataframe, top_rebs, containing the player name and season for the top 5 number of rebounds across the 4 seasons.
    
    Include the top 5 plus ties. In other words, if there are ties, keep all of the results, even if it results in more than 5 rows being returned.
    
    The dataframe should be sorted from most to least, with ties broken by name in alphabetical order.

Use the nba_stats_3 dataframe as the input for this.

The output dataframe should have the following columns:  `player`, `season`, `total_rebounds`.

## Pandas Functions `nlargest()` and `nsmallest()`

To meet this requirement, we will want to use the pandas function:  `nlargest()`. The function `nsmallest()` operates in the same manner, and you would want to use this if the exercise requirement was for the least/lowest number of rows.

We have explicitly stated what function to use here, but on an exam, you might see something like "the pandas function nlargest might be useful for this exercise".

The requirement on ties is satisfied by the parameter "keep", and the value of "all".

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nsmallest.html

#### What is our strategy for solving this problem?

1. Create a dataframe with only the columns required.

2. Return/keep only the top 5 in the dataframe.

3. Set the indexes to be columns.

4. Rename the columns.

5. Sort the dataframe.

In [13]:
# create the dataframe with only the columns required
top_rebs = nba_stats_3[['REBOUNDS']]

# now return only the top 5
top_rebs = top_rebs.nlargest(5, 'REBOUNDS', keep="all")

# set the indexes to be columns
top_rebs.reset_index(inplace=True)

# rename the columns
top_rebs.rename(columns={"PLAYER_NAME": "player", "SEASON_ID": "season", "REBOUNDS": "total_rebounds"}, inplace=True)

# sort the dataframe
top_rebs.sort_values(['total_rebounds', 'player'], ascending=[False, True],inplace=True)

top_rebs

Unnamed: 0,season,player,total_rebounds
0,22017,Andre Drummond,1247
1,22018,Andre Drummond,1232
2,22017,DeAndre Jordan,1171
3,22018,Rudy Gobert,1041
4,22017,Dwight Howard,1012
5,22017,Karl-Anthony Towns,1012


Your solution should match the dataframe below.

In [14]:
top_rebs_soln = pd.read_csv('top_rebs.csv')
top_rebs_soln

Unnamed: 0,player,season,total_rebounds
0,Andre Drummond,22017,1247
1,Andre Drummond,22018,1232
2,DeAndre Jordan,22017,1171
3,Rudy Gobert,22018,1041
4,Dwight Howard,22017,1012
5,Karl-Anthony Towns,22017,1012


### What are your questions on this exercise?

**Requirement**:  

Return a dataframe, top_mins, containing the player name for the top 10 average number of minutes for the 4 seasons together.

    This means that we want to add up the total number of minutes the player has played and divide by the number of seasons, to get the average. Round the average to 1 decimal place, after sorting. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

Include the top 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

The dataframe should be sorted from most to least, with ties broken by name in reverse alphabetical order. Minutes are in the column MIN.

Use the nba_stats_3 dataframe as the input for this.

The output dataframe should have the following columns:  `player`, `seasons_played`, `avg_minutes`.

#### What is our strategy for solving this problem?

0. Remember that the dataframe has SEASON_ID and PLAYER_NAME as indexes (see above).

1. Create two working dataframes, one to summarize minutes, the other to count seasons played.

2. With the average minutes dataframe:

    Compute the average minutes
    
    Rename the MINUTES column
    
    Set the indexes to be columns
    

2. With the seasons played dataframe:

    Compute the number of seasons each player played in
    
    Rename the MINUTES column
    
    Set the indexes to be columns
    
    
3. Merge the two dataframes

4. Rename the columns and keep only those columns required (two steps).

5. Keep only the top 10.

6. Drop the index column (created from the previous step)

7. Sort the dataframe.

8. Round the average minutes to one decimal place.

In [15]:
# create the mins_df (working) dataframe with only the columns required
mins_df = nba_stats_3[['MINUTES']]

# compute the average minutes for each player
top_mins = mins_df.groupby("PLAYER_NAME").mean()
# rename the MINUTES column
top_mins.rename(columns={"MINUTES": "avg_minutes"}, inplace=True)
# set the index to be a column
top_mins.reset_index(inplace=True)

# compute how many seasons each played
num_seasons = mins_df.groupby(by=["PLAYER_NAME"]).count()
# rename the MINUTES column
num_seasons.rename(columns={"MINUTES": "seasons_played"}, inplace=True)
# set the indexes to be columns
num_seasons = num_seasons.reset_index()

# merge the two dataframes
top_mins = top_mins.merge(num_seasons, how='inner')

# rename the PLAYER_NAME column
top_mins.rename(columns={"PLAYER_NAME": "player"}, inplace=True)
# only keep the required columns
top_mins = top_mins[['player','seasons_played','avg_minutes']]

#now only keep the 10 highest, plus ties
top_mins = top_mins.nlargest(10, 'avg_minutes', keep="all").reset_index()
# drop the index column
del top_mins['index']

# sort the dataframe
top_mins.sort_values(['avg_minutes', 'player'], ascending=[False, False],inplace=True)

# round the avg_minutes column to one decimal place
top_mins = top_mins.round({'avg_minutes': 1})

top_mins

Unnamed: 0,player,seasons_played,avg_minutes
0,Damian Lillard,4,2594.8
1,Klay Thompson,2,2578.3
2,Bradley Beal,4,2551.1
3,Tobias Harris,4,2499.8
4,Russell Westbrook,4,2490.3
5,DeMar DeRozan,4,2443.2
6,Nikola Jokic,4,2442.3
7,Harrison Barnes,4,2437.8
8,Andrew Wiggins,4,2436.1
9,James Harden,4,2377.2


Your solution should match the dataframe below.

In [16]:
top_mins_soln = pd.read_csv('top_mins.csv')
top_mins_soln

Unnamed: 0,player,seasons_played,avg_minutes
0,Damian Lillard,4,2594.8
1,Klay Thompson,2,2578.3
2,Bradley Beal,4,2551.1
3,Tobias Harris,4,2499.8
4,Russell Westbrook,4,2490.3
5,DeMar DeRozan,4,2443.2
6,Nikola Jokic,4,2442.3
7,Harrison Barnes,4,2437.8
8,Andrew Wiggins,4,2436.1
9,James Harden,4,2377.2


### What are your questions on this exercise?

### Now let's look at Boolean Masks.

#### What is a boolean mask?

While boolean masks are typically used with numpy arrays, they can also be applied to pandas dataframes.

We will introduce the concept here and later cover how they are used with numpy arrays, which is a bit different from how they are used with pandas.

Conceptually, they are similar in pandas and numpy, but how each implements them is different.

**In pandas, a mask is used to filter and return only the rows that meet a certain condition.**

With pandas, we can use one of the comparison operators (`<, >, >=, <=, ==`), the `isin()` function, or the `contains()` function for strings.

Vanderplas has an EXCELLENT introduction to masks in his book, focused on numpy. Chapter linked to here:  https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [17]:
# mask to filter by comparison
minutes_mask = nba_stats['MINUTES'] >= 2000
minutes_mask

0       False
1       False
2       False
3       False
4       False
        ...  
2134    False
2135    False
2136    False
2137    False
2138    False
Name: MINUTES, Length: 2139, dtype: bool

In [18]:
# filter the dataframe using the mask
high_minutes = nba_stats[minutes_mask]
high_minutes

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...
2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


In [19]:
# filter the dataframe directly, without creating the mask
# as a separate dataframe
high_minutes_2 = nba_stats[nba_stats['MINUTES'] >= 2000]
high_minutes_2

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...
2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


Note that the index returned is that of the index in the original dataframes. To reset the index to only those rows in the masked dataframe, use reset_index().

In [20]:
# mask to filter by comparison
high_minutes_idx = nba_stats[minutes_mask].reset_index()
high_minutes_idx

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
1,23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
2,34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
3,42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
4,45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...,...
300,2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
301,2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
302,2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
303,2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


In [21]:
# mask using isin()
season_2017_mask = nba_stats['SEASON_ID'].isin([22017])
season_2017_mask

0       False
1       False
2       False
3       False
4       False
        ...  
2134     True
2135     True
2136     True
2137     True
2138     True
Name: SEASON_ID, Length: 2139, dtype: bool

In [22]:
# mask using isin(), with reset_index()
season_2017 = nba_stats[season_2017_mask].reset_index()
season_2017

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1599,22017,201166,Aaron Brooks,32,189.413333,75,17,-75
1,1600,22017,203932,Aaron Gordon,58,1909.078333,1022,457,-92
2,1601,22017,1626151,Aaron Harrison,9,233.251667,60,24,-72
3,1602,22017,1628935,Aaron Jackson,1,34.500000,8,3,-10
4,1603,22017,1627846,Abdel Nader,48,521.526667,146,71,-109
...,...,...,...,...,...,...,...,...,...
535,2134,22017,1628380,Zach Collins,66,1045.450000,292,221,16
536,2135,22017,203897,Zach LaVine,24,656.286667,401,94,-172
537,2136,22017,2216,Zach Randolph,59,1507.611667,857,397,-353
538,2137,22017,2585,Zaza Pachulia,69,971.746667,373,321,196


In [23]:
# mask using isin(), with reset_index()
season_2017_2 = nba_stats[nba_stats['SEASON_ID'].isin([22017])].reset_index()
season_2017_2

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1599,22017,201166,Aaron Brooks,32,189.413333,75,17,-75
1,1600,22017,203932,Aaron Gordon,58,1909.078333,1022,457,-92
2,1601,22017,1626151,Aaron Harrison,9,233.251667,60,24,-72
3,1602,22017,1628935,Aaron Jackson,1,34.500000,8,3,-10
4,1603,22017,1627846,Abdel Nader,48,521.526667,146,71,-109
...,...,...,...,...,...,...,...,...,...
535,2134,22017,1628380,Zach Collins,66,1045.450000,292,221,16
536,2135,22017,203897,Zach LaVine,24,656.286667,401,94,-172
537,2136,22017,2216,Zach Randolph,59,1507.611667,857,397,-353
538,2137,22017,2585,Zaza Pachulia,69,971.746667,373,321,196


#### Now let's do a multiple comparison mask.

`Return the players with 2000 or more minutes in the 2017 and 2018 seasons.`

In [24]:
# mask to filter by multiple comparison
multiple_mask = (nba_stats['MINUTES'] >= 2000) & (nba_stats['SEASON_ID'].isin([22017,22018]))
multiple_mask

0       False
1       False
2       False
3       False
4       False
        ...  
2134    False
2135    False
2136    False
2137    False
2138    False
Length: 2139, dtype: bool

In [25]:
# return the dataframe
high_minutes_idx = nba_stats[multiple_mask].reset_index()
high_minutes_idx

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1069,22018,203932,Aaron Gordon,78,2632.533333,1246,574,107
1,1073,22018,202329,Al-Farouq Aminu,81,2291.698333,760,610,384
2,1086,22018,203083,Andre Drummond,79,2646.890000,1370,1232,176
3,1091,22018,203952,Andrew Wiggins,73,2542.713333,1321,352,-66
4,1099,22018,203085,Austin Rivers,76,2027.726667,618,162,104
...,...,...,...,...,...,...,...,...,...
199,2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
200,2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
201,2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
202,2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


#### Let's do a string comparison mask.

`Return all of the season stats for players named Anthony, in either their first or last names (or both).`

In [26]:
name_anthony_mask = nba_stats['PLAYER_NAME'].str.contains('Anthony')
name_anthony_mask

0       False
1       False
2       False
3       False
4       False
        ...  
2134    False
2135    False
2136    False
2137    False
2138    False
Name: PLAYER_NAME, Length: 2139, dtype: bool

In [27]:
# mask to filter by string comparison
# return all of the players named Anthony
name_anthony = nba_stats[name_anthony_mask].reset_index()
name_anthony

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,22,22020,203076,Anthony Davis,36,1161.735,786,286,115
1,23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
2,24,22020,1630264,Anthony Gill,26,218.015,80,51,-17
3,25,22020,1630237,Anthony Lamb,24,415.21,133,70,-89
4,26,22020,201229,Anthony Tolliver,11,98.973333,17,10,30
5,66,22020,2546,Carmelo Anthony,69,1690.143333,924,214,-24
6,85,22020,1630175,Cole Anthony,47,1272.843333,605,221,-322
7,113,22020,1629001,De'Anthony Melton,52,1045.213333,472,161,119
8,292,22020,1626157,Karl-Anthony Towns,50,1688.918333,1239,529,11
9,563,22019,203076,Anthony Davis,62,2131.193333,1618,577,240


#### What if we only wanted an array of the player's names, and not their season statistics?

Use the `unique()` function, which returns an array.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

In [28]:
name_anthony["PLAYER_NAME"].unique()

array(['Anthony Davis', 'Anthony Edwards', 'Anthony Gill', 'Anthony Lamb',
       'Anthony Tolliver', 'Carmelo Anthony', 'Cole Anthony',
       "De'Anthony Melton", 'Karl-Anthony Towns', 'Anthony Brown'],
      dtype=object)

### Now let's do a more complex analysis, one that might be typical for a 2 or 3 point question on an exam.

**Requirement**:

In the NBA, for a player to lead in any statistical category, he must have played in a minimum number of games. For a full season, that number is 58 games. If you are interested in a full explanation of the requirements, see the link below.

Write a function, `top_ten_scorers(df,min_games,season_id)` that returns the top 10 scoring leaders, in points per game, for any given season.

1. Return a dataframe, top_scorers, containing the player name for the top 10 average points per game for any season, for players who meet the minimum number of games qualification.

2. Round the average to 1 decimal place, after sorting. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

3. Include the top 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

4. The dataframe should be sorted from most to least, with ties broken by name in alphabetical order.

5. The nba_stats dataframe will be the input for this, along with the season to be filtered for and minimum number of games to qualify.

6. The output dataframe should have the following columns:  `player`, `games`, `points`, `PPG`.

https://www.nba.com/stats/help/statminimums

#### What is our strategy for solving this problem?

1. Create a dataframe with only the rows required, using a boolean mask.

2. Create the column `PPG`.

3. Keep only the required columns in the dataframe.

4. Rename the columns.

5. Return/keep only the top 10 for PPG in the dataframe, and reset the index.

6. Drop the index column.

7. Sort the dataframe.

8. Round PPG to one decimal place.

In [29]:
def top_ten_scorers(df, min_games, season_id):
    ###
    ###YOUR CODE HERE

    # return a dataframe with players from the 2018 season who meet the 58 game minimum requirement
    # using boolean mask
    top_scorers = nba_stats[(nba_stats['GAMES_PLAYED'] >= min_games) & (nba_stats['SEASON_ID'].isin([season_id]))].reset_index()

    # # create the column PPG
    top_scorers['PPG'] = top_scorers['POINTS'] / top_scorers['GAMES_PLAYED']

    # # # keep only the required columns
    # top_scorers = top_scorers[['PLAYER_NAME','GAMES_PLAYED','POINTS','PPG']]

    # rename the columns
    top_scorers.rename(columns={"PLAYER_NAME": "player","GAMES_PLAYED": "games","POINTS": "points"}, inplace=True)
    # only keep the required columns
    top_scorers = top_scorers[['player','games','points','PPG']]

    # now only keep the 10 highest, plus ties
    top_scorers = top_scorers.nlargest(10, 'PPG', keep="all").reset_index()
    # drop the index column
    del top_scorers['index']

    # sort the dataframe
    top_scorers.sort_values(['PPG', 'player'], ascending=[False, True],inplace=True)

    # round the avg_minutes column to one decimal place
    top_scorers = top_scorers.round({'PPG': 1})

    return top_scorers

# test dataframe
top_scoring_players = top_ten_scorers(nba_stats,58,22018)
top_scoring_players

Unnamed: 0,player,games,points,PPG
0,James Harden,78,2818,36.1
1,Paul George,77,2159,28.0
2,Giannis Antetokounmpo,72,1994,27.7
3,Joel Embiid,64,1761,27.5
4,Stephen Curry,69,1881,27.3
5,Kawhi Leonard,60,1596,26.6
6,Devin Booker,64,1700,26.6
7,Kevin Durant,78,2027,26.0
8,Damian Lillard,80,2067,25.8
9,Kemba Walker,82,2102,25.6


Your dataframe results should match those at this link:  https://www.espn.com/nba/stats/_/season/2019/seasontype/2

### What are your questions on this exercise, and on the notebook as a whole?

### Extra Credit, for fun (will not be covered during Bootcamp live session)

**Requirement**:

In the NBA, the metric `PLUS_MINUS` provides a single number for the value of a player. The metric is defined as the difference between the number of points the player's team scores, minus the number of points the opposing team scores, during the time that the player is in the game.

A positive number means that, over the course of the season, the player's team scored that many more points than their opponents when he was on the court. Likewise for a negative number, his team scored that many fewer points.

In general, the best players have the highest `PLUS_MINUS`, and the worst player have the lowest `PLUS_MINUS`.

So let's see who the best and worst players were, during the 2020 season.

Return a dataframe, best_players, containing the top 10 players and their `PLUS_MINUS` value. Include the top 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties). The dataframe should be sorted from most to least, with ties broken by name in reverse alphabetical order.  

Additionally, return a dataframe, worst_players, containing the bottom 10 players and their `PLUS_MINUS` value. Include the bottom 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties). The dataframe should be sorted from lowest value to highest value, with ties broken by name in alphabetical order.

The output dataframes should have the following columns:  `PLAYER_NAME`, `PLUS_MINUS`. There is no need to rename the columns from their original names in the source dataframe for this exercise.

Use the nba_stats_2 dataframe as the input for this.

The `nsmallest()` function is analogous to `nlargest` for finding the smallest values.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nsmallest.html





In [30]:
# best players here
# keep only the 2020 season players
best_players = nba_stats[nba_stats['SEASON_ID'].isin([22020])].reset_index()

# # keep only the required columns
best_players = best_players[['PLAYER_NAME','PLUS_MINUS']]

# now only keep the 10 highest, plus ties
best_players = best_players.nlargest(10, 'PLUS_MINUS', keep="all").reset_index()
# drop the index column
del best_players['index']

# sort the dataframe
best_players.sort_values(['PLUS_MINUS', 'PLAYER_NAME'], ascending=[False, True],inplace=True)

best_players

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Rudy Gobert,728
1,Mike Conley,548
2,Royce O'Neale,471
3,Joe Ingles,454
4,Kawhi Leonard,446
5,Paul George,432
6,Bojan Bogdanovic,419
7,Giannis Antetokounmpo,409
8,Joel Embiid,405
9,Nikola Jokic,384


Your solution should match the dataframe below.

In [31]:
best_players_soln = pd.read_csv('best_players.csv')
best_players_soln

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Rudy Gobert,728
1,Mike Conley,548
2,Royce O'Neale,471
3,Joe Ingles,454
4,Kawhi Leonard,446
5,Paul George,432
6,Bojan Bogdanovic,419
7,Giannis Antetokounmpo,409
8,Joel Embiid,405
9,Nikola Jokic,384


In [32]:
# worst players here
# keep only the 2020 season players
worst_players = nba_stats[nba_stats['SEASON_ID'].isin([22020])].reset_index()

# # keep only the required columns
worst_players = worst_players[['PLAYER_NAME','PLUS_MINUS']]

# now only keep the 10 highest, plus ties
worst_players = worst_players.nsmallest(10, 'PLUS_MINUS', keep="all").reset_index()
# drop the index column
del worst_players['index']

# sort the dataframe
worst_players.sort_values(['PLUS_MINUS', 'PLAYER_NAME'], ascending=[True, True],inplace=True)

worst_players

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Theo Maledon,-621
1,Darius Bazley,-477
2,Dwayne Bacon,-443
3,Isaiah Roby,-437
4,Isaac Okoro,-408
5,Aleksej Pokusevski,-393
6,Collin Sexton,-377
7,Moses Brown,-363
8,Nikola Vucevic,-341
9,Cedi Osman,-323


Your solution should match the dataframe below.

In [33]:
worst_players_soln = pd.read_csv('worst_players.csv')
worst_players_soln

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Theo Maledon,-621
1,Darius Bazley,-477
2,Dwayne Bacon,-443
3,Isaiah Roby,-437
4,Isaac Okoro,-408
5,Aleksej Pokusevski,-393
6,Collin Sexton,-377
7,Moses Brown,-363
8,Nikola Vucevic,-341
9,Cedi Osman,-323
