# Pandas Functions for Data Analysis

1. **apply(), and groupby() with apply()**
2. **nlargest(), nsmallest(), using sum, and mean functions**
3. **boolean mask**
4. **complex analysis using all of the above**   

## The purpose of this notebook is to work through some pandas functions and concepts that are commonly used in data analysis, in a problem-solving format.

## The types of analyses that we cover here are ones that you could possibly be asked to recreate in some fashion, before the semester's end.

In [1]:
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/nba_stats.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/worst_players.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/best_players.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/top_rebs.csv
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/top_mins.csv

--2025-06-22 17:25:50--  https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/nba_stats.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 543264 (531K) [text/plain]
Saving to: ‘nba_stats.csv’


2025-06-22 17:25:51 (16.6 MB/s) - ‘nba_stats.csv’ saved [543264/543264]

--2025-06-22 17:25:51--  https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%202/worst_players.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 219 [text/plain]
Saving to: ‘worst_players.csv’


2025-06-22 17

In [2]:
# some modules we will need
import pandas as pd
import numpy as np

We will be using some data from the National Basketball Association's (NBA) statistics API for this exercise. The data is from the 2017-2020 seasons and includes the major statistics for players.

We will import the data into a dataframe called nba_stats and take a quick look at the data.

In [3]:
# load the data file
# bring in the sample output file
nba_stats = pd.read_csv('nba_stats.csv')
# create df with only the columns we want to work with
nba_stats= nba_stats[['SEASON_ID','PLAYER_ID','PLAYER_NAME','GP','MIN','PTS','REB','PLUS_MINUS']]
nba_stats = nba_stats.rename(columns={"GP": "GAMES_PLAYED", "MIN": "MINUTES","PTS": "POINTS", "REB": "REBOUNDS"})

### Before we get started on the functions, let's take a quick look at some of the key data fields that we will be working with, and some fields whose meaning may not be easily discernble from the name.

- `PLAYER_ID` - The unique ID number for each player.
- `SEASON_ID` - The ID number for each season. The combination of PLAYER_ID and SEASON_ID gives us the primary key for the dataframe.
- `PLAYER_NAME` - The name of each player.

#### Note that there are 2,139 rows in the dataframe. That means we have 2,139 unique player-season combinations.

- `GAMES_PLAYED` through `PLUS_MINUS` columns- The individual statistics for the player for that season. Whenever we are working with one of the columns, we will define what that column means in the exercise.

#### The info() and describe() functions are good to use when first looking at a dataframe.

info() gives us column information, and describe() gives us some statistical measurements of the dataframe.

In [4]:
nba_stats.info()
nba_stats.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SEASON_ID     2139 non-null   int64  
 1   PLAYER_ID     2139 non-null   int64  
 2   PLAYER_NAME   2139 non-null   object 
 3   GAMES_PLAYED  2139 non-null   int64  
 4   MINUTES       2139 non-null   float64
 5   POINTS        2139 non-null   int64  
 6   REBOUNDS      2139 non-null   int64  
 7   PLUS_MINUS    2139 non-null   int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 133.8+ KB


Unnamed: 0,SEASON_ID,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
count,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0
mean,22018.499766,974788.4,45.654511,1038.742403,474.080411,191.116877,0.0
std,1.122678,720006.6,24.546739,785.248157,451.590027,180.995174,154.498564
min,22017.0,1713.0,1.0,0.516667,0.0,0.0,-672.0
25%,22017.0,203076.5,24.0,282.795,98.5,48.0,-69.0
50%,22018.0,1626179.0,51.0,977.45,360.0,150.0,-8.0
75%,22020.0,1628470.0,66.0,1684.266667,722.0,276.0,45.5
max,22020.0,1630466.0,82.0,3027.651667,2818.0,1247.0,728.0


## The apply() function

#### `apply()` is used to apply a function to a data frame or to a series (column of the data frame).

The basic way to use the function is:

out = `dataframe`.apply(`func`)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

### Use the apply() function on a single column of the dataframe

Pass a built-in function to apply().

What is the average number of games that a player played in during any season?

In [5]:
# note the syntax of using the DOUBLE BRACKETS around the column name.
mean_value = nba_stats[['GAMES_PLAYED']].apply(np.mean)
print(mean_value)

nba_stats[['GAMES_PLAYED']].apply(np.mean)

GAMES_PLAYED    45.654511
dtype: float64


Unnamed: 0,0
GAMES_PLAYED,45.654511


#### We can use apply() on multiple columns or the whole DataFrame, but the function must work with all column data types — like numbers, strings, or dates — or it may cause errors.

#### With this data, we can apply to multiple columns that are INT and FLOAT, but not to the entire dataframe, because we also have **OBJECT** data types.

What is the average number of games, points scored, and rebounds for the typical player in a season?

In [6]:
nba_stats[['GAMES_PLAYED','POINTS','REBOUNDS']].apply(np.mean)

# returns value error of "could not convert string to float"
# nba_stats.apply(np.mean)

Unnamed: 0,0
GAMES_PLAYED,45.654511
POINTS,474.080411
REBOUNDS,191.116877


As you can see, the function returns a value for each column.

That is to say, the default way of apply( ) dealing with a dataframe is to take a whole column each time and operate on that column with the function passed.

We can change this default setting by specifying the `axis` parameter, in which axis=0 (the default) applies by column and axis=1 applies by row. We will not demonstrate row-based apply with this dataset.

### Remember the groupby() function from the last notebook.

A `groupby()` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The basic way to use the function is:

out = `dataframe`.groupby(by=columnname).`function`()

For example:

df.groupby(by=["b"]).sum()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

On this dataset, an example might be:

In [7]:
nba_stats.groupby(by=["SEASON_ID"]).mean() # This will error out due to mixed data types (e.g., PLAYER_NAME is non-numeric)

TypeError: agg function failed [how->mean,dtype->object]

In [8]:
# To fix the error, let's tell pandas to ignore non-numeric columns

nba_stats.groupby("SEASON_ID").mean(numeric_only=True)


Unnamed: 0_level_0,PLAYER_ID,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
22017,762753.0,48.346296,1099.722222,484.407407,198.238889,0.0
22018,913443.1,49.24717,1121.603774,516.175472,209.635849,0.0
22019,1070723.0,42.330813,967.996219,447.614367,179.502836,0.0
22020,1153053.0,42.692593,965.740741,448.364815,177.196296,0.0


To re-iterate, when using groupby() with functions like .mean(), pandas applies the function only to applicable numeric columns based on the by parameter.

- Non-numeric columns like PLAYER_NAME are automatically excluded.

- Numeric columns such as PLAYER_ID are included because of their data type (int64).

Usually, we don’t want to group the entire DataFrame. Instead, we typically want to:

1. Calculate statistics for specific columns
2. Group data by one or more columns

To do this, we use groupby() together with apply().

**The syntax for a single column looks like:**

`dataframe.groupby('columnname').apply(function)`

**The syntax for a multiple columns looks like:**

`dataframe.groupby(['columnname1','columnname2']).apply(function)`

Remember:  
- Using `axis=0` (default) applies a function to each **column**.  
- Using `axis=1` applies it to each **row**.

Also:  
- `Series.apply` works on individual values.  
- `DataFrame.apply` works on rows or columns (which are `Series`).  
- `groupby.apply` works on each **group**, which is a smaller `DataFrame`.

In the example below, `print` is applied to each group created by `groupby`.


In [9]:
nba_stats.groupby('SEASON_ID').apply(np.sum, axis=0)

  nba_stats.groupby('SEASON_ID').apply(np.sum, axis=0)


Unnamed: 0_level_0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
SEASON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
22017,11889180,411886621,Aaron BrooksAaron GordonAaron HarrisonAaron Ja...,26107,593850.000002,261580,107049,0
22018,11669540,484124825,Aaron GordonAaron HolidayAbdel NaderAl Horford...,26101,594449.999993,273573,111107,0
22019,11648051,566412615,Aaron GordonAaron HolidayAbdel NaderAdam Mokok...,22393,512070.000004,236788,94957,0
22020,11890800,622648405,Aaron GordonAaron HolidayAaron NesmithAbdel Na...,23054,521499.999998,242117,95686,0


In [10]:
# This will error out if there are non-numeric columns (like PLAYER_NAME) because np.sum can't handle strings

nba_stats.groupby('SEASON_ID').apply(np.mean, axis=0)

TypeError: Could not convert ["Aaron BrooksAaron GordonAaron HarrisonAaron JacksonAbdel NaderAdreian PayneAl HorfordAl JeffersonAl-Farouq AminuAlan WilliamsAlec BurksAlec PetersAlex AbrinesAlex CarusoAlex LenAlex PoythressAlfonzo McKinnieAllen CrabbeAmir JohnsonAndre DrummondAndre IguodalaAndre IngramAndre RobersonAndrew BogutAndrew HarrisonAndrew White IIIAndrew WigginsAnte ZizicAnthony BrownAnthony DavisAnthony TolliverAntonio BlakeneyAntonius ClevelandAron BaynesArron AfflaloAustin RiversAvery BradleyBam AdebayoBen McLemoreBen MooreBen SimmonsBismack BiyomboBlake GriffinBoban MarjanovicBobby BrownBobby PortisBogdan BogdanovicBojan BogdanovicBradley BealBrandan WrightBrandon IngramBrandon JenningsBrandon PaulBriante WeberBrice JohnsonBrook LopezBruno CabocloBryn ForbesBuddy HieldC.J. WilliamsCJ McCollumCJ MilesCaleb SwaniganCameron PayneCaris LeVertCarmelo AnthonyCedi OsmanChandler ParsonsChanning FryeCharles CookeCheick DialloChinanu OnuakuChris BoucherChris McCulloughChris PaulClint CapelaCody ZellerCole AldrichCorey BrewerCory JosephCourtney LeeCristiano FelicioD'Angelo RussellD.J. AugustinD.J. WilsonDakari JohnsonDamian JonesDamian LillardDamien WilkinsDamion LeeDamyean DotsonDaniel HamiltonDaniel TheisDanilo GallinariDanny GreenDante CunninghamDante ExumDanuel House Jr.Dario SaricDarius MillerDarrell ArthurDarren CollisonDarrun HilliardDavid NwabaDavid StocktonDavid WestDavis BertansDavon ReedDe'Aaron FoxDeAndre JordanDeAndre LigginsDeAndre' BembryDeMar DeRozanDeMarcus CousinsDeMarre CarrollDejounte MurrayDelon WrightDemetrius JacksonDennis SchroderDennis Smith Jr.Denzel ValentineDerrick FavorsDerrick Jones Jr.Derrick RoseDerrick Walton Jr.Derrick WhiteDerrick WilliamsDevin BookerDevin HarrisDevin RobinsonDewayne DedmonDeyonta DavisDillon BrooksDion WaitersDirk NowitzkiDomantas SabonisDonovan MitchellDorian Finney-SmithDoug McDermottDragan BenderDraymond GreenDwayne BaconDwight BuycksDwight HowardDwight PowellDwyane WadeE'Twaun MooreEd DavisEdmond SumnerEkpe UdohElfrid PaytonEmeka OkaforEmmanuel MudiayEnes KanterEric BledsoeEric GordonEric MorelandErik McCreeErsan IlyasovaEvan FournierEvan TurnerFrank KaminskyFrank MasonFrank NtilikinaFred VanVleetFurkan KorkmazGarrett TempleGary HarrisGary Payton IIGeorge HillGeorges NiangGeorgios PapagiannisGerald GreenGian ClavellGiannis AntetokounmpoGlenn Robinson IIIGoran DragicGordon HaywardGorgui DiengGreg MonroeGuerschon YabuseleHarrison BarnesHassan WhitesideHenry EllensonIan ClarkIan MahinmiIke AnigboguIman ShumpertIsaiah CanaanIsaiah HicksIsaiah TaylorIsaiah ThomasIsaiah WhiteheadIsh SmithIvan RabbIvica ZubacJ.J. BareaJJ RedickJR SmithJaKarr SampsonJaMychal GreenJaVale McGeeJabari BirdJabari ParkerJack CooleyJacob PullenJacob WileyJae CrowderJahlil OkaforJake LaymanJakob PoeltlJalen JonesJamal CrawfordJamal MurrayJameel WarneyJameer NelsonJamel ArtisJames Ennis IIIJames HardenJames JohnsonJames Michael McAdooJames Webb IIIJames YoungJamil WilsonJared DudleyJarell EddieJarell MartinJarrett AllenJarrett JackJason SmithJason TerryJawun EvansJaylen BrownJaylen MorrisJayson TatumJeff GreenJeff TeagueJeff WitheyJerami GrantJeremy EvansJeremy LambJeremy LinJerian GrantJerryd BaylessJimmy ButlerJoakim NoahJodie MeeksJoe HarrisJoe InglesJoe JohnsonJoe YoungJoel BolomboyJoel EmbiidJoffrey LauvergneJohn CollinsJohn HensonJohn HollandJohn WallJohnathan MotleyJohnny O'Bryant IIIJon LeuerJonas JerebkoJonas ValanciunasJonathan GibsonJonathan IsaacJonathon SimmonsJordan BellJordan ClarksonJordan CrawfordJordan MickeyJose CalderonJosh GrayJosh HartJosh HuestisJosh JacksonJosh MagetteJosh McRobertsJosh RichardsonJosh SmithJrue HolidayJuancho HernangomezJulius RandleJulyan StoneJustin AndersonJustin HolidayJustin JacksonJustin PattonJustise WinslowJusuf NurkicKadeem AllenKarl-Anthony TownsKawhi LeonardKay FelderKelly OlynykKelly Oubre Jr.Kemba WalkerKendrick PerkinsKenneth FariedKent BazemoreKentavious Caldwell-PopeKevin DurantKevin LoveKevon LooneyKhem BirchKhris MiddletonKlay ThompsonKobi SimmonsKosta KoufosKris DunnKristaps PorzingisKyle AndersonKyle CollinsworthKyle KorverKyle KuzmaKyle LowryKyle O'QuinnKyle SinglerKyrie IrvingLaMarcus AldridgeLance StephensonLance ThomasLangston GallowayLarry Drew IILarry Nance Jr.Lauri MarkkanenLeBron JamesLondon PerrantesLonzo BallLorenzo BrownLou WilliamsLuc Mbah a MouteLucas NogueiraLuis MonteroLuke BabbittLuke KennardLuke KornetLuol DengMalachi RichardsonMalcolm BrogdonMalcolm DelaneyMalcolm MillerMalik BeasleyMalik MonkMangok MathiangManu GinobiliMarShon BrooksMarc GasolMarcin GortatMarco BelinelliMarcus Georges-HuntMarcus Morris Sr.Marcus PaigeMarcus SmartMario ChalmersMario HezonjaMarkel BrownMarkelle FultzMarkieff MorrisMarquese ChrissMarquis TeagueMarreese SpeightsMarshall PlumleeMarvin WilliamsMason PlumleeMatt CostelloMatt Williams Jr.Matthew DellavedovaMaurice HarklessMaxi KleberMeyers LeonardMichael BeasleyMichael Carter-WilliamsMichael Kidd-GilchristMike ConleyMike JamesMike MuscalaMike ScottMiles PlumleeMilos TeodosicMilton DoyleMindaugas KuzminskasMirza TeletovicMonte MorrisMontrezl HarrellMyke HenryMyles TurnerNate WoltersNaz Mitrou-LongNemanja BjelicaNeneNerlens NoelNick CollisonNick YoungNicolas BatumNicolas BrussinoNigel HayesNik StauskasNikola JokicNikola MiroticNikola VucevicNoah VonlehNorman PowellOG AnunobyOkaro WhiteOmari JohnsonOmer AsikOmri CasspiOtto Porter Jr.P.J. DozierP.J. TuckerPascal SiakamPat ConnaughtonPatrick BeverleyPatrick McCawPatrick PattersonPatty MillsPau GasolPaul GeorgePaul MillsapPaul ZipserQuincy AcyQuincy PondexterQuinn CookRJ HunterRajon RondoRamon SessionsRashad VaughnRaul NetoRaymond FeltonReggie BullockReggie HearnReggie JacksonRichard JeffersonRichaun HolmesRicky RubioRobert CovingtonRobin LopezRodney HoodRodney McGruderRodney PurvisRon BakerRondae Hollis-JeffersonRoyce O'NealeRudy GayRudy GobertRussell WestbrookRyan AndersonRyan ArcidiaconoSalah MejriSam DekkerScotty HopsonSean KilpatrickSemi OjeleyeSerge IbakaShabazz MuhammadShabazz NapierShane LarkinShaquille HarrisonShaun LivingstonShelvin MackSindarius ThornwellSkal LabissiereSolomon HillSpencer DinwiddieStanley JohnsonStephen CurrySterling BrownSteven AdamsT.J. LeafT.J. McConnellT.J. WarrenTaj GibsonTarik BlackTaurean PrinceTerrance FergusonTerrence RossTerry RozierThabo SefoloshaThaddeus YoungThomas BryantThon MakerTim FrazierTim Hardaway Jr.Tim QuartermanTimofey MozgovTimothe Luwawu-CabarrotTobias HarrisTomas SatoranskyTony AllenTony BradleyTony ParkerTony SnellTorrey CraigTravis WearTreveon GrahamTrevor ArizaTrevor BookerTrey BurkeTrey LylesTrey McKinney-JonesTristan ThompsonTroy DanielsTroy WilliamsTyler CavanaughTyler DorseyTyler EnnisTyler JohnsonTyler LydonTyler UlisTyler ZellerTyreke EvansTyrone WallaceTyson ChandlerTyus JonesUdonis HaslemVander BlueVictor OladipoVince CarterVincent HunterWade Baldwin IVWalt Lemon Jr.Wayne EllingtonWayne SeldenWes IwunduWesley JohnsonWesley MatthewsWill BartonWillie Cauley-SteinWillie ReedWilly HernangomezWilson ChandlerXavier MunfordXavier Rathan-MayesXavier SilasYogi FerrellZach CollinsZach LaVineZach RandolphZaza PachuliaZhou Qi"] to numeric

#### Note the difference between `apply()`, `groupby()`, and `groupby().apply()`. This is important!

1. `apply()` alone runs the function on columns or rows **it can work with**.  
   If a column isn’t compatible, it will cause an error.

2. `groupby()` alone runs the function on columns **it can work with**,  
   but it just skips columns the function can’t handle.

3. `groupby().apply()` runs the function on each group.  
   It behaves like `groupby()` but:  
   - For strings like `PLAYER_NAME`, `sum()` joins the names together (concatenates).  
   - For functions like `np.mean()`, it will error if non-numeric columns are included.


## What’s the problem with the approach above? And how do we fix it?

### Problem:
Some functions (like `sum` vs. `mean`) include or exclude columns differently. This can cause unexpected results and test failures.

### How to fix it:

1. Create a new DataFrame with only the columns needed for your analysis.  
2. (Optional) Set the group-by columns as the index to avoid grouping extra columns.  
3. Perform your `groupby`/`apply`/function on this new DataFrame.  
4. (Optional) Reset the index columns back to normal columns.

#### Extra steps for exercises or homework:

5. Merge the result with other DataFrames if needed.  
6. Drop any extra columns from the merged DataFrame.  
7. Rename columns as required by the task.

### How to use this in a real or testing scenario?

Return a dataframe that summarizes **total minutes, games played, points, and rebounds** for each player over 4 seasons.

Use the `nba_stats` dataframe as your starting point.


#### What is our strategy for solving this problem?

1. Select only the columns we need and create a dataframe.  
2. (Optional) Set grouping columns as indexes.  
3. Use `groupby` and `apply` (or `sum`) to aggregate.  
4. Reset indexes to regular columns.  
5. Complete any extra steps required by the exercise.

In [11]:
# Step 1: Create a new dataframe with only the required columns
nba_stats_test = nba_stats[['PLAYER_NAME','GAMES_PLAYED','POINTS','REBOUNDS']]
print('New dataframe')
print(nba_stats_test.head(5))

# Step 2 (optional): Set the grouping columns as index
nba_stats_test = nba_stats_test.set_index(['PLAYER_NAME'])
print('\nColumns set as index')
print(nba_stats_test.head(5))

# Step 3: Perform groupby and apply the sum function
nba_stats_test2 = nba_stats_test.groupby('PLAYER_NAME').apply(np.sum, axis=0)
print('\nGrouped and summed')
print(nba_stats_test2.head(5))

# Step 4: Reset index to convert back to regular columns
nba_stats_test2.reset_index(inplace=True)
print('\nIndex reset to columns')
print(nba_stats_test2)

# Step 5: Perform any additional steps required by the analysis


New dataframe
     PLAYER_NAME  GAMES_PLAYED  POINTS  REBOUNDS
0   Aaron Gordon            50     618       284
1  Aaron Holiday            66     475        89
2  Aaron Nesmith            46     218       127
3    Abdel Nader            24     160        62
4    Adam Mokoka            14      15         5

Columns set as index
               GAMES_PLAYED  POINTS  REBOUNDS
PLAYER_NAME                                  
Aaron Gordon             50     618       284
Aaron Holiday            66     475        89
Aaron Nesmith            46     218       127
Abdel Nader              24     160        62
Adam Mokoka              14      15         5

Grouped and summed
                GAMES_PLAYED  POINTS  REBOUNDS
PLAYER_NAME                                   
Aaron Brooks              32      75        17
Aaron Gordon             248    3780      1790
Aaron Harrison             9      60        24
Aaron Holiday            182    1396       312
Aaron Jackson              1       8         3

### What if we don’t set the column as an index?

Compare the results below with the ones above to see the difference.

In [12]:
# Step 1: Create a new dataframe with only the required columns
nba_stats_test = nba_stats[['PLAYER_NAME','GAMES_PLAYED','POINTS','REBOUNDS']]
print('New dataframe')
print(nba_stats_test.head(5))

# Step 2 (optional): Set the grouping columns as index
# Skipped here to compare results without setting index
# nba_stats_test = nba_stats_test.set_index(['PLAYER_NAME'])
# print(nba_stats_test.head(5))

# Step 3: Perform groupby and apply the sum function
nba_stats_test2 = nba_stats_test.groupby('PLAYER_NAME').apply(np.sum, axis=0)
print('\nGrouped and summed')
print(nba_stats_test2.head(5))

# Step 4: Reset index to convert back to regular columns
# Skipped here because index was not set
# nba_stats_test2.reset_index(inplace=True)
# nba_stats_test2

# Step 5: Perform any additional steps required by the analysis


New dataframe
     PLAYER_NAME  GAMES_PLAYED  POINTS  REBOUNDS
0   Aaron Gordon            50     618       284
1  Aaron Holiday            66     475        89
2  Aaron Nesmith            46     218       127
3    Abdel Nader            24     160        62
4    Adam Mokoka            14      15         5

Grouped and summed
                                                     PLAYER_NAME  \
PLAYER_NAME                                                        
Aaron Brooks                                        Aaron Brooks   
Aaron Gordon    Aaron GordonAaron GordonAaron GordonAaron Gordon   
Aaron Harrison                                    Aaron Harrison   
Aaron Holiday            Aaron HolidayAaron HolidayAaron Holiday   
Aaron Jackson                                      Aaron Jackson   

                GAMES_PLAYED  POINTS  REBOUNDS  
PLAYER_NAME                                     
Aaron Brooks              32      75        17  
Aaron Gordon             248    3780      1790  

  nba_stats_test2 = nba_stats_test.groupby('PLAYER_NAME').apply(np.sum, axis=0)


### Some good references

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.SeriesGroupBy.apply.html

https://datagy.io/pandas-groupby/

https://www.geeksforgeeks.org/grouping-and-aggregating-with-pandas/

https://datagy.io/pandas-exploratory-data-analysis/

https://stackabuse.com/efficient-data-manipulation-with-apply-function-in-pandas/

### Any questions up to this point?

################################ ABove this in NB1 #################

### Next, let's do a more detailed analysis by player and season. This is similar to what you might see on an exam.

We will create a new dataframe, `nba_stats_3`, from `nba_stats`, grouping by `PLAYER_NAME` and `SEASON_ID`. We will also set these columns as the index (optional step).


In [13]:
nba_stats_3 = nba_stats.set_index(['SEASON_ID','PLAYER_NAME'])
print('\n Column as index')
print(nba_stats_3.head(5))


 Column as index
                         PLAYER_ID  GAMES_PLAYED      MINUTES  POINTS  \
SEASON_ID PLAYER_NAME                                                   
22020     Aaron Gordon      203932            50  1383.780000     618   
          Aaron Holiday    1628988            66  1176.086667     475   
          Aaron Nesmith    1630174            46   668.731667     218   
          Abdel Nader      1627846            24   355.250000     160   
          Adam Mokoka      1629690            14    56.178333      15   

                         REBOUNDS  PLUS_MINUS  
SEASON_ID PLAYER_NAME                          
22020     Aaron Gordon        284          60  
          Aaron Holiday        89           3  
          Aaron Nesmith       127          -7  
          Abdel Nader          62          28  
          Adam Mokoka           5          -8  


**Requirement**:  

Return a dataframe, top_rebs, containing the player name and season for the top 5 number of rebounds across the 4 seasons.
    
    Include the top 5 plus ties. In other words, if there are ties, keep all of the results, even if it results in more than 5 rows being returned.
    
    The dataframe should be sorted from most to least, with ties broken by name in alphabetical order.

Use the nba_stats_3 dataframe as the input for this.

The output dataframe should have the following columns:  `player`, `season`, `total_rebounds`.

## Pandas Functions `nlargest()` and `nsmallest()`

To solve this, use the pandas function `nlargest()`. It helps you get the top rows with the largest values.

If you need the smallest values, use `nsmallest()` instead — it works the same way.

Sometimes, exam questions hint that `nlargest()` could be useful, without saying to use it directly.

To handle ties (rows with the same value), use the parameter `keep='all'` to include all tied rows.


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nsmallest.html

#### What’s our strategy for solving this problem?

1. Select only the columns you need.  
2. Keep only the top 5 rows.  
3. Reset the index to turn it back into columns.  
4. Rename columns as needed.  
5. Sort the final dataframe.


In [14]:
# Step 1: Select only the columns you need
top_rebs = nba_stats_3[['REBOUNDS']]

# Step 2: Keep only the top 5 rows
top_rebs = top_rebs.nlargest(5, 'REBOUNDS', keep="all")

# Step 3: Reset the index to turn it back into columns
top_rebs.reset_index(inplace=True)

# Step 4: Rename columns as needed
top_rebs.rename(columns={"PLAYER_NAME": "player", "SEASON_ID": "season", "REBOUNDS": "total_rebounds"}, inplace=True)

# Step 5: Sort the final dataframe
top_rebs.sort_values(['total_rebounds', 'player'], ascending=[False, True], inplace=True)

top_rebs


Unnamed: 0,season,player,total_rebounds
0,22017,Andre Drummond,1247
1,22018,Andre Drummond,1232
2,22017,DeAndre Jordan,1171
3,22018,Rudy Gobert,1041
4,22017,Dwight Howard,1012
5,22017,Karl-Anthony Towns,1012


Your solution should match the dataframe below.

In [None]:
top_rebs_soln = pd.read_csv('top_rebs.csv')
top_rebs_soln

### What are your questions on this exercise?



**Requirement:**

Create a dataframe called `top_mins` that shows the **top 10 players (plus ties)** based on their **average minutes played** over the 4 seasons.

  This means we need to calculate the total minutes each player has played across all seasons, divide by the number of seasons to get the average, and then round the result to 1 decimal place (after sorting). The final dataframe index should range from 0 to 9 — or higher if there are ties.

### What to do:
- Use the `nba_stats_3` dataframe.
- For each player, calculate:
  - Total minutes played
  - Number of seasons played
  - Average minutes = total minutes / seasons played
- Round the average to **1 decimal place**.
- Sort the results:
  1. By average minutes (highest to lowest)
  2. Break ties using player names in **reverse alphabetical order**
- Return only the top 10 players **plus ties**.
- Reset the index to start from 0.
- Final columns should be: `player`, `seasons_played`, `avg_minutes`.


#### What’s our strategy for solving this problem?

0. Note: `nba_stats_3` has `SEASON_ID` and `PLAYER_NAME` set as indexes.

1. Create two dataframes:
   - One to calculate total minutes
   - One to count how many seasons each player played

2. In the average minutes dataframe:
   - Compute average minutes per player
   - Rename the `MIN` column
   - Reset the index

3. In the seasons played dataframe:
   - Count number of seasons per player
   - Rename the column
   - Reset the index

4. Merge the two dataframes.

5. Keep and rename only the required columns.

6. Use `nlargest()` to get the top 10 players (plus ties).

7. Reset the index.

8. Sort the dataframe and round the average minutes to 1 decimal place.

In [15]:
# Step 1: Create a working dataframe with only the minutes column
mins_df = nba_stats_3[['MINUTES']]

# Step 2: Calculate average minutes per player
top_mins = mins_df.groupby("PLAYER_NAME").mean()
# Rename the MINUTES column to avg_minutes
top_mins.rename(columns={"MINUTES": "avg_minutes"}, inplace=True)
# Reset index to turn PLAYER_NAME back into a column
top_mins.reset_index(inplace=True)

# Step 3: Count number of seasons each player played
num_seasons = mins_df.groupby(by=["PLAYER_NAME"]).count()
# Rename the MINUTES column to seasons_played
num_seasons.rename(columns={"MINUTES": "seasons_played"}, inplace=True)
# Reset index to turn PLAYER_NAME back into a column
num_seasons = num_seasons.reset_index()

# Step 4: Merge average minutes and seasons played dataframes
top_mins = top_mins.merge(num_seasons, how='inner')

# Step 5: Rename PLAYER_NAME to player
top_mins.rename(columns={"PLAYER_NAME": "player"}, inplace=True)
# Keep only the required columns
top_mins = top_mins[['player', 'seasons_played', 'avg_minutes']]

# Step 6: Get top 10 players by avg_minutes, including ties
top_mins = top_mins.nlargest(10, 'avg_minutes', keep="all").reset_index()
# Drop the old index column created during reset
del top_mins['index']

# Step 7: Sort by avg_minutes (descending) and then player name (reverse alphabetical)
top_mins.sort_values(['avg_minutes', 'player'], ascending=[False, False], inplace=True)

# Step 8: Round avg_minutes to 1 decimal place
top_mins = top_mins.round({'avg_minutes': 1})

top_mins


Unnamed: 0,player,seasons_played,avg_minutes
0,Damian Lillard,4,2594.8
1,Klay Thompson,2,2578.3
2,Bradley Beal,4,2551.1
3,Tobias Harris,4,2499.8
4,Russell Westbrook,4,2490.3
5,DeMar DeRozan,4,2443.2
6,Nikola Jokic,4,2442.3
7,Harrison Barnes,4,2437.8
8,Andrew Wiggins,4,2436.1
9,James Harden,4,2377.2


Your solution should match the dataframe below.

In [16]:
top_mins_soln = pd.read_csv('top_mins.csv')
top_mins_soln

Unnamed: 0,player,seasons_played,avg_minutes
0,Damian Lillard,4,2594.8
1,Klay Thompson,2,2578.3
2,Bradley Beal,4,2551.1
3,Tobias Harris,4,2499.8
4,Russell Westbrook,4,2490.3
5,DeMar DeRozan,4,2443.2
6,Nikola Jokic,4,2442.3
7,Harrison Barnes,4,2437.8
8,Andrew Wiggins,4,2436.1
9,James Harden,4,2377.2


### What are your questions on this exercise?

### Now Let's Look at Boolean Masks

#### What is a Boolean Mask?

A **boolean mask** is a way to filter data using **True** or **False** values.

While boolean masks are often used with **NumPy arrays**, they also work with **pandas DataFrames**. We'll focus on how they work with pandas for now and look at NumPy later.

**In pandas, a boolean mask helps you select only the rows that meet a certain condition.**

To create a mask, you can use:

- **Comparison operators** like `<`, `>`, `>=`, `<=`, `==`
- The `.isin()` function — to check if values are in a list
- The `.str.contains()` function — to filter text values

These tools help you easily


Vanderplas has an EXCELLENT introduction to masks in his book, focused on numpy. Chapter linked to here:  https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [17]:
# mask to filter by comparison
minutes_mask = nba_stats['MINUTES'] >= 2000
minutes_mask

Unnamed: 0,MINUTES
0,False
1,False
2,False
3,False
4,False
...,...
2134,False
2135,False
2136,False
2137,False


In [18]:
# filter the dataframe using the mask
high_minutes = nba_stats[minutes_mask]
high_minutes

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...
2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


In [19]:
# filter the dataframe directly, without creating the mask
# as a separate dataframe
high_minutes_2 = nba_stats[nba_stats['MINUTES'] >= 2000]
high_minutes_2

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...
2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


**Note:** When you use a boolean mask, the filtered rows keep their original index.

To reset the index (so it starts from 0), use `.reset_index()`:


In [20]:
# mask to filter by comparison
high_minutes_idx = nba_stats[minutes_mask].reset_index()
high_minutes_idx

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,20,22020,203952,Andrew Wiggins,71,2364.270000,1320,347,0
1,23,22020,1630162,Anthony Edwards,72,2314.166667,1392,336,-228
2,34,22020,1628389,Bam Adebayo,64,2142.616667,1197,573,24
3,42,22020,202711,Bojan Bogdanovic,72,2215.565000,1225,281,419
4,45,22020,203078,Bradley Beal,60,2146.998333,1878,283,-3
...,...,...,...,...,...,...,...,...,...
300,2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
301,2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
302,2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
303,2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


In [21]:
# mask using isin()
season_2017_mask = nba_stats['SEASON_ID'].isin([22017])
season_2017_mask

Unnamed: 0,SEASON_ID
0,False
1,False
2,False
3,False
4,False
...,...
2134,True
2135,True
2136,True
2137,True


In [22]:
# mask using isin(), with reset_index()
season_2017 = nba_stats[season_2017_mask].reset_index()
season_2017

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1599,22017,201166,Aaron Brooks,32,189.413333,75,17,-75
1,1600,22017,203932,Aaron Gordon,58,1909.078333,1022,457,-92
2,1601,22017,1626151,Aaron Harrison,9,233.251667,60,24,-72
3,1602,22017,1628935,Aaron Jackson,1,34.500000,8,3,-10
4,1603,22017,1627846,Abdel Nader,48,521.526667,146,71,-109
...,...,...,...,...,...,...,...,...,...
535,2134,22017,1628380,Zach Collins,66,1045.450000,292,221,16
536,2135,22017,203897,Zach LaVine,24,656.286667,401,94,-172
537,2136,22017,2216,Zach Randolph,59,1507.611667,857,397,-353
538,2137,22017,2585,Zaza Pachulia,69,971.746667,373,321,196


In [23]:
# mask using isin(), with reset_index()
season_2017_2 = nba_stats[nba_stats['SEASON_ID'].isin([22017])].reset_index()
season_2017_2

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1599,22017,201166,Aaron Brooks,32,189.413333,75,17,-75
1,1600,22017,203932,Aaron Gordon,58,1909.078333,1022,457,-92
2,1601,22017,1626151,Aaron Harrison,9,233.251667,60,24,-72
3,1602,22017,1628935,Aaron Jackson,1,34.500000,8,3,-10
4,1603,22017,1627846,Abdel Nader,48,521.526667,146,71,-109
...,...,...,...,...,...,...,...,...,...
535,2134,22017,1628380,Zach Collins,66,1045.450000,292,221,16
536,2135,22017,203897,Zach LaVine,24,656.286667,401,94,-172
537,2136,22017,2216,Zach Randolph,59,1507.611667,857,397,-353
538,2137,22017,2585,Zaza Pachulia,69,971.746667,373,321,196


#### Now let's do a multiple comparison mask.

`Return the players with 2000 or more minutes in the 2017 and 2018 seasons.`

In [24]:
# mask to filter by multiple comparison
multiple_mask = (nba_stats['MINUTES'] >= 2000) & (nba_stats['SEASON_ID'].isin([22017,22018]))
multiple_mask

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
2134,False
2135,False
2136,False
2137,False


In [25]:
# return the dataframe
high_minutes_idx = nba_stats[multiple_mask].reset_index()
high_minutes_idx

Unnamed: 0,index,SEASON_ID,PLAYER_ID,PLAYER_NAME,GAMES_PLAYED,MINUTES,POINTS,REBOUNDS,PLUS_MINUS
0,1069,22018,203932,Aaron Gordon,78,2632.533333,1246,574,107
1,1073,22018,202329,Al-Farouq Aminu,81,2291.698333,760,610,384
2,1086,22018,203083,Andre Drummond,79,2646.890000,1370,1232,176
3,1091,22018,203952,Andrew Wiggins,73,2542.713333,1321,352,-66
4,1099,22018,203085,Austin Rivers,76,2027.726667,618,162,104
...,...,...,...,...,...,...,...,...,...
199,2124,22017,202083,Wesley Matthews,63,2131.433333,802,198,-324
200,2125,22017,203115,Will Barton,81,2682.901667,1268,409,123
201,2126,22017,1626161,Willie Cauley-Stein,73,2043.823333,932,510,-323
202,2129,22017,201163,Wilson Chandler,74,2346.253333,738,398,56


#### Let's do a string comparison mask.

`Return all of the season stats for players named Anthony, in either their first or last names (or both).`

In [None]:
name_anthony_mask = nba_stats['PLAYER_NAME'].str.contains('Anthony')
name_anthony_mask

In [None]:
# mask to filter by string comparison
# return all of the players named Anthony
name_anthony = nba_stats[name_anthony_mask].reset_index()
name_anthony

#### What if we only wanted an array of the player's names, and not their season statistics?

Use the `unique()` function, which returns an array.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

In [None]:
name_anthony["PLAYER_NAME"].unique()

### Now let's do a more complex analysis, one that might be typical for a 2 or 3 point question on an exam.

**Requirement**:

In the NBA, for a player to lead in any statistical category, he must have played in a minimum number of games. For a full season, that number is 58 games. If you are interested in a full explanation of the requirements, see the link below.

Write a function, `top_ten_scorers(df,min_games,season_id)` that returns the top 10 scoring leaders, in points per game, for any given season.

1. Return a dataframe, top_scorers, containing the player name for the top 10 average points per game for any season, for players who meet the minimum number of games qualification.

2. Round the average to 1 decimal place, after sorting. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

3. Include the top 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties).

4. The dataframe should be sorted from most to least, with ties broken by name in alphabetical order.

5. The nba_stats dataframe will be the input for this, along with the season to be filtered for and minimum number of games to qualify.

6. The output dataframe should have the following columns:  `player`, `games`, `points`, `PPG`.

https://www.nba.com/stats/help/statminimums

#### What is our strategy for solving this problem?

1. Create a new dataframe using a boolean mask to keep only the rows we need.

2. Create a new column called `PPG`.

3. Keep only the columns we need.

4. Rename the columns.

5. Keep only the top 10 rows with the highest `PPG`, and reset the index.

6. Remove the old index column.

7. Sort the dataframe.

8. Round the `PPG` column to one decimal place.


In [27]:
def top_ten_scorers(df, min_games, season_id):
    ###
    ###YOUR CODE HERE

    # 1. Create a new dataframe using a boolean mask to keep only the rows we need
    top_scorers = nba_stats[(nba_stats['GAMES_PLAYED'] >= min_games) & (nba_stats['SEASON_ID'].isin([season_id]))].reset_index()

    # 2. Create a new column called PPG
    top_scorers['PPG'] = top_scorers['POINTS'] / top_scorers['GAMES_PLAYED']

    # 3. (Old step for keeping columns – skipped here, done after renaming)
    # top_scorers = top_scorers[['PLAYER_NAME','GAMES_PLAYED','POINTS','PPG']]

    # 4. Rename the columns
    top_scorers.rename(columns={"PLAYER_NAME": "player","GAMES_PLAYED": "games","POINTS": "points"}, inplace=True)

    # 3. Keep only the columns we need (after renaming)
    top_scorers = top_scorers[['player','games','points','PPG']]

    # 5. Keep only the top 10 rows with the highest PPG, and reset the index
    top_scorers = top_scorers.nlargest(10, 'PPG', keep="all").reset_index()

    # 6. Remove the old index column
    del top_scorers['index']

    # 7. Sort the dataframe
    top_scorers.sort_values(['PPG', 'player'], ascending=[False, True], inplace=True)

    # 8. Round the PPG column to one decimal place
    top_scorers = top_scorers.round({'PPG': 1})

    return top_scorers

# test dataframe
top_scoring_players = top_ten_scorers(nba_stats, 58, 22018)
top_scoring_players


Unnamed: 0,player,games,points,PPG
0,James Harden,78,2818,36.1
1,Paul George,77,2159,28.0
2,Giannis Antetokounmpo,72,1994,27.7
3,Joel Embiid,64,1761,27.5
4,Stephen Curry,69,1881,27.3
5,Kawhi Leonard,60,1596,26.6
6,Devin Booker,64,1700,26.6
7,Kevin Durant,78,2027,26.0
8,Damian Lillard,80,2067,25.8
9,Kemba Walker,82,2102,25.6


Your dataframe results should match those at this link:  https://www.espn.com/nba/stats/_/season/2019/seasontype/2

### What are your questions on this exercise, and on the notebook as a whole?

### Extra Credit, for fun (will not be covered during Bootcamp live session)

**Requirement**:

In the NBA, the metric `PLUS_MINUS` provides a single number for the value of a player. The metric is defined as the difference between the number of points the player's team scores, minus the number of points the opposing team scores, during the time that the player is in the game.

A positive number means that, over the course of the season, the player's team scored that many more points than their opponents when he was on the court. Likewise for a negative number, his team scored that many fewer points.

In general, the best players have the highest `PLUS_MINUS`, and the worst player have the lowest `PLUS_MINUS`.

So let's see who the best and worst players were, during the 2020 season.

Return a dataframe, best_players, containing the top 10 players and their `PLUS_MINUS` value. Include the top 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties). The dataframe should be sorted from most to least, with ties broken by name in reverse alphabetical order.  

Additionally, return a dataframe, worst_players, containing the bottom 10 players and their `PLUS_MINUS` value. Include the bottom 10 plus ties. The final dataframe indexes should be from 0 to 9 (or higher, if there are ties). The dataframe should be sorted from lowest value to highest value, with ties broken by name in alphabetical order.

The output dataframes should have the following columns:  `PLAYER_NAME`, `PLUS_MINUS`. There is no need to rename the columns from their original names in the source dataframe for this exercise.

Use the nba_stats_2 dataframe as the input for this.

The `nsmallest()` function is analogous to `nlargest` for finding the smallest values.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nsmallest.html





In [28]:
# best players here
# keep only the 2020 season players
best_players = nba_stats[nba_stats['SEASON_ID'].isin([22020])].reset_index()

# # keep only the required columns
best_players = best_players[['PLAYER_NAME','PLUS_MINUS']]

# now only keep the 10 highest, plus ties
best_players = best_players.nlargest(10, 'PLUS_MINUS', keep="all").reset_index()
# drop the index column
del best_players['index']

# sort the dataframe
best_players.sort_values(['PLUS_MINUS', 'PLAYER_NAME'], ascending=[False, True],inplace=True)

best_players

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Rudy Gobert,728
1,Mike Conley,548
2,Royce O'Neale,471
3,Joe Ingles,454
4,Kawhi Leonard,446
5,Paul George,432
6,Bojan Bogdanovic,419
7,Giannis Antetokounmpo,409
8,Joel Embiid,405
9,Nikola Jokic,384


Your solution should match the dataframe below.

In [30]:
best_players_soln = pd.read_csv('best_players.csv')
best_players_soln

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Rudy Gobert,728
1,Mike Conley,548
2,Royce O'Neale,471
3,Joe Ingles,454
4,Kawhi Leonard,446
5,Paul George,432
6,Bojan Bogdanovic,419
7,Giannis Antetokounmpo,409
8,Joel Embiid,405
9,Nikola Jokic,384


In [29]:
# worst players here
# keep only the 2020 season players
worst_players = nba_stats[nba_stats['SEASON_ID'].isin([22020])].reset_index()

# # keep only the required columns
worst_players = worst_players[['PLAYER_NAME','PLUS_MINUS']]

# now only keep the 10 highest, plus ties
worst_players = worst_players.nsmallest(10, 'PLUS_MINUS', keep="all").reset_index()
# drop the index column
del worst_players['index']

# sort the dataframe
worst_players.sort_values(['PLUS_MINUS', 'PLAYER_NAME'], ascending=[True, True],inplace=True)

worst_players

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Theo Maledon,-621
1,Darius Bazley,-477
2,Dwayne Bacon,-443
3,Isaiah Roby,-437
4,Isaac Okoro,-408
5,Aleksej Pokusevski,-393
6,Collin Sexton,-377
7,Moses Brown,-363
8,Nikola Vucevic,-341
9,Cedi Osman,-323


Your solution should match the dataframe below.

In [31]:
worst_players_soln = pd.read_csv('worst_players.csv')
worst_players_soln

Unnamed: 0,PLAYER_NAME,PLUS_MINUS
0,Theo Maledon,-621
1,Darius Bazley,-477
2,Dwayne Bacon,-443
3,Isaiah Roby,-437
4,Isaac Okoro,-408
5,Aleksej Pokusevski,-393
6,Collin Sexton,-377
7,Moses Brown,-363
8,Nikola Vucevic,-341
9,Cedi Osman,-323
