## Basic pandas optimizations

This chapter offers a brief introduction on how to efficiently work with pandas DataFrames. You'll learn the various options you have for iterating over a DataFrame. Then, you'll learn how to efficiently apply functions to data stored in a DataFrame.

### Iterating with .iterrows()
In the video, we discussed that .iterrows() returns each DataFrame row as a tuple of (index, pandas Series) pairs. But, what does this mean? Let's explore with a few coding exercises.

A pandas DataFrame has been loaded into your session called pit_df. This DataFrame contains the stats for the Major League Baseball team named the Pittsburgh Pirates (abbreviated as 'PIT') from the year 2008 to the year 2012. It has been printed into your console for convenience.

In [6]:
import pandas as pd

pit_df = pd.read_csv('Baseball statistics.csv')

# Use .iterrows() to loop over pit_df and print each row. Save the first item from .iterrows() as i and the second as row.

# Iterate over pit_df and print each row
for i, row in pit_df.iterrows():
    print(row)

Team              ARI
League             NL
Year             2012
RS                734
RA                688
W                  81
OBP             0.328
SLG             0.418
BA              0.259
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.317
OSLG            0.415
Name: 0, dtype: object
Team              ATL
League             NL
Year             2012
RS                700
RA                600
W                  94
OBP              0.32
SLG             0.389
BA              0.247
Playoffs            1
RankSeason          4
RankPlayoffs        5
G                 162
OOBP            0.306
OSLG            0.378
Name: 1, dtype: object
Team              BAL
League             AL
Year             2012
RS                712
RA                705
W                  93
OBP             0.311
SLG             0.417
BA              0.247
Playoffs            1
RankSeason          5
RankPlayoffs        4
G                 162
OOBP    

Name: 413, dtype: object
Team              SEA
League             AL
Year             1999
RS                859
RA                905
W                  79
OBP             0.343
SLG             0.455
BA              0.269
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.368
OSLG            0.454
Name: 414, dtype: object
Team              SFG
League             NL
Year             1999
RS                872
RA                831
W                  86
OBP             0.356
SLG             0.434
BA              0.271
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.345
OSLG            0.423
Name: 415, dtype: object
Team              STL
League             NL
Year             1999
RS                809
RA                838
W                  75
OBP             0.338
SLG             0.426
BA              0.262
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G

Name: 779, dtype: object
Team              MIL
League             AL
Year             1983
RS                764
RA                708
W                  87
OBP             0.333
SLG             0.418
BA              0.277
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 780, dtype: object
Team              MIN
League             AL
Year             1983
RS                709
RA                822
W                  70
OBP             0.319
SLG             0.401
BA              0.261
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 781, dtype: object
Team              MON
League             NL
Year             1983
RS                677
RA                646
W                  82
OBP             0.326
SLG             0.386
BA              0.264
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G

Name: 1162, dtype: object
Team              MIN
League             AL
Year             1965
RS                774
RA                600
W                 102
OBP             0.324
SLG             0.399
BA              0.254
Playoffs            1
RankSeason          1
RankPlayoffs        2
G                 162
OOBP              NaN
OSLG              NaN
Name: 1163, dtype: object
Team              MLN
League             NL
Year             1965
RS                708
RA                633
W                  86
OBP              0.31
SLG             0.416
BA              0.256
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 1164, dtype: object
Team              NYM
League             NL
Year             1965
RS                495
RA                752
W                  50
OBP             0.277
SLG             0.327
BA              0.221
Playoffs            0
RankSeason        NaN
RankPlayoffs      Na

In [7]:
# Add two lines to the loop: one before print(row) to print each index variable and one after to print each row's type.

# Iterate over pit_df and print each index variable and then each row
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

0
Team              ARI
League             NL
Year             2012
RS                734
RA                688
W                  81
OBP             0.328
SLG             0.418
BA              0.259
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.317
OSLG            0.415
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
1
Team              ATL
League             NL
Year             2012
RS                700
RA                600
W                  94
OBP              0.32
SLG             0.389
BA              0.247
Playoffs            1
RankSeason          4
RankPlayoffs        5
G                 162
OOBP            0.306
OSLG            0.378
Name: 1, dtype: object
<class 'pandas.core.series.Series'>
2
Team              BAL
League             AL
Year             2012
RS                712
RA                705
W                  93
OBP             0.311
SLG             0.417
BA              0.247
Playoffs          

Name: 366, dtype: object
<class 'pandas.core.series.Series'>
367
Team              CIN
League             NL
Year             2000
RS                825
RA                765
W                  85
OBP             0.343
SLG             0.447
BA              0.274
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 163
OOBP            0.341
OSLG            0.438
Name: 367, dtype: object
<class 'pandas.core.series.Series'>
368
Team              CLE
League             AL
Year             2000
RS                950
RA                816
W                  90
OBP             0.367
SLG              0.47
BA              0.288
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP             0.35
OSLG            0.425
Name: 368, dtype: object
<class 'pandas.core.series.Series'>
369
Team              COL
League             NL
Year             2000
RS                968
RA                897
W                  82
OBP          

Name: 661, dtype: object
<class 'pandas.core.series.Series'>
662
Team              TEX
League             AL
Year             1988
RS                637
RA                735
W                  70
OBP              0.32
SLG             0.368
BA              0.252
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 161
OOBP              NaN
OSLG              NaN
Name: 662, dtype: object
<class 'pandas.core.series.Series'>
663
Team              TOR
League             AL
Year             1988
RS                763
RA                680
W                  87
OBP             0.332
SLG             0.419
BA              0.268
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 663, dtype: object
<class 'pandas.core.series.Series'>
664
Team              ATL
League             NL
Year             1987
RS                747
RA                829
W                  69
OBP          

Name: 962, dtype: object
<class 'pandas.core.series.Series'>
963
Team              NYM
League             NL
Year             1975
RS                646
RA                625
W                  82
OBP             0.319
SLG             0.361
BA              0.256
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 963, dtype: object
<class 'pandas.core.series.Series'>
964
Team              NYY
League             AL
Year             1975
RS                681
RA                588
W                  83
OBP             0.325
SLG             0.382
BA              0.264
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 160
OOBP              NaN
OSLG              NaN
Name: 964, dtype: object
<class 'pandas.core.series.Series'>
965
Team              OAK
League             AL
Year             1975
RS                758
RA                606
W                  98
OBP          

In [8]:
# Instead of using i and row in the for statement to store the output of .iterrows(), use one variable named row_tuple.

# Add a line in the for loop to print the type of each row_tuple.

# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    print(type(row_tuple))

(0, Team              ARI
League             NL
Year             2012
RS                734
RA                688
W                  81
OBP             0.328
SLG             0.418
BA              0.259
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.317
OSLG            0.415
Name: 0, dtype: object)
<class 'tuple'>
(1, Team              ATL
League             NL
Year             2012
RS                700
RA                600
W                  94
OBP              0.32
SLG             0.389
BA              0.247
Playoffs            1
RankSeason          4
RankPlayoffs        5
G                 162
OOBP            0.306
OSLG            0.378
Name: 1, dtype: object)
<class 'tuple'>
(2, Team              BAL
League             AL
Year             2012
RS                712
RA                705
W                  93
OBP             0.311
SLG             0.417
BA              0.247
Playoffs            1
RankSeason          5
RankPl

Name: 398, dtype: object)
<class 'tuple'>
(399, Team              COL
League             NL
Year             1999
RS                906
RA               1028
W                  72
OBP             0.348
SLG             0.472
BA              0.288
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.384
OSLG            0.499
Name: 399, dtype: object)
<class 'tuple'>
(400, Team              DET
League             AL
Year             1999
RS                747
RA                882
W                  69
OBP             0.326
SLG             0.443
BA              0.261
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 161
OOBP            0.349
OSLG            0.451
Name: 400, dtype: object)
<class 'tuple'>
(401, Team              FLA
League             NL
Year             1999
RS                691
RA                852
W                  64
OBP             0.325
SLG             0.395
BA              0.26

Name: 743, dtype: object)
<class 'tuple'>
(744, Team              BOS
League             AL
Year             1984
RS                810
RA                764
W                  86
OBP             0.341
SLG             0.441
BA              0.283
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 744, dtype: object)
<class 'tuple'>
(745, Team              CAL
League             AL
Year             1984
RS                696
RA                697
W                  81
OBP             0.319
SLG             0.381
BA              0.249
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 745, dtype: object)
<class 'tuple'>
(746, Team              CHC
League             NL
Year             1984
RS                762
RA                658
W                  96
OBP             0.331
SLG             0.397
BA               0.2

Name: 1052, dtype: object)
<class 'tuple'>
(1053, Team              HOU
League             NL
Year             1970
RS                744
RA                763
W                  79
OBP             0.332
SLG             0.391
BA              0.259
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 1053, dtype: object)
<class 'tuple'>
(1054, Team              KCR
League             AL
Year             1970
RS                611
RA                705
W                  65
OBP             0.309
SLG             0.348
BA              0.244
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP              NaN
OSLG              NaN
Name: 1054, dtype: object)
<class 'tuple'>
(1055, Team              LAD
League             NL
Year             1970
RS                749
RA                684
W                  87
OBP             0.334
SLG             0.382
BA            

### Run differentials with .iterrows()
You've been hired by the San Francisco Giants as an analyst—congrats! The team's owner wants you to calculate a metric called the run differential for each season from the year 2008 to 2012. This metric is calculated by subtracting the total number of runs a team allowed in a season from the team's total number of runs scored in a season. 'RS' means runs scored and 'RA' means runs allowed.

The below function calculates this metric:

def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

A DataFrame has been loaded into your session as giants_df and printed into the console. Let's practice using .iterrows() to add a run differential column to this DataFrame.

In [13]:
# function
def calc_run_diff(runs_scored, runs_allowed):
    run_diff = runs_scored - runs_allowed
    return run_diff

# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in pit_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

pit_df['RD'] = run_diffs
print(pit_df)

     Team League  Year   RS   RA    W    OBP    SLG     BA  Playoffs  \
0     ARI     NL  2012  734  688   81  0.328  0.418  0.259         0   
1     ATL     NL  2012  700  600   94  0.320  0.389  0.247         1   
2     BAL     AL  2012  712  705   93  0.311  0.417  0.247         1   
3     BOS     AL  2012  734  806   69  0.315  0.415  0.260         0   
4     CHC     NL  2012  613  759   61  0.302  0.378  0.240         0   
...   ...    ...   ...  ...  ...  ...    ...    ...    ...       ...   
1227  PHI     NL  1962  705  759   81  0.330  0.390  0.260         0   
1228  PIT     NL  1962  706  626   93  0.321  0.394  0.268         0   
1229  SFG     NL  1962  878  690  103  0.341  0.441  0.278         1   
1230  STL     NL  1962  774  664   84  0.335  0.394  0.271         0   
1231  WSA     AL  1962  599  716   60  0.308  0.373  0.250         0   

      RankSeason  RankPlayoffs    G   OOBP   OSLG   RD  
0            NaN           NaN  162  0.317  0.415   46  
1            4.0     

### Iterating with .itertuples()
Remember, .itertuples() returns each DataFrame row as a special data type called a namedtuple. You can look up an attribute within a namedtuple with a special syntax. Let's practice working with namedtuples.

A pandas DataFrame has been loaded into your session called rangers_df. This DataFrame contains the stats ('Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', and 'Playoffs') for the Major League baseball team named the Texas Rangers (abbreviated as 'TEX').

In [22]:
rangers_df = pit_df[(pit_df["Team"]=="TEX")]


# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if row.Playoffs == 1:
    print(i, year, wins)

27 2012 93
57 2011 96
87 2010 90
418 1999 95
448 1998 88
504 1996 90


### Run differentials with .itertuples()
The New York Yankees have made a trade with the San Francisco Giants for your analyst contract— you're a hot commodity! Your new boss has seen your work with the Giants and now wants you to do something similar with the Yankees data. He'd like you to calculate run differentials for the Yankees from the year 1962 to the year 2012 and find which season they had the best run differential.

You've remembered the function you used when working with the Giants and quickly write it down:

def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff
Let's use .itertuples() to loop over the yankees_df DataFrame (which has been loaded into your session) and calculate run differentials.

In [23]:
yankees_df = pit_df[(pit_df["Team"]=="NYY")]

run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df)

     Team League  Year   RS   RA    W    OBP    SLG     BA  Playoffs  \
18    NYY     AL  2012  804  668   95  0.337  0.453  0.265         1   
48    NYY     AL  2011  867  657   97  0.343  0.444  0.263         1   
78    NYY     AL  2010  859  693   95  0.350  0.436  0.267         1   
108   NYY     AL  2009  915  753  103  0.362  0.478  0.283         1   
138   NYY     AL  2008  789  727   89  0.342  0.427  0.271         0   
168   NYY     AL  2007  968  777   94  0.366  0.463  0.290         1   
198   NYY     AL  2006  930  767   97  0.363  0.461  0.285         1   
228   NYY     AL  2005  886  789   95  0.355  0.450  0.276         1   
259   NYY     AL  2004  897  808  101  0.353  0.458  0.268         1   
289   NYY     AL  2003  877  716  101  0.356  0.453  0.271         1   
319   NYY     AL  2002  897  697  103  0.354  0.455  0.275         1   
349   NYY     AL  2001  804  713   95  0.334  0.435  0.267         1   
379   NYY     AL  2000  871  814   87  0.354  0.450  0.277      

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yankees_df['RD'] = run_diffs


### Settle a debate with .apply()
Word has gotten to the Arizona Diamondbacks about your awesome analytics skills. They'd like for you to help settle a debate amongst the managers. One manager claims that the team has made the playoffs every year they have had a win percentage of 0.50 or greater. Another manager says this is not true.

Let's use the below function and the .apply() method to see which manager is correct.

def calc_win_perc(wins, games_played):

    win_perc = wins / games_played
    return np.round(win_perc,2)

A DataFrame named dbacks_df has been loaded into your session.

In [26]:
import numpy as np

def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

dbacks_df = pit_df[(pit_df["Team"]=="ARI")]

# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

    Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  \
0    ARI     NL  2012  734  688  81  0.328  0.418  0.259         0   
30   ARI     NL  2011  731  662  94  0.322  0.413  0.250         1   
60   ARI     NL  2010  713  836  65  0.325  0.416  0.250         0   
90   ARI     NL  2009  720  782  70  0.324  0.418  0.253         0   
120  ARI     NL  2008  720  706  82  0.327  0.415  0.251         0   

     RankSeason  RankPlayoffs    G   OOBP   OSLG   RD  
0           NaN           NaN  162  0.317  0.415   46  
30          5.0           4.0  162  0.316  0.409   69  
60          NaN           NaN  162  0.340  0.448 -123  
90          NaN           NaN  162  0.330  0.419  -62  
120         NaN           NaN  162  0.318  0.398   14  
0      0.50
30     0.58
60     0.40
90     0.43
120    0.51
150    0.56
180    0.47
210    0.48
241    0.31
271    0.52
301    0.60
331    0.57
361    0.52
391    0.62
421    0.40
dtype: float64 

    Team League  Year   RS   RA    W    OBP   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dbacks_df['WP'] = win_percs


### Replacing .iloc with underlying arrays
Now that you have a better grasp on a DataFrame's internals let's update one of your previous analyses to leverage a DataFrame's underlying arrays. You'll revisit the win percentage calculations you performed row by row with the .iloc method.

In [30]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs_list = []

for i in range(len(pit_df)):
    row = pit_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

pit_df['WP'] = win_percs_list

In [33]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(pit_df['W'].values, pit_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
pit_df['WP'] = win_percs_np

print(pit_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG   RD    WP  
0           NaN  162  0.317  0.415   46  0.50  
1           5.0  162  0.306  0.378  100  0.58  
2           4.0  162  0.315  0.403    7  0.57  
3           NaN  162  0.331  0.428  -72  0.43  
4           NaN  162  0.335  0.424 -146  0.38  


### Bringing it all together: Predict win percentage
A pandas DataFrame (pitdf) has been loaded into your session. For convenience, a dictionary describing each column within baseball_df has been printed into your console. You can reference these descriptions throughout the exercise.

You'd like to attempt to predict a team's win percentage for a given season by using the team's total runs scored in a season ('RS') and total runs allowed in a season ('RA') with the following function predict_win_perc 

Let's compare the approaches you've learned to calculate a predicted win percentage for each season (or row) in your DataFrame.

In [35]:
# function 
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in pit_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = pit_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(pit_df['RS'].values, pit_df['RA'].values)
pit_df['WP_preds'] = win_perc_preds_np
print(pit_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG   RD    WP  WP_preds  
0           NaN  162  0.317  0.415   46  0.50      0.53  
1           5.0  162  0.306  0.378  100  0.58      0.58  
2           4.0  162  0.315  0.403    7  0.57      0.50  
3           NaN  162  0.331  0.428  -72  0.43      0.45  
4           NaN  162  0.335  0.424 -146  0.38      0.39  
