In [35]:
import pandas as pd
import numpy as np

Iterating with .iterrows()
==========================

In the video, we discussed that `.iterrows()` returns each DataFrame row as a tuple of (index, `pandas` Series) pairs. But, what does this mean? Let's explore with a few coding exercises.

A `pandas` DataFrame has been loaded into your session called `pit_df`. This DataFrame contains the stats for the Major League Baseball team named the Pittsburgh Pirates (abbreviated as `'PIT'`) from the year 2008 to the year 2012. It has been printed into your console for convenience.

### Instructions 1/4

*   Use `.iterrows()` to loop over `pit_df` and print each row. Save the first item from `.iterrows()` as `i` and the second as `row`.

In [3]:
# Your data
data = {
    'Team': ['PIT', 'PIT', 'PIT', 'PIT', 'PIT'],
    'League': ['NL', 'NL', 'NL', 'NL', 'NL'],
    'Year': [2012, 2011, 2010, 2009, 2008],
    'RS': [651, 610, 587, 636, 735],
    'RA': [674, 712, 866, 768, 884],
    'W': [79, 72, 57, 62, 67],
    'G': [162, 162, 162, 161, 162],
    'Playoffs': [0, 0, 0, 0, 0]
}

# Create DataFrame
pit_df = pd.DataFrame(data)

In [4]:
# Iterate over pit_df and print each row
for i,row in pit_df.iterrows():
    print(row)

Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object
Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object
Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object
Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object
Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object


### Instructions 2/4

*   Add **two** lines to the loop: one _before_ `print(row)` to print each index variable and one _after_ to print each row's type.

In [5]:
# Iterate over pit_df and print each index variable, row, and row type
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

0
Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
1
Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object
<class 'pandas.core.series.Series'>
2
Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object
<class 'pandas.core.series.Series'>
3
Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object
<class 'pandas.core.series.Series'>
4
Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object
<class 'pandas.core.series.Series'>


### Instructions 3/4
    
*   Instead of using `i` and `row` in the for statement to store the output of `.iterrows()`, use **one** variable named `row_tuple`.

In [6]:
# Use one variable instead of two to store the result of .iterrows()
for row_tuple in pit_df.iterrows():
    print(row_tuple)

(0, Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object)
(1, Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object)
(2, Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object)
(3, Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object)
(4, Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object)


### Instructions 4/4
    
*   Add a line in the for loop to print the type of each `row_tuple`.

In [7]:
# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    print(type(row_tuple))

(0, Team         PIT
League        NL
Year        2012
RS           651
RA           674
W             79
G            162
Playoffs       0
Name: 0, dtype: object)
<class 'tuple'>
(1, Team         PIT
League        NL
Year        2011
RS           610
RA           712
W             72
G            162
Playoffs       0
Name: 1, dtype: object)
<class 'tuple'>
(2, Team         PIT
League        NL
Year        2010
RS           587
RA           866
W             57
G            162
Playoffs       0
Name: 2, dtype: object)
<class 'tuple'>
(3, Team         PIT
League        NL
Year        2009
RS           636
RA           768
W             62
G            161
Playoffs       0
Name: 3, dtype: object)
<class 'tuple'>
(4, Team         PIT
League        NL
Year        2008
RS           735
RA           884
W             67
G            162
Playoffs       0
Name: 4, dtype: object)
<class 'tuple'>


Nice work! Since `.iterrows()` returns each DataFrame row as a tuple of (index, `pandas` Series) pairs, you can either split this tuple and use the index and row-values separately (as you did with `for i,row in pit_df.iterrows()`), or you can keep the result of `.iterrows()` in the tuple form (as you did with `for row_tuple in pit_df.iterrows()`).  
  
If using `i,row`, you can access things from the row using square brackets (i.e., `row['Team']`). If using `row_tuple`, you would have to specify which element of the tuple you'd like to access before grabbing the team name (i.e., `row_tuple[1]['Team']`).  
  
With either approach, using `.iterrows()` will still be substantially faster than using `.iloc` as you saw in the video.

Run differentials with .iterrows()
==================================

You've been hired by the San Francisco Giants as an analyst—congrats! The team's owner wants you to calculate a metric called the _run differential_ for each season from the year 2008 to 2012. This metric is calculated by subtracting the total number of runs a team allowed in a season from the team's total number of runs scored in a season. `'RS'` means runs scored and `'RA'` means runs allowed.

The below function calculates this metric:

In [8]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff


A DataFrame has been loaded into your session as `giants_df` and printed into the console. Let's practice using `.iterrows()` to add a _run differential_ column to this DataFrame.

### Instructions 1/4

*   Create an empty list called `run_diffs` that will be used to store the _run differentials_ you will calculate.

In [9]:
giants_dict = {'G': {0: 162, 1: 162, 2: 162, 3: 162, 4: 162},
 'League': {0: 'NL', 1: 'NL', 2: 'NL', 3: 'NL', 4: 'NL'},
 'Playoffs': {0: 1, 1: 0, 2: 1, 3: 0, 4: 0},
 'RA': {0: 649, 1: 578, 2: 583, 3: 611, 4: 759},
 'RS': {0: 718, 1: 570, 2: 697, 3: 657, 4: 640},
 'Team': {0: 'SFG', 1: 'SFG', 2: 'SFG', 3: 'SFG', 4: 'SFG'},
 'W': {0: 94, 1: 86, 2: 92, 3: 88, 4: 72},
 'Year': {0: 2012, 1: 2011, 2: 2010, 3: 2009, 4: 2008}}

giants_df = pd.DataFrame(giants_dict)

In [10]:
# Create an empty list to store run differentials
run_diffs = []

### Instructions 2/4

*   Write a for loop that uses `.iterrows()` to loop over `giants_df` and collects each row's runs scored and runs allowed.

In [11]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']

### Instructions 3/4

*   Add a line to the for loop that uses the provided function to calculate each row's _run differential_.

In [12]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)

### Instructions 4/4

*   Add a line to the loop that appends each row's _run differential_ to the `run_diffs` list.

In [13]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs
print(giants_df)

     G League  Playoffs   RA   RS Team   W  Year   RD
0  162     NL         1  649  718  SFG  94  2012   69
1  162     NL         0  578  570  SFG  86  2011   -8
2  162     NL         1  583  697  SFG  92  2010  114
3  162     NL         0  611  657  SFG  88  2009   46
4  162     NL         0  759  640  SFG  72  2008 -119


Great job! Take a look at the `giants_df` DataFrame with the new run differential column (`'RD'`) you created (it has been printed in the console).  
  
The `'Playoffs'` column tells you if a team made the playoffs for a given season. A `1` means that the team made the playoffs in that season and a `0` means the team did not make the playoffs in that season.  
  
Did you notice that in the seasons with the highest run differentials the Giants made the playoffs? In fact, in both of these seasons (2010 and 2012), the San Francisco Giants not only made the playoffs but also won the World Series! Cool!

Iterating with .itertuples()
============================

Remember, `.itertuples()` returns each DataFrame row as a special data type called a **namedtuple**. You can look up an attribute within a namedtuple with a special syntax. Let's practice working with namedtuples.

A `pandas` DataFrame has been loaded into your session called `rangers_df`. This DataFrame contains the stats (`'Team'`, `'League'`, `'Year'`, `'RS'`, `'RA'`, '`W'`, `'G'`, and `'Playoffs'`) for the Major League baseball team named the Texas Rangers (abbreviated as `'TEX'`).

### Instructions 1/3
*   Use `.itertuples()` to loop over `rangers_df` and print each row.

In [14]:
#Variables

rangers_dict = {'G': {0: 162,
       1: 162,
       2: 162,
       3: 162,
       4: 162,
       5: 162,
       6: 162,
       7: 162,
       8: 162,
       9: 162,
       10: 162,
       11: 162,
       12: 162,
       13: 162,
       14: 162,
       15: 162,
       16: 163,
       17: 162,
       18: 162,
       19: 162,
       20: 162,
       21: 162,
       22: 161,
       23: 162,
       24: 162,
       25: 161,
       26: 161,
       27: 163,
       28: 162,
       29: 163,
       30: 162,
       31: 162,
       32: 162,
       33: 162,
       34: 162,
       35: 161,
       36: 162},
 'League': {0: 'AL',
            1: 'AL',
            2: 'AL',
            3: 'AL',
            4: 'AL',
            5: 'AL',
            6: 'AL',
            7: 'AL',
            8: 'AL',
            9: 'AL',
            10: 'AL',
            11: 'AL',
            12: 'AL',
            13: 'AL',
            14: 'AL',
            15: 'AL',
            16: 'AL',
            17: 'AL',
            18: 'AL',
            19: 'AL',
            20: 'AL',
            21: 'AL',
            22: 'AL',
            23: 'AL',
            24: 'AL',
            25: 'AL',
            26: 'AL',
            27: 'AL',
            28: 'AL',
            29: 'AL',
            30: 'AL',
            31: 'AL',
            32: 'AL',
            33: 'AL',
            34: 'AL',
            35: 'AL',
            36: 'AL'},
 'Playoffs': {0: 1,
              1: 1,
              2: 1,
              3: 0,
              4: 0,
              5: 0,
              6: 0,
              7: 0,
              8: 0,
              9: 0,
              10: 0,
              11: 0,
              12: 0,
              13: 1,
              14: 1,
              15: 0,
              16: 1,
              17: 0,
              18: 0,
              19: 0,
              20: 0,
              21: 0,
              22: 0,
              23: 0,
              24: 0,
              25: 0,
              26: 0,
              27: 0,
              28: 0,
              29: 0,
              30: 0,
              31: 0,
              32: 0,
              33: 0,
              34: 0,
              35: 0,
              36: 0},
 'RA': {0: 707,
        1: 677,
        2: 687,
        3: 740,
        4: 967,
        5: 844,
        6: 784,
        7: 858,
        8: 794,
        9: 969,
        10: 882,
        11: 968,
        12: 974,
        13: 859,
        14: 871,
        15: 823,
        16: 799,
        17: 751,
        18: 753,
        19: 814,
        20: 696,
        21: 714,
        22: 735,
        23: 849,
        24: 743,
        25: 785,
        26: 714,
        27: 609,
        28: 749,
        29: 752,
        30: 698,
        31: 632,
        32: 657,
        33: 652,
        34: 733,
        35: 698,
        36: 844},
 'RS': {0: 808,
        1: 855,
        2: 787,
        3: 784,
        4: 901,
        5: 816,
        6: 835,
        7: 865,
        8: 860,
        9: 826,
        10: 843,
        11: 890,
        12: 848,
        13: 945,
        14: 940,
        15: 807,
        16: 928,
        17: 835,
        18: 682,
        19: 829,
        20: 676,
        21: 695,
        22: 637,
        23: 823,
        24: 771,
        25: 617,
        26: 656,
        27: 639,
        28: 590,
        29: 756,
        30: 750,
        31: 692,
        32: 767,
        33: 616,
        34: 714,
        35: 690,
        36: 619},
 'Team': {0: 'TEX',
          1: 'TEX',
          2: 'TEX',
          3: 'TEX',
          4: 'TEX',
          5: 'TEX',
          6: 'TEX',
          7: 'TEX',
          8: 'TEX',
          9: 'TEX',
          10: 'TEX',
          11: 'TEX',
          12: 'TEX',
          13: 'TEX',
          14: 'TEX',
          15: 'TEX',
          16: 'TEX',
          17: 'TEX',
          18: 'TEX',
          19: 'TEX',
          20: 'TEX',
          21: 'TEX',
          22: 'TEX',
          23: 'TEX',
          24: 'TEX',
          25: 'TEX',
          26: 'TEX',
          27: 'TEX',
          28: 'TEX',
          29: 'TEX',
          30: 'TEX',
          31: 'TEX',
          32: 'TEX',
          33: 'TEX',
          34: 'TEX',
          35: 'TEX',
          36: 'TEX'},
 'W': {0: 93,
       1: 96,
       2: 90,
       3: 87,
       4: 79,
       5: 75,
       6: 80,
       7: 79,
       8: 89,
       9: 71,
       10: 72,
       11: 73,
       12: 71,
       13: 95,
       14: 88,
       15: 77,
       16: 90,
       17: 86,
       18: 77,
       19: 85,
       20: 83,
       21: 83,
       22: 70,
       23: 75,
       24: 87,
       25: 62,
       26: 69,
       27: 77,
       28: 64,
       29: 76,
       30: 83,
       31: 87,
       32: 94,
       33: 76,
       34: 79,
       35: 83,
       36: 57},
 'Year': {0: 2012,
          1: 2011,
          2: 2010,
          3: 2009,
          4: 2008,
          5: 2007,
          6: 2006,
          7: 2005,
          8: 2004,
          9: 2003,
          10: 2002,
          11: 2001,
          12: 2000,
          13: 1999,
          14: 1998,
          15: 1997,
          16: 1996,
          17: 1993,
          18: 1992,
          19: 1991,
          20: 1990,
          21: 1989,
          22: 1988,
          23: 1987,
          24: 1986,
          25: 1985,
          26: 1984,
          27: 1983,
          28: 1982,
          29: 1980,
          30: 1979,
          31: 1978,
          32: 1977,
          33: 1976,
          34: 1975,
          35: 1974,
          36: 1973}}

rangers_df = pd.DataFrame(rangers_dict)

In [15]:
# Loop over the DataFrame and print each row
for row_namedtuple in rangers_df.itertuples():
  print(row_namedtuple)

Pandas(Index=0, G=162, League='AL', Playoffs=1, RA=707, RS=808, Team='TEX', W=93, Year=2012)
Pandas(Index=1, G=162, League='AL', Playoffs=1, RA=677, RS=855, Team='TEX', W=96, Year=2011)
Pandas(Index=2, G=162, League='AL', Playoffs=1, RA=687, RS=787, Team='TEX', W=90, Year=2010)
Pandas(Index=3, G=162, League='AL', Playoffs=0, RA=740, RS=784, Team='TEX', W=87, Year=2009)
Pandas(Index=4, G=162, League='AL', Playoffs=0, RA=967, RS=901, Team='TEX', W=79, Year=2008)
Pandas(Index=5, G=162, League='AL', Playoffs=0, RA=844, RS=816, Team='TEX', W=75, Year=2007)
Pandas(Index=6, G=162, League='AL', Playoffs=0, RA=784, RS=835, Team='TEX', W=80, Year=2006)
Pandas(Index=7, G=162, League='AL', Playoffs=0, RA=858, RS=865, Team='TEX', W=79, Year=2005)
Pandas(Index=8, G=162, League='AL', Playoffs=0, RA=794, RS=860, Team='TEX', W=89, Year=2004)
Pandas(Index=9, G=162, League='AL', Playoffs=0, RA=969, RS=826, Team='TEX', W=71, Year=2003)
Pandas(Index=10, G=162, League='AL', Playoffs=0, RA=882, RS=843, Team=

### Instructions 2/3
    
*   Loop over `rangers_df` with `.itertuples()` and save each row's `Index`, `Year`, and Wins (`W`) attribute as `i`, `year`, and `wins`.

In [16]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  print(i, year, wins)

0 2012 93
1 2011 96
2 2010 90
3 2009 87
4 2008 79
5 2007 75
6 2006 80
7 2005 79
8 2004 89
9 2003 71
10 2002 72
11 2001 73
12 2000 71
13 1999 95
14 1998 88
15 1997 77
16 1996 90
17 1993 86
18 1992 77
19 1991 85
20 1990 83
21 1989 83
22 1988 70
23 1987 75
24 1986 87
25 1985 62
26 1984 69
27 1983 77
28 1982 64
29 1980 76
30 1979 83
31 1978 87
32 1977 94
33 1976 76
34 1975 79
35 1974 83
36 1973 57


### Instructions 3/3    
*   Now, loop over `rangers_df` and print these values **only for those rows** where the Rangers made the playoffs.

In [17]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if row.Playoffs == 1:
    print(i, year, wins)

0 2012 93
1 2011 96
2 2010 90
13 1999 95
14 1998 88
16 1996 90


Awesome! You're getting the hang of using `.itertuples()`. Remember, you need to use the \_dot\_ syntax for referencing an attribute in a \_\_namedtuple\_\_.  
  
You can create a new variable using a row's dot reference (as you did when storing `row.Index` as the variable `i`). Or you can use the row's dot reference directly to perform calculations and checks. Notice that you did not have to save `row.Playoffs` to a new variable in your check statement (you were able to use `row.Playoffs` directly in your check).  
  
Did you notice the pattern in the Texas Rangers playoff appearances? Only six appearances and two distinct sets of groupings (one from 2010 - 2012 and one from 1996 - 1999).

Run differentials with .itertuples()
====================================

The New York Yankees have made a trade with the San Francisco Giants for your analyst contract— you're a hot commodity! Your new boss has seen your work with the Giants and now wants you to do something similar with the Yankees data. He'd like you to calculate _run differentials_ for the Yankees from the year 1962 to the year 2012 and find which season they had the best _run differential_.

You've remembered the function you used when working with the Giants and quickly write it down:

    def calc_run_diff(runs_scored, runs_allowed):
    
        run_diff = runs_scored - runs_allowed
    
        return run_diff
    

Let's use `.itertuples()` to loop over the `yankees_df` DataFrame (which has been loaded into your session) and calculate _run differentials_.

### Instructions 1/4

*   Use `.itertuples()` to loop over `yankees_df` and grab each row's runs scored and runs allowed values.

In [18]:
# Variables

yankees_dict = {'G': {0: 162,
       1: 162,
       2: 162,
       3: 162,
       4: 162,
       5: 162,
       6: 162,
       7: 162,
       8: 162,
       9: 163,
       10: 161,
       11: 161,
       12: 161,
       13: 162,
       14: 162,
       15: 162,
       16: 162,
       17: 162,
       18: 162,
       19: 162,
       20: 162,
       21: 161,
       22: 161,
       23: 162,
       24: 162,
       25: 161,
       26: 162,
       27: 162,
       28: 162,
       29: 162,
       30: 160,
       31: 163,
       32: 162,
       33: 159,
       34: 160,
       35: 162,
       36: 162,
       37: 162,
       38: 163,
       39: 162,
       40: 164,
       41: 163,
       42: 160,
       43: 162,
       44: 164,
       45: 161,
       46: 162},
 'League': {0: 'AL',
            1: 'AL',
            2: 'AL',
            3: 'AL',
            4: 'AL',
            5: 'AL',
            6: 'AL',
            7: 'AL',
            8: 'AL',
            9: 'AL',
            10: 'AL',
            11: 'AL',
            12: 'AL',
            13: 'AL',
            14: 'AL',
            15: 'AL',
            16: 'AL',
            17: 'AL',
            18: 'AL',
            19: 'AL',
            20: 'AL',
            21: 'AL',
            22: 'AL',
            23: 'AL',
            24: 'AL',
            25: 'AL',
            26: 'AL',
            27: 'AL',
            28: 'AL',
            29: 'AL',
            30: 'AL',
            31: 'AL',
            32: 'AL',
            33: 'AL',
            34: 'AL',
            35: 'AL',
            36: 'AL',
            37: 'AL',
            38: 'AL',
            39: 'AL',
            40: 'AL',
            41: 'AL',
            42: 'AL',
            43: 'AL',
            44: 'AL',
            45: 'AL',
            46: 'AL'},
 'Playoffs': {0: 1,
              1: 1,
              2: 1,
              3: 1,
              4: 0,
              5: 1,
              6: 1,
              7: 1,
              8: 1,
              9: 1,
              10: 1,
              11: 1,
              12: 1,
              13: 1,
              14: 1,
              15: 1,
              16: 1,
              17: 0,
              18: 0,
              19: 0,
              20: 0,
              21: 0,
              22: 0,
              23: 0,
              24: 0,
              25: 0,
              26: 0,
              27: 0,
              28: 0,
              29: 1,
              30: 0,
              31: 1,
              32: 1,
              33: 1,
              34: 0,
              35: 0,
              36: 0,
              37: 0,
              38: 0,
              39: 0,
              40: 0,
              41: 0,
              42: 0,
              43: 0,
              44: 1,
              45: 1,
              46: 1},
 'RA': {0: 668,
        1: 657,
        2: 693,
        3: 753,
        4: 727,
        5: 777,
        6: 767,
        7: 789,
        8: 808,
        9: 716,
        10: 697,
        11: 713,
        12: 814,
        13: 731,
        14: 656,
        15: 688,
        16: 787,
        17: 761,
        18: 746,
        19: 777,
        20: 749,
        21: 792,
        22: 748,
        23: 758,
        24: 738,
        25: 660,
        26: 679,
        27: 703,
        28: 716,
        29: 662,
        30: 672,
        31: 582,
        32: 651,
        33: 575,
        34: 588,
        35: 623,
        36: 610,
        37: 641,
        38: 612,
        39: 587,
        40: 531,
        41: 621,
        42: 612,
        43: 604,
        44: 577,
        45: 547,
        46: 680},
 'RS': {0: 804,
        1: 867,
        2: 859,
        3: 915,
        4: 789,
        5: 968,
        6: 930,
        7: 886,
        8: 897,
        9: 877,
        10: 897,
        11: 804,
        12: 871,
        13: 900,
        14: 965,
        15: 891,
        16: 871,
        17: 821,
        18: 733,
        19: 674,
        20: 603,
        21: 698,
        22: 772,
        23: 788,
        24: 797,
        25: 839,
        26: 758,
        27: 770,
        28: 709,
        29: 820,
        30: 734,
        31: 735,
        32: 831,
        33: 730,
        34: 681,
        35: 671,
        36: 641,
        37: 648,
        38: 680,
        39: 562,
        40: 536,
        41: 522,
        42: 611,
        43: 611,
        44: 730,
        45: 714,
        46: 817},
 'Team': {0: 'NYY',
          1: 'NYY',
          2: 'NYY',
          3: 'NYY',
          4: 'NYY',
          5: 'NYY',
          6: 'NYY',
          7: 'NYY',
          8: 'NYY',
          9: 'NYY',
          10: 'NYY',
          11: 'NYY',
          12: 'NYY',
          13: 'NYY',
          14: 'NYY',
          15: 'NYY',
          16: 'NYY',
          17: 'NYY',
          18: 'NYY',
          19: 'NYY',
          20: 'NYY',
          21: 'NYY',
          22: 'NYY',
          23: 'NYY',
          24: 'NYY',
          25: 'NYY',
          26: 'NYY',
          27: 'NYY',
          28: 'NYY',
          29: 'NYY',
          30: 'NYY',
          31: 'NYY',
          32: 'NYY',
          33: 'NYY',
          34: 'NYY',
          35: 'NYY',
          36: 'NYY',
          37: 'NYY',
          38: 'NYY',
          39: 'NYY',
          40: 'NYY',
          41: 'NYY',
          42: 'NYY',
          43: 'NYY',
          44: 'NYY',
          45: 'NYY',
          46: 'NYY'},
 'W': {0: 95,
       1: 97,
       2: 95,
       3: 103,
       4: 89,
       5: 94,
       6: 97,
       7: 95,
       8: 101,
       9: 101,
       10: 103,
       11: 95,
       12: 87,
       13: 98,
       14: 114,
       15: 96,
       16: 92,
       17: 88,
       18: 76,
       19: 71,
       20: 67,
       21: 74,
       22: 85,
       23: 89,
       24: 90,
       25: 97,
       26: 87,
       27: 91,
       28: 79,
       29: 103,
       30: 89,
       31: 100,
       32: 100,
       33: 97,
       34: 83,
       35: 89,
       36: 80,
       37: 81,
       38: 93,
       39: 80,
       40: 83,
       41: 72,
       42: 70,
       43: 77,
       44: 99,
       45: 104,
       46: 96},
 'Year': {0: 2012,
          1: 2011,
          2: 2010,
          3: 2009,
          4: 2008,
          5: 2007,
          6: 2006,
          7: 2005,
          8: 2004,
          9: 2003,
          10: 2002,
          11: 2001,
          12: 2000,
          13: 1999,
          14: 1998,
          15: 1997,
          16: 1996,
          17: 1993,
          18: 1992,
          19: 1991,
          20: 1990,
          21: 1989,
          22: 1988,
          23: 1987,
          24: 1986,
          25: 1985,
          26: 1984,
          27: 1983,
          28: 1982,
          29: 1980,
          30: 1979,
          31: 1978,
          32: 1977,
          33: 1976,
          34: 1975,
          35: 1974,
          36: 1973,
          37: 1971,
          38: 1970,
          39: 1969,
          40: 1968,
          41: 1967,
          42: 1966,
          43: 1965,
          44: 1964,
          45: 1963,
          46: 1962}}

yankees_df = pd.DataFrame(yankees_dict)

In [19]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

### Instructions 2/4

*   Now, calculate each row's _run differential_ using `calc_run_diff()`. Be sure to append each row's _run differential_ to `run_diffs`.

In [20]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

### Instructions 3/4

*   Append a new column called `'RD'` to the `yankees_df` DataFrame that contains the _run differentials_ you calculated.

In [21]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df)

      G League  Playoffs   RA   RS Team    W  Year   RD
0   162     AL         1  668  804  NYY   95  2012  136
1   162     AL         1  657  867  NYY   97  2011  210
2   162     AL         1  693  859  NYY   95  2010  166
3   162     AL         1  753  915  NYY  103  2009  162
4   162     AL         0  727  789  NYY   89  2008   62
5   162     AL         1  777  968  NYY   94  2007  191
6   162     AL         1  767  930  NYY   97  2006  163
7   162     AL         1  789  886  NYY   95  2005   97
8   162     AL         1  808  897  NYY  101  2004   89
9   163     AL         1  716  877  NYY  101  2003  161
10  161     AL         1  697  897  NYY  103  2002  200
11  161     AL         1  713  804  NYY   95  2001   91
12  161     AL         1  814  871  NYY   87  2000   57
13  162     AL         1  731  900  NYY   98  1999  169
14  162     AL         1  656  965  NYY  114  1998  309
15  162     AL         1  688  891  NYY   96  1997  203
16  162     AL         1  787  871  NYY   92  19

### Instructions 4/4

#### Question

*   In what year within your DataFrame did the New York Yankees have the highest _run differential_?

**You'll need to rerun the code that creates the `'RD'` column if you'd like to analyze the DataFrame with code rather than looking at the console output.**

#### Possible answers

*   In **2011** (with a _Run Differential_ of **210**)

*   In **1998** (with a _Run Differential_ of **309**)

*   In **1962** (with a _Run Differential_ of **503**)

*   In **1985** (with a _Run Differential_ of **315**)



In [22]:
# Find the row with max RD

yankees_df_maxRD = yankees_df[yankees_df['RD'] == yankees_df['RD'].max()]
print(yankees_df_maxRD)

      G League  Playoffs   RA   RS Team    W  Year   RD
14  162     AL         1  656  965  NYY  114  1998  309


Great job! You used `.itertuples()` to help the Yankees calculate \_run differentials\_. Remember, using `.itertuples()` is just like using `.iterrows()` except it tends to be faster. You also have to use a \_dot\_ reference when looking up attributes with `.itertuples()`.  
  
You found that the Yankees' highest \_run differential\_ was in 1998. Did you know they actually hold the record for the highest \_run differential\_ in an MLB season (411 in the year 1939 where they scored 967 runs and allowed 556)? Wow!

Analyzing baseball stats with .apply()
======================================

The Tampa Bay Rays want you to analyze their data.

They'd like the following metrics:

*   The sum of each column in the data
*   The total amount of runs scored in a year (`'RS'` + `'RA'` for each year)
*   The `'Playoffs'` column in text format rather than using `1`'s and `0`'s

The below function can be used to convert the `'Playoffs'` column to text:

In [27]:
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No'

Use `.apply()` to get these metrics. A DataFrame (`rays_df`) has been loaded and printed to the console. This DataFrame is indexed on the `'Year'` column.

### Instructions 1/3

*   Apply `sum()` to each **column** of `rays_df` to collect the sum of each column. Be sure to specify the correct `axis`.



In [23]:
# Variables

rays_dict = {'Playoffs': {2008: 1, 2009: 0, 2010: 1, 2011: 1, 2012: 0},
 'RA': {2008: 671, 2009: 754, 2010: 649, 2011: 614, 2012: 577},
 'RS': {2008: 774, 2009: 803, 2010: 802, 2011: 707, 2012: 697},
 'W': {2008: 97, 2009: 84, 2010: 96, 2011: 91, 2012: 90}}

rays_df = pd.DataFrame(rays_dict)

In [24]:
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis=0)
print(stat_totals)

Playoffs       3
RA          3265
RS          3783
W            458
dtype: int64


### Instructions 2/3
    
*   Apply `sum()` to each **row** of `rays_df`, only looking at the `'RS'` and `'RA'` columns, and specify the correct `axis`.



In [25]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs_scored)

2008    1445
2009    1557
2010    1451
2011    1321
2012    1274
dtype: int64


### Instructions 3/3
    
*   Use `.apply()` and a `lambda` function to apply `text_playoffs()` to each **row**'s `'Playoffs'` value of the `rays_df` DataFrame.

In [28]:
# Convert numeric playoffs to text by applying text_playoffs()
textual_playoffs = rays_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

2008    Yes
2009     No
2010    Yes
2011    Yes
2012     No
dtype: object


Great work! The `.apply()` method let's you apply functions to all rows or columns of a DataFrame by specifying an axis.  
  
If you've been using `pandas` for some time, you may have noticed that a better way to find these stats would use the `pandas` built-in `.sum()` method.  
  
You could have used `rays_df.sum(axis=0)` to get columnar sums and `rays_df[['RS', 'RA']].sum(axis=1)` to get row sums.  
  
You could have also used `.apply()` \_\_directly\_\_ on a Series (or column) of the DataFrame. For example, you could use `rays_df['Playoffs'].apply(text_playoffs)` to convert the `'Playoffs'` column to text.

In [30]:
# Variables

dbacks_dict = {'G': {0: 162,
       1: 162,
       2: 162,
       3: 162,
       4: 162,
       5: 162,
       6: 162,
       7: 162,
       8: 162,
       9: 162,
       10: 162,
       11: 162,
       12: 162,
       13: 162,
       14: 162},
 'League': {0: 'NL',
            1: 'NL',
            2: 'NL',
            3: 'NL',
            4: 'NL',
            5: 'NL',
            6: 'NL',
            7: 'NL',
            8: 'NL',
            9: 'NL',
            10: 'NL',
            11: 'NL',
            12: 'NL',
            13: 'NL',
            14: 'NL'},
 'Playoffs': {0: 0,
              1: 1,
              2: 0,
              3: 0,
              4: 0,
              5: 1,
              6: 0,
              7: 0,
              8: 0,
              9: 0,
              10: 1,
              11: 1,
              12: 0,
              13: 1,
              14: 0},
 'RA': {0: 688,
        1: 662,
        2: 836,
        3: 782,
        4: 706,
        5: 732,
        6: 788,
        7: 856,
        8: 899,
        9: 685,
        10: 674,
        11: 677,
        12: 754,
        13: 676,
        14: 812},
 'RS': {0: 734,
        1: 731,
        2: 713,
        3: 720,
        4: 720,
        5: 712,
        6: 773,
        7: 696,
        8: 615,
        9: 717,
        10: 819,
        11: 818,
        12: 792,
        13: 908,
        14: 665},
 'Team': {0: 'ARI',
          1: 'ARI',
          2: 'ARI',
          3: 'ARI',
          4: 'ARI',
          5: 'ARI',
          6: 'ARI',
          7: 'ARI',
          8: 'ARI',
          9: 'ARI',
          10: 'ARI',
          11: 'ARI',
          12: 'ARI',
          13: 'ARI',
          14: 'ARI'},
 'W': {0: 81,
       1: 94,
       2: 65,
       3: 70,
       4: 82,
       5: 90,
       6: 76,
       7: 77,
       8: 51,
       9: 84,
       10: 98,
       11: 92,
       12: 85,
       13: 100,
       14: 65},
 'Year': {0: 2012,
          1: 2011,
          2: 2010,
          3: 2009,
          4: 2008,
          5: 2007,
          6: 2006,
          7: 2005,
          8: 2004,
          9: 2003,
          10: 2002,
          11: 2001,
          12: 2000,
          13: 1999,
          14: 1998}}

dbacks_df = pd.DataFrame(dbacks_dict)

In [31]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

     G League  Playoffs   RA   RS Team   W  Year
0  162     NL         0  688  734  ARI  81  2012
1  162     NL         1  662  731  ARI  94  2011
2  162     NL         0  836  713  ARI  65  2010
3  162     NL         0  782  720  ARI  70  2009
4  162     NL         0  706  720  ARI  82  2008


Settle a debate with .apply()
=============================

Word has gotten to the Arizona Diamondbacks about your awesome analytics skills. They'd like for you to help settle a debate amongst the managers. One manager claims that the team has made the playoffs every year they have had a win percentage of `0.50` or greater. Another manager says this is not true.

Let's use the below function and the `.apply()` method to see which manager is correct.

In [32]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

A DataFrame named `dbacks_df` has been loaded into your session.

### Instructions 1/4

*   Print the first five rows of the `dbacks_df` DataFrame to see what the data looks like.

In [33]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

     G League  Playoffs   RA   RS Team   W  Year
0  162     NL         0  688  734  ARI  81  2012
1  162     NL         1  662  731  ARI  94  2011
2  162     NL         0  836  713  ARI  65  2010
3  162     NL         0  782  720  ARI  70  2009
4  162     NL         0  706  720  ARI  82  2008


### Instructions 2/4

*   Create a `pandas` Series called `win_percs` by _applying_ the `calc_win_perc()` function to each **row** of the DataFrame with a `lambda` function.

In [36]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

     G League  Playoffs   RA   RS Team   W  Year
0  162     NL         0  688  734  ARI  81  2012
1  162     NL         1  662  731  ARI  94  2011
2  162     NL         0  836  713  ARI  65  2010
3  162     NL         0  782  720  ARI  70  2009
4  162     NL         0  706  720  ARI  82  2008
0     0.50
1     0.58
2     0.40
3     0.43
4     0.51
5     0.56
6     0.47
7     0.48
8     0.31
9     0.52
10    0.60
11    0.57
12    0.52
13    0.62
14    0.40
dtype: float64 



### Instructions 3/4

*   Create a new column in `dbacks_df` called `WP` that contains the win percentages you calculated in the above step.

In [37]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

     G League  Playoffs   RA   RS Team   W  Year
0  162     NL         0  688  734  ARI  81  2012
1  162     NL         1  662  731  ARI  94  2011
2  162     NL         0  836  713  ARI  65  2010
3  162     NL         0  782  720  ARI  70  2009
4  162     NL         0  706  720  ARI  82  2008
0     0.50
1     0.58
2     0.40
3     0.43
4     0.51
5     0.56
6     0.47
7     0.48
8     0.31
9     0.52
10    0.60
11    0.57
12    0.52
13    0.62
14    0.40
dtype: float64 

      G League  Playoffs   RA   RS Team    W  Year    WP
0   162     NL         0  688  734  ARI   81  2012  0.50
1   162     NL         1  662  731  ARI   94  2011  0.58
2   162     NL         0  836  713  ARI   65  2010  0.40
3   162     NL         0  782  720  ARI   70  2009  0.43
4   162     NL         0  706  720  ARI   82  2008  0.51
5   162     NL         1  732  712  ARI   90  2007  0.56
6   162     NL         0  788  773  ARI   76  2006  0.47
7   162     NL         0  856  696  ARI   77  2005  0.48
8   162    

### Instructions 4/4

#### Question

*   Which manager was correct in their claim?

#### Possible answers

*   The manager who claimed the team **made** the playoffs every year they've had a win percentage of `0.50` or greater.

*   **The manager who claimed the team **has not made** the playoffs every year they've had a win percentage of `0.50` or greater.**

*   Both managers are crazy! The Arizona Diamondbacks have never made the playoffs.

In [39]:
# Load baseball_csv

baseball_df = pd.read_csv('baseball_stats.csv')

Replacing .iloc with underlying arrays
======================================

Now that you have a better grasp on a DataFrame's internals let's update one of your previous analyses to leverage a DataFrame's underlying arrays. You'll revisit the win percentage calculations you performed row by row with the `.iloc` method:

In [40]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list

Let's update this analysis to use arrays instead of the `.iloc` method. A DataFrame (`baseball_df`) has been loaded into your session.

### Instructions 1/3

*   Use _the right method_ to collect the underlying `'W'` and `'G'` arrays of `baseball_df` and pass them **directly into** the `calc_win_perc()` function. Store the result as a variable called `win_percs_np`.

Nicely done! Using the `.apply()` method with a `lambda` function allows you to apply a function to a DataFrame without the need to write a for loop.  
  
Sadly, the second manager was correct. In the year 2012, 2008, 2003, and 2000 the Arizona Diamondbacks had a win percentage greater than or equal to 0.50, but still \_\_did not\_\_ make the playoffs.

In [41]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

### Instructions 2/3

*   Create a new column in `baseball_df` called `'WP'` that contains the win percentages you just calculated.

In [42]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

print(baseball_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG    WP  
0           NaN  162  0.317  0.415  0.50  
1           5.0  162  0.306  0.378  0.58  
2           4.0  162  0.315  0.403  0.57  
3           NaN  162  0.331  0.428  0.43  
4           NaN  162  0.335  0.424  0.38  


### Instructions 3/3

#### Question

Use `timeit` in _cell magic mode_ **within your IPython console** to compare the runtimes between the old code block using `.iloc` and the new code you developed using NumPy arrays.

**Don't include the code that defines the `calc_win_perc()` function or the `print()` statements or when timing**.

You should include **eight lines of code** when timing the old code block and **two lines of code** when timing the new code you developed. You may need to press `SHIFT+ENTER` when using `timeit` in _cell magic mode_ to get to a new line within your IPython console.

**Which approach was the faster?**

#### Possible answers

*   The original code with `.iloc` is much faster than using NumPy arrays

*   The old code block with `.iloc` and the new code with NumPy arrays have similar runtimes.

*   **The NumPy array approach is faster than the `.iloc` approach.**

In [54]:
%%timeit -r 10 -n 1000

# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

75.8 µs ± 7.42 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)


In [56]:
%%timeit -r 2 -n 10

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list

74.4 ms ± 1.59 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)


Bringing it all together: Predict win percentage
================================================

A `pandas` DataFrame (`baseball_df`) has been loaded into your session. For convenience, a dictionary describing each column within `baseball_df` has been printed into your console. You can reference these descriptions throughout the exercise.

You'd like to attempt to _predict_ a team's win percentage for a given season by using the team's total runs scored in a season (`'RS'`) and total runs allowed in a season (`'RA'`) with the following function:

In [58]:
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

Let's compare the approaches you've learned to calculate a _predicted win percentage_ for each season (or row) in your DataFrame.

### Instructions 1/4

*   Use a for loop and `.itertuples()` to predict the win percentage for each row of `baseball_df` with the `predict_win_perc()` function. Save each row's predicted wi

In [59]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

### Instructions 2/4

*   Apply `predict_win_perc()` to each row of the `baseball_df` DataFrame using a `lambda` function. Save the predicted win percentage as `win_perc_preds_apply`.

In [60]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

### Instructions 3/4

*   Calculate the predicted win percentages by passing the underlying `'RS'` and `'RA'` **arrays** from `baseball_df` into `predict_win_perc()`. Save these predictions as `win_perc_preds_np`.

In [61]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np
print(baseball_df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG    WP  WP_preds  
0           NaN  162  0.317  0.415  0.50      0.53  
1           5.0  162  0.306  0.378  0.58      0.58  
2           4.0  162  0.315  0.403  0.57      0.50  
3           NaN  162  0.331  0.428  0.43      0.45  
4           NaN  162  0.335  0.424  0.38      0.39  


### Instructions 4/4

#### Question

Compare runtimes **within your IPython console** between **all three** approaches used to calculate the predicted win percentages.

Use **`%%timeit`** (_cell magic mode_) to time the **six lines of code** (not including comment lines) for the `.itertuples()` approach. You may need to press `SHIFT+ENTER` after entering `%%timeit` to get to a new line within your IPython console.

Use **`%timeit`** (_line magic mode_) to time the `.apply()` approach and the NumPy array approach separately. Each has only **one line of code** (not including comment lines).

**What is the order of approaches from fastest to slowest?**

#### Possible answers

*   The `.apply()` with a `lambda` function was the **fastest**, followed by the `.itertuples()` approach, and the array approach was **slowest**.

*   Using NumPy arrays was the **fastest** approach, followed by the `.itertuples()` approach, and the `.apply()` approach was **slowest**.

*   The `.itertuples()` approach was **fastest**, followed by the array approach, and the `.apply()` approach was **slowest**.

*   All three approaches had comparable runtimes.


In [73]:
%%timeit -r 10 -n 100

win_perc_preds_loop = []
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

10.4 ms ± 558 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [74]:
%%timeit -r 10 -n 100

win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)
baseball_df['WP_preds'] = win_perc_preds_apply

17.2 ms ± 1.49 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [75]:
%%timeit -r 10 -n 100
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np

96.7 µs ± 19.1 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


Great job! That's a home run! You practiced using three different approaches to iterate over a `pandas` DataFrame and perform calculations. Did you notice that the `.itertuples()` approach beat the `.apply()` approach? Even though both these implementations can be useful, you should default to using a DataFrame's underlying arrays to perform calculations.  
  
Take a look at your win percentage predictions (column `'WP_preds'`) and compare them to the actual win percentages (column `'WP'`). Not bad!  
  
You've done a great job throughout the course! Now, you are well on your way to writing efficient Python and `pandas` code!