# MLB opening day salaries

Let's start by poking at some MLB opening day salary data from 2017. The file lives here: `../data/mlb.csv`.

Let's also open the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) in a new browser tab.

### Import pandas

We've already installed `pandas`, an external Python library that we'll use to analyze data. Now we just need to _import_ it so we can use its functionality in our script.

👉For more details on installing and importing Python libraries, [see this notebook](../reference/Installing%20and%20importing%20modules%20and%20libraries.ipynb).

In [29]:
import pandas as pd

### Load the CSV

Next, we'll load the CSV into a pandas _data frame_, which is sort of like a virtual spreadsheet with rows and columns.

We'll take a _string_ -- some text sandwiched between two apostrophes, or two quotation marks -- with the path to our CSV and hand it off to the pandas [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method.

We'll assign the result to a variable called `df`. (The name of the `df` variable is arbitrary -- you could call it `banana` and things would still work, though people reading your notebook would be confused.)

👉For more details on _strings_ (and other data types) and _variable assignment_, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb).

👉For more details on loading data into pandas, [see this notebook](../reference/Importing%20data%20into%20pandas.ipynb).

In [30]:
df = pd.read_csv('../data/mlb.csv')

### Use `head()` to check out the data

Now that the dataframe is loaded with data, let's use the [`head()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method to see the first five rows of data.

In [31]:
df.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### Other ways to check out the dataframe

- [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) will get you the _last_ 5 rows of data
- `.columns` will list the column names
- [`.dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html) will list the data types of each column
- [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) will let us know if any columns have null values in them
- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) will count the records in each column
- [`.sample(5)`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) will give you a sample of the data
- [`.shape`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) will give you `(number of rows, number of columns)`
- [`.describe()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.describe.html) will compute summary stats for the values in each numeric column

In [32]:
df.tail()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
863,Steve Selsky,BOS,RF,535000,2017,2017,1
864,Stuart Turner,CIN,C,535000,2017,2017,1
865,Vicente Campos,LAA,RP,535000,2017,2017,1
866,Wandy Peralta,CIN,RP,535000,2017,2017,1
867,Yandy Diaz,CLE,3B,535000,2017,2017,1


In [33]:
df.columns

Index(['NAME', 'TEAM', 'POS', 'SALARY', 'START_YEAR', 'END_YEAR', 'YEARS'], dtype='object')

In [34]:
df.dtypes

NAME          object
TEAM          object
POS           object
SALARY         int64
START_YEAR     int64
END_YEAR       int64
YEARS          int64
dtype: object

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 868 entries, 0 to 867
Data columns (total 7 columns):
NAME          868 non-null object
TEAM          868 non-null object
POS           868 non-null object
SALARY        868 non-null int64
START_YEAR    868 non-null int64
END_YEAR      868 non-null int64
YEARS         868 non-null int64
dtypes: int64(4), object(3)
memory usage: 47.5+ KB


In [36]:
df.count()

NAME          868
TEAM          868
POS           868
SALARY        868
START_YEAR    868
END_YEAR      868
YEARS         868
dtype: int64

In [37]:
df.sample(5)

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
816,Manuel Margot,SD,CF,535600,2017,2017,1
75,Yadier Molina,STL,C,14200000,2013,2017,5
723,Michael Ynoa,CWS,RP,540000,2017,2017,1
753,Orlando Arcia,MIL,SS,538100,2017,2017,1
744,Mac Williamson,SF,RF,539000,2017,2017,1


In [38]:
df.shape

(868, 7)

In [39]:
df.describe()

Unnamed: 0,SALARY,START_YEAR,END_YEAR,YEARS
count,868.0,868.0,868.0,868.0
mean,4468069.0,2016.486175,2017.430876,1.9447
std,5948459.0,1.205923,1.163087,1.916764
min,535000.0,2008.0,2015.0,1.0
25%,545500.0,2017.0,2017.0,1.0
50%,1562500.0,2017.0,2017.0,1.0
75%,6000000.0,2017.0,2017.0,2.0
max,33000000.0,2017.0,2027.0,13.0


### Come up with a list of questions

Now that we have a general idea of our data, let's come up with a list of questions. For starters:

- What's the total, average and median salary for an MLB player?
- How many players are on each team?
- Which catchers makes the most money?
- How many players make the league minimum?
- Which teams have the biggest payrolls?

Other questions?

### Q: What's the total, average and median salary for an MLB player?

If we were doing this in Excel, we'd probably scroll to the bottom of the worksheet and enter, in the SALARY column, `=SUM(D2:D868)`, and below that, `=AVERAGE(D2:D868)`, and then below that, `=MEDIAN(D2:D868)`. Here, we're going to select the values in the SALARY column and use a couple of built-in pandas methods to do the same math.

In pandas, to select a column of data, you can use dot notation (`df.SALARY`) or bracket notation (`df['SALARY']`). If your column name has spaces, you must use bracket notation.

In [40]:
df.SALARY.sum()

3878284045

In [41]:
df.SALARY.mean()

4468069.176267281

In [42]:
df.SALARY.median()

1562500.0

You can also use the [`agg()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.agg.html) method to pass in multiple functions, including ones that you write yourself.

In [43]:
df.SALARY.agg(['mean', 'median', 'sum'])

mean      4.468069e+06
median    1.562500e+06
sum       3.878284e+09
Name: SALARY, dtype: float64

### Q: How many players are on each team?

To answer this question, we're going to use a method called [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) on the TEAM column. The equivalent operation in Excel would involve a pivot table. In SQL, it might be something like:

```sql
SELECT TEAM, COUNT(*)
FROM mlb
GROUP BY TEAM
ORDER BY 2 DESC
```

👉 For more details on grouping with `value_counts()`, [see this notebook](../reference/Grouping%20data%20in%20pandas.ipynb#Value-counts).

In [44]:
df.TEAM.value_counts()

TEX    34
TB     32
COL    32
NYM    31
SD     31
SEA    31
LAD    31
CIN    31
BOS    31
STL    30
OAK    30
LAA    30
ATL    30
TOR    29
MIN    29
SF     28
KC     28
ARI    28
CWS    28
BAL    28
CLE    28
MIA    28
HOU    27
NYY    27
WSH    26
PHI    26
MIL    26
DET    26
PIT    26
CHC    26
Name: TEAM, dtype: int64

### Q: Which catchers makes the most money?

To answer this question, first we'll _filter_ the dataframe to include only catchers. Then we'll sort the data descending and look at the top 5.

👉For more details on filtering data in pandas, [see this notebook](../reference/Filtering%20columns%20and%20rows%20in%20pandas.ipynb).

First, we need to figure out how "catcher" is represented in our data. Let's use the [`unique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) method to get a list of unique values in the `POS` column.

In [45]:
df.POS.unique()

array(['SP', '1B', 'RF', '2B', 'DH', 'CF', 'C', 'LF', '3B', 'SS', 'OF',
       'RP', 'P'], dtype=object)

Looks like we want to target records where the `POS` value is "C."

To filter data in a pandas dataframe, we'll put the filtering condition inside square brackets and pass that to the `df[]`. It's a little confusing at first.

In [46]:
catchers = df[df['POS'] == 'C']

Now we want to sort these records top to bottom. To do that, we'll use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method, which needs the name of the column to sort by ('SALARY'). We want to sort largest to smallest, so we'll also specify that `ascending=False`. Finally, we want to look at the top 10, so we'll tack on `.head(10)` to our method chain.

In [47]:
catchers.sort_values('SALARY', ascending=False).head(10)

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
18,Buster Posey,SF,C,22177778,2013,2021,9
36,Russell Martin,TOR,C,20000000,2015,2019,5
52,Brian McCann,HOU,C,17000000,2014,2018,5
75,Yadier Molina,STL,C,14200000,2013,2017,5
77,Miguel Montero,CHC,C,14000000,2013,2017,5
108,Carlos Santana,CLE,C,12000000,2012,2016,5
129,Matt Wieters,WSH,C,10500000,2017,2017,1
143,Francisco Cervelli,PIT,C,9000000,2017,2019,3
151,Jason Castro,MIN,C,8500000,2017,2019,3
176,Devin Mesoraco,CIN,C,7325000,2015,2018,4


### Q: How many players make the league minimum?

First, we'll need to figure out what the [league minimum](https://www.statista.com/statistics/256187/minimum-salary-of-players-in-major-league-baseball/) is.

By definition, it's the lowest number in the salary data. We could also reasonably expect that number to occur more frequently than other numbers.

So first, let's use the [`min()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) method to see what the lowest salary value is; then we'll use `value_counts()` to check the frequency. (If we wanted to get crazy, we could also get the [`mode()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html) of the SALARY column -- there's always a zillion ways to skin the cat.)

In [48]:
df.SALARY.min()

535000

In [49]:
df.SALARY.mode()

0    535000
dtype: int64

In [50]:
df.SALARY.value_counts().head()

535000     50
540000     21
545000     14
2000000    13
4000000    13
Name: SALARY, dtype: int64

#### Bonus Q: What percentage of MLB players make the league minimum?

First, we can filter to get just the players who make the league minimum. Then we can use the built-in Python function `len()` to get the count. We can also use `len()` to count the records in our main data frame -- `df`, the one will all of the players in it -- and from there the math is straightforward: `(part / whole) * 100`

In [51]:
league_min = df[df.SALARY == df.SALARY.min()]

In [52]:
pct_minimum = (len(league_min) / len(df)) * 100
print(pct_minimum)

5.76036866359447


### Q: Which teams have the biggest payrolls?

To answer this question, we're again going to use equivalent of an Excel pivot table. Our steps:

1. Select the two columns we're interested in: `[TEAM, SALARY]`
2. Use the [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method to group the data by team
3. Use the [`sum()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.sum.html) method to sum salaries by team
4. Use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to sort the results descending

_Furthermore_, we're gonna chain these methods together and do it all in one whack. And we can use `\` at the end of the line to tell Python that we're _not quite done yet_.

👉 For more details on grouping data in pandas, [see this notebook](../reference/Grouping%20data%20in%20pandas.ipynb)

In [53]:
grouped = df[['TEAM', 'SALARY']].groupby('TEAM') \
                                .sum() \
                                .sort_values('SALARY', ascending=False)

In [54]:
grouped.head()

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
LAD,187989811
DET,180250600
TEX,178431396
SF,176531278
NYM,176284679


### Reformatting how the salary looks

If you'd like to change how the `SALARY` column is being displayed, you can change the [formatting specification](https://docs.python.org/3/library/string.html#format-examples) by handing off a dictionary to our grouped object's [`style.format`](https://pandas.pydata.org/pandas-docs/stable/style.html#Finer-Control:-Display-Values) attribute.

👉 For more information on dictionaries, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries)

👉 For more information on using Python string formatting to display numbers, [see this notebook](../reference/String%20formatting.ipynb#Formatting-numbers).

In [55]:
grouped.style.format(
    {'SALARY': '${:,}'.format}
)

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
LAD,"$187,989,811"
DET,"$180,250,600"
TEX,"$178,431,396"
SF,"$176,531,278"
NYM,"$176,284,679"
BOS,"$174,287,098"
NYY,"$170,389,199"
CHC,"$170,088,502"
WSH,"$162,742,157"
TOR,"$162,353,367"
