# MLB opening day salaries

Let's start by poking at some MLB opening day salary data from 2017. The file lives here: `../data/mlb.csv`.

Let's also open the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) in a new browser tab.

### Import pandas

We've already installed `pandas`, an external Python library that we'll use to analyze data. Now we just need to _import_ it so we can use its functionality in our script.

👉For more details on installing and importing Python libraries, [see this notebook](../reference/Installing%20and%20importing%20modules%20and%20libraries.ipynb).

In [None]:
# import pandas


### Load the CSV

Next, we'll load the CSV into a pandas _data frame_, which is sort of like a virtual spreadsheet with rows and columns.

We'll take a _string_ -- some text sandwiched between two apostrophes, or two quotation marks -- with the path to our CSV and hand it off to the pandas [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method.

We'll assign the result to a variable called `df`. (The name of the `df` variable is arbitrary -- you could call it `banana` and things would still work, though people reading your notebook would be confused.)

👉For more details on _strings_ (and other data types) and _variable assignment_, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb).

👉For more details on loading data into pandas, [see this notebook](../reference/Importing%20data%20into%20pandas.ipynb).

In [None]:
# read csv into data frame


### Use `head()` to check out the data

Now that the dataframe is loaded with data, let's use the [`head()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method to see the first five rows of data.

In [None]:
# use head() to check out the data


### Other ways to check out the dataframe

- [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) will get you the _last_ 5 rows of data
- `.columns` will list the column names
- [`.dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html) will list the data types of each column
- [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) will let us know if any columns have null values in them
- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) will count the records in each column
- [`.sample(5)`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) will give you a sample of the data
- [`.shape`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) will give you `(number of rows, number of columns)`
- [`.describe()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.describe.html) will compute summary stats for the values in each numeric column

In [None]:
# tail()


In [None]:
# columns


In [None]:
# dtypes


In [None]:
# info()


In [None]:
# count()


In [None]:
# sample(5)


In [None]:
# shape


In [None]:
# describe()


### Come up with a list of questions

Now that we have a general idea of our data, let's come up with a list of questions. For starters:

- What's the total, average and median salary for an MLB player?
- How many players are on each team?
- Which catchers makes the most money?
- How many players make the league minimum?
- Which teams have the biggest payrolls?

Other questions?

### Q: What's the total, average and median salary for an MLB player?

If we were doing this in Excel, we'd probably scroll to the bottom of the worksheet and enter, in the SALARY column, `=SUM(D2:D868)`, and below that, `=AVERAGE(D2:D868)`, and then below that, `=MEDIAN(D2:D868)`. Here, we're going to select the values in the SALARY column and use a couple of built-in pandas methods to do the same math.

In pandas, to select a column of data, you can use dot notation (`df.SALARY`) or bracket notation (`df['SALARY']`). If your column name has spaces, you must use bracket notation.

In [None]:
# sum of salary column


In [None]:
# mean of salary column


In [None]:
# median of salary column


You can also use the [`agg()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.agg.html) method to pass in multiple functions, including ones that you write yourself.

In [None]:
# agg - sum, mean, median


### Q: How many players are on each team?

To answer this question, we're going to use a method called [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) on the TEAM column. The equivalent operation in Excel would involve a pivot table. In SQL, it might be something like:

```sql
SELECT TEAM, COUNT(*)
FROM mlb
GROUP BY TEAM
ORDER BY 2 DESC
```

👉 For more details on grouping with `value_counts()`, [see this notebook](../reference/Grouping%20data%20in%20pandas.ipynb#Value-counts).

In [None]:
# value_counts() of TEAM column


### Q: Which catchers makes the most money?

To answer this question, first we'll _filter_ the dataframe to include only catchers. Then we'll sort the data descending and look at the top 5.

👉For more details on filtering data in pandas, [see this notebook](../reference/Filtering%20columns%20and%20rows%20in%20pandas.ipynb).

First, we need to figure out how "catcher" is represented in our data. Let's use the [`unique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) method to get a list of unique values in the `POS` column.

In [None]:
# unique() on POS column


Looks like we want to target records where the `POS` value is "C."

To filter data in a pandas dataframe, we'll put the filtering condition inside square brackets and pass that to the `df[]`. It's a little confusing at first.

In [None]:
# filter for catchers on POS column


Now we want to sort these records top to bottom. To do that, we'll use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method, which needs the name of the column to sort by ('SALARY'). We want to sort largest to smallest, so we'll also specify that `ascending=False`. Finally, we want to look at the top 10, so we'll tack on `.head(10)` to our method chain.

In [None]:
# sort values by SALARY column descending and take the top 10


### Q: How many players make the league minimum?

First, we'll need to figure out what the [league minimum](https://www.statista.com/statistics/256187/minimum-salary-of-players-in-major-league-baseball/) is.

By definition, it's the lowest number in the salary data. We could also reasonably expect that number to occur more frequently than other numbers.

So first, let's use the [`min()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) method to see what the lowest salary value is; then we'll use `value_counts()` to check the frequency. (If we wanted to get crazy, we could also get the [`mode()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html) of the SALARY column -- there's always a zillion ways to skin the cat.)

In [None]:
# min() value of SALARY column


In [None]:
# take the mode() for kicks


In [None]:
# value_counts() on salary to verify + head()


#### Bonus Q: What percentage of MLB players make the league minimum?

First, we can filter to get just the players who make the league minimum. Then we can use the built-in Python function `len()` to get the count. We can also use `len()` to count the records in our main data frame -- `df`, the one will all of the players in it -- and from there the math is straightforward: `(part / whole) * 100`

In [None]:
# filter to get just league minimum


In [None]:
# calculate the percent who make minimum

# and print it


### Q: Which teams have the biggest payrolls?

To answer this question, we're again going to use equivalent of an Excel pivot table. Our steps:

1. Select the two columns we're interested in: `[TEAM, SALARY]`
2. Use the [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method to group the data by team
3. Use the [`sum()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.sum.html) method to sum salaries by team
4. Use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to sort the results descending

_Furthermore_, we're gonna chain these methods together and do it all in one whack. And we can use `\` at the end of the line to tell Python that we're _not quite done yet_.

👉 For more details on grouping data in pandas, [see this notebook](../reference/Grouping%20data%20in%20pandas.ipynb)

In [None]:
# select TEAM, SALARY
# groupby TEAM
# sum()
# sort values descending on SALARY


In [None]:
# run head() to check top values


### Reformatting how the salary looks

If you'd like to change how the `SALARY` column is being displayed, you can change the [formatting specification](https://docs.python.org/3/library/string.html#format-examples) by handing off a dictionary to our grouped object's [`style.format`](https://pandas.pydata.org/pandas-docs/stable/style.html#Finer-Control:-Display-Values) attribute.

👉 For more information on dictionaries, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries)

👉 For more information on using Python string formatting to display numbers, [see this notebook](../reference/String%20formatting.ipynb#Formatting-numbers).

In [None]:
# change the formatting of the salary column


# 📚 GROUP HOMEWORK 📚

In your groups, write the code necessary to answer these questions:
- How many starting pitchers make the league minimum?
- Which teams have designated hitters?
- Which Chicago team pays better (Sox or Cubs, you pick the metric for "better")?