First, let's import the libraries we'll be using for this lesson.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# supress scientific notation (exponents)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
plt.rcParams["axes.formatter.limits"] = (-5, 12)

Next, let's load in some data. We'll use the `read_csv` method.

In [None]:
df = pd.read_csv("mlb-salaries-2023.csv")

Let's take a peek and see what the top rows show us:

In [None]:
df.head()

## How many columns and rows in this dataset?

In [None]:
df.shape

### Let's see some descriptive statistics on the salary column

In [None]:
df["salary"].describe()

### Let's see some individual descriptive statistics on salary column

In [None]:
df["salary"].mean()

In [None]:
df["salary"].median()

In [None]:
df["salary"].mode()

In [None]:
df["salary"].max()

In [None]:
df["salary"].min()

In [None]:
# count values
df["salary"].count()

In [None]:
df["salary"].sum()

### Let's see how all of the unique positions there are in the position column

In [None]:
df["position"].unique()

### Let's see which positions are most represented in this dataset by counting them

In [None]:
df["position"].value_counts()

### Let's use "groupby" method to find out which team pays the most (total, median, mean)

In [None]:
# This uses sum() to give the TOTAL the amount of salary paid to all of each team's players.
df.groupby("team")["salary"].sum().sort_values(ascending=False)

### Which team has the highest median salary?

In [None]:
# this uses median() to find out which team pays the highest MEDIAN salary.
df.groupby("team")["salary"].median().sort_values(ascending=False)

### Which team has the highest average (mean) salary?

In [None]:
# this uses mean() to find which team has the highest AVERAGE (MEAN) salary.
df.groupby("team")["salary"].mean().sort_values(ascending=False)

### Let's use "groupby" method to find out which position receives the most total dollars in all of the league?

In [None]:
df.groupby("position")["salary"].sum().sort_values(ascending=False)

### Which position has the highest median salary?

In [None]:
df.groupby("position")["salary"].median().sort_values(ascending=False)

### Let's try filtering. Show me only the designated hitters (DH)

In [None]:
df[ df["position"] == "DH" ]

### Show me players with more than \\$35 million salary

In [None]:
df[ df["salary"] > 35000000]

### Show me only the last names and salary of the players with more that \\$35 million salary

In [None]:
df[ df["salary"] > 35000000][["name_last","salary"]]

### Show me players with a salary between \\$1 million and \\$1.2 million

In [None]:
df[ (df["salary"] > 1000000) & (df["salary"] < 1200000)]

### Show me only San Francisco players (top 10 rows)

In [None]:
df[ df["team"] == "San Francisco" ].head(10)

### Show me the median salary of San Francisco players

In [None]:
df[ df["team"] == "San Francisco" ]["salary"].median()

### Show me the position count of San Francisco

In [None]:
df[ df["team"] == "San Francisco" ]["position"].value_counts()

### Show me the lowest amount a pitcher in the MLB receives

In [None]:
df [ df["position"] == "P" ]["salary"].min()

### Show me the lowest paid pitchers

In [None]:
df [ (df["salary"] == 720000) & (df["position"] == "P") ]

### Show me a histogram of the salaries in this dataset with 15 bins

xlabel argument is optional, but allows you to put a label under your chart. ylabel is also available

In [None]:
df["salary"].plot.hist(bins=15, xlabel="Salary")

### Make a horizontal bar chart of median salaries by team (groupby) Sort the values. 

In [None]:
df.groupby("team")["salary"].median().sort_values(ascending=True).plot.barh()