## Session 2: Descriptive Statistics - Mean, Median, Mode

**Objective:** Introduce basic statistical concepts and use Python to calculate the mean, median, and mode for baseball data.

### 1. Concepts Covered
- Descriptive statistics: mean, median, mode.
- Using Python to calculate these statistics for baseball data (e.g., player batting averages).
- Formula references:
   - Mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$
   - Median: The median of a dataset $X = \{x_1, x_2, ..., x_n\}$:
      - If $n$ is odd, the median is $x_{\frac{n+1}{2}}$.
      - If $n$ is even, the median is $\frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}$.
   - Mode: The most frequent value

### 2. Python Code Walkthrough

Install and import packages:

In [1]:
from pybaseball import batting_stats, playerid_lookup, statcast_pitcher
import pandas as pd
pd.set_option('display.max_columns', None)

#### Descriptive Statistics: Batting Average Analysis

In [2]:
data = batting_stats(2023)
# data.columns.to_list()

In [3]:
mean_batting_avg = data['AVG'].mean()
median_batting_avg = data['AVG'].median()
mode_batting_avg = data['AVG'].mode()[0]

print(f"Mean Batting Average:   {mean_batting_avg:.3f}")
print(f"Median Batting Average: {median_batting_avg:.3f}")
print(f"Mode Batting Average:   {mode_batting_avg:.3f}")

Mean Batting Average:   0.262
Median Batting Average: 0.262
Mode Batting Average:   0.258


##### How much better were the top five hitters in 2023 compared to the three measures of central tendency player that year? 

In [4]:
# First 5 observations in the Batting Stats df
batting = batting_stats(2023)

In [5]:
top_10_avg_df = batting.sort_values(
    by='AVG', ascending=False)[:5][
    ['IDfg', 'Name', 'Team', 'AVG']]

top_10_avg_df['mean_diff'] = round(top_10_avg_df['AVG'] - mean_batting_avg, 3)
top_10_avg_df['median_diff'] = round(top_10_avg_df['AVG'] - median_batting_avg, 3)
top_10_avg_df['mode_diff'] = round(top_10_avg_df['AVG'] - mode_batting_avg, 3)

display(top_10_avg_df)

Unnamed: 0,IDfg,Name,Team,AVG,mean_diff,median_diff,mode_diff
13,18568,Luis Arraez,MIA,0.354,0.092,0.092,0.096
1,18401,Ronald Acuna Jr.,ATL,0.337,0.075,0.075,0.079
5,5361,Freddie Freeman,LAD,0.331,0.069,0.069,0.073
6,16578,Yandy Diaz,TBR,0.33,0.068,0.068,0.072
2,13624,Corey Seager,TEX,0.327,0.065,0.065,0.069


#### Descriptive Statistics: Average Fastball Speed Analysis for Clayton Kershaw

In [6]:
# Find Clayton Kershaw's player id
playerid_lookup('kershaw', 'clayton', fuzzy=True) 
# His MLBAM ID is 477132.

# Get Kershaw's stats for a specific date using his ID
kershaw_stats = statcast_pitcher('2017-06-02', '2017-06-02', 477132)

Gathering player lookup table. This may take a moment.
Gathering Player Data


In [11]:
fastball_df = kershaw_stats[kershaw_stats.pitch_name == '4-Seam Fastball']

In [13]:
mean_batting_avg = fastball_df.release_speed.mean()
median_batting_avg = fastball_df.release_speed.median()
mode_batting_avg = fastball_df.release_speed.mode()[0]

print(f"Mean Release Speed:   {mean_batting_avg:.3f}")
print(f"Median Release Speed: {median_batting_avg:.3f}")
print(f"Mode Release Speed:   {mode_batting_avg:.3f}")

Mean Release Speed:   93.409
Median Release Speed: 93.500
Mode Release Speed:   92.900


### 3. Exercise:
- Calculate the mean, median, and mode for the batting averages of the top 5 players.
- Questions:
    - Find the player with the highest batting average? How different is their average compared to the top 5 players and all the players?
    - Why might the median be different from the mean? (Strech Goal)

**Hint:** To find the highest batting average, try sorting the data and selecting the first entry.

In [None]:
## Calculate the mean, median, and mode for the batting averages of the top 5 players

In [None]:
## Find the player with the highest batting average? How different is their average compared to the top 5 players and all the players?

In [None]:
## Why might the median be different from the mean (Explain in words

### 4. Reference:
- [Khan Academy - Descriptive statistics](https://www.khanacademy.org/math/engageny-alg-1/alg1-2)
- [How to Calculate Summary Statistics](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)