# Assignment 3

**Interactive MLB Hitting Performance Dashboard**

This notebook demonstrates how to build an interactive data exploration dashboard using Plotly and ipywidgets in Jupyter. The main goal is to show how multiple linked visualizations can work together to tell an analytical story.

The dataset being used contains the 2025 MLB batting statistics for qualifying players (502 plate appearances) across various offensive metrics (HR, OBP, SLG, OPS, WAR). The dashboard enbales users to interact and filter players by team and minimum plate appearances while watching multiple visualizations update in real time.


**Visualization Technique** 

The dashboard integrates four visualization types, each revealing a different perspective of offensive performance:

| Visualization | Purpose | Insight |
|---|---|---|
| **Top HR Bar Chart** | Displays players with the most power | Top home run hitters |
| **OBP vs SLG Scatter (bubble size = HR)** | Maps overall offensive skill profiles | Distinguishes balanced hitters vs one-dimensional power hitters |
| **OPS Distribution Histogram** | Shows league-wide offensive performance spread | Reveals how talent clusters or spreads |
| **OPS by Team (Box/Violin)** | Compares overall lineup strength across teams | Shows which rosters have depth vs star-centric hitting |

All together, these visualizations support a narrative analysis:
- Identifying **elite hitters**
- Comparing **team offensive identity**
- Evaluating **consistency vs power**
- Understanding **league-wide performance distribution**


**Visualization Library and Framework** 

The dashboard demonstrated was built using: 

**Plotly**for interactive visualizations 

**ipywidgets** for UI controls (dropdowns & sliders)

**Pandas** for data manipulation within Jupyter

**Plotly Origin** 

-Plotly was creating by Plotly Technologies Inc. it was founded in 2013 by Alex Johnson, Jack Parmer, Chris Parmer, and Matthew Sundquist. 

**Reason For Plotly** 

-Plotly is open sourced (free for academic and commercial use) and used commonly in team analytics and dashboard metrics. 

-Great hover accessibility with built in interaction without complex coding. 

-Plotly also works great with DataFrames and within the Jupyter notebook. 

**Reason For ipywidgets** 

-ipywidgets allows for interation without having to leave the notebook enviroment. 

-It also allows the capability of using multiple charts to update simultaneously when filters change adherring to the assignment requirements. 

-Plotly is also declarative. Therefore you can specify what you want to display such as columns and relationships. This makes Plotly well equipped for demonstations when a story is trying to be told. 



In [18]:
import pandas as pd 
import numpy as np 

#interactive 
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual, HBox, VBox, Output, Button

#display
from IPython.display import display, Markdown, clear_output

#plotly viz 
import plotly.express as px
import plotly.graph_objects as go

#to ignore 
import warnings
warnings.filterwarnings("ignore")



In [19]:
#load data 
df = pd.read_csv("mlb_batting_2025.csv")
df.head()

Unnamed: 0,Rk,Player,Age,Team,Lg,WAR,G,PA,AB,R,...,Rbat+,TB,GIDP,HBP,SH,SF,IBB,Pos,Awards,Player-additional
0,1,Francisco Lindor#,31,NYM,NL,5.9,160,732,644,117,...,129,300,9,16,0,7,2,*6/DH,AS,lindofr01
1,2,Rafael Devers*,28,2TM,2LG,4.1,163,729,607,99,...,139,291,16,6,0,4,10,*D3/5H,,deverra01
2,2,Rafael Devers*,28,BOS,AL,2.3,73,334,272,47,...,153,137,7,4,0,2,7,D,,deverra01
3,2,Rafael Devers*,28,SFG,NL,1.9,90,395,335,52,...,127,154,9,2,0,2,3,D3/5H,,deverra01
4,3,Shohei Ohtani*,30,LAD,NL,6.6,158,727,611,146,...,175,380,9,3,0,2,20,*D1,AS,ohtansh01


**Data Cleaning** 

This dashboard uses 2025 MLB batting statistics from Baseball Reference (https://www.baseball-reference.com/). The dataset contains comprehensive offensive statistics for all players who appeared in Major League Baseball games during the 2025 season.

The raw data includes 30+ columns covering:
- Basic identification (Player, Age, Team, Position)
- Counting stats (Games, Plate Appearances, At-Bats, Hits, Home Runs, RBIs)
- Rate stats (Batting Average, On-Base Percentage, Slugging Percentage, OPS)
- Advanced metrics (WAR, wRC+, etc.)

**Data Cleaning Process**

The raw dataset required several cleaning steps to prepare it for dashboard visualization:

1. **Handle duplicate entries:** Some players who were traded mid-season appear multiple times (once per team, plus a combined total row)
2. **Remove missing values:** Filter out rows with missing critical statistics
3. **Data type conversions:** Ensure numeric columns are properly typed for calculations
4. **Filter for qualified players:** Focus on players with meaningful sample sizes (minimum plate appearances)
5. **Calculate derived metrics:** Ensure OPS and other composite stats are accurate
6. **Standardize team abbreviations:** Ensure consistent team naming (example: handling "2TM" for traded players)

**Troubleshooting Common Errors**

While building this dashboard, I faced several issues during the data preparation and development process. Here are the key problems and solutions:

**Error: Incorrect CSV Delimiter When Loading Data**

*Problem:* The MLB batting statistics data does not export directly into a file so I had to copy and paste the data into a .txt file, and then convert to a CSV file. 

*Initial attempt:*
```
df = pd.read_csv("mlb_batting_2025.txt")  # Assumed comma delimiter
# Result: All data appeared in one column instead of being properly separated
```

*Cause:* The downloaded .txt file used tab-delimited format, not comma-separated values. When pandas defaults to comma delimiter, it couldn't find any commas to split on.

**Solution:**

When exporting .txt file to CSV I made sure to select tab delimiter instead of comma. 

*Lesson learned:* Always inspect the raw data file first to identify the delimiter. Common delimiters include:
- Comma (`,`) for .csv files
- Tab (`\t`) for tab-separated files
- Pipe (`|`) for some database exports
- Semicolon (`;`) for some European formats

**General Debugging Tips**

1. **Check your data structure first:** Always use `df.head()` immediately after loading to verify columns loaded correctly
2. **Use `print()` statements:** Print DataFrame shapes and widget values to trace where issues occur

In [20]:
#clean data 
df.columns = df.columns.str.strip()

#remove # and * next to player names using regex techniques from SIADS 505 Data Manipulation course
df['Player_raw'] = df['Player'].astype(str)
df['Player'] = (df['Player']
                .astype(str)
                .str.replace(r'[#*]+', '', regex=True)
                .str.replace(r'\s+', ' ', regex=True)
                .str.strip())

#coerce good
numeric_candidates = [
    "Rk","Age","WAR","G","PA","AB","R","H","2B","3B","HR","RBI","SB","CS",
    "BB","SO","BA","OBP","SLG","OPS","OPS+","rOBA","Rbat+","TB","GIDP","HBP","SH","SF","IBB"
]
for c in set(numeric_candidates).intersection(df.columns):
    df[c] = pd.to_numeric(df[c], errors="coerce")


#multi codes for teams
multi_map = {"2TM":"TOT", "3TM":"TOT", "4TM":"TOT"}
if "Team" in df.columns:
    df["Team"] = df["Team"].replace(multi_map)

#remove player duplicate for total summary 
df = df[df["Team"].ne("TOT")]

#make Rk the index so the columns match 
if "Rk" in df.columns:
    df["Rk"] = pd.to_numeric(df["Rk"], errors="coerce").astype("Int64")
    df = df.set_index("Rk").sort_index()
    df.index.name = "Rk"

display(df[['Player','Team','PA','HR']].head())
print(f"Rows: {len(df):,} | Columns: {len(df.columns)}")

Unnamed: 0_level_0,Player,Team,PA,HR
Rk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Francisco Lindor,NYM,732,31
2,Rafael Devers,BOS,334,15
2,Rafael Devers,SFG,395,20
3,Shohei Ohtani,LAD,727,55
4,Matt Olson,ATL,724,29


Rows: 162 | Columns: 35


**Controls**

To make the dashboard interactive, I incorporated two user controls using **ipywidgets**:

1. **Team Dropdown**  
   This control allows the user to switch between viewing the entire league or filtering to a specific team. When a team is selected, all charts and the summary table update to show only players from that team. This helps compare team level offensive identity and identify standout hitters within a single roster.

2. **Minimum Plate Appearances Slider**  
   This slider lets the user set a threshold for minimum playing time. Players with very few plate appearances can distort hitting metrics (for example, for a player to be eligible for MLB awards such MVP, Silver Slugger, etc they need a minimum of 502 plate appearances across 162 games). Therefore, By adjusting `Min PA`, the user can remove small sample outliers and focus on everyday starters or high usage players.


**Reactive Filtering** 

Both controls are connected to a reactive `render()` function. When the user adjusts either the dropdown or the slider, the dashboard updates by calling the filtering function:

```python
def get_filtered():
    d = df.copy()
    if team_dropdown.value != "ALL":
        d = d[d["Team"] == team_dropdown.value]
    d = d[d["PA"] >= min_pa_slider.value]
    return d


In [21]:
#controls 
team_dropdown = widgets.Dropdown(
    options=["ALL"] + sorted(df["Team"].dropna().unique().tolist()),
    value="ALL",
    description="Team:",
    layout=widgets.Layout(width="280px"),
)

#minimum plate appearance (PA column) to ensure eligibility from players 
#called up from AAA league 
min_pa_slider = widgets.IntSlider(
    value=200,
    min=0,
    max=int(df["PA"].max()),
    step=25,
    description='Min PA:',
    continuous_update=False, 
    layout=widgets.Layout(width="280px"),
)

display(team_dropdown, min_pa_slider)


Dropdown(description='Team:', layout=Layout(width='280px'), options=('ALL', 'ARI', 'ATH', 'ATL', 'BAL', 'BOS',…

IntSlider(value=200, continuous_update=False, description='Min PA:', layout=Layout(width='280px'), max=732, st…

In [22]:
#reactive filters 
out = widgets.Output()

def get_filtered():
    d = df.copy()
    if team_dropdown.value != "ALL":
        d = d[d["Team"] == team_dropdown.value]
    d = d[d["PA"] >= min_pa_slider.value]
    return d

def kpi_text(d):
    teams = ", ".join(sorted(d["Team"].unique()))
    return f"**Rows:** {len(d)} | **Teams:** {teams}"

def render():
    with out:
        clear_output()
        d = get_filtered()
        display(Markdown(f"**Rows:** {len(d)}  |  **Teams:** {', '.join(sorted(d['Team'].unique())) if len(d)>0 else '—'}"))
        display(d[["Player","Team","PA","HR","R","RBI","OBP","SLG","OPS"]].head(10))

#render when controls change 
team_dropdown.observe(lambda change: render(), names="value")
min_pa_slider.observe(lambda change: render(), names="value")


render() 
display(out) 


Output()

In [23]:
def render():
    d = get_filtered()
    with out:
        clear_output()

        # KPIs
        display(Markdown(kpi_text(d)))

        # Compact table
        cols = ["Player","Team","PA","HR","R","RBI","OBP","SLG","OPS"]
        display(
            d[cols].sort_values(["HR","OPS"], ascending=[False, False]).head(25)
        )

        fig_hr = px.bar(
            d.nlargest(10, "HR"),
            x="HR", y="Player", orientation="h",
            title="Top 10 HR (filtered)"
        )
        fig_hr.show()

        fig_slash = px.scatter(
            d, x="OBP", y="SLG", size="HR", color="Team",
            hover_name="Player", title="OBP vs SLG • size=HR"
        )
        fig_slash.show()

        fig_pa = px.histogram(d, x="PA", nbins=20, title="PA Distribution")
        fig_pa.show()

        if team_dropdown.value == "ALL":
            fig_ops = px.box(d, x="Team", y="OPS", title="OPS by Team")
        else:
            fig_ops = px.violin(d, y="OPS", box=True, title=f"OPS — {team_dropdown.value}")
        fig_ops.show()
        

**Public Dashboard Access & Code Repository**

**Live Interactive Dashboard:**
[Launch Dashboard on Binder](https://mybinder.org/v2/gh/marvcast2027/mlb-dashboard-2025/HEAD)

**Source Code Repository:**
[View on GitHub](https://github.com/marvcast2027/mlb-dashboard-2025)

Repository includes all source code, data files, and documentation.

## Setup & Deployment Instructions

### Running Locally

**Step 1: Clone the repository**
```bash
git clone https://github.com/marvcast2027/mlb-dashboard-2025
cd mlb-dashboard-2025
```

**Step 2: Install required packages**
```bash
pip install pandas numpy plotly ipywidgets jupyter
```

Or use requirements.txt if provided:
```bash
pip install -r requirements.txt
```

**Step 3: Enable Jupyter widgets**
```bash
jupyter nbextension enable --py widgetsnbextension
```

**Step 4: Launch the notebook**
```bash
jupyter notebook assignment3.ipynb
```

**Step 5: Run the dashboard**
- Click "Cell" → "Run All" to execute all cells
- Interact with the Team dropdown and Min PA slider

**Required Dependencies:**
- pandas
- numpy  
- plotly
- ipywidgets

### Binder Deployment

The dashboard is already deployed on Binder at the link above. Binder automatically reads `requirements.txt` from the GitHub repository and builds the environment. First launch takes 1-2 minutes to build.

**To deploy your own version:**
1. Ensure your repo has: `assignment3.ipynb`, `mlb_batting_2025.csv`, `requirements.txt`
2. Go to mybinder.org
3. Enter your GitHub repo URL
4. Click "launch"

A 1-minute video demonstration showcasing the dashboard's key interactive features.

*The video demonstrates:*
- Team filtering with synchronized visualization updates
- Minimum plate appearances threshold adjustment
- Interactive exploration of MLB 2025 batting statistics
- Key insights from multiple coordinated chart types

**Dashboard Screenshots**

**Full Dashboard View (All Teams)**
![Dashboard showing all teams](screenshot1.png)

**Filtered View (Single Team)**
![Dashboard filtered to Dodgers](screenshot2.png)

**Adjusted PA Threshold**
![Dashboard with minimum 500 PA](screenshot3.png)

In [24]:
team_dropdown.observe(lambda change: render(), names="value")
min_pa_slider.observe(lambda change: render(), names="value")

render()
display(widgets.VBox([widgets.HBox([team_dropdown, min_pa_slider]), out]))

VBox(children=(HBox(children=(Dropdown(description='Team:', layout=Layout(width='280px'), options=('ALL', 'ARI…