# NCAA March Madness - Data Exploration üèÄüìä

### Project Overview
This notebook explores historical **NCAA tournament data** to analyze team performance and trends.  
We'll load the datasets, clean them, check for missing values, and visualize key patterns.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use("ggplot")
sns.set_theme()


## üîç Loading the Datasets
The datasets include:
- **Regular Season Results:** All games played before the tournament.
- **Tournament Results:** Past NCAA tournament games.
- **Seeds:** Team seed rankings for each tournament.
- **Teams:** Team ID and names.

We'll start by loading these datasets into Pandas DataFrames.


In [None]:
# Load datasets
# These CSV files contain historical NCAA tournament and season results
season_results = pd.read_csv("../data/MRegularSeasonCompactResults.csv")  # Regular season game results
tourney_results = pd.read_csv("../data/MNCAATourneyCompactResults.csv")   # Tournament results
seeds = pd.read_csv("../data/MNCAATourneySeeds.csv")                      # Team seeding information
teams = pd.read_csv("../data/MTeams.csv")                                 # Team names and IDs

# Display first few rows of each dataset
print("Regular Season Results:")
display(season_results.head())

print("\nTournament Results:")
display(tourney_results.head())

print("\nSeeds:")
display(seeds.head())

print("\nTeams:")
display(teams.head())

In [None]:
print("Season Results Columns:", season_results.columns)
print("Tournament Results Columns:", tourney_results.columns)
print("Seeds Columns:", seeds.columns)
print("Teams Columns:", teams.columns)



In [None]:
print("\nMissing values in Regular Season Results:")
print(season_results.isnull().sum())

print("\nMissing values in Tournament Results:")
print(tourney_results.isnull().sum())

print("\nMissing values in Seeds:")
print(seeds.isnull().sum())

print("\nMissing values in Teams:")
print(teams.isnull().sum())



In [None]:
print("\nRegular Season Results Summary:")
print(season_results.describe())

print("\nTournament Results Summary:")
print(tourney_results.describe())

print("\nSeeds Summary:")
print(seeds.describe())



## üìä Visualizing Score Distributions
We analyze how teams perform by looking at **winning and losing scores** across all games.


In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(season_results['WScore'], bins=50, kde=True)
plt.title("Distribution of Winning Scores")
plt.xlabel("Winning Score")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(10, 5))
sns.histplot(season_results['LScore'], bins=50, kde=True)
plt.title("Distribution of Losing Scores")
plt.xlabel("Losing Score")
plt.ylabel("Frequency")
plt.show()



In [None]:
top_teams = tourney_results['WTeamID'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_teams.index, y=top_teams.values, palette="viridis")
plt.title("Top 10 Teams with Most Tournament Wins")
plt.xlabel("Team ID")
plt.ylabel("Win Count")
plt.show()
