Imagine you're a high school football coach trying to predict if your team will win a game based on two clues: whether your star player is healthy and whether the weather is good. You’ve tracked past games:

- Out of 20 games, you won 10.
- In the 10 wins: 8 had a healthy star player, 7 had good weather.
- In the 10 losses: 3 had a healthy star player, 4 had good weather.

Naive Bayes helps you guess the chance of winning by combining these clues, assuming the star player’s health and weather don’t affect each other. Here’s how:

1. **Chance of winning?** You won 10 out of 20 games, so 50%.
2. **If you win, how often is the star player healthy?** 8 out of 10 wins, so 80%.
3. **If you win, how often is the weather good?** 7 out of 10 wins, so 70%.
4. **If you lose, how often is the star player healthy?** 3 out of 10 losses, so 30%.
5. **If you lose, how often is the weather good?** 4 out of 10 losses, so 40%.

Now, suppose today your star player is healthy, and the weather is good. Naive Bayes multiplies the chances:
- For a win: 50% (chance of winning) × 80% (healthy player in wins) × 70% (good weather in wins) = 0.5 × 0.8 × 0.7 = 0.28.
- For a loss: 50% (chance of losing) × 30% (healthy player in losses) × 40% (good weather in losses) = 0.5 × 0.3 × 0.4 = 0.06.

Since 0.28 (win) is bigger than 0.06 (loss), Naive Bayes predicts you’re more likely to win!

It’s like making a quick prediction by combining simple patterns from past games, assuming the clues work independently.

*** Naive Bayes is based on independent clues***

*** Independent *** is the key

Let’s dive into a Python example that uses Naive Bayes to predict whether a high school football team will win a game based on time series data, like their performance over past games. We’ll use simple features: the team’s average points scored in the last 3 games and the opponent’s average points allowed in the last 3 games. Then, I’ll provide three assignments to deepen understanding.

Here’s the Python example using scikit-learn’s Gaussian Naive Bayes, assuming we have game data with these features and win/loss outcomes. The code generates synthetic time series data for simplicity, but the concept applies to real data.

In [2]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic time series data: 20 games
np.random.seed(42)
n_games = 20
team_points = np.random.normal(24, 5, n_games)  # Team's points scored
opponent_points_allowed = np.random.normal(20, 4, n_games)  # Opponent's points allowed
wins = np.where((team_points - opponent_points_allowed) > 0, 1, 0)  # Win if team scores more than opponent allows

# Create features: average points scored and allowed over last 3 games
X = []
y = []
for i in range(3, n_games):
    avg_team_points = np.mean(team_points[i-3:i])
    avg_opponent_points_allowed = np.mean(opponent_points_allowed[i-3:i])
    X.append([avg_team_points, avg_opponent_points_allowed])
    y.append(wins[i])

X = np.array(X)
y = np.array(y)

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Example prediction for a new game
new_game = np.array([[25.0, 18.0]])  # Team averaged 25 points, opponent allowed 18
prediction = model.predict(new_game)
print(f"Predicted outcome (1=win, 0=loss): {prediction[0]}")

Model Accuracy: 0.67
Predicted outcome (1=win, 0=loss): 1


This code:

Creates synthetic data for 20 games with team points scored and opponent points allowed.
Builds time series features by averaging the last 3 games for both.
Trains a Gaussian Naive Bayes model to predict wins (1) or losses (0).
Tests the model and makes a prediction for a new game.

**Data Exploration Assignment**:

Modify the code to load a CSV file with real or simulated football game data (e.g., columns: team_points, opponent_points_allowed, win). Create a new feature, like the difference between team points and opponent points allowed in the last 3 games. Train the model with this feature and compare its accuracy to the original model.
Hint: Use pandas.read_csv() to load data and compute the difference as avg_team_points - avg_opponent_points_allowed.

In [3]:
# Your code here

**Feature Engineering Assignment**

Add a new time series feature, such as the team’s win streak (number of consecutive wins before the current game). Update the code to include this feature in the Naive Bayes model. Test if it improves accuracy.
Hint: Loop through the wins array to count consecutive 1s before each game.

In [4]:
# Your code here

**Model Comparison Assignment**

Replace the Gaussian Naive Bayes model with a different classifier, like Logistic Regression (sklearn.linear_model.LogisticRegression). Compare the accuracy of both models on the same data. Write a short explanation (in a text file) of why one might perform better.
Hint: Use sklearn.linear_model.LogisticRegression and ensure the same train-test split for fair comparison.

In [5]:
# Your code here