# Introduction to Machine Learning

This notebook provides practical examples for the concepts covered in Episode 01: Introduction to Machine Learning.

## Learning Objectives
- Understand the essence of machine learning
- Learn the concepts of function, data, and learning process
- Distinguish between supervised and unsupervised learning
- Understand the ML lifecycle and key concepts like parameters vs hyperparameters

## Part 1: ML as a Mathematical Function

A machine learning model is essentially a mathematical function: `f(X) -> y`
- Input: features (X)
- Output: prediction (y)
- All data must be converted to numerical values

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## Part 2: Continuous vs Discrete Functions

### 2.1 Regression (Continuous Output)
Predicting continuous values like cyberthreat risk, network latency, or house prices.

In [None]:
# Generate synthetic regression data
# Simulating cyberthreat risk prediction based on network features
X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Create and train a linear regression model
model_reg = LinearRegression()
model_reg.fit(X_reg, y_reg)

# Make predictions
y_pred_reg = model_reg.predict(X_reg)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_reg, y_reg, alpha=0.6, label='Actual data')
plt.plot(X_reg, y_pred_reg, color='red', linewidth=2, label='Model prediction')
plt.xlabel('Network Feature (e.g., number of servers)')
plt.ylabel('Cyberthreat Risk Score')
plt.title('Regression: Continuous Output Example')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Model equation: y = {model_reg.coef_[0]:.4f} * X + {model_reg.intercept_:.4f}")

### 2.2 Classification (Discrete Output)
Predicting discrete categories like benign vs malicious network events.

In [None]:
# Generate synthetic classification data
# Simulating network event classification (benign=0, malicious=1)
X_clf, y_clf = make_classification(n_samples=200, n_features=2, n_informative=2,
                                    n_redundant=0, random_state=42)

# Create and train a logistic regression model
model_clf = LogisticRegression(random_state=42)
model_clf.fit(X_clf, y_clf)

# Make predictions
y_pred_clf = model_clf.predict(X_clf)

# Visualize decision boundary
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_clf[:, 0], X_clf[:, 1], c=y_clf, cmap='coolwarm', alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Classification: Discrete Output Example (Benign vs Malicious)')
plt.colorbar(scatter, label='Class (0=Benign, 1=Malicious)')
plt.grid(True, alpha=0.3)
plt.show()

accuracy = accuracy_score(y_clf, y_pred_clf)
print(f"Classification Accuracy: {accuracy:.4f}")

## Part 3: Supervised vs Unsupervised Learning

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Supervised Learning: We have labels (y)
print("=== SUPERVISED LEARNING ===")
print(f"Training data shape: {X_clf.shape}")
print(f"Labels shape: {y_clf.shape}")
print(f"Sample labels: {y_clf[:5]}")
print()

# Unsupervised Learning: We only have features (X), no labels
print("=== UNSUPERVISED LEARNING ===")
print(f"Training data shape: {X_clf.shape}")
print("No labels provided - algorithm finds structure in data")
print()

# Example: K-means clustering (unsupervised)
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_clf)

plt.figure(figsize=(12, 5))

# Supervised
plt.subplot(1, 2, 1)
plt.scatter(X_clf[:, 0], X_clf[:, 1], c=y_clf, cmap='coolwarm', alpha=0.6)
plt.title('Supervised: True Labels')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Unsupervised
plt.subplot(1, 2, 2)
plt.scatter(X_clf[:, 0], X_clf[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, edgecolors='black', linewidths=2)
plt.title('Unsupervised: K-means Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

## Part 4: The ML Lifecycle

The machine learning process involves:
1. **Train**: Adjust model parameters on training set
2. **Validate**: Evaluate on validation set
3. **Adjust**: Tune hyperparameters
4. **Test**: Final evaluation on test set
5. **Deploy**: Use model for predictions

In [None]:
# Split data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X_clf, y_clf, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Data Split for ML Lifecycle:")
print(f"Training set: {X_train.shape[0]} samples (60%)")
print(f"Validation set: {X_val.shape[0]} samples (20%)")
print(f"Test set: {X_test.shape[0]} samples (20%)")
print()
print("IMPORTANT: Never mix training, validation, and test sets!")

## Part 5: Parameters vs Hyperparameters

In [None]:
from sklearn.tree import DecisionTreeClassifier

print("=== PARAMETERS vs HYPERPARAMETERS ===")
print()
print("PARAMETERS:")
print("- Learned automatically during training")
print("- Example: weights in linear regression")
print(f"- Logistic Regression coefficients: {model_clf.coef_}")
print()
print("HYPERPARAMETERS:")
print("- Set before training, not learned automatically")
print("- Require manual tuning based on validation performance")
print()

# Example: Decision Tree with different hyperparameters
print("Decision Tree Hyperparameters:")
dt1 = DecisionTreeClassifier(max_depth=2, random_state=42)
dt2 = DecisionTreeClassifier(max_depth=10, random_state=42)

dt1.fit(X_train, y_train)
dt2.fit(X_train, y_train)

acc1_train = accuracy_score(y_train, dt1.predict(X_train))
acc1_val = accuracy_score(y_val, dt1.predict(X_val))

acc2_train = accuracy_score(y_train, dt2.predict(X_train))
acc2_val = accuracy_score(y_val, dt2.predict(X_val))

print(f"\nModel 1 (max_depth=2):")
print(f"  Training accuracy: {acc1_train:.4f}")
print(f"  Validation accuracy: {acc1_val:.4f}")

print(f"\nModel 2 (max_depth=10):")
print(f"  Training accuracy: {acc2_train:.4f}")
print(f"  Validation accuracy: {acc2_val:.4f}")

## Part 6: Bias-Variance Tradeoff (Underfitting vs Overfitting)

In [None]:
# Demonstrate underfitting, overfitting, and good fit
max_depths = [1, 3, 10, 20]
train_accs = []
val_accs = []

for depth in max_depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_accs.append(accuracy_score(y_train, dt.predict(X_train)))
    val_accs.append(accuracy_score(y_val, dt.predict(X_val)))

plt.figure(figsize=(10, 6))
plt.plot(max_depths, train_accs, marker='o', label='Training Accuracy', linewidth=2)
plt.plot(max_depths, val_accs, marker='s', label='Validation Accuracy', linewidth=2)
plt.xlabel('Tree Depth (Hyperparameter)')
plt.ylabel('Accuracy')
plt.title('Bias-Variance Tradeoff: Underfitting vs Overfitting')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(max_depths)
plt.ylim([0.5, 1.05])

# Annotate regions
plt.axvspan(0.5, 2, alpha=0.1, color='red', label='Underfitting')
plt.axvspan(2, 5, alpha=0.1, color='green', label='Good Fit')
plt.axvspan(5, 20.5, alpha=0.1, color='orange', label='Overfitting')

plt.show()

print("Bias-Variance Analysis:")
for i, depth in enumerate(max_depths):
    gap = train_accs[i] - val_accs[i]
    print(f"Depth {depth}: Train={train_accs[i]:.4f}, Val={val_accs[i]:.4f}, Gap={gap:.4f}")

## Part 7: Evaluation Metrics - Precision, Recall, and Confusion Matrix

In [None]:
# Use the best model for evaluation
best_dt = DecisionTreeClassifier(max_depth=3, random_state=42)
best_dt.fit(X_train, y_train)
y_pred_test = best_dt.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
cm = confusion_matrix(y_test, y_pred_test)

print("=== EVALUATION METRICS ===")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f} (avoid false positives)")
print(f"Recall: {recall:.4f} (find all positive cases)")
print()

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Benign', 'Malicious'],
            yticklabels=['Benign', 'Malicious'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.show()

print("Confusion Matrix Interpretation:")
print(f"True Negatives (TN): {cm[0, 0]} - Correctly identified benign")
print(f"False Positives (FP): {cm[0, 1]} - Benign misclassified as malicious")
print(f"False Negatives (FN): {cm[1, 0]} - Malicious misclassified as benign")
print(f"True Positives (TP): {cm[1, 1]} - Correctly identified malicious")

## Summary

Key Takeaways:
1. **ML is a function**: f(X) -> y, where X are features and y is the prediction
2. **Two main types**: Regression (continuous) and Classification (discrete)
3. **Supervised vs Unsupervised**: Supervised has labels, unsupervised finds patterns
4. **ML Lifecycle**: Train → Validate → Adjust → Test → Deploy
5. **Parameters vs Hyperparameters**: Parameters are learned, hyperparameters are tuned
6. **Bias-Variance**: Balance between underfitting and overfitting
7. **Evaluation**: Use appropriate metrics (accuracy, precision, recall) for your problem