# üéì Complete Guide to Building Machine Learning Models

Welcome! In this tutorial, you'll learn how to build, train, and evaluate machine learning models from scratch.

## What You'll Learn:
1. Data preparation and exploration
2. Classification models (predicting categories)
3. Regression models (predicting numbers)
4. Clustering models (finding patterns)
5. Model evaluation and selection

Let's get started! üöÄ

## üì¶ Step 1: Import Required Libraries

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Regression Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor

# Clustering Models
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_classification, make_regression, make_blobs

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")

## üìä Part 1: Classification Models

Classification is used when you want to predict **categories** (e.g., spam/not spam, disease/healthy).

### Example: Customer Purchase Prediction

In [None]:
# Generate sample classification dataset
# Let's predict whether a customer will make a purchase based on age and income
X_class, y_class = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_classes=2,
    random_state=42
)

# Create a DataFrame for better understanding
df_class = pd.DataFrame(X_class, columns=['Age (normalized)', 'Income (normalized)'])
df_class['Purchase'] = y_class

print("Dataset shape:", df_class.shape)
print("\nFirst few rows:")
print(df_class.head())
print("\nClass distribution:")
print(df_class['Purchase'].value_counts())

In [None]:
# Visualize the data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df_class['Age (normalized)'], 
                     df_class['Income (normalized)'], 
                     c=df_class['Purchase'], 
                     cmap='viridis', 
                     alpha=0.6,
                     edgecolors='black')
plt.xlabel('Age (normalized)', fontsize=12)
plt.ylabel('Income (normalized)', fontsize=12)
plt.title('Customer Data: Will They Purchase?', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Purchase (0=No, 1=Yes)')
plt.grid(True, alpha=0.3)
plt.show()

### üîÑ Train-Test Split

We split data into:
- **Training set (80%)**: To teach the model
- **Testing set (20%)**: To evaluate how well it learned

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

### ü§ñ Model 1: Logistic Regression

Simple but effective for binary classification.

In [None]:
# Create and train the model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred_lr = log_reg.predict(X_test)

# Evaluate
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Logistic Regression Accuracy: {accuracy_lr:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

### üå≥ Model 2: Decision Tree

Makes decisions like a flowchart.

In [None]:
# Create and train
dt_clf = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt_clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy_dt:.2%}")

### üå≤ Model 3: Random Forest

Combines multiple decision trees for better predictions.

In [None]:
# Create and train
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.2%}")

### üìä Compare Classification Models

In [None]:
# Compare all models
models_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [accuracy_lr, accuracy_dt, accuracy_rf]
})

print(models_comparison)

# Visualize comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(models_comparison['Model'], models_comparison['Accuracy'], 
               color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
plt.ylabel('Accuracy', fontsize=12)
plt.title('Classification Models Comparison', fontsize=14, fontweight='bold')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2%}',
             ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

### üéØ Confusion Matrix

Shows where the model gets confused.

In [None]:
# Confusion matrix for best model (Random Forest)
cm = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Purchase', 'Purchase'],
            yticklabels=['No Purchase', 'Purchase'])
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.title('Confusion Matrix - Random Forest', fontsize=14, fontweight='bold')
plt.show()

print(f"\nTrue Negatives: {cm[0][0]}")
print(f"False Positives: {cm[0][1]}")
print(f"False Negatives: {cm[1][0]}")
print(f"True Positives: {cm[1][1]}")

## üìà Part 2: Regression Models

Regression is used to predict **continuous values** (e.g., prices, temperatures, sales).

### Example: House Price Prediction

In [None]:
# Generate sample regression dataset
X_reg, y_reg = make_regression(
    n_samples=500,
    n_features=1,
    noise=20,
    random_state=42
)

# Create DataFrame
df_reg = pd.DataFrame(X_reg, columns=['House Size (sq ft)'])
df_reg['Price ($1000s)'] = y_reg

print("Dataset shape:", df_reg.shape)
print("\nFirst few rows:")
print(df_reg.head())
print("\nStatistics:")
print(df_reg.describe())

In [None]:
# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(df_reg['House Size (sq ft)'], df_reg['Price ($1000s)'], 
           alpha=0.5, edgecolors='black')
plt.xlabel('House Size (sq ft)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('House Size vs Price', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train_reg)}")
print(f"Testing samples: {len(X_test_reg)}")

### üìè Linear Regression

Fits a straight line through the data.

In [None]:
# Create and train
lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

# Predict
y_pred_lin = lin_reg.predict(X_test_reg)

# Evaluate
mse_lin = mean_squared_error(y_test_reg, y_pred_lin)
rmse_lin = np.sqrt(mse_lin)
r2_lin = r2_score(y_test_reg, y_pred_lin)
mae_lin = mean_absolute_error(y_test_reg, y_pred_lin)

print(f"Linear Regression Metrics:")
print(f"  R¬≤ Score: {r2_lin:.4f} (closer to 1 is better)")
print(f"  RMSE: {rmse_lin:.2f}")
print(f"  MAE: {mae_lin:.2f}")

In [None]:
# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(X_test_reg, y_test_reg, alpha=0.5, label='Actual', edgecolors='black')
plt.scatter(X_test_reg, y_pred_lin, alpha=0.5, label='Predicted', edgecolors='black')
plt.plot(X_test_reg, y_pred_lin, color='red', linewidth=2, label='Regression Line')
plt.xlabel('House Size (sq ft)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('Linear Regression: Actual vs Predicted', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### üå≥ Decision Tree Regressor

In [None]:
# Create and train
dt_reg = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_reg.fit(X_train_reg, y_train_reg)

# Predict and evaluate
y_pred_dt_reg = dt_reg.predict(X_test_reg)
r2_dt = r2_score(y_test_reg, y_pred_dt_reg)
rmse_dt = np.sqrt(mean_squared_error(y_test_reg, y_pred_dt_reg))

print(f"Decision Tree Regressor Metrics:")
print(f"  R¬≤ Score: {r2_dt:.4f}")
print(f"  RMSE: {rmse_dt:.2f}")

### üìä Compare Regression Models

In [None]:
# Create comparison DataFrame
reg_comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree'],
    'R¬≤ Score': [r2_lin, r2_dt],
    'RMSE': [rmse_lin, rmse_dt]
})

print(reg_comparison)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# R¬≤ Score comparison
axes[0].bar(reg_comparison['Model'], reg_comparison['R¬≤ Score'], 
            color=['#FF6B6B', '#4ECDC4'])
axes[0].set_ylabel('R¬≤ Score', fontsize=11)
axes[0].set_title('R¬≤ Score Comparison', fontsize=12, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# RMSE comparison
axes[1].bar(reg_comparison['Model'], reg_comparison['RMSE'], 
            color=['#FF6B6B', '#4ECDC4'])
axes[1].set_ylabel('RMSE', fontsize=11)
axes[1].set_title('RMSE Comparison (lower is better)', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## üé® Part 3: Clustering Models

Clustering finds **natural groups** in data without labels.

### Example: Customer Segmentation

In [None]:
# Generate sample clustering dataset
X_cluster, y_cluster = make_blobs(
    n_samples=300,
    n_features=2,
    centers=3,
    cluster_std=1.0,
    random_state=42
)

df_cluster = pd.DataFrame(X_cluster, columns=['Annual Income', 'Spending Score'])

print("Dataset shape:", df_cluster.shape)
print("\nFirst few rows:")
print(df_cluster.head())

In [None]:
# Visualize unlabeled data
plt.figure(figsize=(10, 6))
plt.scatter(df_cluster['Annual Income'], df_cluster['Spending Score'], 
           alpha=0.6, edgecolors='black')
plt.xlabel('Annual Income', fontsize=12)
plt.ylabel('Spending Score', fontsize=12)
plt.title('Customer Data (Unlabeled)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

### üéØ K-Means Clustering

In [None]:
# Find optimal number of clusters using Elbow Method
inertias = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_cluster)
    inertias.append(kmeans.inertia_)

# Plot Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method: Finding Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print("Look for the 'elbow' point where the curve starts to flatten.")

In [None]:
# Apply K-Means with optimal k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_cluster)

# Add cluster labels to DataFrame
df_cluster['Cluster'] = clusters

print("Cluster distribution:")
print(df_cluster['Cluster'].value_counts().sort_index())

In [None]:
# Visualize clusters
plt.figure(figsize=(12, 6))

scatter = plt.scatter(df_cluster['Annual Income'], 
                     df_cluster['Spending Score'],
                     c=df_cluster['Cluster'],
                     cmap='viridis',
                     s=100,
                     alpha=0.6,
                     edgecolors='black')

# Plot cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], 
           c='red', s=300, alpha=0.8, 
           marker='*', edgecolors='black', linewidth=2,
           label='Centroids')

plt.xlabel('Annual Income', fontsize=12)
plt.ylabel('Spending Score', fontsize=12)
plt.title('Customer Segments (K-Means Clustering)', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Cluster')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n‚úÖ Successfully identified 3 customer segments!")

## üîÑ Part 4: Cross-Validation

Cross-validation gives a more reliable estimate of model performance.

In [None]:
# Let's use our classification data
models_cv = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVM': SVC(random_state=42)
}

cv_results = []

for name, model in models_cv.items():
    # 5-fold cross-validation
    scores = cross_val_score(model, X_class, y_class, cv=5, scoring='accuracy')
    cv_results.append({
        'Model': name,
        'Mean Accuracy': scores.mean(),
        'Std Dev': scores.std()
    })
    print(f"{name}:")
    print(f"  Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
    print()

In [None]:
# Visualize cross-validation results
cv_df = pd.DataFrame(cv_results)

plt.figure(figsize=(12, 6))
x_pos = np.arange(len(cv_df))
plt.bar(x_pos, cv_df['Mean Accuracy'], 
        yerr=cv_df['Std Dev'], 
        capsize=5,
        color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'],
        edgecolor='black')
plt.xticks(x_pos, cv_df['Model'], rotation=15, ha='right')
plt.ylabel('Accuracy', fontsize=12)
plt.title('5-Fold Cross-Validation Results', fontsize=14, fontweight='bold')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nBest model based on cross-validation:")
best_idx = cv_df['Mean Accuracy'].idxmax()
print(f"{cv_df.loc[best_idx, 'Model']}: {cv_df.loc[best_idx, 'Mean Accuracy']:.2%}")

## üéØ Part 5: Feature Importance

Understanding which features matter most.

In [None]:
# Create a dataset with named features
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names

# Train a Random Forest
rf_iris = RandomForestClassifier(n_estimators=100, random_state=42)
rf_iris.fit(X_iris, y_iris)

# Get feature importances
importances = rf_iris.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(X_iris.shape[1]), importances[indices], 
        color='skyblue', edgecolor='black')
plt.xticks(range(X_iris.shape[1]), 
          [feature_names[i] for i in indices], 
          rotation=45, ha='right')
plt.ylabel('Importance', fontsize=12)
plt.title('Feature Importance (Iris Dataset)', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("Feature Ranking:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.4f}")

## üìù Summary & Key Takeaways

### What You've Learned:

1. **Classification Models**
   - Logistic Regression: Simple, fast, interpretable
   - Decision Trees: Easy to understand, can overfit
   - Random Forest: More accurate, less interpretable
   - SVM & KNN: Different approaches with their own strengths

2. **Regression Models**
   - Linear Regression: Best for linear relationships
   - Decision Tree Regressor: Can capture non-linear patterns
   
3. **Clustering**
   - K-Means: Find natural groups in data
   - Elbow method: Determine optimal number of clusters

4. **Model Evaluation**
   - Train/Test Split: Basic validation
   - Cross-Validation: More robust evaluation
   - Multiple metrics: Accuracy, R¬≤, RMSE, MAE

5. **Best Practices**
   - Always split your data
   - Use cross-validation for reliable results
   - Compare multiple models
   - Understand feature importance
   - Visualize your results

### Next Steps:
- Try with your own datasets
- Experiment with hyperparameter tuning
- Learn about feature engineering
- Explore deep learning for complex problems

Happy Learning! üéâ

## üèãÔ∏è Practice Exercises

Try these on your own:

1. **Modify hyperparameters**: Change `max_depth` in Decision Trees and see how it affects accuracy
2. **Try different splits**: Use 70-30 or 60-40 train-test splits
3. **Add more features**: Generate datasets with more features and see how models perform
4. **Real datasets**: Try with sklearn's built-in datasets (wine, breast_cancer, diabetes)
5. **Ensemble methods**: Combine multiple models for better predictions