## Introduction to Machine Learning with Scikit-learn

**Basic ML Concepts**

*   **Supervised vs. Unsupervised Learning:**
    *   **Supervised Learning:**  You have labeled data (input features and corresponding output labels).  The goal is to learn a mapping from inputs to outputs.
        *   **Regression:** Predict a continuous output variable (e.g., house price, temperature).
        *   **Classification:** Predict a categorical output variable (e.g., spam/not spam, image category).
    *   **Unsupervised Learning:** You have unlabeled data (only input features).  The goal is to discover patterns or structure in the data.
        *   **Clustering:** Group similar data points together (e.g., customer segmentation).
        *   **Dimensionality Reduction:** Reduce the number of features while preserving important information (e.g., PCA).

*   **Model Evaluation:**
    *   **Regression Metrics:**
        *   **Mean Squared Error (MSE):** Average squared difference between predicted and actual values.
        *   **Root Mean Squared Error (RMSE):** Square root of MSE (more interpretable).
        *   **R-squared (Coefficient of Determination):**  Proportion of variance in the output variable explained by the model.
    *   **Classification Metrics:**
        *   **Accuracy:** Proportion of correctly classified instances.
        *   **Precision:**  Proportion of true positives among all predicted positives (how many predicted "spam" emails are actually spam?).
        *   **Recall:** Proportion of true positives among all actual positives (how many actual spam emails were correctly identified?).
        *   **F1-score:** Harmonic mean of precision and recall (balances the two).
        *   **ROC Curve (Receiver Operating Characteristic):**  Plots the true positive rate against the false positive rate at various threshold settings.
        *   **AUC (Area Under the ROC Curve):**  A single number summarizing the ROC curve (higher is better).
    *   **Cross-Validation:**  A technique to evaluate model performance more robustly by splitting the data into multiple folds and training/testing on different combinations of folds.  This helps prevent overfitting to a single training set. Common types:
        *   **K-Fold Cross-Validation:** Split data into K folds, train on K-1, test on the remaining fold, repeat K times.
        * **Stratified K-Fold:** K-Fold, but ensures class proportions are maintained in each fold (important for imbalanced datasets).
        * **Leave-One-Out Cross-Validation (LOOCV):** Extreme case of K-Fold, where K = number of samples.

*   **Bias-Variance Tradeoff:**
    *   **Bias:**  Error due to overly simplistic assumptions in the model (underfitting).
    *   **Variance:**  Error due to sensitivity to fluctuations in the training data (overfitting).
    *   **Tradeoff:**  Finding the right balance between bias and variance to achieve good generalization performance.

*   **Overfitting and Underfitting:**
    *   **Overfitting:**  The model learns the training data too well, including noise and irrelevant details.  Performs poorly on unseen data.
    *   **Underfitting:**  The model is too simple to capture the underlying patterns in the data.  Performs poorly on both training and unseen data.

**Scikit-learn API**

*   **Estimators:**  Objects that learn from data (e.g., `LinearRegression`, `LogisticRegression`, `DecisionTreeClassifier`).  They have a `fit()` method to train the model and a `predict()` method to make predictions.

*   **Transformers:** Objects that transform data (e.g., `StandardScaler`, `OneHotEncoder`).  They have a `fit()` method to learn parameters from the data and a `transform()` method to apply the transformation.  `fit_transform()` combines both steps.

*   **Pipelines:**  Combine multiple steps (transformers and an estimator) into a single object.  This simplifies the workflow and ensures that the same preprocessing is applied to both training and test data.

**Data Preprocessing**

*   **Scaling:**  Transform features to a similar range (e.g., 0 to 1 or -1 to 1).  Important for algorithms that are sensitive to feature scales (e.g., SVM, k-NN).

    ```python
    from sklearn.preprocessing import StandardScaler, MinMaxScaler
    import numpy as np

    data = np.array([[1, 10], [2, 20], [3, 30]])

    # Standard scaling (mean=0, std=1)
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)
    print("Standard Scaled:\n", scaled_data)

    # Min-max scaling (range [0, 1])
    min_max_scaler = MinMaxScaler()
    min_max_scaled_data = min_max_scaler.fit_transform(data)
    print("MinMax Scaled:\n", min_max_scaled_data)
    ```

*   **Normalization:**  Scale individual samples to have unit norm (useful for text classification or clustering).

    ```python
    from sklearn.preprocessing import Normalizer

    normalizer = Normalizer()
    normalized_data = normalizer.fit_transform(data)
    print("Normalized:\n", normalized_data)
    ```

*   **One-Hot Encoding:**  Convert categorical features into numerical features by creating binary columns for each category.

    ```python
    from sklearn.preprocessing import OneHotEncoder

    data = np.array([['Red'], ['Green'], ['Blue'], ['Red']])

    encoder = OneHotEncoder()
    encoded_data = encoder.fit_transform(data)  # Returns a sparse matrix
    print("One-Hot Encoded:\n", encoded_data.toarray()) # Convert to dense array for display

    #Handle unknown values
    encoder = OneHotEncoder(handle_unknown='ignore')
    encoder.fit(data)
    new_data = np.array([['Red'], ['Yellow']])
    encoded_new_data = encoder.transform(new_data)
    print("Encoded new data (with unknown handling):\n", encoded_new_data.toarray())

    ```

*   **Feature Engineering:**  Creating new features from existing ones to improve model performance. This often requires domain knowledge. Examples:
    *   Polynomial features: Create interaction terms (e.g., x1 * x2).
    *   Binning: Convert continuous features into categorical features.
    *   Text features: Extract features like word counts, TF-IDF.

    ```python
    from sklearn.preprocessing import PolynomialFeatures

    data = np.array([[1, 2], [3, 4]])
    poly = PolynomialFeatures(degree=2, include_bias=False)
    poly_data = poly.fit_transform(data)
    print("Polynomial Features:\n", poly_data)
    ```

**Model Training and Evaluation**

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import pandas as pd

# --- Regression Example (Linear Regression) ---
# Generate some sample data
X = np.random.rand(100, 1) * 10
y = 2 * X + 1 + np.random.randn(100, 1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = model.score(X_test, y_test)  # R-squared

print("Regression Results:")
print("MSE:", mse)
print("RMSE:", rmse)
print("R-squared:", r2)

# --- Classification Example (Logistic Regression) ---
# Create a sample dataset for classification
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                           n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("\nClassification Results (Logistic Regression):")
print("Accuracy:", accuracy)
print("Classification Report:\n", report)
print("Confusion Matrix:\n", conf_matrix)
print("ROC AUC:", roc_auc)

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# --- Other Models (Decision Tree, SVM, k-NN) ---
# (Similar workflow - fit, predict, evaluate)

# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print("\nDecision Tree Accuracy:", accuracy_score(y_test, dt_pred))

# SVM
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM Accuracy:", accuracy_score(y_test, svm_pred))

# k-NN
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nk-NN Accuracy:", accuracy_score(y_test, knn_pred))
```

**Model Selection and Hyperparameter Tuning**

*   **Grid Search:**  Exhaustively search over a specified parameter grid to find the best combination of hyperparameters.

    ```python
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestClassifier # Example with a different model

    # Define the parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],  # Number of trees
        'max_depth': [None, 10, 20, 30],    # Maximum depth of trees
        'min_samples_split': [2, 5, 10]     # Minimum samples required to split a node
    }

    # Create a Random Forest classifier
    rf = RandomForestClassifier(random_state=42)

    # Create a GridSearchCV object
    grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                              cv=3, scoring='accuracy', n_jobs=-1) # Use all available cores

    # Fit the grid search to the data
    grid_search.fit(X_train, y_train)

    # Print the best parameters and best score
    print("Best Parameters:", grid_search.best_params_)
    print("Best Score:", grid_search.best_score_)

    # Get the best estimator
    best_rf = grid_search.best_estimator_

    # Evaluate the best estimator on the test set
    y_pred = best_rf.predict(X_test)
    print("Accuracy on Test Set:", accuracy_score(y_test, y_pred))
    ```

*   **Randomized Search:**  Sample a fixed number of parameter settings from specified distributions.  More efficient than grid search when you have many hyperparameters.

    ```python
    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import randint  # For integer parameters

    param_dist = {
        'n_estimators': randint(50, 200),
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 11)
    }

    random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist,
                                      n_iter=10, cv=3, scoring='accuracy',
                                      random_state=42, n_jobs=-1)

    random_search.fit(X_train, y_train)
    print("Best Parameters (Randomized Search):", random_search.best_params_)
    ```
*   **Cross-Validation with Hyperparameter Tuning:** Grid search and randomized search use cross-validation internally to evaluate different parameter combinations. This ensures that the hyperparameter selection is not biased towards a specific train/test split.

**Pipelines**

*   **Combine Preprocessing and Model Training:**

    ```python
    from sklearn.pipeline import Pipeline

    # Create a pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Step 1: Standard scaling
        ('svm', SVC())                # Step 2: SVM classifier
    ])

    # Fit the pipeline to the training data
    pipeline.fit(X_train, y_train)

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Evaluate the pipeline
    print("Pipeline Accuracy:", accuracy_score(y_test, y_pred))

    # Pipeline with GridSearchCV
    param_grid = {
      'scaler__with_mean': [True, False], # Accessing parameters within the pipeline
      'svm__C': [0.1, 1, 10],
      'svm__kernel': ['linear', 'rbf']
    }

    grid_search = GridSearchCV(pipeline, param_grid, cv=3, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    print("Best Pipeline Parameters:", grid_search.best_params_)
    ```

**Practice:**

1.  **Work through Scikit-learn Tutorials:**  The official documentation has excellent tutorials that cover many of these topics.  This is a great way to get hands-on experience.

2.  **Apply Different Algorithms:**  Use a real-world dataset (e.g., from Kaggle, UCI Machine Learning Repository, or the datasets included in Scikit-learn) and apply different algorithms (linear regression, logistic regression, decision trees, SVM, k-NN) to the same dataset.  Compare their performance using appropriate metrics.

3.  **Implement Cross-Validation:**  Use cross-validation to evaluate the performance of your models more robustly.  Experiment with different cross-validation strategies (k-fold, stratified k-fold).

4.  **Hyperparameter Tuning:**  Use grid search or randomized search to tune the hyperparameters of at least one of your models.  Compare the performance of the tuned model to the default model.

5.  **Pipelines**: Build a pipeline including scaling, and a classifier. Train and evaluate the pipeline. Experiment with different scalers.

This course provides a practical introduction to machine learning with Scikit-learn.  By working through the code examples and practice exercises, you'll gain a solid understanding of the key concepts and techniques needed to start building and evaluating your own machine learning models. Remember to consult the Scikit-learn documentation for more details and advanced features.
