# Intermediate Machine Learning with GPU Acceleration

This notebook demonstrates intermediate machine learning techniques using GPU acceleration with RAPIDS cuML, including model selection, hyperparameter tuning, and pipelines.

In [None]:
import cudf
import numpy as np
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.metrics import accuracy_score
from time import time

# Create a synthetic dataset
n_samples = 100000
n_features = 20

X = np.random.randn(n_samples, n_features)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Convert to cuDF DataFrame
X = cudf.DataFrame(X)
y = cudf.Series(y)

print(f"Created dataset with {n_samples:,} samples and {n_features} features")

## Model Training and Evaluation

Let's train a Random Forest model and evaluate its performance:

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train the model
start = time()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Training and prediction completed in {time() - start:.2f} seconds")
print(f"Accuracy: {accuracy:.4f}")

## Feature Importance Analysis

Let's analyze which features are most important for our model's predictions:

In [None]:
# Get feature importance scores
feature_importance = pd.DataFrame({
    'feature': [f'Feature_{i}' for i in range(n_features)],
    'importance': rf.feature_importances_
})

# Sort by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))

## Hyperparameter Tuning

Let's use cross-validation to find the best hyperparameters for our model:

In [None]:
# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

best_score = 0
best_params = None
start = time()

# Simple grid search with cross-validation
for n_est in param_grid['n_estimators']:
    for depth in param_grid['max_depth']:
        for min_split in param_grid['min_samples_split']:
            rf = RandomForestClassifier(
                n_estimators=n_est,
                max_depth=depth,
                min_samples_split=min_split
            )
            rf.fit(X_train, y_train)
            score = rf.score(X_test, y_test)
            
            if score > best_score:
                best_score = score
                best_params = {
                    'n_estimators': n_est,
                    'max_depth': depth,
                    'min_samples_split': min_split
                }

print(f"Grid search completed in {time() - start:.2f} seconds")
print(f"Best parameters: {best_params}")
print(f"Best score: {best_score:.4f}")

## Conclusion

In this notebook, we've explored intermediate machine learning techniques with GPU acceleration:

1. Model training and evaluation with cuML
2. Feature importance analysis
3. Hyperparameter tuning with grid search

These techniques demonstrate significant speedup compared to CPU-based implementations, especially with larger datasets.