# Introduction to Scikit-learn (sklearn)

## What is Scikit-learn?
- Industry-standard machine learning library for Python
- Built on NumPy, SciPy, and matplotlib
- Provides simple and efficient tools for:
  - **Classification**: Identifying which category an object belongs to
  - **Regression**: Predicting continuous-valued attributes
  - **Clustering**: Grouping similar objects
  - **Dimensionality Reduction**: Reducing the number of random variables
  - **Model Selection**: Comparing, validating, and choosing parameters
  - **Preprocessing**: Feature extraction and normalization

## Key Features
- **Consistent API**: All estimators follow the same interface pattern
- **Well-documented**: Comprehensive docs with examples
- **Performance**: Optimized with Cython for speed
- **BSD License**: Free for commercial use
- **Active Community**: Regular updates and support

## Installation & Setup

```bash
# Using pip
pip install scikit-learn

# Using conda
conda install scikit-learn
```

**Verify Installation:**

In [None]:
import sklearn
import numpy as np
import pandas as pd

print(f"scikit-learn version: {sklearn.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## Machine Learning Workflow with sklearn

```
1. Load Data          → from sklearn.datasets import ...
2. Preprocess Data    → from sklearn.preprocessing import ...
3. Split Data         → from sklearn.model_selection import train_test_split
4. Choose Model       → from sklearn.linear_model import ...
5. Train Model        → model.fit(X_train, y_train)
6. Make Predictions   → model.predict(X_test)
7. Evaluate Model     → from sklearn.metrics import ...
```

## Quick Example: Iris Classification

Let's build a simple classifier to predict iris flower species:

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Load dataset
iris = load_iris()
X, y = iris.data, iris.target

print("Dataset shape:", X.shape)
print("Classes:", iris.target_names)
print("Features:", iris.feature_names)

# 2. Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Preprocess (scale features)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# 5. Predict
y_pred = model.predict(X_test_scaled)

# 6. Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

## Key Concepts

### Estimators
Any object that learns from data. Main methods:
- `fit(X, y)`: Train the model
- `predict(X)`: Make predictions
- `score(X, y)`: Evaluate performance

### Transformers
Estimators that transform data:
- `fit_transform(X)`: Fit and transform in one step
- `transform(X)`: Apply learned transformation

### Pipelines
Chain multiple steps (preprocessors + model):
```python
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
```

## Important Modules

| Module | Purpose | Examples |
|--------|---------|----------|
| `sklearn.datasets` | Sample datasets | `load_iris()`, `load_digits()` |
| `sklearn.model_selection` | Split, validate | `train_test_split`, `cross_val_score` |
| `sklearn.preprocessing` | Data scaling | `StandardScaler`, `MinMaxScaler` |
| `sklearn.linear_model` | Linear models | `LinearRegression`, `LogisticRegression` |
| `sklearn.tree` | Decision trees | `DecisionTreeClassifier` |
| `sklearn.ensemble` | Ensemble methods | `RandomForestClassifier`, `GradientBoosting` |
| `sklearn.svm` | Support Vector Machines | `SVC`, `SVR` |
| `sklearn.neighbors` | Nearest neighbors | `KNeighborsClassifier` |
| `sklearn.cluster` | Clustering | `KMeans`, `DBSCAN` |
| `sklearn.metrics` | Evaluation metrics | `accuracy_score`, `mean_squared_error` |

## Next Steps

In the next notebook, we'll explore:
- Detailed sklearn API patterns
- Working with different datasets
- Model evaluation techniques
- Cross-validation strategies
- Hyperparameter tuning