
---

# **📘Scikit-learn Cheatsheet**  
🔍 For ML modeling, pipelines, preprocessing, evaluation & more

---

### 🏗️ 1. Basic Workflow  
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
```

---

### 🧠 2. Supervised Models

| Type         | Models                              |
|--------------|--------------------------------------|
| 🔮 Classifier | `LogisticRegression`, `RandomForestClassifier`, `SVC`, `KNeighborsClassifier`, `GaussianNB` |
| 📈 Regressor  | `LinearRegression`, `Ridge`, `Lasso`, `SVR`, `RandomForestRegressor` |

```python
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
```

---

### ⚙️ 3. Unsupervised Models

| Task         | Models                              |
|--------------|--------------------------------------|
| 📊 Clustering | `KMeans`, `DBSCAN`, `AgglomerativeClustering` |
| 🧩 Decomposition | `PCA`, `TruncatedSVD`, `NMF`           |

```python
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
```

---

### 🧹 4. Preprocessing

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

scaler = StandardScaler()         # 📊 Normalize features
encoder = OneHotEncoder()         # 🔤 Categorical encoding
imputer = SimpleImputer()         # 🧼 Fill missing values
```

---

### 🧪 5. Model Selection

```python
from sklearn.model_selection import cross_val_score, GridSearchCV

scores = cross_val_score(model, X, y, cv=5)  # 🔁 Cross-validation
search = GridSearchCV(model, param_grid, cv=5)  # 🔍 Hyperparameter tuning
```

---

### 🧰 6. Model Evaluation

```python
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, mean_squared_error

classification_report(y_test, y_pred)      # 📄 Class metrics
confusion_matrix(y_test, y_pred)           # 🔳 Confusion matrix
roc_auc_score(y_test, y_probs)             # 🚦 AUC-ROC score
mean_squared_error(y_test, y_pred)         # 📉 Regression loss
```

---

### 🔄 7. Pipelines

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
```

---

### 🔎 8. Feature Selection

```python
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X, y)
```

---

### 🧠 9. Ensemble Learning

```python
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier

model = VotingClassifier(estimators=[
    ('rf', RandomForestClassifier()), 
    ('svc', SVC(probability=True))
], voting='soft')
```

---

### 🧬 10. Custom Scoring & Metrics

```python
from sklearn.metrics import make_scorer, f1_score

f1_scorer = make_scorer(f1_score)
GridSearchCV(model, param_grid, scoring=f1_scorer)
```

---

### 🧵 11. Useful Utilities

```python
from sklearn.datasets import load_iris
from sklearn.inspection import permutation_importance

data = load_iris()
importance = permutation_importance(model, X, y)
```

---

### 🧠 Real ML Project Flow:

```python
# Load Data → Preprocess → Train/Test Split → Build Pipeline →
# Fit Model → Tune Hyperparams → Evaluate → Save/Deploy
```

---

# **🤝Scikit-learn + SciPy Pipeline Integration**
🔗 Use SciPy's power **inside sklearn pipelines** like a pro!

---

### ✅ 1. Custom Feature Selector using `scipy.stats` + `sklearn`

```python
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import ttest_ind
import numpy as np

class TTestFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.05):
        self.threshold = threshold

    def fit(self, X, y):
        self.selected_ = [
            i for i in range(X.shape[1])
            if ttest_ind(X[y == 0, i], X[y == 1, i], equal_var=False).pvalue < self.threshold
        ]
        return self

    def transform(self, X):
        return X[:, self.selected_]
```

✅ Plug into `Pipeline()` as a preprocessing step.

---

### 🔍 2. Optimize Hyperparameters with `scipy.optimize`

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from scipy.optimize import minimize

def objective(params):
    max_depth = int(params[0])
    model = RandomForestClassifier(max_depth=max_depth)
    score = cross_val_score(model, X, y, cv=3).mean()
    return -score  # ❗ Minimizing

res = minimize(objective, [5], bounds=[(1, 20)])
best_depth = int(res.x[0])
```

🎯 Fine-grained control beyond GridSearchCV.

---

### 🧠 3. Use `scipy.optimize.curve_fit` as Custom Regressor

```python
from scipy.optimize import curve_fit
import numpy as np

# Define custom model
def custom_func(x, a, b):
    return a * x + b

# Fit model using SciPy
params, _ = curve_fit(custom_func, X_train.ravel(), y_train)

# Predict with fitted params
y_pred = custom_func(X_test.ravel(), *params)
```

✅ Great for domain-specific curves or interpretable models.

---

### 📏 4. Precompute Distance Matrix with `scipy.spatial`

```python
from scipy.spatial.distance import cdist
from sklearn.cluster import AgglomerativeClustering

D = cdist(X, X, metric='euclidean')  # Custom distance matrix
model = AgglomerativeClustering(affinity='precomputed', linkage='average')
model.fit(D)
```

✅ For clustering tasks where Euclidean isn't enough.

---

### 📊 5. Use `scipy.stats` for Custom Metrics in `make_scorer`

```python
from sklearn.metrics import make_scorer
from scipy.stats import pearsonr

def pearson_score(y_true, y_pred):
    return pearsonr(y_true, y_pred)[0]  # return only correlation

pearson_scorer = make_scorer(pearson_score, greater_is_better=True)
```

✅ Use in `GridSearchCV` or model validation.

---

### 🧼 6. Smooth Input Data with `scipy.signal` Before Training

```python
from scipy.signal import savgol_filter
from sklearn.base import BaseEstimator, TransformerMixin

class SmoothingFilter(BaseEstimator, TransformerMixin):
    def __init__(self, window=5, polyorder=2):
        self.window = window
        self.polyorder = polyorder

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return savgol_filter(X, self.window, self.polyorder, axis=0)
```

✅ Add to pipeline for time series or noisy data.

---

### 🔄 7. Full Pipeline Example

```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('smooth', SmoothingFilter(window=7)),
    ('select', TTestFeatureSelector(threshold=0.01)),
    ('clf', RandomForestClassifier(max_depth=best_depth))
])

pipe.fit(X_train, y_train)
```

📦 Unified SciPy + Sklearn = 💪 custom + reproducible ML!

---

### ✅ Summary Table

| SciPy Tool          | Sklearn Use Case               |
|---------------------|-------------------------------|
| `ttest_ind`         | Feature Selection             |
| `curve_fit`         | Custom regression             |
| `minimize`          | Hyperparameter tuning         |
| `cdist`             | Custom clustering distances   |
| `savgol_filter`     | Time-series smoothing         |
| `pearsonr`          | Custom scoring                |

---
