Perfect! Let’s create a **Master-Level Scikit-Learn (sklearn) Reference Sheet** that you can use as a single-page ultimate cheat sheet for **interviews, projects, and learning**. I’ll structure it **categorically with brief explanations, syntax, and pro tips**.

---

# **Ultimate Scikit-Learn Reference Sheet**

---

## **1️⃣ Data Preprocessing / Feature Engineering**

| Function / Class                                    | Module          | Purpose                                    | Example                                        |
| --------------------------------------------------- | --------------- | ------------------------------------------ | ---------------------------------------------- |
| `StandardScaler()`                                  | `preprocessing` | Standardize features (mean=0, std=1)       | `X_scaled = StandardScaler().fit_transform(X)` |
| `MinMaxScaler()`                                    | `preprocessing` | Scale to \[0,1]                            | `X_scaled = MinMaxScaler().fit_transform(X)`   |
| `RobustScaler()`                                    | `preprocessing` | Scale using median & IQR                   | Resistant to outliers                          |
| `Normalizer()`                                      | `preprocessing` | Normalize rows to unit norm                | Useful for text data                           |
| `Binarizer(threshold=0.0)`                          | `preprocessing` | Convert values > threshold to 1 else 0     | `Binarizer(0.5)`                               |
| `PolynomialFeatures(degree=2)`                      | `preprocessing` | Generate polynomial & interaction features | Feature expansion                              |
| `OneHotEncoder()`                                   | `preprocessing` | Convert categorical to one-hot             | `OneHotEncoder().fit_transform(df_cat)`        |
| `LabelEncoder()`                                    | `preprocessing` | Encode labels to integers                  | `LabelEncoder().fit_transform(y)`              |
| `LabelBinarizer()`                                  | `preprocessing` | Convert multi-class labels to binary       | For multi-class logistic regression            |
| `PowerTransformer(method='yeo-johnson')`            | `preprocessing` | Make data Gaussian-like                    | Reduce skewness                                |
| `QuantileTransformer(output_distribution='normal')` | `preprocessing` | Map data to uniform / normal               | Robust scaling                                 |

---

## **2️⃣ Model Selection & Evaluation**

| Function / Class                                     | Module            | Purpose                              |
| ---------------------------------------------------- | ----------------- | ------------------------------------ |
| `train_test_split()`                                 | `model_selection` | Split data into train/test           |
| `KFold(n_splits=5)`                                  | `model_selection` | K-fold cross-validation              |
| `StratifiedKFold()`                                  | `model_selection` | Maintain class distribution in folds |
| `cross_val_score(estimator, X, y, cv=5)`             | `model_selection` | Evaluate model using CV              |
| `cross_val_predict()`                                | `model_selection` | Get cross-validated predictions      |
| `GridSearchCV(estimator, param_grid)`                | `model_selection` | Exhaustive hyperparameter tuning     |
| `RandomizedSearchCV(estimator, param_distributions)` | `model_selection` | Randomized hyperparameter tuning     |
| `ShuffleSplit()`                                     | `model_selection` | Random splitting multiple times      |
| `StratifiedShuffleSplit()`                           | `model_selection` | Maintain class balance in splits     |

---

## **3️⃣ Supervised Learning**

### **Regression**

| Model                     | Module         | Use Case                  | Syntax                                    |
| ------------------------- | -------------- | ------------------------- | ----------------------------------------- |
| LinearRegression          | `linear_model` | Continuous prediction     | `LinearRegression().fit(X,y)`             |
| Ridge                     | `linear_model` | L2-regularized regression | `Ridge(alpha=1.0)`                        |
| Lasso                     | `linear_model` | L1-regularized regression | `Lasso(alpha=0.1)`                        |
| ElasticNet                | `linear_model` | L1+L2 regularization      | `ElasticNet(alpha=0.1,l1_ratio=0.5)`      |
| DecisionTreeRegressor     | `tree`         | Non-linear regression     | `DecisionTreeRegressor(max_depth=5)`      |
| RandomForestRegressor     | `ensemble`     | Ensemble trees            | `RandomForestRegressor(n_estimators=100)` |
| GradientBoostingRegressor | `ensemble`     | Boosted regression        | `GradientBoostingRegressor()`             |

### **Classification**

| Model                      | Module         | Use Case                  | Syntax                                     |
| -------------------------- | -------------- | ------------------------- | ------------------------------------------ |
| LogisticRegression         | `linear_model` | Binary / multi-class      | `LogisticRegression()`                     |
| DecisionTreeClassifier     | `tree`         | Tree-based classification | `DecisionTreeClassifier()`                 |
| RandomForestClassifier     | `ensemble`     | Ensemble classifier       | `RandomForestClassifier(n_estimators=100)` |
| GradientBoostingClassifier | `ensemble`     | Boosted classifier        | `GradientBoostingClassifier()`             |
| KNeighborsClassifier       | `neighbors`    | kNN classifier            | `KNeighborsClassifier(n_neighbors=5)`      |
| SVC                        | `svm`          | SVM classifier            | `SVC(kernel='rbf')`                        |
| GaussianNB                 | `naive_bayes`  | Probabilistic classifier  | `GaussianNB()`                             |
| AdaBoostClassifier         | `ensemble`     | Boosted ensemble          | `AdaBoostClassifier()`                     |

---

## **4️⃣ Unsupervised Learning / Dimensionality Reduction**

| Model / Function        | Module          | Use Case                            |
| ----------------------- | --------------- | ----------------------------------- |
| KMeans                  | `cluster`       | Partition into k clusters           |
| DBSCAN                  | `cluster`       | Density-based clustering            |
| AgglomerativeClustering | `cluster`       | Hierarchical clustering             |
| MeanShift               | `cluster`       | Cluster by shifting centroids       |
| PCA                     | `decomposition` | Reduce dimensions linearly          |
| TruncatedSVD            | `decomposition` | Sparse dimensionality reduction     |
| KernelPCA               | `decomposition` | Non-linear dimensionality reduction |
| NMF                     | `decomposition` | Non-negative matrix factorization   |

---

## **5️⃣ Metrics / Model Evaluation**

### **Regression**

| Metric                | Purpose            | Syntax                                |
| --------------------- | ------------------ | ------------------------------------- |
| mean\_squared\_error  | Measures MSE       | `mean_squared_error(y_true, y_pred)`  |
| mean\_absolute\_error | Measures MAE       | `mean_absolute_error(y_true, y_pred)` |
| r2\_score             | Variance explained | `r2_score(y_true, y_pred)`            |

### **Classification**

| Metric                 | Purpose                              | Syntax                                  |
| ---------------------- | ------------------------------------ | --------------------------------------- |
| accuracy\_score        | Correct prediction rate              | `accuracy_score(y_true, y_pred)`        |
| precision\_score       | True positives / predicted positives | `precision_score(y_true, y_pred)`       |
| recall\_score          | True positives / actual positives    | `recall_score(y_true, y_pred)`          |
| f1\_score              | Harmonic mean of precision & recall  | `f1_score(y_true, y_pred)`              |
| roc\_auc\_score        | Area under ROC curve                 | `roc_auc_score(y_true, y_prob)`         |
| confusion\_matrix      | Count of TP, FP, FN, TN              | `confusion_matrix(y_true, y_pred)`      |
| classification\_report | Summary metrics                      | `classification_report(y_true, y_pred)` |

### **Clustering**

| Metric                    | Purpose                               | Syntax                               |
| ------------------------- | ------------------------------------- | ------------------------------------ |
| silhouette\_score         | Cluster cohesion                      | `silhouette_score(X, labels)`        |
| davies\_bouldin\_score    | Cluster separation                    | `davies_bouldin_score(X, labels)`    |
| calinski\_harabasz\_score | Ratio of inter/intra-cluster distance | `calinski_harabasz_score(X, labels)` |

---

## **6️⃣ Pipeline & Utilities**

| Function / Class                | Purpose                                     |
| ------------------------------- | ------------------------------------------- |
| Pipeline(steps)                 | Combine preprocessing + model in one object |
| ColumnTransformer(transformers) | Apply transformations to specific columns   |
| FunctionTransformer(func)       | Custom feature transformation               |
| joblib.dump() / joblib.load()   | Save / Load models                          |
| make\_pipeline()                | Shortcut to create pipelines                |

---

## **7️⃣ Feature Selection**

| Function / Class                        | Purpose                                  |
| --------------------------------------- | ---------------------------------------- |
| SelectKBest(score\_func, k=10)          | Top-k features based on univariate stats |
| RFE(estimator, n\_features\_to\_select) | Recursive feature elimination            |
| VarianceThreshold(threshold=0.0)        | Remove low variance features             |
| SelectFromModel(estimator, threshold)   | Select features using model importance   |

---



Perfect! You want a **master-level coding reference sheet for scikit-learn**, including **actual code examples** for all major tasks: preprocessing, model building, evaluation, pipelines, and feature selection. Here’s a **concise, ready-to-use Python reference sheet**.

---

# **Master-Level Scikit-Learn Coding Reference Sheet**

---

## **1️⃣ Data Preprocessing / Feature Engineering**

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, PolynomialFeatures
from sklearn.preprocessing import PowerTransformer, QuantileTransformer

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max Scaling
X_minmax = MinMaxScaler().fit_transform(X)

# Robust Scaling (resistant to outliers)
X_robust = RobustScaler().fit_transform(X)

# Row-wise Normalization
X_norm = Normalizer().fit_transform(X)

# Polynomial Features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# One-Hot Encoding (categorical)
ohe = OneHotEncoder(sparse=False)
X_ohe = ohe.fit_transform(X_cat)

# Label Encoding (target)
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Power Transform (Yeo-Johnson)
pt = PowerTransformer()
X_pt = pt.fit_transform(X)

# Quantile Transformation (Gaussian)
qt = QuantileTransformer(output_distribution='normal')
X_qt = qt.fit_transform(X)
```

---

## **2️⃣ Train/Test Split & Cross-Validation**

```python
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# K-Fold CV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X):
    X_tr, X_val = X[train_idx], X[val_idx]
    y_tr, y_val = y[train_idx], y[val_idx]

# Cross-validated score
scores = cross_val_score(estimator, X, y, cv=5)

# Hyperparameter tuning
grid = GridSearchCV(estimator, param_grid={'C':[0.1,1,10]}, cv=5)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_
```

---

## **3️⃣ Supervised Learning: Regression**

```python
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# ElasticNet
en = ElasticNet(alpha=0.1, l1_ratio=0.5)
en.fit(X_train, y_train)

# Decision Tree Regressor
dt = DecisionTreeRegressor(max_depth=5)
dt.fit(X_train, y_train)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

# Gradient Boosting Regressor
gb = GradientBoostingRegressor()
gb.fit(X_train, y_train)
```

---

## **4️⃣ Supervised Learning: Classification**

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Logistic Regression
logr = LogisticRegression()
logr.fit(X_train, y_train)
y_pred = logr.predict(X_test)

# Decision Tree
dtc = DecisionTreeClassifier(max_depth=5)
dtc.fit(X_train, y_train)

# Random Forest
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

# Gradient Boosting
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

# AdaBoost
abc = AdaBoostClassifier(n_estimators=100)
abc.fit(X_train, y_train)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# SVM
svc = SVC(kernel='rbf', probability=True)
svc.fit(X_train, y_train)

# Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
```

---

## **5️⃣ Unsupervised Learning / Dimensionality Reduction**

```python
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, MeanShift
from sklearn.decomposition import PCA, TruncatedSVD, KernelPCA, NMF

# K-Means Clustering
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)

# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
labels_db = db.fit_predict(X)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3)
labels_agg = agg.fit_predict(X)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Truncated SVD (for sparse matrices)
svd = TruncatedSVD(n_components=2)
X_svd = svd.fit_transform(X)

# Kernel PCA
kpca = KernelPCA(n_components=2, kernel='rbf')
X_kpca = kpca.fit_transform(X)

# NMF
nmf = NMF(n_components=2)
X_nmf = nmf.fit_transform(X)
```

---

## **6️⃣ Metrics & Evaluation**

```python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Regression Metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Classification Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc = roc_auc_score(y_test, y_prob)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Clustering Metrics
sil = silhouette_score(X, labels)
db = davies_bouldin_score(X, labels)
ch = calinski_harabasz_score(X, labels)
```

---

## **7️⃣ Pipelines & Column Transformers**

```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Preprocessing pipeline for numeric and categorical
numeric_features = ['age','income']
categorical_features = ['gender','city']

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Full pipeline with classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
```

---

## **8️⃣ Feature Selection**

```python
from sklearn.feature_selection import SelectKBest, f_classif, RFE, VarianceThreshold, SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# Select top-k features using ANOVA F-test
skb = SelectKBest(score_func=f_classif, k=5)
X_new = skb.fit_transform(X, y)

# Recursive Feature Elimination (RFE)
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

# Variance Threshold
vt = VarianceThreshold(threshold=0.1)
X_vt = vt.fit_transform(X)

# Select from model (feature importance)
sfm = SelectFromModel(estimator=RandomForestClassifier(), threshold='median')
X_sfm = sfm.fit_transform(X, y)
```

---

