Skip to content

Latest commit

 

History

History
135 lines (74 loc) · 3.64 KB

pipeline.md

File metadata and controls

135 lines (74 loc) · 3.64 KB

View in JupyterLab with Open With → Markdown Preview.

Machine Learning Plan

$$y = mx + b$$

Classes

  • Introduction to Machine Learning for Coders, fast.ai [DONE]
  • Deep Learning, fast.ai [in progress]
  • Computational Linear Algebra, fast.ai [TBD]
  • mlcourse.ai [Schedule to start Sept 9, 2019]

Reading

Lectures

  • 3blue1brown Linear Algebra.
  • Owen Zhang talk at NYC Data Academy (link). Key ideas on model stacking (using glm on sparse and then feeding into xgb); using leave-one-out target encoding for high cardinality categorical variables; gbm tuning.
  • raddar My Journey to Kaggle Grandmasster, Kaggle Days talk link.
  • raddar NCAA March Madness competition 1st place model approach; paris madness kernel. link
  • CPMP Beyond Feature Engineering and HPO, Kaggle Days talk link.
  • Vincent W. Winning with Linear Models link.
  • Vincent W. The Duct Tape of Heroes (Bayesian stats; pomegranate) link.
  • Szilard, @datascienceLA, On Machine Learning Software link.
  • Szilard, @datascienceLA, Better than Deep Learning: GBM link
  • Tianqi Chen, XGBoost: A Scaleable Tree Boosting System, June 2016 talk at DataScienceLA link

Machine Learning Pipeline

EDA

from pandas_profiling import ProfileReport

Data Cleaning

Baseline Random Model

from sklearn.dummy import DummyRegressor
from sklearn.metrics import make_scorer
scorer = make_scorer(mean_squared_error)
scores_dummy = cross_val_score(baseline, train_df.values, y, cv=RepeatedKFold(n_repeats=100), scoring=scorer)

Feature Importance and Explanability

from eli5 import show_weights
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, random_state=1).fit(X_train, y_train)
show_weights(perm, top=50, feature_names=top_columns)
  • tree_iterpreter
  • shap
  • Jeremy's dendrogram code to inspect for redundant features
  • Jeremy's RF code to see if feature can predict if a sample is in/out of the test set. If it can, this means that

Feature Engineering and Encoding

import category_encoders as ce
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)

Imputing

from fancyimpute import KNN
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)

Imbalanced Data and Data Augmentation

Scaling and Outlier Handling

Model Stacking

Cross Validation

Hyperparameter Optimzation

  • BayesOptCV

Blending

Using Neptune

AutoML

  • Kaggle Days SF h2o, AutoML gets 8th place on Hackathon link

Bayesian Learning