-
Notifications
You must be signed in to change notification settings - Fork 3
/
pipeline.md~
111 lines (61 loc) · 2.7 KB
/
pipeline.md~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Machine Learning Plan #
## Classes ##
- Introduction to Machine Learning for Coders, fast.ai [DONE]
- Deep Learning, fast.ai [in progress]
- Computational Linear Algebra, fast.ai [TBD]
- [mlcourse.ai](https://mlcourse.ai/) [Schedule to start Sept 9, 2019]
## Reading ##
- Deep Learning Book, [math prerequesites](http://www.deeplearningbook.org/). [TBD]
- Feature Engineering Talk, [slides link](https://www.slideshare.net/HJvanVeen/feature-engineering-72376750). [TBD]
## Lectures ##
- [Owen Zhang](https://www.kaggle.com/owenzhang1) talk at NYC Data Academy ([link](https://www.youtube.com/watch?v=LgLcfZjNF44)). Key ideas on model stacking (using glm on sparse and then feeding into xgb); using leave-one-out target encoding for high cardinality categorical variables; gbm tuning.
- [raddar]() My Journey to Kaggle Grandmasster, Kaggle Days talk [link](https://www.youtube.com/watch?v=7XEMPU17-Wo).
- [CPMP]() Beyond Feature Engineering and HPO, Kaggle Days talk [link](https://www.youtube.com/watch?v=fH_FiquKhiI).
- Vincent W. Winning with Linear Models [link](https://www.youtube.com/watch?v=68ABAU_V8qI).
- Vincent W. The Duct Tape of Heroes (Bayesian stats; pomegranate) [link](https://www.youtube.com/watch?v=dE5j6NW-Kzg).
# Machine Learning Pipeline #
## EDA ##
```python
from pandas_profiling import ProfileReport
```
## Data Cleaning ##
- `pyjanitor` https://github.com/ericmjl/pyjanitor
## Baseline Random Model ##
```python
from sklearn.dummy import DummyRegressor
from sklearn.metrics import make_scorer
scorer = make_scorer(mean_squared_error)
scores_dummy = cross_val_score(baseline, train_df.values, y, cv=RepeatedKFold(n_repeats=100), scoring=scorer)
```
## Feature Importance and Explanability ##
```python
from eli5.sklearn import PermutationImportance
```
- `tree_iterpreter`
- `shap`
- Jeremy's dendrogram code to inspect for redundant features
- Jeremy's RF code to see if feature can predict if a sample is in/out of the test set. If it can, this means that
## Feature Engineering and Encoding ##
- Category Encoding http://contrib.scikit-learn.org/categorical-encoding/index.html
```python
import category_encoders as ce
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
```
## Imputing ##
~~~{.python}
from fancyimpute import KNN
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)
~~~
## Imbalanced Data and Data Augmentation ##
- https://imbalanced-learn.org/en/stable/index.html
## Scaling and Outlier Handling ##
## Model Stacking ##
- `vectack` package compat with `sklearn` api, https://github.com/vecxoz/vecstack
## Hyperparameter Optimzation ##
- `BayesOptCV`
## Blending ##
## Using Neptune ##
# Bayesian Learning #
-