# Notes on Machine Learning Models - Part 3 - Best Practices

အပိုင်း ၃ ပိုင်းရှိတဲ့ ဒီ notes တွေမှာ အဓိကအားဖြင့် Machine Learning Model တွေကို အသုံးပြုရာမှာ သတိပြုရမဲ့ အောက်ပါ အကြောင်းအရာများကို ပြောပြပေးသွားမှာ ဖြစ်ပါတယ်။ 

* Feature Engineering 
* Model Validation and
* Coding Best Practices

ဒီအပိုင်းက တတိယနဲ့ နောက်ဆုံးအပိုင်း Best Practices ဖြစ်ပါတယ်။ 

Data Science နဲ့ ပတ်သက်ရင် (ခုနောက်ပိုင်း prompt engineering လိုမျိုးတွေ၊ AutoML တွေ သုံးလာကြပြီ ဖြစ်ပေမဲ့) လက်ရှိမှာ ကုဒ်တွေ ရေးနေရတုန်းပဲ ဖြစ်ပါတယ်။ 

ဒါကြောင့် ဒီအပိုင်းမှာ ... 

* `sklearn.pipe` နဲ့ တခြား `sklearn` utilities တွေ အကြောင်း
* version control system (VCS) အကြောင်းနဲ့ 
* တခြား coding နဲ့ ဆက်နွယ်နေတဲ့ ဆောင်ရန်/ရှောင်ရန် အကြောင်းတွေ ပြောမယ်။

## `sklearn` Like a Boss

### The Ugly Way

အခု အချိန်အထိ Feature Engineering လုပ်ဖို့ ကျနော်တို့ ရှုပ်ရှုပ်ထွေးထွေး code တွေ ရေးခဲ့ကြတယ်။ 

> Exploratory Data Analysis နဲ့ Idea phase လိုမျိုးမှာ စမ်းသပ်ဖို့တော့ ဒီလို ရေးချင် ရေးခဲ့လိမ့်မယ်။

In [1]:
from sklearn import preprocessing as sk_pp
from sklearn import datasets as sk_ds

import numpy as np
import pandas as pd

In [2]:
df_X, ds_y = sk_ds.fetch_openml(name="credit-g", as_frame=True, return_X_y=True)
df_X.head()

  " {version}.".format(name=name, version=res[0]["version"])


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,4.0,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,4.0,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes


In [3]:
df_X_category = df_X.select_dtypes(include=["category"])
df_X_number = df_X.select_dtypes(include=["number"])

In [4]:
from sklearn import model_selection as sk_ms

df_X_tr, df_X_ts, ds_y_tr, ds_y_ts = sk_ms.train_test_split(df_X, ds_y, test_size=0.2, shuffle=True, random_state=42)

df_feat_tr = pd.DataFrame(data=None, index=df_X_tr.index)
df_feat_ts = pd.DataFrame(data=None, index=df_X_ts.index)

In [5]:
ordinal_columns = ["credit_history", "savings_status"]
oe = sk_pp.OrdinalEncoder(
    categories=[
        ['no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'],
        ['no known savings', '<100', '100<=X<500', '500<=X<1000', '>=1000']
    ], 
    handle_unknown="use_encoded_value", 
    unknown_value=np.nan
)

df_feat_tr.loc[:, ["oe_{}".format(c) for c in ordinal_columns]] = oe.fit_transform(df_X_tr[ordinal_columns])
df_feat_ts.loc[:, ["oe_{}".format(c) for c in ordinal_columns]] = oe.transform(df_X_ts[ordinal_columns])

df_feat_tr.head()

Unnamed: 0,oe_credit_history,oe_savings_status
29,3.0,1.0
535,4.0,1.0
695,2.0,3.0
557,0.0,0.0
836,2.0,0.0


In [6]:
norminal_columns = [c for c in df_X_category.columns if c not in ordinal_columns]

ohe = sk_pp.OneHotEncoder(sparse=False, handle_unknown="ignore")
ohe.fit(df_X_tr[norminal_columns])

norminal_features = ohe.get_feature_names_out()
df_feat_tr.loc[:, norminal_features] = ohe.transform(df_X_tr[norminal_columns])
df_feat_ts.loc[:, norminal_features] = ohe.transform(df_X_ts[norminal_columns])
df_feat_tr.head()

Unnamed: 0,oe_credit_history,oe_savings_status,checking_status_0<=X<200,checking_status_<0,checking_status_>=200,checking_status_no checking,purpose_business,purpose_domestic appliance,purpose_education,purpose_furniture/equipment,...,housing_own,housing_rent,job_high qualif/self emp/mgmt,job_skilled,job_unemp/unskilled non res,job_unskilled resident,own_telephone_none,own_telephone_yes,foreign_worker_no,foreign_worker_yes
29,3.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
535,4.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
695,2.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
557,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
836,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


ဒါမပြီးသေးဘူး၊ အပိုင်း ၁ မှာ Feature Engineering တွေ ရှိသေးတယ်။ 

ဒီလိုတွေ ရှုပ်ရှုပ်ထွေးထွေးတွေ ရေးနေမဲ့အစား ... ပိုကောင်းတဲ့ နည်းလမ်း မရှိဘူးလား။ 

### `sklearn.pipeline`

In [7]:
from sklearn import pipeline as sk_pipe
from sklearn import base as sk_base

In [8]:
class ColumnSelector(sk_base.TransformerMixin):
    def __init__(self, cols_to_select) -> None:
        super().__init__()
        self.cols_to_select = cols_to_select

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None):
        return X[self.cols_to_select]


In [9]:
ordinal_columns = ['credit_history', 'savings_status']
norminal_columns = [c for c in df_X_tr.select_dtypes(include=["category"]).columns if c not in ordinal_columns]
numeric_columns = list(df_X_tr.select_dtypes(include=["number"]).columns)

one_hot_pipeline = sk_pipe.Pipeline(steps=[
    ("norminal_selector", ColumnSelector(cols_to_select=norminal_columns)),
    ("one_hot_encoder", sk_pp.OneHotEncoder(sparse=False, handle_unknown="ignore"))
], verbose=True)

ordinal_pipeline = sk_pipe.Pipeline(steps=[
    ("ordinal_selector", ColumnSelector(cols_to_select=ordinal_columns)),
    ("ordinal_encoder", sk_pp.OrdinalEncoder(
        categories=[
            ['no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'],
            ['no known savings', '<100', '100<=X<500', '500<=X<1000', '>=1000']
        ], 
    handle_unknown="use_encoded_value", unknown_value=np.nan
    ))
], verbose=True)

numeric_pipeline = sk_pipe.Pipeline(steps=[
    ("to_scale", sk_pipe.FeatureUnion(transformer_list=[
        ("ordinal_pipeline", ordinal_pipeline),
        ("numeric_selector", ColumnSelector(cols_to_select=numeric_columns))
    ], n_jobs=2)),
    ("scalers", sk_pipe.FeatureUnion(transformer_list=[
        ("ss", sk_pp.StandardScaler()),
        ("mms", sk_pp.MinMaxScaler()),
        ("mas", sk_pp.MaxAbsScaler())
    ], n_jobs=2, verbose=True))
])

In [10]:
from sklearn import feature_selection as sk_fs
from sklearn import svm
from sklearn import ensemble

model_pipeline = sk_pipe.Pipeline(steps=[
    ("preprocessing", sk_pipe.FeatureUnion(transformer_list=[
        ("ordinal_pipe", one_hot_pipeline),
        ("numeric_pipe", numeric_pipeline)
    ])),
    ("feature_select", sk_fs.SequentialFeatureSelector(
        estimator=svm.NuSVC(class_weight={0: 1, 1: 5}, random_state=42), n_features_to_select=50, direction="backward", cv=5, n_jobs=4)
    ),
    ("gbm", ensemble.GradientBoostingClassifier(n_estimators=100))
])

In [11]:
y_tr = [0 if _y == "good" else 1 for _y in ds_y_tr]
y_ts = [0 if _y == "good" else 1 for _y in ds_y_ts]

model_pipeline.fit(df_X_tr, y_tr)

[Pipeline] . (step 1 of 2) Processing norminal_selector, total=   0.0s
[Pipeline] ... (step 2 of 2) Processing one_hot_encoder, total=   0.0s
[Pipeline] .. (step 1 of 2) Processing ordinal_selector, total=   0.0s
[Pipeline] ... (step 2 of 2) Processing ordinal_encoder, total=   0.0s
[FeatureUnion] ............ (step 1 of 3) Processing ss, total=   0.0s
[FeatureUnion] ........... (step 2 of 3) Processing mms, total=   0.0s
[FeatureUnion] ........... (step 3 of 3) Processing mas, total=   0.0s


Pipeline(steps=[('preprocessing',
                 FeatureUnion(transformer_list=[('ordinal_pipe',
                                                 Pipeline(steps=[('norminal_selector',
                                                                  <__main__.ColumnSelector object at 0x7ff1d8ed7b38>),
                                                                 ('one_hot_encoder',
                                                                  OneHotEncoder(handle_unknown='ignore',
                                                                                sparse=False))],
                                                          verbose=True)),
                                                ('numeric_pipe',
                                                 Pipeline(steps=[('to_scale',
                                                                  FeatureUnion(n_jobs=2,
                                                                               transformer_list=[('ord

In [12]:
from sklearn import metrics
y_hat = model_pipeline.predict(df_X_ts)
print (metrics.classification_report(y_ts, y_hat, sample_weight=[1 if _y==0 else 5 for _y in y_ts]))

              precision    recall  f1-score   support

           0       0.43      0.89      0.58     141.0
           1       0.90      0.44      0.59     295.0

    accuracy                           0.59     436.0
   macro avg       0.66      0.67      0.59     436.0
weighted avg       0.75      0.59      0.59     436.0



Pipeline နဲ့ ရေးတာနဲ့ အရင်လို တစစီ ကျဲပြီးရေးတာနဲ့ readibility ချင်း ယှဉ်ကြည့်ပါ။ သိသာလာပါလိမ့်မယ်။ 

> pipeline တွေကို debug လုပ်ရတာတော့ လက်ဝင်တယ်။ ဒီတော့ တစချင်းစီကို `fit`/`transform`/`predict` လုပ်ပြီး debug လုပ်နိုင်တယ်။
>
> တစစီ ကျဲရေးထားတာတွေကို production ပို့ခါနီးမှာ pipeline နဲ့ ပြောင်းရေးတာမျိုးလဲ လုပ်တတ်ကြတယ်။ လုပ်သင့်တယ်။ 

### Searching for Hyper-parameters

အခု အချိန်ထိ Hyper-parameter တွေကို ရှာတဲ့အခါမှာ လက်နဲ့ နည်းနည်းစီ ပြင်ပြီးတော့ပဲ ရှာခဲ့ကြတယ်။ 

> စနစ်တကျ ရှာဖွေတဲ့ နည်းလမ်းနဲ့ မရှာရသေးဘူး။ 
>
> အခု ဒီ section မှာ စနစ်တကျ hyper-parameter ရှာတဲ့ အပိုင်းကို ပြောပြမယ်။ 

In [22]:
feature_pipeline = sk_pipe.Pipeline(steps=[
    ("preprocessing", sk_pipe.FeatureUnion(transformer_list=[
        ("ordinal_pipe", one_hot_pipeline),
        ("numeric_pipe", numeric_pipeline)
    ])),
    ("feature_select", sk_fs.SequentialFeatureSelector(
        estimator=svm.NuSVC(class_weight={0: 1, 1: 5}, random_state=42), n_features_to_select=50, direction="backward", cv=5, n_jobs=4)
    )
])

feature_values_tr = feature_pipeline.fit_transform(df_X_tr, y_tr)
feature_values_ts = feature_pipeline.transform(df_X_ts)

[Pipeline] . (step 1 of 2) Processing norminal_selector, total=   0.0s
[Pipeline] ... (step 2 of 2) Processing one_hot_encoder, total=   0.0s
[Pipeline] .. (step 1 of 2) Processing ordinal_selector, total=   0.0s
[Pipeline] ... (step 2 of 2) Processing ordinal_encoder, total=   0.0s
[FeatureUnion] ............ (step 1 of 3) Processing ss, total=   0.0s
[FeatureUnion] ........... (step 3 of 3) Processing mas, total=   0.0s
[FeatureUnion] ........... (step 2 of 3) Processing mms, total=   0.0s


ဆိုကြပါစို့။ `sklearn.svm.NuSVC` အတွက် hyper-parameter ရှာမယ် ဆိုပါစို့။ 

အရင်ဆုံး ဒီလို စဉ်းစားကြည့်မယ်။ 

* nu အတွက် 0.4, 0.5, 0.6 နဲ့ စမ်းကြည့်မယ်။ 
* degree အတွက် 2, 3, 4, 5, 6 နဲ့ စမ်းကြည့်မယ်။ 

ဒါဆိုရင် ဒီလို ရေးမယ်။

In [28]:
param_grid = [{
    "nu": [0.4, 0.5, 0.6],
    "kernel": ["poly"], 
    "degree": [2, 3, 4, 5, 6],
    "class_weight": [{0:1, 1:5}], 
    "max_iter": [1000]
}]
gridsearch = sk_ms.GridSearchCV(estimator=svm.NuSVC(), param_grid=param_grid, n_jobs=2, cv=3, verbose=1)
gridsearch.fit(feature_values_tr, y_tr)
y_hat = gridsearch.predict(feature_values_ts)
print (metrics.classification_report(y_ts, y_hat, sample_weight=[1 if _y==0 else 5 for _y in y_ts]))

Fitting 3 folds for each of 15 candidates, totalling 45 fits




              precision    recall  f1-score   support

           0       0.44      0.91      0.59     141.0
           1       0.92      0.44      0.59     295.0

    accuracy                           0.59     436.0
   macro avg       0.68      0.68      0.59     436.0
weighted avg       0.76      0.59      0.59     436.0





In [29]:
gridsearch.best_params_

{'class_weight': {0: 1, 1: 5},
 'degree': 2,
 'kernel': 'poly',
 'max_iter': 1000,
 'nu': 0.5}

အလားတူပဲ `GradientBoostingClassifier` နဲ့ `neural_network.MLPClassifier` တို့ကို သုံးပြီး `GridSearchCV` နဲ့ hyper-parameter တွေ ရှာကြည့်ပါဦး။

In [None]:
# write your code here

## VCS

Version Control System (VCS) တွေက အခုခေတ်မှာ code ရေးတဲ့သူတွေအတွက် မဖြစ်မနေပါပဲ။ ဘာ project ပဲ လုပ်လုပ် github/gitlab/bitbucket အကောင့်တခုနဲ့ project file တွေ (50MB အောက်) ကို အမြဲသိမ်းတဲ့ အကျင့်ကို လုပ်ပါ။ 

> အသေးစိတ်ကို week 13 မှာ ပြန်ကြည့်ပါ။

## Coding Best Practices

### refactor, refactor and refactor


Software Engineering မှာ အရေးကြီးတာ ၃ ခု ရှိတယ်။ အဲဒါတွေကတော့ 

1. Refactor
2. Refactor and
3. Refactor တို့ပဲ ဖြစ်တယ်။

တတ်နိုင်သလောက် function ခွဲရေးပါ။ 

> Rule of Thumb အနေနဲ့ function တခုမှာ code 10 - 50 line ပဲ ရှိပါစေ။ 
> 
> အကောင်းဆုံးကတော့ line 20 - 30 ပေါ့။ 
> 
> line 100 ကျော်ပြီဆိုတာနဲ့ function ခွဲဖို့ လိုပါပြီ။

တတ်နိုင်သလောက် file တွေ ခွဲရေးပါ။ 

> Rule of Thumb အနေနဲ့ file တခုမှာ code 100 - 500 line ပဲ ရှိပါစေ။
>
> အကောင်းဆုံးကတော့ line 100 - 200 ပေါ့။ 
>
> line 1000 ကျော်ပြီဆိုတာနဲ့ file ခွဲဖို့ လိုပါပြီ။ 

Object-oriented နည်းနဲ့ ရေးတတ်တယ်ဆိုရင် Object-oriented နည်းနဲ့ ရေးပါ။ 

> မတတ်လဲ ကိစ္စ မရှိပါဘူး။ 

### Zen of Python

Zen of Python အချက် ၂၀ ကို အမြဲ နှလုံးသွင်းနေပါ။ 

> 1. Beautiful is better than ugly.
> 
> 2. Explicit is better than implicit.
> 
> 3. Simple is better than complex.
> 
> 4. Complex is better than complicated.
> 
> 5. Flat is better than nested.
> 
> 6. Sparse is better than dense.
> 
> 7. Readability counts.
> 
> 8. Special cases aren't special enough to break the rules.
> 
> 9. Although practicality beats purity.
> 
> 10. Errors should never pass silently.
> 
> 11. Unless explicitly silenced.
> 
> 12. In the face of ambiguity, refuse the temptation to guess.
> 
> 13. There should be one-- and preferably only one --obvious way to do it.
> 
> 14. Although that way may not be obvious at first unless you're Dutch.
> 
> 15. Now is better than never.
> 
> 16. Although never is often better than *right* now.
> 
> 17. If the implementation is hard to explain, it's a bad idea.
> 
> 18. If the implementation is easy to explain, it may be a good idea.
> 
> 19. Namespaces are one honking great idea -- let's do more of those!