# Step 2: Create pipelines
After prototyping in the nobook we break our code into .py modules describing functionality (classes and functions) and logic (pipelines) which are contolled by params.yaml

With this we now have a version control over code, logging and opportunity to cover the code by tests.

In [1]:
!pwd

/Users/antongusarov/ML_REPA/github/predict-device-change


In [4]:
%%bash
cd src/data
tree -L 1

.
├── __init__.py
├── load.py
└── process.py

0 directories, 3 files


In [9]:
%%bash
cd src/evaluate
tree -L 1

.
├── __pycache__
└── metrics.py

1 directory, 1 file


In [10]:
%%bash
cd src/train
tree -L 1

.
├── __pycache__
├── test
└── train.py

2 directories, 1 file


In [11]:
%%bash
cd src/utils
tree -L 1

.
├── __pycache__
├── logging.py
└── test_environment.py

1 directory, 2 files


In [12]:
%%bash
cd src/pipelines
tree -L 1

.
├── __pycache__
├── featurize.py
├── load_data.py
└── train.py

1 directory, 3 files


In [20]:
import yaml
yaml.safe_load(open('./config/params.yaml'))

{'base': {'project_dir': '.', 'random_state': 42, 'log_level': 'DEBUG'},
 'data_load': {'target': 'data/raw/target.feather',
  'dataset': 'data/raw/user_features.feather',
  'target_processed': 'data/processed/target.feather',
  'dataset_processed': 'data/processed/user_features.feather'},
 'featurize': {'features_path': 'data/processed/features.feather',
  'categories': ['feature_17',
   'feature_21',
   'feature_11',
   'feature_16',
   'feature_22']},
 'data_split': {'split_oos': True,
  'test_size': 1,
  'train_index_path': 'data/processed/train_index.csv',
  'test_index_path': 'data/processed/test_index.csv'},
 'train': {'catboost_params': {'iterations': 20,
   'thread_count': 20,
   'has_time': True,
   'allow_writing_files': False},
  'top_K_coef': 0.05,
  'model_path': 'models/model.joblib',
  'train_metrics': 'reports/train_metrics.json',
  'train_metrics_path': 'reports/train_metrics.json',
  'train_metrics_png': 'reports/train_metrics.png',
  'train_plots_path': 'reports/tra

## Load and process data for training
Call the load_data.py pipeline. Note that we now have logs to track and monitor processing.

In [1]:
! pwd

/Users/antongusarov/ML_REPA/github/predict-device-change


In [2]:
! python -m src.pipelines.load_data --config=config/params.yaml

2021-01-28 16:06:22,357 — DATA_LOAD — INFO — Load dataset
2021-01-28 16:06:22,834 — DATA_LOAD — INFO — Process target
2021-01-28 16:06:22,863 — DATA_LOAD — INFO — Process dataset
2021-01-28 16:06:23,138 — DATA_LOAD — INFO — Save processed data and target
2021-01-28 16:06:23,374 — DATA_LOAD — DEBUG — Processed data path: data/processed/user_features.feather
2021-01-28 16:06:23,374 — DATA_LOAD — DEBUG — Processed data path: data/processed/target.feather


## Build features for training
Call the featurize.py pipeline.

In [3]:
! python -m src.pipelines.featurize --config=config/params.yaml

2021-01-28 16:07:56,664 — FEATURIZE — INFO — Load dataset
2021-01-28 16:07:57,179 — FEATURIZE — INFO — Process dataset
2021-01-28 16:07:57,384 — FEATURIZE — INFO — Add target column
2021-01-28 16:07:58,151 — FEATURIZE — INFO — Process nulls
2021-01-28 16:07:58,612 — FEATURIZE — INFO — Save features
2021-01-28 16:07:58,815 — FEATURIZE — DEBUG — Features path: data/processed/features.feather


## Train and save model together with scores
Call the pipelines/train.py pipeline where we train model, save calculated during train test metrics. Finally, save trained model.

In [4]:
! python -m src.pipelines.train --config=config/params.yaml

{'project_dir': '.', 'random_state': 42, 'log_level': 'DEBUG'}
2021-01-28 16:10:03,924 — TRAIN — INFO — Load data
2021-01-28 16:10:04,501 — TRAIN — INFO — Instantiate model
2021-01-28 16:10:04,588 — TRAIN — INFO — Top_K 5.0% of the dataset size: 37606
2021-01-28 16:10:04,591 — TRAIN — INFO — Fold 1:
2021-01-28 16:10:04,591 — TRAIN — INFO — Train: 2020-04-30 00:00:00 - 2020-04-30 00:00:00
2021-01-28 16:10:04,591 — TRAIN — INFO — Test: 2020-05-31 00:00:00 

2021-01-28 16:10:04,717 — TRAIN — INFO — Train shapes: X - (150484, 30), y - (150484,)
2021-01-28 16:10:04,717 — TRAIN — INFO — Test shapes: X - (150411, 30), y - (150411,)
Learning rate set to 0.5
0:	learn: 0.6136792	total: 248ms	remaining: 4.72s
1:	learn: 0.5580362	total: 349ms	remaining: 3.14s
2:	learn: 0.5270051	total: 446ms	remaining: 2.53s
3:	learn: 0.5080045	total: 531ms	remaining: 2.12s
4:	learn: 0.4978499	total: 600ms	remaining: 1.8s
5:	learn: 0.4870497	total: 654ms	remaining: 1.53s
6:	learn: 0.4816503	total: 713ms	remaining: