<a href="https://colab.research.google.com/github/indrikwijaya/Approaching-Any-ML-Problem/blob/master/3_Arranging_machine_learning_projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
%cd /content/drive/MyDrive/Colab Notebooks/Approaching-Any-ML-Problem/
!mkdir src input models notebooks

/content/drive/MyDrive/Colab Notebooks/Approaching-Any-ML-Problem


In [7]:
%%writefile src/train.py
import joblib
import pandas as pd
from sklearn import metrics
from sklearn import tree

def run(fold):
  # read the training data with folds
  df = pd.read_csv('data/mnist_train_folds.csv')

  # training data is where kfold isn't equal to provided fold
  # also, note that we reset the index
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # validation data is where kfold is equal to provided fold
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # drop the label column from dataframe and convert it to
  # a numpy array by using .values
  # target is label column in the dataframe
  x_train = df_train.drop('label', axis=1).values
  y_train = df_train.label.values

  # similarly, for validation, we have
  x_valid = df_valid.drop('label', axis=1).values
  y_valid = df_valid.label.values

  # initialize simple decision tree classifier
  clf = tree.DecisionTreeClassifier()

  # fit the model on training data
  clf.fit(x_train, y_train)

  # create predictions for validation samples
  preds = clf.predict(x_valid)

  # accuracy
  accuracy = metrics.accuracy_score(y_valid, preds)
  print(f"Fold={fold}, Accuracy={accuracy}")

  # save the model
  joblib.dump(clf, f"models/dt_{fold}.bin")

if __name__ == "__main__":
  run(fold=0)
  run(fold=1)
  run(fold=2)
  run(fold=3)
  run(fold=4)


Overwriting src/train.py


In [8]:
!python src/train.py

Fold=0, Accuracy=0.8645833333333334
Fold=1, Accuracy=0.86975
Fold=2, Accuracy=0.8684166666666666
Fold=3, Accuracy=0.8715833333333334
Fold=4, Accuracy=0.8683333333333333


Some things are still hardcoded, for example, the fold numbers, the training file and the output folder -> create a config file

As you can see, we call the run function multiple times for every fold. Sometimes it’s not advisable to run multiple folds in the same script as the memory consumption may keep increasing, and your program may crash. To take care of this problem, we can pass arguments to the training script. I like doing it using `argparse`.

In [10]:
%%writefile src/config.py
TRAINING_FILE = "../data/mnist_train_folds.csv"
MODEL_OUTPUT = "../models/"

Overwriting src/config.py


Then, update `train.py` accordingly

In [12]:
%cd src

/content/drive/MyDrive/Colab Notebooks/Approaching-Any-ML-Problem/src


In [20]:
%%writefile train.py
import os
import argparse
import joblib
import config
import pandas as pd
from sklearn import metrics
from sklearn import tree

def run(fold):
  # read the training data with folds
  df = pd.read_csv(config.TRAINING_FILE)

  # training data is where kfold isn't equal to provided fold
  # also, note that we reset the index
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # validation data is where kfold is equal to provided fold
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # drop the label column from dataframe and convert it to
  # a numpy array by using .values
  # target is label column in the dataframe
  x_train = df_train.drop('label', axis=1).values
  y_train = df_train.label.values

  # similarly, for validation, we have
  x_valid = df_valid.drop('label', axis=1).values
  y_valid = df_valid.label.values

  # initialize simple decision tree classifier
  clf = tree.DecisionTreeClassifier()

  # fit the model on training data
  clf.fit(x_train, y_train)

  # create predictions for validation samples
  preds = clf.predict(x_valid)

  # accuracy
  accuracy = metrics.accuracy_score(y_valid, preds)
  print(f"Fold={fold}, Accuracy={accuracy}")

  # save the model
  joblib.dump(clf, 
              os.path.join(config.MODEL_OUTPUT,
              f"dt_{fold}.bin"))

if __name__ == "__main__":
  # initialize ArgumentParser class of argparse
  parser = argparse.ArgumentParser()

  # add the different arguments you need and their type
  parser.add_argument(
      "--fold",
      type=int
  )
  args = parser.parse_args()

  run(fold=args.fold)


Overwriting train.py


In [21]:
!python train.py --fold 0

Fold=0, Accuracy=0.86925


In [23]:
%%writefile train_5.sh
#!/bin/sh
python train.py --fold 0
python train.py --fold 1
python train.py --fold 2
python train.py --fold 3
python train.py --fold 4

Overwriting train_5.sh


In [25]:
!sh train_5.sh

Fold=0, Accuracy=0.8660833333333333
Fold=1, Accuracy=0.8683333333333333
Fold=2, Accuracy=0.8695
Fold=3, Accuracy=0.86925
Fold=4, Accuracy=0.871


If we look at our training script, we still are limited by a few things, for example, the model. The model is hardcoded in the training script, and the only way to change it is to modify the script. So, we will create a new python script called `model_dispatcher.py`. `model_dispatcher.py`, as the name suggests, will dispatch our models to our training script.

In [26]:
%%writefile model_dispatcher.py
from sklearn import tree

models = {
    "decision_tree_gini": tree.DecisionTreeClassifier(
        criterion="gini"
    ),
    "decision_tree_entropy": tree.DecisionTreeClassifier(
        criterion="entropy"
    ),
}

Writing model_dispatcher.py


In [27]:
%%writefile train.py
import os
import argparse

import joblib
import pandas as pd
from sklearn import metrics

import config
import model_dispatcher

def run(fold, model):
  # read the training data with folds
  df = pd.read_csv(config.TRAINING_FILE)

  # training data is where kfold isn't equal to provided fold
  # also, note that we reset the index
  df_train = df[df.kfold != fold].reset_index(drop=True)

  # validation data is where kfold is equal to provided fold
  df_valid = df[df.kfold == fold].reset_index(drop=True)

  # drop the label column from dataframe and convert it to
  # a numpy array by using .values
  # target is label column in the dataframe
  x_train = df_train.drop('label', axis=1).values
  y_train = df_train.label.values

  # similarly, for validation, we have
  x_valid = df_valid.drop('label', axis=1).values
  y_valid = df_valid.label.values

  # initialize simple decision tree classifier
  clf = model_dispatcher.models[model]

  # fit the model on training data
  clf.fit(x_train, y_train)

  # create predictions for validation samples
  preds = clf.predict(x_valid)

  # accuracy
  accuracy = metrics.accuracy_score(y_valid, preds)
  print(f"Fold={fold}, Accuracy={accuracy}")

  # save the model
  joblib.dump(clf, 
              os.path.join(config.MODEL_OUTPUT,
              f"dt_{fold}.bin"))

if __name__ == "__main__":
  # initialize ArgumentParser class of argparse
  parser = argparse.ArgumentParser()

  # add the different arguments you need and their type
  parser.add_argument(
      "--fold",
      type=int
  )
  parser.add_argument(
      "--model",
      type=str
  )

  args = parser.parse_args()

  run(
      fold=args.fold,
      model=args.model)


Overwriting train.py


In [28]:
!python train.py --fold 0 --model decision_tree_gini

Fold=0, Accuracy=0.8661666666666666


In [29]:
!python train.py --fold 0 --model decision_tree_entropy

Fold=0, Accuracy=0.871


We can always add new model to `model_dispatcher.py`

In [30]:
%%writefile model_dispatcher.py
from sklearn import tree
from sklearn import ensemble

models = {
    "decision_tree_gini": tree.DecisionTreeClassifier(
        criterion="gini"
    ),
    "decision_tree_entropy": tree.DecisionTreeClassifier(
        criterion="entropy"
    ),
    "rf": ensemble.RandomForestClassifier(),
}

Overwriting model_dispatcher.py


In [31]:
!python train.py --fold 0 --model rf

Fold=0, Accuracy=0.9681666666666666


In [32]:
%%writefile train_5_rf.sh
#!/bin/sh
python train.py --fold 0 --model rf
python train.py --fold 1 --model rf
python train.py --fold 2 --model rf
python train.py --fold 3 --model rf
python train.py --fold 4 --model rf

Writing train_5_rf.sh


In [33]:
!sh train_5_rf.sh

Fold=0, Accuracy=0.9686666666666667
Fold=1, Accuracy=0.969
Fold=2, Accuracy=0.96625
Fold=3, Accuracy=0.9674166666666667
Fold=4, Accuracy=0.9674166666666667


Please note that I did not import * and neither should you. If I
had imported *, you would have never known where the models dictionary came
from. Writing good, understandable code is an essential quality one can have, and
many data scientists ignore it. If you work on a project that others can understand
and use without consulting you, you save their time and your own time and can
invest that time to improve your project or work on a new one.

You can also use [cookiecutter](https://drivendata.github.io/cookiecutter-data-science/#getting-started) to set up the project directories for you

In [34]:
!pip install cookiecutter

Collecting cookiecutter
  Downloading cookiecutter-1.7.3-py2.py3-none-any.whl (34 kB)
Collecting jinja2-time>=0.2.0
  Downloading jinja2_time-0.2.0-py2.py3-none-any.whl (6.4 kB)
Collecting poyo>=0.5.0
  Downloading poyo-0.5.0-py2.py3-none-any.whl (10 kB)
Collecting binaryornot>=0.4.4
  Downloading binaryornot-0.4.4-py2.py3-none-any.whl (9.0 kB)
Collecting arrow
  Downloading arrow-1.2.2-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 1.8 MB/s 
Installing collected packages: arrow, poyo, jinja2-time, binaryornot, cookiecutter
Successfully installed arrow-1.2.2 binaryornot-0.4.4 cookiecutter-1.7.3 jinja2-time-0.2.0 poyo-0.5.0


In [35]:
%cd /content/

/content


In [37]:
!cookiecutter https://github.com/drivendata/cookiecutter-data-science


project_name [project_name]: test
repo_name [test]: test
author_name [Your name (or your organization/company/team)]: indrik
description [A short description of the project.]: testing data science project directory
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]: 
aws_profile [default]: 
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1



Cookiecutter data science is moving to v2 soon, which will entail using
the command `ccds ...` rather than `cookiecutter ...`. The cookiecutter command
will continue to work, and this version of the template will still be available.
To use the legacy template, you will need to explicitly use `-c v1` to select it.

Please update any scripts/automation you have to append the `-c v1` option,
which is available now.

For example:
    cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-scie

```
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io
```
