# Pipeline of Digits

This is a starting notebook for solving the "Pipeline of Digits" assignment.


This notebook was created by [Santiago L. Valdarrama](https://twitter.com/svpino) as part of the [Machine Learning School](https://www.ml.school) program.

Let's make sure we are running the latest version of the SakeMaker's SDK. **Restart the notebook** after you upgrade the library.

In [1]:
# !pip install -q --upgrade awscli
# !pip install -q --upgrade pip
# !pip install -q --upgrade sagemaker
# !pip show sagemaker

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# import boto3
# import sagemaker
# import pandas as pd
#
# from pathlib import Path
#
# role = sagemaker.get_execution_role()
# region = boto3.Session().region_name
# sagemaker_session = sagemaker.session.Session()

## Creating the S3 Bucket

Let's create an S3 bucket where you will upload all the information generated by the pipeline. Make sure you set `BUCKET` to the name of the bucket you want to use. This name has to be unique.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region
```

The `LocationConstraint` argument should specify the region where you want to create the bucket.

In [4]:
# BUCKET = "mlschool"
#
# !aws s3api create-bucket --bucket $BUCKET

In [5]:
import os
current_folder = os.getcwd()
print(current_folder)
%ls

C:\Users\kamil\Desktop\ml-school\ml.school\mnist
 Volume in drive C has no label.
 Volume Serial Number is 48D1-EA57

 Directory of C:\Users\kamil\Desktop\ml-school\ml.school\mnist

22.04.2023  17:42    <DIR>          .
22.04.2023  17:42    <DIR>          ..
22.04.2023  17:39    <DIR>          __pycache__
22.04.2023  17:03    <DIR>          baseline
22.04.2023  17:39    <DIR>          dataset
19.04.2023  21:08        15˙491˙438 dataset.tar.gz
22.04.2023  17:42            97˙349 mnist.ipynb
22.04.2023  17:04    <DIR>          pipeline
22.04.2023  17:39             4˙607 preprocessor.py
22.04.2023  17:04    <DIR>          test
22.04.2023  17:03    <DIR>          train
22.04.2023  17:39             2˙690 train.py
22.04.2023  17:04    <DIR>          validation
               4 File(s)     15˙596˙084 bytes
               9 Dir(s)  610˙723˙434˙496 bytes free


## Loading the dataset

We have two CSV files containing the MNIST dataset. These files come from the [MNIST in CSV](https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) Kaggle dataset.

The `mnist_train.csv` file contains 60,000 training examples and labels. The `mnist_test.csv` contains 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

Let's extract the `dataset.tar.gz` file.

In [6]:
from pathlib import Path

MNIST_FOLDER = "C:\\Users\\kamil\\Desktop\\ml-school\\ml.school\\mnist"
DATASET_FOLDER = Path(MNIST_FOLDER) / "dataset"

!tar -xvzf $MNIST_FOLDER\dataset.tar.gz -C $MNIST_FOLDER --no-same-owner

x dataset/
x dataset/mnist_test.csv
x dataset/mnist_train.csv


Let's load the first 10 rows of the test set.

In [7]:
import pandas as pd

df = pd.read_csv(DATASET_FOLDER / "mnist_train.csv", nrows=10)
df

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Uploading dataset to S3

In [8]:
# S3_FILEPATH = f"s3://{BUCKET}/{MNIST_FOLDER}"
#
#
# TRAIN_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
#     local_path=str(DATASET_FOLDER / "mnist_train.csv"),
#     desired_s3_uri=S3_FILEPATH,
# )
#
# TEST_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
#     local_path=str(DATASET_FOLDER / "mnist_test.csv"),
#     desired_s3_uri=S3_FILEPATH,
# )
#
# print(f"Train set S3 location: {TRAIN_SET_S3_URI}")
# print(f"Test set S3 location: {TEST_SET_S3_URI}")

In [9]:
MNIST_FOLDER="C:\\Users\\kamil\\Desktop\\ml-school\\ml.school\\mnist"

In [10]:
%%writefile {MNIST_FOLDER}/preprocessor.py

import os
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from pickle import dump
from pathlib import Path


# This is the location where the SageMaker Processing job
# will save the input dataset.
BASE_DIR = "C:\\Users\\kamil\\Desktop\\ml-school\\ml.school\\mnist"
DATA_FILEPATH_TRAIN = Path(BASE_DIR) / "dataset/" / "mnist_train.csv"
DATA_FILEPATH_TEST = Path(BASE_DIR) / "dataset/" / "mnist_test.csv"


def save_splits(base_dir, train, validation, test):
    """
    One of the goals of this script is to output the three
    dataset splits. This function will save each of these
    splits to disk.
    """

    train_path = Path(base_dir) / "train"
    validation_path = Path(base_dir) / "validation"
    test_path = Path(base_dir) / "test"

    train_path.mkdir(parents=True, exist_ok=True)
    validation_path.mkdir(parents=True, exist_ok=True)
    test_path.mkdir(parents=True, exist_ok=True)

    pd.DataFrame(train).to_csv(train_path / "train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(validation_path / "validation.csv", header=False, index=False)
    pd.DataFrame(test).to_csv(test_path / "test.csv", header=False, index=False)


def save_pipeline(base_dir, pipeline):
    """
    Saves the Scikit-Learn pipeline that we used to
    preprocess the data.
    """
    pipeline_path = Path(base_dir) / "pipeline"
    pipeline_path.mkdir(parents=True, exist_ok=True)
    dump(pipeline, open(pipeline_path / "pipeline.pkl", 'wb'))


def generate_baseline(base_dir, X_train, y_train):
    """
    Generates a baseline for our model using the train set.
    It saves the baseline in a JSON file where every line is
    a JSON object.
    """
    baseline_path = Path(base_dir) / "baseline"
    baseline_path.mkdir(parents=True, exist_ok=True)

    df = X_train.copy()
    df["groundtruth"] = y_train

    df.to_json(baseline_path / "baseline.json", orient='records', lines=True)


def preprocess(base_dir, data_filepath_train, data_filepath_test):
    """
    Preprocesses the supplied raw dataset and splits it into a train, validation,
    and a test set.
    """

    df_train = pd.read_csv(data_filepath_train)
    df_test = pd.read_csv(data_filepath_test)

    numerical_columns = df_train.select_dtypes(include=['number']).drop(['label'], axis=1).columns

    numerical_preprocessor = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler())
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ("numerical", numerical_preprocessor, numerical_columns),
        ]
    )

    X_train = df_train.copy()
    y_train = df_train['label']
    columns = list(X_train.drop(['label'], axis=1).columns)

    X_train, X_validation, y_train, y_validation =  train_test_split(X_train, y_train, test_size=0.2, random_state=12)
    X_test = df_test.copy()

    y_train = X_train.label
    y_validation = X_validation.label
    y_test = X_test.label

    X_train.drop(["label"], axis=1, inplace=True)
    X_validation.drop(["label"], axis=1, inplace=True)
    X_test.drop(["label"], axis=1, inplace=True)

    X_train = pd.DataFrame(X_train, columns=columns)
    X_validation = pd.DataFrame(X_validation, columns=columns)
    X_test = pd.DataFrame(X_test, columns=columns)

    y_train = y_train.astype(int)
    y_validation = y_validation.astype(int)
    y_test = y_test.astype(int)

    # Let's use the train set to generate a baseline that we can
    # later use to measure the quality of our model. This baseline
    # will use the original data.
    generate_baseline(base_dir, X_train, y_train)

    # Transform the data using the Scikit-Learn pipeline.
    X_train = preprocessor.fit_transform(X_train)
    X_validation = preprocessor.transform(X_validation)
    X_test = preprocessor.transform(X_test)

    train = np.concatenate((X_train, np.expand_dims(y_train, axis=1)), axis=1)
    validation = np.concatenate((X_validation, np.expand_dims(y_validation, axis=1)), axis=1)
    test = np.concatenate((X_test, np.expand_dims(y_test, axis=1)), axis=1)

    save_splits(base_dir, train, validation, test)
    save_pipeline(base_dir, pipeline=preprocessor)


if __name__ == "__main__":
    preprocess(BASE_DIR, DATA_FILEPATH_TRAIN, DATA_FILEPATH_TEST)

Overwriting C:\Users\kamil\Desktop\ml-school\ml.school\mnist/preprocessor.py


In [11]:
import os
from pathlib import Path

from mnist.preprocessor import preprocess
import tempfile

LOCAL_FILEPATH_TRAIN="C:\\Users\\kamil\\Desktop\\ml-school\\ml.school\\mnist\\dataset\\mnist_train.csv"
LOCAL_FILEPATH_TEST="C:\\Users\\kamil\\Desktop\\ml-school\\ml.school\\mnist\\dataset\\mnist_test.csv"
ROOT_DIR=r"C:\Users\kamil\Desktop\ml-school\ml.school\mnist"


# preprocess(
#     base_dir=ROOT_DIR,
#     data_filepath_train=LOCAL_FILEPATH_TRAIN,
#     data_filepath_test=LOCAL_FILEPATH_TEST,
# )
#
# print(f"Folders: {os.listdir(ROOT_DIR)}")
#
# print()
# print("Baseline:")
#
# with open(Path(ROOT_DIR) / 'baseline' / 'baseline.json') as baseline:
#     lines = [next(baseline) for _ in range(5)]
#
# for l in lines:
#     print(l[:-1])

In [19]:
%%writefile {ROOT_DIR}/train.py

import os
import argparse

import numpy as np
import pandas as pd
import tensorflow as tf

from pathlib import Path
from sklearn.metrics import accuracy_score

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD


def train(base_directory, train_path, validation_path, epochs=50, batch_size=32):
    X_train = pd.read_csv(Path(train_path) / "train.csv")
    y_train = X_train[X_train.columns[-1]]
    X_train.drop(X_train.columns[-1], axis=1, inplace=True)

    X_validation = pd.read_csv(Path(validation_path) / "validation.csv")
    y_validation = X_validation[X_validation.columns[-1]]
    X_validation.drop(X_validation.columns[-1], axis=1, inplace=True)

    model = Sequential([
        Dense(10, input_shape=(X_train.shape[1],), activation="relu"),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10)
    ])

    # Define the loss function
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    model.compile(
        optimizer=SGD(learning_rate=0.01),
        loss=loss_fn,
        metrics=["accuracy"]
    )

    model.fit(
        X_train,
        y_train,
        validation_data=(X_validation, y_validation),
        epochs=epochs,
        batch_size=batch_size,
        verbose=2,
    )

    predictions = np.argmax(model.predict(X_validation), axis=-1)
    print(f"Validation accuracy: {accuracy_score(y_validation, predictions)}")

    model_filepath = Path(base_directory) / "model" / "001"
    model.save(model_filepath)

if __name__ == "__main__":
    # Any hyperparameters provided by the training job are passed to the entry point
    # as script arguments. SageMaker will also provide a list of special parameters
    # that you can capture here. Here is the full list:
    # https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/params.py
    parser = argparse.ArgumentParser()
    # parser.add_argument("--base_directory", type=str, default="/opt/ml/")
    parser.add_argument("--base_directory", type=str, default=ROOT_DIR / 'model')
    parser.add_argument("--train_path", type=str, default=os.environ.get("SM_CHANNEL_TRAIN", None))
    parser.add_argument("--validation_path", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION", None))
    parser.add_argument("--epochs", type=int, default=50)
    parser.add_argument("--batch_size", type=int, default=32)
    args, _ = parser.parse_known_args()

    train(
        base_directory=args.base_directory,
        train_path=args.train_path,
        validation_path=args.validation_path,
        epochs=args.epochs,
        batch_size=args.batch_size
    )

Overwriting C:\Users\kamil\Desktop\ml-school\ml.school\mnist/train.py


In [20]:
from mnist.preprocessor import preprocess
from mnist.train import train


# First, we preprocess the data and create the
# dataset splits.
preprocess(
    base_dir=ROOT_DIR,
    data_filepath_train=LOCAL_FILEPATH_TRAIN,
    data_filepath_test=LOCAL_FILEPATH_TEST,
)

In [21]:
# Then, we train a model using the train and
# validation splits.
train(
    base_directory=ROOT_DIR,
    train_path=Path(ROOT_DIR) / "train",
    validation_path=Path(ROOT_DIR) / "validation",
    epochs=10
)

Epoch 1/10
1500/1500 - 1s - loss: 0.8383 - accuracy: 0.7425 - val_loss: 0.3979 - val_accuracy: 0.8856 - 1s/epoch - 945us/step
Epoch 2/10
1500/1500 - 1s - loss: 0.3483 - accuracy: 0.8963 - val_loss: 0.3234 - val_accuracy: 0.9069 - 1s/epoch - 688us/step
Epoch 3/10
1500/1500 - 1s - loss: 0.2948 - accuracy: 0.9128 - val_loss: 0.2971 - val_accuracy: 0.9159 - 1s/epoch - 705us/step
Epoch 4/10
1500/1500 - 1s - loss: 0.2680 - accuracy: 0.9207 - val_loss: 0.2821 - val_accuracy: 0.9205 - 1s/epoch - 751us/step
Epoch 5/10
1500/1500 - 1s - loss: 0.2512 - accuracy: 0.9267 - val_loss: 0.2737 - val_accuracy: 0.9238 - 1s/epoch - 732us/step
Epoch 6/10
1500/1500 - 1s - loss: 0.2395 - accuracy: 0.9289 - val_loss: 0.2650 - val_accuracy: 0.9277 - 1s/epoch - 699us/step
Epoch 7/10
1500/1500 - 1s - loss: 0.2281 - accuracy: 0.9326 - val_loss: 0.2609 - val_accuracy: 0.9293 - 1s/epoch - 704us/step
Epoch 8/10
1500/1500 - 1s - loss: 0.2222 - accuracy: 0.9350 - val_loss: 0.2538 - val_accuracy: 0.9319 - 1s/epoch - 731



INFO:tensorflow:Assets written to: C:\Users\kamil\Desktop\ml-school\ml.school\mnist\model\001\assets


INFO:tensorflow:Assets written to: C:\Users\kamil\Desktop\ml-school\ml.school\mnist\model\001\assets
