# Your Second Image Classifier: Using CNN to Classify Images
# Data Segregation

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda.
Containing only 3,000 images, the Animals dataset is meant to be another **introductory** dataset
that we can quickly train a CNN model.

Let's take the following steps:

1. Data segregation
2. Split clean data into train, validation and test

<center><img width="900" src="https://drive.google.com/uc?export=view&id=1haMB_Zt6Et9q9sPHxfuR4g3FT5QRXlTI"></center>


## Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:


*   **pip install wandb** – Install the W&B library
*   **import wandb** – Import the wandb library
*   **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [None]:
!pip install wandb -qU

### Import Packages

In [2]:
# import the necessary packages
import logging
import joblib
from sklearn.model_selection import train_test_split
import wandb

In [3]:
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
# configure logging
# reference for a logging obj
logger = logging.getLogger()

# set level of logging
logger.setLevel(logging.INFO)

# create handlers
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

# add handler to the logger
logger.handlers[0] = c_handler

## Step 02 Data Segregation

In [5]:
# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
  "project_name": "alexnet",
  "artifact_name_feature": "clean_features:latest",
  "artifact_name_target": "labels:latest",
  "train_feature_artifact": "train_x",
  "train_target_artifact": "train_y",
  "val_feature_artifact": "val_x",
  "val_target_artifact": "val_y",
  "test_feature_artifact": "test_x",
  "test_target_artifact": "test_y",
}

In [6]:
# open the W&B project created in the Fetch step
run = wandb.init(entity="ivanovitch-silva",
                 project=args["project_name"],
                 job_type="data_segregation")

logger.info("Downloading and reading clean data artifact")
clean_data = run.use_artifact(args["artifact_name_feature"])
clean_data_path = clean_data.file()

logger.info("Downloading and reading label data artifact")
label_data = run.use_artifact(args["artifact_name_target"])
label_data_path = label_data.file()

# unpacking the artifacts
data = joblib.load(clean_data_path)
label = joblib.load(label_data_path)

[34m[1mwandb[0m: Currently logged in as: [33mivanovitch-silva[0m. Use [1m`wandb login --relogin`[0m to force relogin


24-10-2022 23:57:56 Downloading and reading clean data artifact
24-10-2022 23:58:40 Downloading and reading label data artifact



<center><img width="600" src="https://drive.google.com/uc?export=view&id=15ynGAo9KLIOB_6fNv5dh-hAS30YT_mMd"></center>

In [7]:
# partition the data into training, test splits using 75% of
# the data for training and the remaining 25% for test
(train_x, test_x, train_y, test_y) = train_test_split(data,
                                                      label,
                                                      test_size=0.25, random_state=42)

In [8]:
# partition the training into training, validation splits using 75% of
# the training set for training and the remaining 25% for validation
(train_x, val_x, train_y, val_y) = train_test_split(train_x, 
                                                    train_y,
                                                    test_size=0.25,
                                                    random_state=42)

In [9]:
logger.info("Train x: {}".format(train_x.shape))
logger.info("Train y: {}".format(train_y.shape))
logger.info("Validation x: {}".format(val_x.shape))
logger.info("Validation y: {}".format(val_y.shape))
logger.info("Test x: {}".format(test_x.shape))
logger.info("Test y: {}".format(test_y.shape))

24-10-2022 23:59:05 Train x: (1687, 227, 227, 3)
24-10-2022 23:59:05 Train y: (1687,)
24-10-2022 23:59:05 Validation x: (563, 227, 227, 3)
24-10-2022 23:59:05 Validation y: (563,)
24-10-2022 23:59:05 Test x: (750, 227, 227, 3)
24-10-2022 23:59:05 Test y: (750,)


In [10]:
# Save the artifacts using joblib
joblib.dump(train_x, args["train_feature_artifact"])
joblib.dump(train_y, args["train_target_artifact"])
joblib.dump(val_x, args["val_feature_artifact"])
joblib.dump(val_y, args["val_target_artifact"])
joblib.dump(test_x, args["test_feature_artifact"])
joblib.dump(test_y, args["test_target_artifact"])

logger.info("Dumping the train and validation data artifacts to the disk")

24-10-2022 23:59:24 Dumping the train and validation data artifacts to the disk


In [11]:
# train_x artifact
artifact = wandb.Artifact(args["train_feature_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_x"
                          )

logger.info("Logging train_x artifact")
artifact.add_file(args["train_feature_artifact"])
run.log_artifact(artifact)

24-10-2022 23:59:31 Logging train_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faedddb6490>

In [12]:
# train_y artifact
artifact = wandb.Artifact(args["train_target_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_y"
                          )

logger.info("Logging train_y artifact")
artifact.add_file(args["train_target_artifact"])
run.log_artifact(artifact)

25-10-2022 00:00:00 Logging train_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faedddc8950>

In [13]:
# val_x artifact
artifact = wandb.Artifact(args["val_feature_artifact"],
                          type="VAL_DATA",
                          description="A json file representing the val_x"
                          )

logger.info("Logging val_x artifact")
artifact.add_file(args["val_feature_artifact"])
run.log_artifact(artifact)

25-10-2022 00:00:03 Logging val_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faedddc8110>

In [14]:
# val_y artifact
artifact = wandb.Artifact(args["val_target_artifact"],
                          type="VAL_DATA",
                          description="A json file representing the val_y"
                          )

logger.info("Logging val_y artifact")
artifact.add_file(args["val_target_artifact"])
run.log_artifact(artifact)

25-10-2022 00:00:15 Logging val_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faeddddf750>

In [15]:
# test_x artifact
artifact = wandb.Artifact(args["test_feature_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_x"
                          )

logger.info("Logging test_x artifact")
artifact.add_file(args["test_feature_artifact"])
run.log_artifact(artifact)

25-10-2022 00:00:18 Logging test_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faeddddf310>

In [16]:
# test_y artifact
artifact = wandb.Artifact(args["test_target_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_y"
                          )

logger.info("Logging test_y artifact")
artifact.add_file(args["test_target_artifact"])
run.log_artifact(artifact)

25-10-2022 00:00:32 Logging test_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7faeddddcf50>

In [17]:
run.finish()