# 5 Data Segregation

In this section we are to split our dataset into two blocks. The first one is going to be used for traning and stored as training_data.csv. The other will be reserved for testing the model.

## 5.1 Install WandB


In [1]:
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.12.21-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 5.0 MB/s 
[?25hCollecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 63.1 MB/s 
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.7.2-py2.py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 62.0 MB/s 
[?25hCollecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-

## 5.2 Loading libraries

In [6]:
import logging
import tempfile
import pandas as pd
import os
import wandb
import tensorflow as tf
from sklearn.model_selection import train_test_split

## 5.3 Login to wandb

In [3]:
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 5.4 Data Segregation

In [4]:
# global variables

# ratio used to split train and test data
test_size = 0.20

# seed used to reproduce purposes
seed = 57

# reference column to straitify the data
stratify = "status"

# name of the input artifact
artifact_input_name = "phishing-detection-2/preprocessed_data.csv:latest"

# type of the artifact
artifact_type = "data"

In [5]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%Y-%m-%d %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate wandb project
run = wandb.init(project="phishing-detection-2", job_type="split_data")

logger.info("Downloading and reading artifact")
artifact = run.use_artifact(artifact_input_name)
artifact_path = artifact.file()
df = pd.read_csv(artifact_path)

[34m[1mwandb[0m: Currently logged in as: [33mlupamedeiros[0m. Use [1m`wandb login --relogin`[0m to force relogin


2022-07-15 17:43:51 Downloading and reading artifact


In [7]:
# Split firstily in train/test, then we further divide the dataset to train and validation
logger.info("Splitting data into train and test")
splits = {}
splits["train"], splits["test"] = train_test_split(df,
                                                   test_size=test_size,
                                                   random_state=seed,
                                                   stratify=df[stratify])

# Save the artifacts. We use a temporary directory so we do not leave any trace behind
with tempfile.TemporaryDirectory() as tmp_dir:
    
    for split, df in splits.items():

        # MAke the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"

        # Get the path on disk within the tempo directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        logger.info(f"Uploading the {split} dataset to {artifact_name}")

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(name=artifact_name,
                                  type=artifact_type,
                                  description=f"{split} split of dataset {artifact_input_name}")
        artifact.add_file(temp_path)

        logger.info("Logging artifact")
        run.log_artifact(artifact)

        # This waits for the artifact to be upload to W&B. If you do not add
        # this, te temp directory might be removed before W&B had a chance to 
        # upload the datasets, and upload might fail
        artifact.wait()

2022-07-15 17:58:06 Splitting data into train and test
2022-07-15 17:58:06 Uploading the train dataset to train.csv
2022-07-15 17:58:07 Logging artifact
2022-07-15 17:58:09 Uploading the test dataset to test.csv
2022-07-15 17:58:09 Logging artifact


In [8]:
# close the run
run.finish()

VBox(children=(Label(value='3.480 MB of 3.480 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…