# DataOps onto S3 Bucket in AWS
The intent of this notebook is to develop a test/train datasets for Sagemaker Studio. This notebook goes step-by-step on ingesting data onto an S3 bucket so that we can connect to Sagemaker. This component addresses data engineer, feature selection, and feature extraction and provides a Data Scientist the test/train datasets to begin experimentation, model training, and begin the data science work.

In [1]:
import sagemaker
from sklearn.model_selection import train_test_split
import boto3
import pandas as pd

# creating a client for sagemaker
sm_boto3 = boto3.client('sagemaker')
# creating a session for sagemaker
sess = sagemaker.Session()
region = sess.boto_session.region_name

# Need to create an S3 bucket
bucket = 'martymdlregistry' # specific s3 bucket
print('Using bucket ' + bucket)


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/martinlopez/Library/Application Support/sagemaker/config.yaml
Using bucket martymdlregistry


## Source Information
We have downloaded the Metro DC Truck routes from the following source: https://opendata.dc.gov/datasets/DCGIS::truck-and-bus-through-route/about 

We now initialize the dataset and conduct featur development and develop synthetic data. The synthetic data we include is the labels as well as the truck_break_off flag. This should emulate when a truck successfully broke off from the route. 

In [2]:
import os
import random
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

print(os.getcwd())
# Change the directory to 'dataset'
os.chdir('../dataset')
# Path to the CSV file
csv_file = 'Truck_and_Bus_Through_Route.csv'

# Read the CSV file into a dataframe
df = pd.read_csv(csv_file)

# define feature TRUCK_BREAK_OFF
df['TRUCK_BREAK_OFF'] = 0
# capture labels
df['LABEL'] = 0
df['LABEL'] = [random.randint(0, 1) for _ in range(len(df))]

# Randomize 0s and 1s for the column TRUCK_BREAK_OFF
df['TRUCK_BREAK_OFF'] = [random.randint(0, 1) for _ in range(len(df))]

    # Data preprocessing
df['LAST_EDITED_DATE'] = pd.to_datetime(df['LAST_EDITED_DATE'])
# Convert datetime to Unix timestamp
df['LAST_EDITED_DATE'] = df['LAST_EDITED_DATE'].astype(int)
df['ROUTEID'] = df['ROUTEID'].astype('category').cat.codes

## normalization and feature selection
scaler = MinMaxScaler()
df[['ROUTEID','LAST_EDITED_DATE','TRUCK_BREAK_OFF']] = scaler.fit_transform(df[['ROUTEID','LAST_EDITED_DATE','TRUCK_BREAK_OFF']])

# Data preprocessing complete
print('Dataset:\n', df.head(5))

# create train test split
train, test = train_test_split(df, test_size=0.2, random_state=200)


print('Train set:\n', train.shape)
print('Test set:\n', test.shape)

/Users/martinlopez/Documents/GitHub/truck_break_off_rl/src
Dataset:
             NAME   ROUTEID  FROMMEASURE   TOMEASURE                FROMDATE  \
0  Primary Route  0.032468    9899.5123   9988.3999  2019/01/01 00:00:00+00   
1  Primary Route  0.032468    7773.7123   7797.1786  2019/01/01 00:00:00+00   
2  Primary Route  0.032468    1104.9776   1187.2989  2019/01/01 00:00:00+00   
3  Primary Route  0.032468   10277.8444  10287.2693  2019/01/01 00:00:00+00   
4  Primary Route  0.032468    7972.7124   8057.2873  2019/01/01 00:00:00+00   

   TODATE                                 EVENTID  LOCERROR CREATED_USER  \
0     NaN  {95A11B7D-E871-4A60-841D-4EE8B490B7D1}  NO ERROR          NaN   
1     NaN  {7108E69C-3464-4FF5-8A2E-38EF961B2E23}  NO ERROR          NaN   
2     NaN  {9AE842B8-98E4-4D5B-9EE8-73AE4CE25B25}  NO ERROR          NaN   
3     NaN  {0555805C-A411-4FD0-BD99-6E3D5AE1446C}  NO ERROR          NaN   
4     NaN  {279C06E7-BC10-4BF6-A49D-A185FC44132D}  NO ERROR          NaN   


## Create training and testing datesets

In [3]:
train.to_csv('train-V1.csv', index=False)
test.to_csv('test-V1.csv', index=False)

## Send test/train datasets to Sagemaker
We will send the test/train datasets into sagemaker

In [4]:
# send to s3 bucket. Sagemaker will take training data from the S3 bucket
sk_prefix = "sagemaker/truck-break-off-rl_markov"
train_path = sess.upload_data(path='train-V1.csv', bucket= bucket, key_prefix=sk_prefix)
test_path = sess.upload_data(path='test-V1.csv', bucket= bucket, key_prefix=sk_prefix)

print(train_path)
print(test_path)

s3://martymdlregistry/sagemaker/truck-break-off-rl_markov/train-V1.csv
s3://martymdlregistry/sagemaker/truck-break-off-rl_markov/test-V1.csv
