## Uploading to Bucket

In [1]:
%pip install -U sagemaker
!pip install scikit-learn==1.2.1

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import boto3
import pandas as pd
import numpy as np
import os

# Create an S3 client
s3_client = boto3.client('s3')

# Specify a unique bucket name
bucket_name = 'crisis-detection-bucket'
prefix = "model"

In [4]:
# Create the S3 bucket
try:
    # Create the S3 bucket
    s3_client.create_bucket(Bucket=bucket_name)
    print("Bucket created successfully!")
except ClientError as e:
    # Check if the error is due to bucket already existing
    error_code = e.response['Error']['Code']
    if error_code == 'BucketAlreadyOwnedByYou':
        print("Bucket already exists. Continuing with the existing bucket.")
    else:
        print("An error occurred while creating the bucket:", error_code)
        raise

Bucket created successfully!


In [5]:
# Read the local file as a DataFrame using pandas
# Specify the local file path and desired S3 object key
local_file_path = 'data.csv'
df = pd.read_csv(local_file_path)
df = df[['target','text']]
df.head()

Unnamed: 0,target,text
0,1,Our Deeds are the Reason of this #earthquake M...
1,1,Forest fire near La Ronge Sask. Canada
2,1,All residents asked to 'shelter in place' are ...
3,1,"13,000 people receive #wildfires evacuation or..."
4,1,Just got sent this photo from Ruby #Alaska as ...


In [6]:
# Create a sagemaker session to upload data to S3
import sagemaker
from sklearn.model_selection import train_test_split
from sagemaker.inputs import TrainingInput

sagemaker_session = sagemaker.Session()
batch_1, batch_2 = train_test_split(df, test_size=0.2, random_state=2023)

# Save the train and test datasets to local files
batch_1.to_csv('batch_1.csv', index=False)
batch_2.to_csv('batch_2.csv', index=False)
batch_1_input_path = sagemaker_session.upload_data("batch_1.csv", bucket_name, os.path.join('batch_1'))
batch_2_input_path = sagemaker_session.upload_data("batch_2.csv", bucket_name, os.path.join('batch_2'))

print(batch_1_input_path)
print(batch_2_input_path)

s3://crisis-detection-bucket/batch_1/batch_1.csv
s3://crisis-detection-bucket/batch_2/batch_2.csv


## Train model
The model is trained using the SageMaker SDK's Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located.

In [7]:
# Use the current execution role for training. It needs access to S3
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::823616654574:role/LabRole


Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called `SKLearn`. In this estimator, we define the following parameters:
1. The script that we want to use to train the model (i.e. `entry_point`). This is the heart of the Script Mode method. Additionally, set the `script_mode` parameter to `True`.
1. The role which allows us access to the S3 bucket containing the train and test data set (i.e. `role`)
1. How many instances we want to use in training (i.e. `instance_count`) and what type of instance we want to use in training (i.e. `instance_type`)
1. Which version of scikit-learn to use (i.e. `framework_version`)
1. Training hyperparameters (i.e. `hyperparameters`)

After setting these parameters, the `fit` function is invoked to train the model.

In [8]:
# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

from sagemaker.sklearn import SKLearn

# Define the training script and dependencies
train_script = 'train.py'  # Replace with your actual training script name
dependencies = ['utils.py']  # Replace with your required dependencies

# Set up the SKLearn estimator with dependencies
sk_estimator = SKLearn(
    entry_point=train_script,
    dependencies=dependencies,
    role=role,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version="1.2-1",
    script_mode=True,
    py_version='py3',
    sagemaker_session=sagemaker_session,
    output_path="s3://{}/{}".format(bucket_name, prefix),
    base_job_name= "sagemaker-crisis-detection",
    code_location= "s3://{}/{}".format(bucket_name, "jobs")
    
)

# Train the estimator
print(batch_1_input_path)
s3_input_train = TrainingInput(batch_1_input_path)
sk_estimator.fit({"train": s3_input_train})

INFO:sagemaker:Creating training-job with name: sagemaker-crisis-detection-2023-05-30-16-59-10-218


s3://crisis-detection-bucket/batch_1/batch_1.csv
Using provided s3_resource
2023-05-30 16:59:10 Starting - Starting the training job...
2023-05-30 16:59:25 Starting - Preparing the instances for training......
2023-05-30 17:00:17 Downloading - Downloading input data...
2023-05-30 17:00:58 Training - Training image download completed. Training in progress..[34m2023-05-30 17:01:04,389 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-05-30 17:01:04,392 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-30 17:01:04,399 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-05-30 17:01:04,603 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-30 17:01:04,614 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-30 17:01:04,624 sagemaker-training-toolkit INFO     

In [9]:
model_data = sk_estimator.model_data
image_uri = sk_estimator.image_uri
model_role = sk_estimator.role

print(f"Model Data: {model_data}\nImage URI: {image_uri}\nModel Role: {model_role}")

# print(sk_estimator.__dict__)


Model Data: s3://crisis-detection-bucket/model/sagemaker-crisis-detection-2023-05-30-16-59-10-218/output/model.tar.gz
Image URI: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3
Model Role: arn:aws:iam::823616654574:role/LabRole


## Trying out Incremental Learning on SageMaker

In [82]:
# import joblib
# import tarfile

# # Download the model tar.gz file from S3
# s3_client.download_file('crisis-detection-bucket', model_data.split('s3://{}/'.format(bucket_name))[1], 'model.tar.gz')

# # Extract the model file from the tar.gz archive
# with tarfile.open('model.tar.gz', 'r:gz') as tar:
#     tar.extractall('.')

# # Load the trained model
# model = joblib.load('model.joblib')


In [83]:
# test_data = [
#     'This is the first text.', 'This is the second text.',
#     'Another text for prediction.', 'One more text to predict.',
#     'disaster is imminent', 'omg an earthquake'
# ]

# # Create a DataFrame with the test data and feature names
# test_df = pd.DataFrame({'text': test_data})
# print(test_df)

# # Use the transformed test data in prediction
# predictions = model.predict(test_df)

# print(predictions)

                           text
0       This is the first text.
1      This is the second text.
2  Another text for prediction.
3     One more text to predict.
4          disaster is imminent
5             omg an earthquake
[0 0 0 0 1 1]


In [85]:
# print(model)

Pipeline(steps=[('col_transformer',
                 ColumnTransformer(transformers=[('vect',
                                                  TfidfVectorizer(ngram_range=(1,
                                                                               2),
                                                                  stop_words='english',
                                                                  token_pattern='(?u)\\b\\w[\\w-]*\\w\\b|\\b\\w+\\b'),
                                                  'text')])),
                ('clf',
                 SGDClassifier(loss='log_loss', max_iter=3500, penalty='l1'))])


In [84]:
# df2 = pd.read_csv('test_data.csv')
# df2.head()

Unnamed: 0,target,text
0,0,Having trouble understanding the rotations wit...
1,0,Wow Crackdown 3 uses multiple servers in multi...
2,0,Heart disease prevention: What about secondhan...
3,1,Cape Coral city leaders take part in mock hurr...
4,0,#np Avenged Sevenfold - Hail To The King


In [78]:
# X = df2.drop(['id', 'target', 'keyword','location'], axis=1, errors='ignore')
# y = df2['target']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2023, stratify=y)
# X_train = model.named_steps['col_transformer'].transform(X_train)

In [79]:
model.named_steps['clf'].partial_fit(X_train, y_train)

In [81]:
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

# y_pred = model.predict(X_test)
# f1 = f1_score(y_test, y_pred, average='binary')
# print('F1 Score: {:.4f}'.format(f1))

F1 Score: 0.6489


In [11]:
# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

from sagemaker.sklearn import SKLearn

# Define the training script and dependencies
incremental_script = 'incremental.py'  # Replace with your actual training script name
dependencies = ['utils.py']  # Replace with your required dependencies

# Set up the SKLearn estimator with dependencies
sk_estimator_incremental = SKLearn(
    entry_point=incremental_script,
    dependencies=dependencies,
    role=role,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version="1.2-1",
    script_mode=True,
    py_version='py3',
    sagemaker_session=sagemaker_session,
    output_path="s3://{}/{}".format(bucket_name, prefix),
    base_job_name= "sagemaker-crisis-detection",
    code_location= "s3://{}/{}".format(bucket_name, "jobs")
    
)

# Train the estimator
print(batch_2_input_path, model_data)
s3_input_train_incremental = TrainingInput(batch_2_input_path)
sk_estimator_incremental.fit({"train": s3_input_train_incremental})

INFO:sagemaker:Creating training-job with name: sagemaker-crisis-detection-2023-05-30-17-08-10-997


s3://crisis-detection-bucket/batch_2/batch_2.csv s3://crisis-detection-bucket/model/sagemaker-crisis-detection-2023-05-30-16-59-10-218/output/model.tar.gz
Using provided s3_resource
2023-05-30 17:08:11 Starting - Starting the training job...
2023-05-30 17:08:27 Starting - Preparing the instances for training......
2023-05-30 17:09:22 Downloading - Downloading input data...
2023-05-30 17:10:03 Training - Training image download completed. Training in progress..[34m2023-05-30 17:10:07,949 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-05-30 17:10:07,951 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-30 17:10:07,959 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-05-30 17:10:08,146 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-05-30 17:10:08,156 sagemaker-training-toolkit INFO     No GPUs de

In [12]:
model_data_incre = sk_estimator_incremental.model_data
image_uri_incre = sk_estimator_incremental.image_uri
model_role_incre = sk_estimator_incremental.role

print(f"Model Data: {model_data_incre}\nImage URI: {image_uri_incre}\nModel Role: {model_role_incre}")

# print(sk_estimator.__dict__)


Model Data: s3://crisis-detection-bucket/model/sagemaker-crisis-detection-2023-05-30-17-08-10-997/output/model.tar.gz
Image URI: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3
Model Role: arn:aws:iam::823616654574:role/LabRole


## Deploy and test endpoint
After training the model, it is time to deploy it as an endpoint. To do so, we invoke the `deploy` function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. `initial_instance_count`) and instance type (i.e. `instance_type`) used to deploy the model.

In [13]:
import time

sk_endpoint_name = "sklearn-rf-model" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sk_predictor = sk_estimator_incremental.deploy(
    initial_instance_count=1, instance_type="ml.m5.large", endpoint_name=sk_endpoint_name
)

INFO:sagemaker:Creating model with name: sagemaker-crisis-detection-2023-05-30-17-11-12-833
INFO:sagemaker:Creating endpoint-config with name sklearn-rf-model2023-05-30-17-11-12
INFO:sagemaker:Creating endpoint with name sklearn-rf-model2023-05-30-17-11-12


----!

After the endpoint has been completely deployed, it can be invoked using the [SageMaker Runtime Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) (which is the method used in the code cell below) or [Scikit Learn Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-predictor). If you plan to use the latter method, make sure to use a [Serializer](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) to serialize your data properly.

In [14]:
import json

client = sagemaker_session.sagemaker_runtime_client

# Define the input data in the desired format
input_data = {"Input": ["This is a disaster","Hello world"]}

# Convert the input data to JSON payload
payload = json.dumps(input_data)

# Invoke the endpoint to get the prediction
response = client.invoke_endpoint(
    EndpointName=sk_endpoint_name,
    ContentType='application/json',
    Body=payload
)

# Parse the prediction response
response_body = response['Body'].read().decode('utf-8')
prediction_result = json.loads(response_body)['Output']
print(response_body)


{"Output":[1,0],"Probabilities":[[0.1585506774886135,0.8414493225113865],[0.6825362070930563,0.31746379290694365]]}



## Cleanup
If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources.

In [15]:
# sk_predictor.delete_model()
# sk_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: sagemaker-crisis-detection-2023-05-30-17-11-12-833
INFO:sagemaker:Deleting endpoint configuration with name: sklearn-rf-model2023-05-30-17-11-12
INFO:sagemaker:Deleting endpoint with name: sklearn-rf-model2023-05-30-17-11-12


In [147]:
# import boto3

# # Specify the endpoint name to delete
# endpoint_name = 'sklearn-rf-model2023-05-20-09-06-08'

# # Create a SageMaker client
# sagemaker_client = boto3.client('sagemaker')

# # Delete the endpoint
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
