In [3]:
%matplotlib inline

In [15]:
from sagemaker import get_execution_role
from sagemaker.session import Session
import sagemaker
import boto3

role = get_execution_role()
bucket = Session().default_bucket()

In [4]:
import pickle, gzip, numpy, boto3, json

# Load the dataset
region = boto3.Session().region_name
s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-data-{region}", "algorithms/kmeans/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open("mnist.pkl.gz", "rb") as f:
    train_set, valid_set, test_set = pickle.load(f, encoding="latin1")
    
df = pd.read_csv( 's3://sagemaker-us-east-1-346023323361/knn/train/train.csv')
df.to_pickle()
#df.to_pickle('.pkl')
train_data = 's3://sagemaker-us-east-1-346023323361/knn/train/train.csv'
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')
data_channels = {'train': train_channel, 'validation': valid_channel}

In [21]:
train_data = 's3://sagemaker-us-east-1-346023323361/knn/train/train.csv'
validation_data = 's3://sagemaker-us-east-1-346023323361/knn/valid/valid.csv'
train_set =  sagemaker.inputs.TrainingInput(train_data, content_type='text/csv')
valid_set = sagemaker.inputs.TrainingInput(validation_data, content_type='text/csv')
data_channels = {'train': train_set, 'validation': valid_set}

## Training the K-Means model

Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the k-means training algorithm.  But Amazon SageMaker's k-means has been tested on, and scales well with, multi-terabyte datasets.

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes around 4 minutes.

In [22]:
from sagemaker import KMeans

data_location =f"s3://sagemaker-us-east-1-346023323361/knn/kmeans-output"
output_location = f"s3://{bucket}/kmeans_example/output"

print(f"training data will be uploaded to: {data_location}")
print(f"training artifacts will be uploaded to: {output_location}")

kmeans = KMeans(
    role=role,
    instance_count=2,
    instance_type="ml.c4.xlarge",
    output_path=output_location,
    k=10,
    data_location=data_location,
)

training data will be uploaded to: s3://sagemaker-us-east-1-346023323361/knn/kmeans-output
training artifacts will be uploaded to: s3://sagemaker-us-east-1-346023323361/kmeans_example/output


In [28]:
%%time
from time import gmtime, strftime
import pandas as pd
import numpy as np

df_train = pd.read_csv( 's3://sagemaker-us-east-1-346023323361/knn/train/train.csv')
#np_array = df_train.to_numpy()
np_array = np.array(df_train.values, dtype=np.float32)
job_name = f'jumpstart-example-kmeans-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'
kmeans.fit(kmeans.record_set(np_array), job_name = job_name)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: jumpstart-example-kmeans-2023-04-16-04-22-06


2023-04-16 04:22:10 Starting - Starting the training job...
2023-04-16 04:22:33 Starting - Preparing the instances for training......
2023-04-16 04:23:25 Downloading - Downloading input data...
2023-04-16 04:23:50 Training - Downloading the training image......
2023-04-16 04:25:06 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/16/2023 04:25:15 INFO 140652233082688] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'init_method': 'random', 'mini_batch_size': '5000', 'epochs': '1', 'extra_center_factor': 'auto', 'local_lloyd_max_iter': '300', 'local_lloyd_tol': '0.0001', 'local_lloyd_init_method': 'kmeans++', 'local_lloyd_num_trials': 'auto', 'half_life_time_size': '0', 'eval_metrics': '["msd"]', 'force_dense': 'true', '_disable_wait_to_read': 'false', '_enable_profiler': 'false'

## Set up hosting for the model
Now, we can deploy the model we just trained behind a real-time hosted endpoint.  This next step can take, on average, 7 to 11 minutes to complete.

In [29]:
%%time

endpoint_name = f'jumpstart-example-kmeans-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'
kmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge", endpoint_name = endpoint_name)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: kmeans-2023-04-16-04-26-13-179
INFO:sagemaker:Creating endpoint-config with name jumpstart-example-kmeans-2023-04-16-04-26-13
INFO:sagemaker:Creating endpoint with name jumpstart-example-kmeans-2023-04-16-04-26-13


---------!CPU times: user 212 ms, sys: 13.8 ms, total: 226 ms
Wall time: 5min 2s


## Validate the model for use
Finally, we'll validate the model for use. Let's generate a classification for a single observation from the trained model using the endpoint we just created.

In [49]:
result = kmeans_predictor.predict(train_set.head(1))
                                  #[0][30:31])
print(result)

AttributeError: 'TrainingInput' object has no attribute 'head'

OK, a single prediction works.

Let's do a whole batch and see how well the clustering works.

## Predict on Validation Set

In [67]:
%%time 

df_valid = pd.read_csv( 's3://sagemaker-us-east-1-346023323361/knn/valid/valid.csv')
np_array = np.array(df_valid.values, dtype=np.float32)

# returns 
result = kmeans_predictor.predict(np_array)

# this returns the closest cluster that the particular row of data in the validation set
clusters = [r.label["closest_cluster"].float32_tensor.values[0] for r in result]
df_valid['label'] = clusters

CPU times: user 614 ms, sys: 4.4 ms, total: 618 ms
Wall time: 975 ms


In [64]:
print(len(result))
print(len(clusters))
print(len(df_valid))

5563
5563
5563


In [68]:
# Now we groupby the cluster and then take the mean for each column of each cluster and that will tell us
# what is the characteristic of that particular cluster
means = df_valid.groupby('label').mean()

# gives you the mean for cluster2
#print(means)
means.to_csv('s3://sagemaker-us-east-1-346023323361/kmeans/means.csv')

              0    35.1088    -106.533        1.0         1  0.1       0.2  \
label                                                                        
0.0    0.362292  40.917964  -74.250503   1.254159  0.759704  0.0  0.079482   
1.0    0.580982  32.059584  -93.616420   1.125392  0.334378  0.0  0.113898   
2.0    0.479381  39.202546 -120.207368   1.317869  0.721649  0.0  0.187285   
3.0    0.518703  28.262854  -81.866735   1.144638  0.381546  0.0  0.137157   
4.0    0.554360  34.521437  -82.660040   1.107643  0.331539  0.0  0.097955   
5.0    0.460203  40.868700  -90.392587   1.254703  0.529667  0.0  0.128799   
6.0    0.609053  36.642981 -107.932765   1.168724  0.390947  0.0  0.242798   
7.0    0.000000  37.318522  -94.665367  41.666667  0.444444  0.0  0.111111   
8.0    0.416667  20.792475 -157.117500   1.000000  1.000000  0.0  0.250000   
9.0    0.526636  40.568290  -83.338041   1.280061  0.467275  0.0  0.124810   

            0.3       0.4       0.5  ...      0.21      0.22   

In [69]:
print(df_valid['label'].value_counts())

0.0    1082
1.0     957
4.0     929
5.0     691
9.0     657
2.0     582
3.0     401
6.0     243
8.0      12
7.0       9
Name: label, dtype: int64


In [57]:
    """
    for cluster in range(10):
    print(f"\n\n\nCluster {int(cluster)}:")
    digits = [img for l, img in zip(clusters, valid_set[0]) if int(l) == cluster]
    height = ((len(digits) - 1) // 5) + 1
    width = 5
    plt.rcParams["figure.figsize"] = (width, height)
    _, subplots = plt.subplots(height, width)
    subplots = numpy.ndarray.flatten(subplots)
    for subplot, image in zip(subplots, digits):
        show_digit(image, subplot=subplot)
    for subplot in subplots[len(digits) :]:
        subplot.axis("off")

    plt.show()
    """
#print(result)
#print(clusters)
from collections import Counter
counts = dict(Counter(clusters))
print(counts)

{9.0: 657, 2.0: 582, 3.0: 401, 4.0: 929, 1.0: 957, 5.0: 691, 0.0: 1082, 6.0: 243, 8.0: 12, 7.0: 9}


## Analysis of Clusters - 10 in Total

### The bottom line

K-Means clustering is not the best algorithm for image analysis problems, but we do see pretty reasonable clusters being built.

## Cluster 1: Characteristics

### This group centers around Westwood, New Jersey.  Mostly Democratic districts.  Stolen guns.  Not specific to any region.

lat = 40.91
long = -74.25   ---> Westwood, NJ  07675, United States
guns_involved = 1.2541589648798521	
Democrat = 0.7597042513863216	
ohe_drug = 0.0	
ohe_officer = 0.07948243992606285	
ohe_gang = 0.015711645101663587	
ohe_accident = 0.11090573012939002	
ohe_murder = 0.0933456561922366	
ohe_suicide = 0.09704251386321626	
ohe_arrest = 0.29390018484288355	
ohe_brandishing = 0.11367837338262476	
ohe_felon = 0.17467652495378927	
ohe_drive = 0.022181146025878003	
ohe_home = 0.056377079482439925	
ohe_homeinvasion = 0.08872458410351201	
ohe_stolen = 0.7781885397412199	
ohe_misc = 0.24306839186691312	
ohe_drugs = 0.1534195933456562	
ohe_carjacking = 0.025878003696857672	
ohe_defensive = 0.06561922365988909	
ohe_robbery = 0.0009242144177449168	
ohe_family = 0.04805914972273567	
ohe_institution = 0.0027726432532347504	
ohe_child = 0.0018484288354898336	
ohe_mass = 0.06377079482439926	
ohe_domestic = 0.07208872458410351	
suspect_age = 0.28188539741219965	
Young Adult = 0.47412199630314233	
Mid-Adult = 0.16081330868761554	
Adult = 0.011090573012939002	
Senior = 0.0	
East_south_Central = 0.47689463955637706	
Mid-Atlantic = 0.0	
Mountain = 0.3345656192236599	
Western = 0.0	
South-Atlantic = 0.18853974121996303	
West_NorthEastern = 0.0	
South_Central = 0.0

## Cluster 2 - Characteristics

South Central Region.  Logansport, LA  71049, United States.  1 gun involved.  Stolen gun.


0.58098223615465	
32.059583594566355	
-93.6164198537095	
1.1253918495297806	
0.3343782654127482	
0.0	
0.11389759665621735	
0.012539184952978056	
0.22152560083594566
0.2006269592476489	
0.2027168234064786	
0.08881922675026123	
0.10553814002089865	
0.0877742946708464	
0.042842215256008356	
0.05956112852664577	
0.055381400208986416	
0.722048066875653	
0.0877742946708464	
0.1922675026123302	
0.07210031347962383	
0.12539184952978055	
0.003134796238244514	
0.05747126436781609	
0.008359456635318705	
0.0073145245559038665	
0.07523510971786834	
0.05956112852664577	
0.19017763845350052	
0.5600835945663531	
0.1703239289446186	
0.01567398119122257	
0.17345872518286312	
0.0	
0.0	
0.0	
0.0	
0.0	
0.03761755485893417	
0.7889237199582028

## Cluster 3 - Characteristics

Pacific Region. 1 gun involved and gun stolen.

0.4793814432989691	
39.202546048109966	
-120.20736769759449	
1.3178694158075601	
0.7216494845360825	- 1 gun involved
0.0	
0.1872852233676976	
0.08419243986254296	
0.22852233676975944	
0.2027491408934708	
0.20618556701030927	
0.12886597938144329	
0.12199312714776632	
0.17353951890034364	
0.029209621993127148	
0.03264604810996564	
0.06013745704467354	
0.8127147766323024	-- gun stolen
0.140893470790378	
0.18384879725085912	
0.05326460481099656	
0.08762886597938144	
0.0	
0.05154639175257732	
0.00859106529209622	
0.003436426116838488	
0.06701030927835051	
0.04982817869415808	
0.17010309278350516	
0.5378006872852233	
0.218213058419244	
0.020618556701030927	
0.0	
0.0	
0.09278350515463918	
0.0	
0.9072164948453608	--> Pacific Region
0.0	
0.0	
0.0

Most clusters had 1 gun involved.  Cluster 8 had a mean of 41 guns involved.
Every cluster had a gun stolen attribute.

Cluster 7 Mountain Region

Cluster 5 and 6   South Atlantic





### (Optional) Delete the Endpoint
If you're ready to be done with this notebook, make sure run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [12]:
print(kmeans_predictor.endpoint)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


jumpstart-example-kmeans-2023-04-14-16-23-06


In [13]:
import sagemaker

sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Deleting endpoint with name: jumpstart-example-kmeans-2023-04-14-16-23-06
