Using Amazon SageMaker to Access AWS S3 Datasets Defined in AWS Glue Data Catalog
====================

This notebook demonstrates accessing data defined in the AWS Glue Data Catalog data from an Amazon SageMaker notebook. 

Access occurs via:
* a [Sparkmagic](https://github.com/jupyter-incubator/sparkmagic) notebook (PySpark) 
* [Apache Livy](https://livy.incubator.apache.org/), running on an [Amazon EMR](https://aws.amazon.com/emr/) cluster
* to [AWS S3](https://aws.amazon.com/s3/) datasets defined in an [AWS Glue Data Catalog](https://aws.amazon.com/glue/)

The [accompanying blog post](https://aws.amazon.com/blogs/machine-learning/how-to-access-amazon-s3-data-managed-by-aws-glue-data-catalog-from-amazon-sagemaker-notebooks/) provides instructions. It also has an [AWS CloudFormation](https://aws.amazon.com/cloudformation/) template that will set up the environment for you. It also sets up an AWS Glue crawler to crawl some sample data on S3; that data (on US legislators) is used in the sample scripts below. 


## Table of Contents
1. [Set up access to the AWS Glue Data Catalog](#emr_setup)
1. [Access AWS Glue Data Catalog using SQL magics](#glue_access)   
1. [Using the Data on the Local Notebook](#local_access)

## Set up Access to the AWS Glue Data Catalog<a name='emr_setup'></a>

This Jupyter notebook is written to run on an Amazon SageMaker notebook instance. It uses SparkMagic (PySpark) to access Apache Spark, running on Amazon EMR. 

The EMR cluster runs Spark and Apache Livy, and must be set up to use the AWS Glue Data Store for its Hive metastore.

In addition, the Amazon SageMaker notebook instance must be configured to access Livy.

This configuration is set up automatically in the CloudFormation templates [accompanying blog post](https://aws.amazon.com/blogs/machine-learning/how-to-access-amazon-s3-data-managed-by-aws-glue-data-catalog-from-amazon-sagemaker-notebooks/).

**Alternatively**, in order to set up the EMR cluster:

* Set up the EMR cluster as described in: [Build Amazon SageMaker notebooks backed by Spark in Amazon EMR](https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/)   BUT!! with the following differences:

    * Select the following two AWS Glue Data Store options:
        * Use for Hive table metadata
        * Use for Spark table metadata
    * Add the following EMR configuration option:

```
{
    "Classification": "livy-conf",
    "Properties": {
      "livy.server.request-log-retain.days": "10",
	  "livy.repl.enable-hive-context": "true"
    }
  }
```

* Modify the Amazon SakeMaker notebook options, as described in the blog post, to point to the master of your EMR cluster.


Now:
* Check that this notebook is using SparkMagic (PySpark).
* Ensure that you've restarted the kernel of this notebook.

Then, print out the list of Livy magic commands.

In [1]:
%%help

Magic,Example,Explanation
info,%%info,Outputs session information for the current Livy endpoint.
cleanup,%%cleanup -f,"Deletes all sessions for the current Livy endpoint, including this notebook's session. The force flag is mandatory."
delete,%%delete -f -s 0,Deletes a session by number for the current Livy endpoint. Cannot delete this kernel's session.
logs,%%logs,Outputs the current session's Livy logs.
configure,"%%configure -f {""executorMemory"": ""1000M"", ""executorCores"": 4}",Configure the session creation parameters. The force flag is mandatory if a session has already been  created and the session will be dropped and recreated. Look at Livy's POST /sessions Request Body for a list of valid parameters. Parameters must be passed in as a JSON string.
spark,%%spark -o df df = spark.read.parquet('...,"Executes spark commands.  Parameters:  -o VAR_NAME: The Spark dataframe of name VAR_NAME will be available in the %%local Python context as a  Pandas dataframe with the same name.  -m METHOD: Sample method, either take or sample.  -n MAXROWS: The maximum number of rows of a dataframe that will be pulled from Livy to Jupyter.  If this number is negative, then the number of rows will be unlimited.  -r FRACTION: Fraction used for sampling."
sql,%%sql -o tables -q SHOW TABLES,"Executes a SQL query against the variable sqlContext (Spark v1.x) or spark (Spark v2.x).  Parameters:  -o VAR_NAME: The result of the SQL query will be available in the %%local Python context as a  Pandas dataframe.  -q: The magic will return None instead of the dataframe (no visualization).  -m, -n, -r are the same as the %%spark parameters above."
local,%%local a = 1,All the code in subsequent lines will be executed locally. Code must be valid Python code.


If everything is set up correctly, the following cells should run against the EMR cluster.

The following cell prints out info about the current Spark sessions.  

In [2]:
%%info

There may be "no active sessions" so far. If so, that's ok, you'll start one below. If you or someone else has run other sessions recently, you may receive a list of existing sessions.   

If you receive an error such as "Error sending http request and maximum retry encountered", try restarting the kernel, then re-executing these cells. 

If you continue to see an error, check the configuration. Use a terminal to check that .sparkmagic/config.json on this Amazon SageMaker notebook instance has the correct IP address and port for the EMR cluster. Check that the EMR cluster has started correctly. Check that the EMR cluster's port is open and accessible to the Amazon SageMaker instance. Try accessing your EMR cluster on the Livy port (default 8998); you should see the Livy welcome page. Remember to restart the kernel after every change!

Once this cell executes correctly, move on! The next cell should start a Spark session, and an application if needed. It'll also print out the IP address of the EMR cluster's master, and current setting of environment variables. 

In [3]:
import os
import platform
# Print some characteristics of the remote system
print(platform.node())
print(platform.platform(aliased=0, terse=0))
print("Spark home currently set to", os.environ.get('SPARK_HOME', None))

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,,pyspark,idle,,,✔


SparkSession available as 'spark'.
ip-172-31-19-148
Linux-4.14.33-51.37.amzn1.x86_64-x86_64-with-glibc2.2.5
('Spark home currently set to', '/usr/lib/spark')

After running the previous cell (it may take a minute or 3 to complete), you should see a message saying "SparkSession available as 'spark'". 

If you don't: restart the Kernel for this notebook, and try running the above cells again. 

Now, run a basic Spark function. This cell shows how you can execute Spark parallelized functions on your EMR cluster.

In [4]:
# Run a basic Spark function, to show that's working
sc.parallelize(range(1000)).count()

1000

Now, we can see the logs from this Spark task. This capability allows us to debug our Spark tasks without leaving the notebook environment.

In [5]:
%%logs

stdout: 

stderr: 
18/06/19 18:01:03 INFO RSCDriver: Connecting to: ip-172-31-19-148.us-west-2.compute.internal:10001
18/06/19 18:01:03 INFO RSCDriver: Starting RPC server...
18/06/19 18:01:03 INFO RpcServer: Connected to the port 10003
18/06/19 18:01:03 WARN RSCConf: Your hostname, ip-172-31-19-148.us-west-2.compute.internal, resolves to a loopback address, but we couldn't find any external IP address!
18/06/19 18:01:03 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
18/06/19 18:01:04 INFO RSCDriver: Received job request e891cc6a-da46-41f2-8a5d-127c40c9ea24
18/06/19 18:01:04 INFO RSCDriver: SparkContext not yet up, queueing job request.
18/06/19 18:01:09 INFO RMProxy: Connecting to ResourceManager at ip-172-31-19-148.us-west-2.compute.internal/172.31.19.148:8032
18/06/19 18:01:22 INFO YarnClientImpl: Submitted application application_1529430545282_0001

YARN Diagnostics: 

## Access AWS Glue Data Catalog Tables using SQL magics<a name='glue_access'></a> 

Let's switch to the topic at hand. First, list the databases in the EMR cluster's Hive metastore. If all is set up correctly, this is actually the AWS Glue Data Catalog. Then, list the tables; and run some SQL commands to retrieve some data. 

The following examples run against the AWS Glue 'legislators' database. If you have not already done so, follow the instructions in the AWS Glue tutorial under [“Crawling the Sample Data Used in the Tutorials”](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-prerequisites.html#dev-endpoint-tutorial-prerequisites-crawl-data ). You can stop once you’ve completed the steps to create the crawler, and can see the six tables created by the crawler in your data catalog, containing metadata that the crawler retrieved. 


In [6]:
%%sql 
show databases like 'legislators' 

Unnamed: 0,databaseName
0,legislators


In [7]:
%%sql
show tables in legislators

Unnamed: 0,database,tableName,isTemporary
0,legislators,areas_json,False
1,legislators,countries_json,False
2,legislators,events_json,False
3,legislators,memberships_json,False
4,legislators,organizations_json,False
5,legislators,persons_json,False


In [8]:
%%sql
select * from legislators.persons_json limit 10

Unnamed: 0,family_name,name,links,gender,image,identifiers,other_names,sort_name,images,given_name,birth_date,id,contact_details,death_date
0,Collins,Mac Collins,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'C000640...","[{'lang': 'bar', 'note': 'multilingual', 'name...","Collins, Michael",[{'url': 'https://theunitedstates.io/images/co...,Michael,1944-10-15,0005af3a-9471-4d1f-9299-737fff4b9b46,,NaT
1,Huizenga,Bill Huizenga,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'ballotpedia', 'identifier': 'Bill...","[{'lang': 'da', 'note': 'multilingual', 'name'...","Huizenga, Bill",[{'url': 'https://theunitedstates.io/images/co...,Bill,1969-01-31,00aa2dc0-bfb6-4412-a7fc-4f0cfdc00ebf,"[{'type': 'fax', 'value': '202-226-0779'}, {'t...",NaT
2,Clawson,Curt Clawson,"[{'note': 'Wikipedia (commons)', 'url': 'https...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'C001102...","[{'lang': 'bar', 'note': 'multilingual', 'name...","Clawson, Curtis",[{'url': 'https://theunitedstates.io/images/co...,Curtis,1959-09-28,00aca284-9323-4953-bb7a-1bf6f5eefe95,"[{'type': 'phone', 'value': '202-225-2536'}, {...",NaT
3,Solomon,Gerald Solomon,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'S000675...","[{'note': 'alternate', 'name': 'Gerald B. H. S...","Solomon, Gerald",[{'url': 'https://theunitedstates.io/images/co...,Gerald,1930-08-14,00b73df5-4180-4418-8b21-b6367add06c9,,2001-10-26
4,Rigell,E. Scott Rigell,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'R000589...","[{'note': 'alternate', 'name': 'Scott Rigell'}...","Rigell, Edward",[{'url': 'https://theunitedstates.io/images/co...,Edward,1960-05-28,00bee44f-db04-4a7d-972d-041f77dfa50d,"[{'type': 'fax', 'value': '202-225-4218'}, {'t...",NaT
5,Crapo,Mike Crapo,"[{'note': 'Wikipedia (da)', 'url': 'https://da...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'ballotpedia', 'identifier': 'Mike...","[{'lang': 'da', 'note': 'multilingual', 'name'...","Crapo, Michael",[{'url': 'https://theunitedstates.io/images/co...,Michael,1951-05-20,00f8f12d-6e27-4a21-910d-02857dda9e27,"[{'type': 'twitter', 'value': 'MikeCrapo'}]",NaT
6,Hutto,Earl Hutto,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'H001018...","[{'note': 'alternate', 'name': 'Earl Dewitt Hu...","Hutto, Earl",[{'url': 'https://theunitedstates.io/images/co...,Earl,1926-05-12,015d77c8-6edb-4eda-b2cb-cf159cce055e,,NaT
7,Ertel,Allen Ertel,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'E000208...","[{'note': 'alternate', 'name': 'Allen E. Ertel...","Ertel, Allen",[{'url': 'https://theunitedstates.io/images/co...,Allen,1937-11-07,01679bc3-da21-482d-b0fd-7adff5df5717,,2015-11-19
8,Minish,Joseph Minish,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'M000796...","[{'lang': 'bar', 'note': 'multilingual', 'name...","Minish, Joseph",[{'url': 'https://theunitedstates.io/images/co...,Joseph,1916-09-01,018247d0-2961-4230-b154-b3e36ffcd7ce,,2007-11-24
9,Andrews,Robert E. Andrews,"[{'note': 'Wikipedia (de)', 'url': 'https://de...",male,https://theunitedstates.io/images/congress/ori...,"[{'scheme': 'bioguide', 'identifier': 'A000210...","[{'note': 'alternate', 'name': 'Rob Andrews'},...","Andrews, Robert",[{'url': 'https://theunitedstates.io/images/co...,Robert,1957-08-04,01b100ac-192e-4b5a-a4a4-b4cd8698f99d,"[{'type': 'phone', 'value': '202-225-6501'}]",NaT


Retrieve some data from the table into a data frame that you can access locally (you'll see how a little later). The "-o" parameter specifies the data frame name. 

In [9]:
%%sql -o party_counts
select on_behalf_of_id as party, count(*) as count from legislators.memberships_json group by on_behalf_of_id

Unnamed: 0,party,count
0,party/republican-conservative,1
1,party/al,1
2,party/republican,4980
3,party/democrat,5423
4,party/popular_democrat,1
5,party/new_progressive,2
6,party/democrat-liberal,1
7,party/independent,30


Remember that SQL queries, by default, limit the number of result rows to 2500. You can modify the number of rows returned by using the '-n' flag.

-n MAXROWS: The maximum number of rows of a SQL query that will be pulled from Livy to Jupyter. If this number is negative, then the number of rows will be unlimited.

Now, run a more complex SQL that joins, filters and projects across several datasets, and returns the result in a dataframe called "members".

You can also add filter conditions to the SQL, to limit the data returned to only the subset of interest, or, to a useful subset for initial exploration.

In [10]:
%%sql -o members
select p.name as membername, p.gender, p.birth_date, p.death_date, 
    m.on_behalf_of_id, m.legislative_period_id, m.start_date, m.end_date, 
    o.name as orgname, o.classification
from legislators.persons_json p
join legislators.memberships_json m on m.person_id = p.id
join legislators.organizations_json o on m.organization_id = o.id

Unnamed: 0,membername,gender,birth_date,death_date,on_behalf_of_id,legislative_period_id,start_date,end_date,orgname,classification
0,Jackie Walorski,female,1963-08-17,NaT,party/republican,term/113,NaT,NaT,House of Representatives,legislature
1,Jackie Walorski,female,1963-08-17,NaT,party/republican,term/114,NaT,NaT,House of Representatives,legislature
2,Jackie Walorski,female,1963-08-17,NaT,party/republican,term/115,NaT,NaT,House of Representatives,legislature
3,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/104,1996-03-26,NaT,House of Representatives,legislature
4,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/105,1997-01-07,NaT,House of Representatives,legislature
5,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/106,NaT,NaT,House of Representatives,legislature
6,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/107,NaT,NaT,House of Representatives,legislature
7,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/108,2003-01-07,NaT,House of Representatives,legislature
8,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/109,NaT,NaT,House of Representatives,legislature
9,Juanita Millender-McDonald,female,1938-09-07,2007-04-22,party/democrat,term/110,NaT,2007-04-22,House of Representatives,legislature


Now, take a random shuffle subset of this joined, filtered data. Run the next cell several times; you should see a different result each time. 

In [11]:
%%sql -o membersshuffle
select p.name as membername, p.gender, p.birth_date, p.death_date, 
    m.on_behalf_of_id, m.legislative_period_id, m.start_date, m.end_date, 
    o.name as orgname, o.classification 
from legislators.persons_json p
join legislators.memberships_json m on m.person_id = p.id
join legislators.organizations_json o on m.organization_id = o.id
distribute by rand()
sort by rand()
limit 10

Unnamed: 0,membername,gender,birth_date,death_date,on_behalf_of_id,legislative_period_id,orgname,classification
0,Gus Yatron,male,1927-10-16,2003-03-13,party/democrat,term/97,House of Representatives,legislature
1,Robert Davis,male,1932-07-31,2009-10-16,party/republican,term/98,House of Representatives,legislature
2,Doug Bereuter,male,1939-10-06,NaT,party/republican,term/99,House of Representatives,legislature
3,Nancy Johnson,female,1935-01-05,NaT,party/republican,term/109,House of Representatives,legislature
4,Jamie Whitten,male,1910-04-18,1995-09-09,party/democrat,term/98,House of Representatives,legislature
5,Jaime Fuster,male,1941-01-12,2007-12-03,party/democrat,term/102,House of Representatives,legislature
6,William Armstrong,male,1937-03-16,2016-07-05,party/republican,term/100,Senate,legislature
7,Neal Smith,male,1920-03-23,NaT,party/democrat,term/101,House of Representatives,legislature
8,Gwen Moore,female,1951-04-18,NaT,party/democrat,term/112,House of Representatives,legislature
9,Robert B. Aderholt,male,1965-07-22,NaT,party/republican,term/112,House of Representatives,legislature


You can see that some of the data is not filled in (start_date, end_date of legislative period). You can also transform and further reduce the redundant information retrieved into the notebook environment, such as removing 'party/' and 'term/'. That will reduce the processing you'll need to do in the notebook.


In [12]:
%%sql -o membersadj
select p.name as membername, p.gender, p.birth_date, p.death_date, 
    split(m.on_behalf_of_id,'/')[1] as party, split(m.legislative_period_id, '/')[1] as legislature,  
    o.name as orgname
from legislators.persons_json p
join legislators.memberships_json m on m.person_id = p.id
join legislators.organizations_json o on m.organization_id = o.id

Unnamed: 0,membername,gender,birth_date,death_date,party,legislature,orgname
0,Duncan Hunter,male,1948-05-31,NaT,republican,100,House of Representatives
1,Duncan Hunter,male,1948-05-31,NaT,republican,101,House of Representatives
2,Duncan Hunter,male,1948-05-31,NaT,republican,102,House of Representatives
3,Duncan Hunter,male,1948-05-31,NaT,republican,103,House of Representatives
4,Duncan Hunter,male,1948-05-31,NaT,republican,104,House of Representatives
5,Duncan Hunter,male,1948-05-31,NaT,republican,105,House of Representatives
6,Duncan Hunter,male,1948-05-31,NaT,republican,106,House of Representatives
7,Duncan Hunter,male,1948-05-31,NaT,republican,107,House of Representatives
8,Duncan Hunter,male,1948-05-31,NaT,republican,108,House of Representatives
9,Duncan Hunter,male,1948-05-31,NaT,republican,109,House of Representatives


## Using the Data on the Local Notebook<a name='local_access'></a>

Next, let's check the characteristics of the local system. Check the IP address printed below; it should be different from the IP address of the EMR system above.

In [13]:
%%local
# Print some characteristics of the local system
import platform
print(platform.machine())
print(platform.node())
print(platform.platform(aliased=0, terse=0))


x86_64
ip-172-16-34-154
Linux-4.9.93-41.60.amzn1.x86_64-x86_64-with-glibc2.9


On the notebook instance, look at some characteristics of the data frames that you created above.

In [14]:
%%local
# print the columns and types of the retrieved data file(s)
membersadj.info()

# show the top few rows
display(membersadj.head())

# describe the data object
display(membersadj.describe())

# Summarize the categorical field party 
display(membersadj.party.value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 7 columns):
membername     2500 non-null object
gender         2500 non-null object
birth_date     2500 non-null datetime64[ns]
death_date     518 non-null datetime64[ns]
party          2500 non-null object
legislature    2500 non-null int64
orgname        2500 non-null object
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 136.8+ KB


Unnamed: 0,membername,gender,birth_date,death_date,party,legislature,orgname
0,Duncan Hunter,male,1948-05-31,NaT,republican,100,House of Representatives
1,Duncan Hunter,male,1948-05-31,NaT,republican,101,House of Representatives
2,Duncan Hunter,male,1948-05-31,NaT,republican,102,House of Representatives
3,Duncan Hunter,male,1948-05-31,NaT,republican,103,House of Representatives
4,Duncan Hunter,male,1948-05-31,NaT,republican,104,House of Representatives


Unnamed: 0,legislature
count,2500.0
mean,105.7904
std,5.514874
min,97.0
25%,101.0
50%,106.0
75%,111.0
max,115.0


democrat                   1343
republican                 1146
independent                  10
republican-conservative       1
Name: party, dtype: int64

You can see that 'membersadj' is now a Pandas dataframe, ready to be accessed locally.

An example of clustering this data is included below as a demonstration of how a data set processed and retrieved in this way can be used in a machine learning project. 

In order to further analyze, you'll add some calculated columns:
- convert gender to a numeric ('gen')
- convery party to a numeric ('partynum')
- calculate an estimated age when seated in the legislature ('age')

You'll then use ScikitLearn to cluster the data.

Note that the utility of the results varies greatly depending on the quality of the input data and the intended use of its outputs; in this case, it's questionnable. 

In [15]:
%%local
membersadj['gen'] = membersadj['gender'].map({'female': 1, 'male': 0})
membersadj['partynum'] = membersadj['party'].map({
    'republican-conservative':0, 
    'al':1,
    'republican':2,
    'democrat': 3,
    'popular_democrat':4,
    'new_progressive':5,
    'democrat-liberal':6,
    'independent':7  })

membersadj['legstart'] =(membersadj['legislature']*2 + 1767)
membersadj['birthyear'] = membersadj['birth_date'].dt.year
membersadj['age'] = membersadj['legstart'] - membersadj['birthyear']

membersadj.head()

Unnamed: 0,membername,gender,birth_date,death_date,party,legislature,orgname,gen,partynum,legstart,birthyear,age
0,Duncan Hunter,male,1948-05-31,NaT,republican,100,House of Representatives,0,2,1967,1948,19
1,Duncan Hunter,male,1948-05-31,NaT,republican,101,House of Representatives,0,2,1969,1948,21
2,Duncan Hunter,male,1948-05-31,NaT,republican,102,House of Representatives,0,2,1971,1948,23
3,Duncan Hunter,male,1948-05-31,NaT,republican,103,House of Representatives,0,2,1973,1948,25
4,Duncan Hunter,male,1948-05-31,NaT,republican,104,House of Representatives,0,2,1975,1948,27


In [16]:
%%local
data = membersadj[['gen','partynum','age']]
data.head()

Unnamed: 0,gen,partynum,age
0,0,2,19
1,0,2,21
2,0,2,23
3,0,2,25
4,0,2,27


In [17]:
%%local
from sklearn.cluster import KMeans, MiniBatchKMeans

est = MiniBatchKMeans(n_clusters=8)
est.fit(data)
centers = est.cluster_centers_
print(str(centers))

[[ 0.14936248  2.51912568 27.55919854]
 [ 0.1552795   2.65838509 49.92236025]
 [ 0.12310606  2.59090909 37.95454545]
 [ 0.09677419  2.4562212  15.05990783]
 [ 0.12790698  2.84883721 60.03488372]
 [ 0.15939597  2.50167785 33.07214765]
 [ 0.15555556  2.54222222 43.16666667]
 [ 0.11061947  2.52876106 21.65707965]]


The results of the clustering are shown above. 

You've now:
* Retrieved data from your data lake
* Joined, filtered and aggregated it using the power of Spark on EMR, from the comfort of your local Jupyter notebook
* Applied a machine learning technique locally, to the data retrieved that way.

  Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  
  Licensed under the Apache License, Version 2.0 (the "License").
  You may not use this file except in compliance with the License.
  A copy of the License is located at
  
      http://www.apache.org/licenses/LICENSE-2.0
  
  or in the "license" file accompanying this file. This file is distributed 
  on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 
  express or implied. See the License for the specific language governing 
  permissions and limitations under the License.
