## Clustering
In this exercise, you will use K-Means clustering to segment customer data into five clusters.

### Import the Libraries
You will use the **KMeans** class to create your model. This will require a vector of features, so you will also use the **VectorAssembler** class.


In [1]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1528428190678_0013,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


### Load Source Data
The source data for your clusters is in a comma-separated values (CSV) file, and incldues the following features:
- CustomerName: The custome's name
- Age: The customer's age in years
- MaritalStatus: The custtomer's marital status (1=Married, 0 = Unmarried)
- IncomeRange: The top-level for the customer's income range (for example, a value of 25,000 means the customer earns up to 25,000)
- Gender: A numeric value indicating gender (1 = female, 2 = male)
- TotalChildren: The total number of children the customer has
- ChildrenAtHome: The number of children the customer has living at home.
- Education: A numeric value indicating the highest level of education the customer has attained (1=Started High School to 5=Post-Graduate Degree
- Occupation: A numeric value indicating the type of occupation of the customer (0=Unskilled manual work to 5=Professional)
- HomeOwner: A numeric code to indicate home-ownership (1 - home owner, 0 = not a home owner)
- Cars: The number of cars owned by the customer.

In [2]:
customers = spark.read.csv('wasb:///data/customers.csv', inferSchema=True, header=True)
customers.show()

+---------------+---+-------------+-----------+------+-------------+--------------+---------+----------+---------+----+
|   CustomerName|Age|MaritalStatus|IncomeRange|Gender|TotalChildren|ChildrenAtHome|Education|Occupation|HomeOwner|Cars|
+---------------+---+-------------+-----------+------+-------------+--------------+---------+----------+---------+----+
|    Aaron Adams| 42|            0|      50000|     0|            0|             0|        3|         2|        1|   1|
|Aaron Alexander| 40|            1|      50000|     0|            0|             0|        2|         2|        1|   2|
|    Aaron Allen| 63|            0|      25000|     0|            2|             1|        2|         1|        1|   2|
|    Aaron Baker| 56|            1|      50000|     0|            4|             2|        2|         2|        1|   2|
|   Aaron Bryant| 72|            0|      75000|     0|            4|             0|        4|         4|        1|   2|
|   Aaron Butler| 42|            1|     

### Create the K-Means Model
You will use the feaures in the customer data to create a Kn-Means model with a k value of 5. This will be used to generate 5 clusters.

In [3]:
assembler = VectorAssembler(inputCols = ["Age", "MaritalStatus", "IncomeRange", "Gender", "TotalChildren", "ChildrenAtHome", "Education", "Occupation", "HomeOwner", "Cars"], outputCol="features")
train = assembler.transform(customers)

kmeans = KMeans(featuresCol=assembler.getOutputCol(), predictionCol="cluster", k=5, seed=0)
model = kmeans.fit(train)
print "Model Created!"

Model Created!

### Get the Cluster Centers
The cluster centers are indicated as vector coordinates.

In [4]:
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[  5.19737441e+01   5.26868545e-01   5.00000000e+04   4.93961141e-01
   1.34552774e+00   4.98337126e-01   3.23035183e+00   2.77927534e+00
   6.62699107e-01   1.14615789e+00]
[  5.82794840e+01   6.22850123e-01   1.50000000e+05   4.79729730e-01
   2.07248157e+00   3.20638821e+00   3.41461916e+00   4.34705160e+00
   6.48648649e-01   3.10995086e+00]
[  5.53417813e+01   5.72411296e-01   1.00000000e+05   4.97103548e-01
   2.54380883e+00   1.54272266e+00   3.46198407e+00   4.19116582e+00
   7.16509776e-01   1.94532947e+00]
[  5.60711289e+01   5.83804487e-01   7.50000000e+04   5.03921211e-01
   2.17308043e+00   8.16706183e-01   3.73244574e+00   3.92759438e+00
   7.23326646e-01   1.38063104e+00]
[  5.31013005e+01   4.17180014e-01   2.50000000e+04   4.80492813e-01
   1.41512663e+00   6.08487337e-01   2.31622177e+00   1.45448323e+00
   5.93086927e-01   1.11464750e+00]

### Predict Clusters
Now that you have trained the model, you can use it to segemnt the customer data into 5 clusters and show each customer with their allocated cluster.

In [5]:
prediction = model.transform(train)
prediction.groupBy("cluster").count().orderBy("cluster").show()

+-------+-----+
|cluster|count|
+-------+-----+
|      0| 5713|
|      1| 1628|
|      2| 2762|
|      3| 5483|
|      4| 2922|
+-------+-----+

In [6]:
prediction.select("CustomerName", "cluster").show(50)

+----------------+-------+
|    CustomerName|cluster|
+----------------+-------+
|     Aaron Adams|      0|
| Aaron Alexander|      0|
|     Aaron Allen|      4|
|     Aaron Baker|      0|
|    Aaron Bryant|      3|
|    Aaron Butler|      3|
|  Aaron Campbell|      3|
|    Aaron Carter|      0|
|      Aaron Chen|      3|
|   Aaron Coleman|      0|
|   Aaron Collins|      1|
|      Aaron Diaz|      2|
|   Aaron Edwards|      1|
|     Aaron Evans|      3|
|    Aaron Flores|      3|
|    Aaron Foster|      3|
|  Aaron Gonzales|      3|
|  Aaron Gonzalez|      0|
|     Aaron Green|      0|
|     Aaron Green|      0|
|   Aaron Griffin|      4|
|      Aaron Hall|      0|
|     Aaron Hayes|      2|
| Aaron Henderson|      0|
| Aaron Hernandez|      0|
|      Aaron Hill|      2|
|    Aaron Hughes|      2|
|       Aaron Jai|      3|
|   Aaron Jenkins|      0|
|      Aaron King|      3|
|     Aaron Kumar|      3|
|       Aaron Lal|      0|
|        Aaron Li|      3|
|  Aaron McDonald|      0|
|