# Clustering - Exemplu

<h2 id="k-means">K-means</h2>

<p><a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a> este unul dintre cei mai utilizați algoritmi de clustering ce grupează datele într-un număr predefinit de clustere. Implementarea MLlib include o variantă paralelizată a metodei <a href="http://en.wikipedia.org/wiki/K-means%2B%2B">k-means++</a>, denumită
<a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmeans||</a>.</p>

<p><code>KMeans</code> este implementat ca <code>Estimator</code> și generează un <code>KMeansModel</code>, ca model de bază.</p>

<h3 id="input-columns">Coloane de intrare</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

<h3 id="output-columns">Coloane de ieșire</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
  </tbody>
</table>

In [None]:
#Exemplu
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [None]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Încărcare date.
dataset = spark.read.format("libsvm").load("../../Data/sample_kmeans_data.txt")

# Antrenarea unui model k-means.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

predictions = model.transform(dataset)

# Evaluarea clustering-ului 

#depreciat din 3.0.0:
#wssse = model.computeCost(dataset)
#print("Within Set Sum of Squared Errors = " + str(wssse))

evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Afișare rezultat.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)