# Foundations of Data Science (GDW) 2023



# Exercise VI: k-means & DBSCAN

This week's exercise will be all about clustering.

## Part 1: Distance measures

An important part of clustering is measuring the distance between n-dimensional data points. A typical distance function is the euclid distance.

### Task 1.1
Implement the euclid distance function for two-dimensional data below.

In [None]:
# write your code here

### Task 1.2
Which `numpy` function that we know of could be used instead?

*write your answer here*

### Task 1.3
Given a set of data points $\{(0,0),(1,0),(-1,2),(2,0),(3,3),(4,-1)\}$, compute the distance matrix for the euclid distance either by hand or programatically.

In [None]:
# write your code here

## Part 2: k-means Clustering

No we want to execute the algorithm for k-means clustering.
First we need to import data that we can cluster. For this, we will import the iris data set and convert it to a pandas data frame.

We further want to use the `sklearn` library, which contains many machine learning algorithms. You can find more information here:
https://scikit-learn.org/stable/

You can install the package as usual by calling

In [None]:
!pip install scikit-learn

Then, we can import the *iris* dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
df.head()

It could be useful to visualize the data before we decide how to proceed.
### Task 2.1
Visualize the data as a scatter plot. For this, you need to choose two attributes (column names).

Just as we did before, you can use `matplotlib.pyplot` or use pandas built-in function `df.plot.scatter()`

How many potential clusters can you visually identify?

In [1]:
# write your code here

*Write your answer here*

We can now run k-means by importing it and calling the function below. You will notice that we need to fit the model to our data.

In [None]:
from sklearn.cluster import KMeans
k = 2 # change this according to you guess
kmeans = KMeans(n_clusters=k, n_init='auto')
kmeans.fit(df)

We can also access further information about our clusters.

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

### Task 2.2
In your own words, explain the role of cluster centers and the labels seen above.

*Write your notes here*

...

Since we would like to visualize our labeled data points, we need to find a way to do so.

The library `seaborn` is an easy (and pretty) way to do this. We probably need to install it first by calling

In [None]:
!pip install seaborn

We need to specify how to color our data points for the new scatter plot. This can be done by adding a column containing the cluster information to our data frame.

In [None]:
import seaborn as sns

df['clusters'] = kmeans.labels_
sns.scatterplot(x=x1, y=y1, data=df, hue='clusters', palette='coolwarm')

## Part 3: DBSCAN

DBSCAN is a density-based clustering algorithm.

We'll start by generating a dataset with the code below:

In [None]:
from sklearn.datasets import make_moons

# Generate random data with two moon-shaped clusters
X, y = make_moons(n_samples=1000, noise=0.05, random_state=0)

### Task 3.1
Run k-means on the data `X`.

What do you notice?

In [None]:
# write your code here

### Task 3.2
Define an $\varepsilon$ and `min_samples` and then execute the *DBSCAN* algorithm on the dataset above. You can do so by calling 

`from sklearn.cluster import DBSCAN`.

*Note: Plotting the result is certainly helpful here!*

In [2]:
# write your code here