<a href="https://colab.research.google.com/github/kalue23/Exercises-Uni/blob/main/05_homework_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework for Session 5

The tasks presented here relate to the contents of the notebook [Unsupervised Learning](https://colab.research.google.com/drive/1wWG76ET42fxPAMAQW1Cjk2OTIcSwl7Mw?usp=sharing).

Remember: **To keep any changes you make to the notebook, you must first save a copy for yourself. To do this, go to `File` â†’ `Save a copy in Drive`.** If you are using a local Python installation, download the notebook before making any changes.

### Exercise 1
In this exercise, we want to explore the farm data a bit more closely using Principal Component Analysis (PCA).
* How much variance is explained by the first component, and how much by the second component? Consequently, how much variance is **not** explained by the first two principal components? What can you conclude from this about how successfully PCA was able to shift the variance into the first two principal components?
* Examine which features are particularly important for the first two principal components. Filter for features whose importance is >0.1 or < -0.1, as otherwise there will be too many features. Display the importance of the features for the first and second principal components (PC1 and PC2) each in a bar chart.

In [None]:
# we load the farm dataset ...
import pandas as pd
url = "https://drive.google.com/uc?id=1mNO7yf89ReYPvjJgfD3YdY5MdIVGvCtp"
farm = pd.read_csv(url)

# ... and scale it as in the course
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(farm)
farm_scaled = scaler.transform(farm)
farm_scaled = pd.DataFrame(farm_scaled, columns=farm.columns)

farm.head()

In [None]:
from sklearn.decomposition import PCA

pca_farm = PCA(n_components=2)
pca_farm.fit(farm_scaled)
farm_pca = pca_farm.transform(farm_scaled)

In [None]:
explained_variance = pd.DataFrame()
# Your code here

In [None]:
feature_importance = pd.DataFrame()
# Your code here

In [None]:
import seaborn as sns

mask = # Your code here
filtered = feature_importance[mask]
sns.barplot(# Your code here)

<font color='purple'>**Optional (Advanced)**</font>  

<font color='purple'>To determine the optimal number of components for a PCA, we can use what is called a "scree plot". A scree plot looks like follows:</font>

<img src="https://www.statology.org/wp-content/uploads/2021/09/scree_plot_python.png" alt="scree" width="400"/>

<font color='purple'>On the y axis is the explained variance of every <i>additional</i> component. On the x axis is the number of components. The optimal number of components for a PCA is where the line of the scree plot "bends", i.e. where additional components do not contribute much to the overall explained variance.</font>

<font color='purple'>Create such a scree plot to determine the optimal number of components for the farm data. For that, do the following:</font>

<ol class="outside">   
<font color='purple'><li>Create two empty lists <i>comp</i> and <i>var</i>.</li>
<li>Loop over range(1, 21) for components 1 to 20.</li>
<li>Inititate and fit the PCA like above in each iteration of the loop.</li>
<li>Store the number of components and explained variance for the last component in lists <i>comp</i> and <i>var</i> in each iteration. <b>Note</b>: You can use the <i>append</i> function. To obtain the explained variance of the additional component use <i>pca.explained_variance_ratio_[-1]</i>.</li>
<li>Call the matplotlib function <i>plt.plot(comp, var, 'o-')</i> like so. Don't forget to <i>import matplotlib.pyplot as plt</i> first.</li></font>
</ol>

## Exercise 2
In this exercise, we want to systematically investigate which number of clusters best fits the farm data. For this, use the data projected onto the first two principal components `farm_pca` from Exercise 1.

* Use the K-Means algorithm and perform clustering for all numbers of clusters between `n_clusters=2` and `n_clusters=20`. Calculate the silhouette coefficient for each clustering and save the silhouette coefficient and the number of clusters in separate lists. **Note:** use a `for` loop. You can proceed very similarly to how we systematically evaluated classification performance in Course 4.
* Plot the silhouette coefficient values against the number of clusters in a line chart. Which silhouette coefficient is the highest? Based on this, what number of clusters is best to divide the existing dataset into clusters?

In [None]:
from sklearn.cluster import KMeans
# Your code here

In [None]:
from sklearn.metrics import silhouette_score
sil_score_3 = silhouette_score(farm_pca, cluster)

In [None]:
from sklearn.metrics import silhouette_score

# lists for saving the results
silhouette_coefficients = []
number_clusters = []

# Your code here

In [None]:
sns.lineplot(# Your code here)

<font color='purple'>**Optional (Advanced)**</font>  
<font color='purple'>Remember that the k-means algorithm needs to be randomly initialized. That is, k-means needs to start from random cluster means and assign points to clusters first to then reassign points and refine clusters.

Repeat the above calculations for different random seeds of k-means. For that, simply add a parameter `random_state` to `KMeans` and put an integer of your choice (e.g., `random_state=15`). Repeat the calculations for three seeds of your choice. How much do the results for the silhouette coefficient differ? Do you always get the same optimal number of clusters?</font>

## Exercise 3

In the [lecture](https://janalasser.at/lectures/MD_KI/VO3_4_algorithms_unsupervised_learning/#/4/4/5), we discussed that the K-Means algorithm does not work well when clusters are not "round" (i.e., do not have a convex shape). In such cases, density-based clustering can be an alternative.

Below, a synthetic example dataset is generated to illustrate this problem:

In [None]:
# generating sample data with two "half moons" interfacing
from sklearn import datasets
moons = datasets.make_moons(n_samples=200, noise=0.05, random_state=42)[0]
moons = pd.DataFrame(moons, columns=["x", "y"])
sns.scatterplot(moons, x="x", y="y");

We attempt to find clusters using the K-Means algorithm. Obviously, there are two groups of observations in the dataset, so we choose `n_clusters=2`.

In [None]:
# Clustering with K-Means algorthim and n_clusters=2
kmeans_moons_2 = KMeans(n_clusters=2, random_state=42)
kmeans_moons_2.fit(moons)
cluster_moons_2 = kmeans_moons_2.predict(moons)

# Creating a DataFrame for visualization
plot_data_moons= moons.copy()
plot_data_moons["cluster_2"] = cluster_moons_2

# Visualizing the clustering with a scatter plot
sns.scatterplot(plot_data_moons, x="x", y="y", hue="cluster_2");

In [None]:
from sklearn.metrics import silhouette_score
silhouette_score(moons, cluster_moons_2)

Although the silhouette coefficient is not bad, the clustering clearly does not match the structure of the data.

To improve the situation, we apply density-based clustering. Scikit-learn provides the [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) algorithm for this. To use DBSCAN, we do **not** need to set any hyperparameters! In particular, DBSCAN automatically determines the optimal number of clusters.

* Perform clustering on the `moons` data using `DBSCAN`, predict the cluster membership of each data point, and visualize the data with a scatter plot as in the course. How many clusters does DBSCAN find? **Note:** `DBSCAN` does not have separate `fit()` and `predict()` functions, but a combined `fit_predict()` function that takes the data as input and returns the clustering.
* Scale the data using `StandardScaler()` as in the lecture and perform clustering again with `DBSCAN`. How many clusters are found now? Does the clustering meet your expectations? What is the silhouette coefficient of the resulting clustering?

In [None]:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN()
# Your code here

In [None]:
from sklearn.preprocessing import StandardScaler
# Your code here

In [None]:
from sklearn.metrics import silhouette_score
# Your code here

<font color='purple'>**Optional (Advanced)**</font>  
<font color='purple'>The silhouette coefficient is a statistical indicator to measure how "good" a clustering is. What disadvantages could there be when measuring the quality of a clustering purely with a statistical indicator? Can you think of other ways how to evalute how "good" or "true" the classes are that you found through clustering?</font>

## Source and License

This notebook was created by Jana Lasser for Course "B1 - Technical Aspects" of the Microcredential "AI and Society" at the University of Graz.

The notebook may be used, modified, and redistributed under the terms of the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0) license.

This notebook was translated from German using GPT-5 and cross-checked by Alina Herderich.