# Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses [wiki]

### Clustering

Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or distance (dissimilarity) measure, such as the Euclidean distance.

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
matplotlib.rcParams['figure.figsize'] = [10, 8]

In [None]:
df_un = pd.read_csv('data/country_total.csv') 
#The first 5 rows
df_un.head()

### Plot unemployment rate (UR) in austria over time

In [None]:
austria = df_un[df_un['country']=='at']
plt.scatter(austria['month'], austria['unemployment_rate'])

### Cluster countries according to their UR

- one of the simplest clustering algorithms, K-means. This is an iterative algorithm which searches for cluster centers such that the distance from each point to its cluster is minimized. 

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)

### Prepare data, run clustering and get the labels

In [None]:
X = df_un[['unemployment_rate','unemployment']].dropna()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
labels = kmeans.fit_predict(X_scaled)

### Plot data where labels are colored differently 

In [None]:
plt.scatter(X['unemployment_rate'], X['unemployment'], c=labels)

### Countries

In [None]:
df_countries = pd.read_csv('data/countries.csv')
df_countries.head()

In [None]:
plt.scatter(df_countries['latitude'], df_countries['longitude'])

### _Exercise: Get dominant cluster label for each country_

### DBSCAN algorithm

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X = df_countries[['latitude','longitude']].to_numpy()
X_scaled = scaler.fit_transform(X)
# cluster the data into five clusters
dbscan = DBSCAN(eps=0.55, min_samples = 2)
clusters = dbscan.fit_predict(X_scaled)
# plot the cluster assignments
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap="plasma")

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()