In [None]:
%run ../common-imports.ipynb

#  Old faithful geyser: DBScan



In [None]:
source = '../datasets/old-faithful-geyser.csv'
raw_data = pd.read_csv(source, sep='\t')
scaler = StandardScaler()
scaled = scaler.fit_transform(raw_data)
data = pd.DataFrame(data={'eruptions':scaled[:, 0], 'waiting': scaled[:,1]})
data.describe(include="all").transpose().style.set_table_styles(sv_table_styles())

**Sample rows**

## Data Visualization

To whet our recollection of the key data exploration findings, let us mention some of the plots of the dataset.

Let us first visualize a scatter plot of the data and look at the histogram of the features. Finally, let us pull of this together with a kernel density plot, into a single plot with subplots.

In [None]:
plt.scatter(data['eruptions'], data['waiting'], alpha=0.5, s=250, color='salmon')
plt.title(r'\textbf{Scatter-plot of $waiting$ vs $eruptions$}')
plt.xlabel(r'eruptions $\longrightarrow$');
plt.ylabel(r'waiting $\longrightarrow$');
plt.show();
data.hist(bins=50, alpha = 0.4, color='salmon', xrot=90, figsize=(10,5));
plt.tight_layout()

In [None]:
sns.set_palette("Reds")

sample = data
g = sns.pairplot(sample, diag_kind='kde', 
                   plot_kws = { 'alpha': 0.20, 's': 80, 'edgecolor': 'k', 'color':'salmon'}, 
                   size=5, );
g.map_diag(sns.kdeplot, color='salmon',  shade=True);
g.map_upper(plt.scatter, color='salmon', alpha=0.5);
g.map_lower(sns.kdeplot, shade=False, shade_lowest=False, cbar=True);

plt.tight_layout()

## DBSCAN

Let us now cluster this data for various values of the hyper-parameters: $\epsilon, nPts$ (the neighborhood radius, and the number of neighbors needed for a point to qualify as an interior point).

In [None]:
%run dbscan_common.ipynb

In [None]:
# Different values of the hyperparameters
epsilons = [0.1, 0.2, 0.4, 0.6]
neighbors = [3, 5, 7]

quality = dbscan_cluster(epsilons=epsilons, neighbors=neighbors, data=data)

In the above plots, the outliers are marked in grey.

We observe that the clustering is markedly different for different values of the hyperparameters. **Which of these clusters would you consider optimal?**

### Clustering quality metrics
Let us now observe the clustering quality metrics, and see how it agrees with your intuition.

In [None]:
quality.style.highlight_max(color = 'lightgreen', axis = 0, subset =['silhouette score'])

### Conclusion

Inspecting the clustering plots, and the silhouette score table, it is apparent that we get the best clustering for the hyperparameter values:
* $\epsilon = 0.4, nPts = 3$
* $\epsilon = 0.4, nPts = 5$


The scree-plot confirms that the optimal number of clusters is $k=2$, i.e. two clusters as we saw visually in the plot-grid of the previous section.