# Protein Consumption - Hierarchical Clustering
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster


In [None]:
plt.style.use('fivethirtyeight')

## Problem description

We have data about the protein consumption in twenty-five European countries for nine food groups. We want to find out whether there are any groups of countries with similar protein consumption patterns. 

## Load data

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/protein/data/protein.txt', sep='\t')

In [None]:
data.head()

## Prepare data

Drop the catgeorical variable `Country` and scale the data.

In [None]:
data_prep = data.drop(['Country'], axis=1) 

In [None]:
scaler = StandardScaler()
data_prep = scaler.fit_transform(data_prep)
data_prep = pd.DataFrame(data_prep, columns=data.columns[1:])

In [None]:
data_prep.head()

## Hierarchical clustering

In [None]:
hclust = linkage(data_prep, method="complete", metric="euclidean")

In [None]:
dendrogram(hclust, , labels=data['Country'].values, leaf_rotation=90, leaf_font_size=10)
plt.show()

In [None]:
cluster_membership = fcluster(hclust, t=5, criterion='maxclust')

In [None]:
cluster_membership

In [None]:
data['Cluster'] = cluster_membership
data = data.sort_values('Cluster')

In [None]:
data