
# 🐊 Hierarchical clustering and PCA of zoo animals

This notebook contains a hierarchical cluster analysis and principal component analysis (PCA) of the Zoo animal data set provided by UCI MACHINE LEARNING that can be found [here](https://www.kaggle.com/datasets/uciml/zoo-animal-classification). I hope you wil enjoy it!

## Table of content
1. [**Setup**](#1)

1. [**Data preparation**](#2)

1. [**Hierarchical clustering**](#3)
    
1. [**Principal component analysis (PCA)**](#4)
    
1. [**Conclusion**](#5)

**Goal of the analysis:**
Will will try to predict the class of each animal based on the provided features using hierarchical clustering and PCA. 

<a id="1"></a>
## ⚙️ Setup

In [None]:
suppressPackageStartupMessages(library(tidyverse)) # metapackage of all tidyverse packages
suppressPackageStartupMessages(library(visdat)) # visualize aspects of a data frame. Here NAs
install.packages('factoextra') # for visualizing clustering results
suppressPackageStartupMessages(library(factoextra))  # for visualizing clustering results

<a id="2"></a>
## 💾  Data preparation
This is a data set containing information about 101 zoo animals. In total, there are 16 variables that desccribe the traits of each anima. Furthermore, we know that each animal belongs to one of 7 classes: mammal, bird, reptile, fish, amphibian, bug and invertebrate.

In [None]:
# Loa data
class <- read.csv("../input/zoo-animal-classification/class.csv")
Zoo <- read.csv("../input/zoo-animal-classification/zoo.csv")

Quick check of data structure and details.

In [None]:
head(Zoo)
class

### Data cleaning

The data is quite messy and contains errors. We will have to go over every column to check if it contains correct values. Here are the main issues:

1. With a basic understanding of biology we can quickly see that the proposed classification of "Bug" and "Invertebrate" is problematic. The class "Bug" contains terrestrial invertebrates while the class "Invertebrate" contains aquatic invertebrates as well as terrestrial invertebrates that are no bugs (slug, worm, scorpion). I have run this analysis with the default classes and the faulty assignment causes problems. Therefore, I will fix this before we proceed by replacing "Bug" and "Invertebrate" with the new categories "terrestrial_invertebrate" and "aquatic_invertebrate".

2. Further manual corrections include:
    * The clam is incorrectly labelled as not aquatic.
    * The seasnake is incorrectly labelled as not breathing.
    * The platypuss actually produces venom.
    * Seals do have legs
    * Aardvaark, bear and seal have a tail
    
3. The fish "tail" is a caudal fin and we will use this information to create a new variable with this information. This is actually very helpful for us as you will see.

4. The "legs" variable has several mistakes.
   * a wallaby has 4 legs not 2
   * a sealion has 4 legs not 2
   * crabs have 10 legs not 4
   * crayfish have 10 legs not 6
   * lobster have 10 legs not 6
   
5. We will drop the variables "domestic" and "catsize" since they introduce unnecessary noise.

In [None]:
# drop columns and fix individual entries
Zoo <- Zoo %>%
select(-c("catsize", "domestic"))%>%
mutate(aquatic = ifelse(animal_name == "clam",1,aquatic),# fix clam entry
       breathes = ifelse(animal_name == "seasnake",1,breathes), #fix seasnake entry
       venomous = ifelse(animal_name == "platypus",1,venomous), # fix slatypus entry
       legs = ifelse(animal_name == "seal",4,legs), # fix seal entry
      )%>%

# add inforamtioon about class type
mutate(type = case_when(
class_type == 1 ~ "Mammal",
class_type == 2 ~ "Bird",
class_type == 3 ~ "Reptile",
class_type == 4 ~ "Fish",
class_type == 5 ~ "Amphibian",
class_type == 6 ~ "Bug",    
class_type == 7 ~ "Invertebrate", 
)) %>%

# Fix invertebrate assignment
mutate(type = case_when(
    animal_name == "flea"~ "terrestrial_invertebrate",
    animal_name =="gnat"~ "terrestrial_invertebrate",
    animal_name =="honeybee"~ "terrestrial_invertebrate",
    animal_name =="housefly"~ "terrestrial_invertebrate",
    animal_name =="ladybird"~ "terrestrial_invertebrate",
    animal_name =="moth"~ "terrestrial_invertebrate",
    animal_name =="termite"~ "terrestrial_invertebrate",
    animal_name =="wasp"~ "terrestrial_invertebrate",
    animal_name =="scorpion"~ "terrestrial_invertebrate",
    animal_name == "slug"~ "terrestrial_invertebrate",
    animal_name =="worm" ~ "terrestrial_invertebrate",
    animal_name == "clam"~ "aquatic_invertebrate",
    animal_name =="crab"~ "aquatic_invertebrate",
    animal_name =="crayfish"~ "aquatic_invertebrate",
    animal_name =="lobster"~ "aquatic_invertebrate",
    animal_name =="octopus"~ "aquatic_invertebrate",
    animal_name =="seawasp"~ "aquatic_invertebrate",
    animal_name =="starfish"~ "aquatic_invertebrate",
     TRUE ~ as.character(type)
), class_type = case_when(
    type == "terrestrial_invertebrate" ~ 6,
    type == "aquatic_invertebrate" ~ 7,
    TRUE ~ as.numeric(class_type)))%>%

    # Fix tail variable
mutate(
    tail = case_when(
    type == "Fish" ~ 0,
    animal_name == "Aardvaark"~ 1,
    animal_name == "bear"~ 1,
    animal_name == "seal"~ 1,
        TRUE ~ as.numeric(tail)),
    # introduce caudal fin variable
caudal_fin = case_when(
    type == "Fish" ~ 1,
    TRUE ~ as.numeric(0)),
    
 # Fix the "legs" variable
 legs= case_when(
    animal_name == "wallaby" ~ 4,
    animal_name == "sealion" ~ 4,
    animal_name == "crab" ~ 10,
    animal_name == "crayfish" ~ 10,
    animal_name == "lobster" ~ 10,
    TRUE ~ as.numeric(legs)))


### Check for NAs and duplicate entries
Before we proceed, we will check if there are any missing values or duplicate entries.

In [None]:
# check for NAs in all columns
apply(Zoo, 2, function(x) any(is.na(x)))
      
# check for duplicates in "animal_name" 
Zoo%>%
group_by(animal_name) %>% 
  filter(n()>1)

The data frame is complete but there are two "frogs" in the data set. One is "venomous" (well, technically poisonous) while the other is not. We will fix this by renaming one of them to frog2

### Fix duplicates

Fix duplicate entry by appending numerical index.

In [None]:
Zoo <- Zoo%>%
group_by(animal_name)%>% # group by animal_name
mutate(animal_name = if(n( ) > 1) {paste0(animal_name, row_number( ))} else {paste0(animal_name)})%>% # intances when name > 1 get added a value
ungroup()%>%
mutate(
    # Add combined label of type and animal_name (for plotting)
    animal = paste0(type,"_",animal_name))

In [None]:
# Drop name, class_type and Convert to matrix
ZooMatrix <- Zoo %>%
select(-c("animal_name","class_type","animal", "type"))%>%
as.matrix()

# Add animal names
rownames(ZooMatrix) <- Zoo$animal_name

head(ZooMatrix)

### Scale data
If any of the features are on different scales it is recommended to scale all features so that they have the same mean and standard deviation. For this you substract the mean of a feature from all observations and divide each feature by its standard deviation. The resulting normalized features have a mean of zero and a standard deviation of one.

In [None]:
# Check if you need to scale the data
mean_sd <- cbind(as.data.frame(colMeans(ZooMatrix)),  as.data.frame(apply(ZooMatrix, 2, sd)))
colnames(mean_sd) <- c("Mean","SD")

mean_sd

We need to scale this data!

In [None]:
# scale data using the scale() function

ScaledZooMatrix <- scale(ZooMatrix)

# Check if scaling worked
mean_sd_norm <- cbind(as.data.frame(colMeans(ScaledZooMatrix)),  as.data.frame(apply(ScaledZooMatrix, 2, sd)))
colnames(mean_sd_norm) <- c("Mean_scaled","SD_scaled")

mean_sd_norm

Good! Column have a mean of 0 (considering floating point precision) and a SD of 1.

In [None]:
head(ScaledZooMatrix)

<a id="3"></a>
## 🧮 Hierarchical clustering

There are four default **linkage functions** for clustering available that are used to measure the distance between two clusters:
1. **"Single linkage clustering"**: Also known as nearest neighbor. Measures the distance between the closest elements of two clusters (method = "single").
2. **"Average linkage clustering"**: Calculates the average distance of all objects of two clusters and with this is less affected by outliers (method = "average").
3. **"Complete linkage clustering"**: Measures the distance between the farthest elements in two clusters (method = "complete").
4. **"Centroid linakge clusterin g"**: First the centers of cluster x and y are calculated and then distance between two cluster centers is assessed (method = "centroid").

Usually the "complete" and "average" methods are the one used because they tend to produce more balanced trees. However, if you want to detect outliers, you might want to go for a different method. Here we will try these two methods and compare the results.

hclust() requires a similairy matrix as input. For this we need to wrap our scaled data using the dist() function.

In [None]:
Zoo_clust_average <- hclust(dist(ScaledZooMatrix), method = "average")
Zoo_clust_complete <- hclust(dist(ScaledZooMatrix), method = "complete")

Extract the cluster IDs for each animal that we know to be true from the Zoo data set.

In [None]:
# Extract cluster IDs
expected_clusters  <- setNames(as.character(Zoo$class_type), Zoo$animal_name)

### Plot dendrograms
Plot the dendrogram for the two models with 7 cluststers. Use the extracted IDs to color code each animals mebership to a cluster.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 20) # set dimensions of plots

# Plot results for "average"
fviz_dend(Zoo_clust_average, k = 7, k_colors = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D"),
         label_cols =  expected_clusters[Zoo_clust_average$order],rect = TRUE, horiz = TRUE, cex = 1.2, main = "Clustering using average linkage")

# Plot results for "complete"
fviz_dend(Zoo_clust_complete, k = 7, k_colors = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D"),
         label_cols =  expected_clusters[Zoo_clust_complete$order], rect = TRUE, horiz = TRUE, cex = 1.3, main = "Clustering using complete linkage")

In [None]:
Animals <- Zoo %>%
group_by(type)%>%
count()%>%
arrange(desc(n))

Animals

Looking at the expected class membership in `Animals` we can see that both methods perform reasonably well. The main issue with the clustering appears to be due to the "poisonous" attribute that some animals have. Both methods cluster the poisonous animals together, although they belong to different categories.

<a id="4"></a>
## 🔍 Principal component an analysis

We will use principal componene analysis (PCA) to cluster groups of similar animals and compare the results to the hierarchical clustering approach.

In [None]:
# Create a matrix
ZooMatrix_PCA <- Zoo%>%
select(-c("animal_name","class_type", "type", "animal"))%>%
as.matrix()


# Add animal as name
rownames(ZooMatrix_PCA) <- Zoo$"animal"
#ZooMatrix2

In [None]:
# Perform PCA
pca <- prcomp(x= ZooMatrix_PCA, scale = TRUE, center = TRUE)
summary(pca)

### Biplot
Plot loadings and mapping of observations for the first two PCs.

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7.5) # set dimensions of plots
#fviz_pca_biplot(pca, jitter = list(what = "label", width = NULL, height = NULL))
set.seed(3)
fviz_pca_biplot(pca, repel = TRUE, habillage=Zoo$type,addEllipses=TRUE,ellipse.level=0.95, title = "PCA - Biplot")

Plotting PC1 vs PC2 shows that, as with the hierarchical clustering, we can separate several of the groups along PC1 and PC2. 

### Plot proportion of variance explained

In [None]:
# Visualize eigenvalues/variances
fviz_screeplot(pca, addlabels = TRUE, ylim = c(0, 35))

In [None]:
# Contributions of features to PC1
fviz_contrib(pca, choice = "var", axes = 1, top = 10)
# Contributions of vfefatures to PC2
fviz_contrib(pca, choice = "var", axes = 2, top = 10)

High ranking features contributing to PC1 are indicative of whether an animal is a mammal or not and aquatic. For PC2 the top contributing features are descriptive of bird and fish specific attributes.

<a id="5"></a>
## 🏆 Conclusion
The goal of this analysis was to predict the class of each animal based on the provided features using hierarchical clustering or PCA. After a thorough cleanup of the initially messy data set we were able separate the majority of animals into their expected clusters. However, this did not work for poisonous animals that were clustered together despite belonging to various different classes.

**Any feedback is much appreciated. Thank you!**