The purpose for this dataset is to be able to predict the classification of the animals, based upon the variables.

This dataset consists of 101 animals from a zoo.
There are 16 variables with various traits to describe the animals.
The 7 Class Types are: Mammal, Bird, Reptile, Fish, Amphibian, Bug and Invertebrate.

zoo.csv

Attribute Information: (name of attribute and type of value domain)

animal_name: Unique for each instance
hair Boolean
feathers Boolean
eggs Boolean
milk Boolean
airborne Boolean
aquatic Boolean
predator Boolean
toothed Boolean
backbone Boolean
breathes Boolean
venomous Boolean
fins Boolean
legs Numeric (set of values: {0,2,4,5,6,8})
tail Boolean
domestic Boolean
catsize Boolean
class_type Numeric (integer values in range [1,7])

class.csv

This csv describes the dataset

Class_Number Numeric (integer values in range [1,7])
Number_Of_Animal_Species_In_Class Numeric
Class_Type character -- The actual word description of the class
Animal_Names character -- The list of the animals that fall in the category of the class

In [None]:
import pandas as pd
import numpy as np
import zipfile

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

See Files inside of Zip

In [None]:
zip_path = "Zoo Animal Dataset.zip"

# Open the ZIP file and check its contents

with zipfile.ZipFile(zip_path, "r") as z:
    print(z.namelist())  # Lists all files inside the ZIP

In [None]:
zoo_filename = "zoo.csv"  # Replace with actual file name
with zipfile.ZipFile(zip_path, "r") as z:
    with z.open(zoo_filename) as f:
        zoo_df = pd.read_csv(f)

# Display the first few rows
zoo_df

In [None]:
class_filename = "class.csv"  # Replace with actual file name
with zipfile.ZipFile(zip_path, "r") as z:
    with z.open(class_filename) as f:
        class_df = pd.read_csv(f)

# Display the first few rows
class_df.head()

In [None]:
missing_values = zoo_df.isnull().sum()

print(missing_values)

In [None]:
zoo_df.describe()

The field called 'Legs' is the only numerical field with more than 2 values aside from info on Classes

In [None]:
unique_counts = zoo_df.isnull().sum()
print(unique_counts)

In [None]:
num_duplicates = zoo_df.duplicated().sum()

print(num_duplicates)

In [None]:
# Map class_type number to actual class name
class_mapping = class_df.set_index("Class_Number")["Class_Type"].to_dict()
zoo_df["class_name"] = zoo_df["class_type"].map(class_mapping)

In [None]:
    # Create a boxplot of number of legs by class
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=zoo_df, x='class_name', y='legs', palette='Set3')
    plt.title('Number of Legs by Animal Class')
    plt.xlabel('Animal Class')
    plt.ylabel('Number of Legs')    
    plt.xticks(rotation=45)
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

Suspicous here which Mamammals have 2 legs or 0 legs

In [None]:
non_fourleg_mam = zoo_df[(zoo_df['class_name'] == 'Mammal')& (zoo_df['legs'] < 4)]


# print(non_fourleg_mam[['animal_name', 'legs','class_name']] )

non_fourleg_mam

In [None]:
target_animal_changelegs = ['wallaby','vampire','squirrel','sealion','seal','gorilla','fruitbat']

zoo_df.loc[zoo_df['animal_name'].isin(target_animal_changelegs), 'legs'] = 4


In [None]:
mask = zoo_df['animal_name'].isin(target_animal_changelegs)

zoo_df_updated = zoo_df[mask]

print(zoo_df_updated[['animal_name', 'legs', 'class_name']].to_string(index=False))


Scaling "Legs" feature as having 8 legs" isn’t 8x more important than "1 leg"
This should help improve training speed and convergence in some models.

StandardScaler will standardize values to: 


Mean = 0


Standard deviation = 1


In [None]:
scaler = StandardScaler()
X['legs']  = scaler.fit_transform(X[['legs']])

In [None]:
# Plot class distribution
plt.figure(figsize=(10, 6))
zoo_df["class_name"].value_counts().plot(kind="bar", color="skyblue", edgecolor="black")
plt.title("Distribution of Animal Classes")
plt.xlabel("Animal Class")
plt.ylabel("Number of Animals")
plt.xticks(rotation=45)
plt.grid(axis="y")
plt.tight_layout()
plt.show()

Preprocessing

In [None]:
zoo_df

In [None]:
class_df

Drop Non Predictive Columns

In [None]:
df_model = zoo_df.drop(columns=['animal_name', 'class_name'])

df_model

Separate Features from Target

In [None]:
X = df_model.drop(columns = 'class_type')
y = df_model['class_type']

Training and Testing Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 22)

# print(X_train.head())
print(y_train.head())

Modelling Classifiers: Decision Tree & Random Forrest

In [None]:
clf1 = DecisionTreeClassifier(random_state = 22)

clf1.fit(X_train, y_train)


Predict

In [None]:
y_pred = clf1.predict(X_test)

Evaluate

In [None]:
# 1. Print the Classification Report
print(classification_report(y_test, y_pred))

# 2. Print the Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# 3. Print the Accuracy Score
print("Accuracy Score:", accuracy_score(y_test, y_pred))


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Step 1: Predict (assuming you already have y_pred)
# y_pred = clf1.predict(X_test)

# Step 2: Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# # Step 3: Plot the heatmap
# plt.figure(figsize=(8,6))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
# plt.title('Confusion Matrix')
# plt.xlabel('Predicted Labels')
# plt.ylabel('True Labels')
# plt.show()

labels = ['Mammal', 'Bird', 'Reptile', 'Fish', 'Amphibian', 'Bug', 'Invertebrate']

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=labels, yticklabels=labels)

