# Feature Extraction (Standard Features)

**Author**: Maleakhi Agung Wijaya  
**Email**: maw219@cam.ac.uk  
**Description**: This file contains code for extracting standard features, using conchology domain knowledge. We evaluate the performance of these features.

In [204]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay
import plotly.express as px
import plotly.graph_objects as go
import scipy.io
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, cross_val_score
from sklearn.dummy import DummyClassifier

import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import layers, models, backend as K, callbacks

In [205]:
%run Utilities.ipynb

## Load Dataset

In this section, we will load previously extracted features, following feature extraction steps described in Zhang et al. (https://www.nature.com/articles/s41597-019-0230-3.pdf). The goal of the project is to build a classifier better than their method through exploring different feature extraction method.

In [182]:
# Load data (contains data that we constructed after feature extraction)
X_color, y_color = load_domain_knowledge_data(color_domain_knowledge)
X_shape, y_shape = load_domain_knowledge_data(shape_domain_knowledge)
X_texture, y_texture = load_domain_knowledge_data(texture_domain_knowledge)
X_all, y_all = load_domain_knowledge_data(all_domain_knowledge)

In [183]:
print(f"X_color: {X_color.shape}")
print(f"X_shape: {X_shape.shape}")
print(f"X_texture: {X_texture.shape}")
print(f"X_all: {X_all.shape}")

X_color: (1340, 12)
X_shape: (1340, 142)
X_texture: (1340, 4000)
X_all: (1340, 164)


In [188]:
len(np.unique(y_all))

134

## Preprocessing

In this section, we further expand the feature sets, introducing whitening and dimensionality reduction using PCA when appropriate. Experimentation are set to fairly compare performance with pipeline discussed in the Zhang et al. paper.

In [184]:
scaler = StandardScaler()
X_color_whitening = scaler.fit_transform(X_color)
y_color_whitening = y_color

X_shape_whitening = scaler.fit_transform(X_shape)
y_shape_whitening = y_shape

pca = PCA(n_components=0.99)
X_texture_pca = pca.fit_transform(scaler.fit_transform(X_texture))
y_texture_pca = y_texture
X_texture_pca_whitening = scaler.fit_transform(X_texture_pca)
y_texture_pca_whitening = y_texture

In [185]:
print(f"Number of components explaining 99% variance {X_texture_pca.shape[1]}")

Number of components explaining 99% variance 14


## Evaluation 
In this section, we train classifiers and evaluate the performance of the classifiers using the domain knowledge features in isolation and in combination.

**Traditional ML Models**  

We consider dummy classifier, SVC, and random forest.

In [207]:
## Hyperparameter configurations and result storage
param_grid_svc = {
    # random search varying the parameter
    'C': [0.1, 1, 10, 100, 1000],  
    'gamma': ["scale", "auto"], 
    'kernel': ["rbf", "linear"]
}

param_dummy = {
    "strategy": ["most_frequent"] # baseline
}

param_grid_rf = { 
    'n_estimators': [10, 100, 200],
    'criterion' :['gini', 'entropy']
}


# Used to store results for different feature sets
list_dict_results = []

In [208]:
## Loop configuration
feature_sets = [
    (X_all, y_all),
    (X_color, y_color),
    (X_shape, y_shape),
    (X_texture, y_texture),
    (X_color_whitening, y_color_whitening),
    (X_shape_whitening, y_shape_whitening),
    (X_texture_pca, y_texture_pca),
    (X_texture_pca_whitening, y_texture_pca_whitening)
]
feature_sets_name = [
    "all",
    "color",
    "shape",
    "texture",
    "color_whiten",
    "shape_whiten",
    "texture_pca",
    "texture_pca_whiten"
]

## Classifier and hyperparameter loops
param_grids = [param_dummy, param_grid_svc, param_grid_rf]
classifiers_name = ["dummy", "svc", "rf"]
classifiers = [DummyClassifier(), SVC(), RandomForestClassifier()]
cmaps = [None, "plasma", "viridis"]

In [None]:
## Evaluation
# Iterate over different feature sets
for feature_set , feature_set_name in zip(feature_sets, feature_sets_name):

    print("*"*50)
    print(feature_set_name)
    
    X = feature_set[0]
    y = feature_set[1]
    param_grid_mlp["input_dim"][0] = X.shape[1]
    
    dict_results = generate_dict_results()
    # Iterate over different classifiers
    for classifier, classifier_name, param_grid, cmap in zip(classifiers, 
                                                             classifiers_name, 
                                                             param_grids,
                                                             cmaps):
        print("-"*30)
        print(classifier_name)

        list_acc, list_cm, list_f1, list_cv_results = nested_cv_sklearn(classifier, param_grid, X, y, 5)
        # Add data to dict_results
        dict_results["accuracy"].append(list_acc)
        dict_results["f1"].append(list_f1)
        dict_results["cv_results"].append(list_cv_results)
        dict_results["cm"].append(list_cm)

        # Display accuracy, f1, confusion matrix
        print(f"Accuracy: {np.mean(list_acc)}")
        print(f"F1: {np.mean(list_f1)}")
        cm = sum(list_cm)

        if classifier_name != "dummy":
            plot_confusion_matrix(cm, cmap=cmap)

        print()
    
    list_dict_results.append(dict_results)

**************************************************
all
------------------------------
dummy
Accuracy: 0.007462686567164178
F1: 0.00011055831951354338

------------------------------
svc


**Neural Networks**  

We consider neural network models.