<a href="https://colab.research.google.com/github/mohityadav11a/asteroid_spectra/blob/main/5_binary_svm_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Machine - Binary Classes

The multiclass clssificaiton problem of the Main Group classes:
C
S
X
Other
shall be transformed into a binary problem. E.g.: X (1) and Not-X (0). In this script we use a Support Vector Machine (SVM) algorithm to perform some classification tasks.

In [24]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import sklearn

In [25]:
# Mount the Google Drive
try:
    from google.colab import drive
    drive.mount('/gdrive')
    core_path = "/gdrive/MyDrive/colab/asteroid_taxonomy/"
except ModuleNotFoundError:
    core_path = ""

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [26]:
# Loading level 2 asteroid data
asteroids_df = pd.read_pickle(os.path.join(core_path, "data/lvl2/", "asteroids.pkl"))

## Data Preparation



In [27]:
# Now we add a binary classification schema, where we distinguish between e.g., X and non-X classes
asteroids_df.loc[:, "Class"] = asteroids_df["Main_Group"].apply(lambda x: 1 if x=="X" else 0)
asteroids_df

Unnamed: 0,Name,Bus_Class,SpectrumDF,Main_Group,Class
0,1 Ceres,C,Wavelength_in_microm Reflectance_norm550n...,C,0
1,2 Pallas,B,Wavelength_in_microm Reflectance_norm550n...,C,0
2,3 Juno,Sk,Wavelength_in_microm Reflectance_norm550n...,S,0
3,4 Vesta,V,Wavelength_in_microm Reflectance_norm550n...,Other,0
4,5 Astraea,S,Wavelength_in_microm Reflectance_norm550n...,S,0
...,...,...,...,...,...
1334,1996 UK,Sq,Wavelength_in_microm Reflectance_norm550n...,S,0
1335,1996 VC,S,Wavelength_in_microm Reflectance_norm550n...,S,0
1336,1997 CZ5,S,Wavelength_in_microm Reflectance_norm550n...,S,0
1337,1997 RD1,Sq,Wavelength_in_microm Reflectance_norm550n...,S,0


In [28]:
# Allocate the spectra to one array and the classes to another one
asteroids_X = np.array([k["Reflectance_norm550nm"].tolist() for k in asteroids_df["SpectrumDF"]])
asteroids_y = np.array(asteroids_df["Class"].to_list())


In [29]:
# Creating a single test-training split with a ratio of 0.8 / 0.2
# The StratifiedShuffleSplit is needed to preserve the ratio of the classes!
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

for train_index, test_index in sss.split(asteroids_X, asteroids_y):
    X_train, X_test = asteroids_X[train_index], asteroids_X[test_index]
    y_train, y_test = asteroids_y[train_index], asteroids_y[test_index]

In [30]:
# Taking a look whether the unbalanced ratio has been preserved
print(f"Ratio of positive training classes: {round(sum(y_train) / len(X_train), 2)}")
print(f"Ratio of positive test classes: {round(sum(y_test) / len(X_test), 2)}")

Ratio of positive training classes: 0.18
Ratio of positive test classes: 0.18


# Imbalanced Datasets
We need this, to weight our imbalanced training set during the ML fitting process

In [31]:
# Compute class weightning
positive_class_weight = int(1.0 / (sum(y_train) / len(X_train)))
print(f"Positive Class weightning: {positive_class_weight}")

Positive Class weightning: 5


# Scaling


In [32]:
from sklearn import preprocessing

# Instantiate the StandardScaler (mean 0, standard deviation 1) and use the training data to fit
# the scaler
scaler = preprocessing.StandardScaler().fit(X_train)

# Transforming the training data
X_train_scaled = scaler.transform(X_train)

# Training

In [33]:
 # Importing SVM class
from sklearn import svm

# Using RBF kernel and apply the class weightning.
wclf = svm.SVC(kernel='rbf', class_weight={1: positive_class_weight}, C=100)

# Performing training
wclf.fit(X_train_scaled, y_train)

In [34]:
# Scaling testing data
X_test_scaled = scaler.transform(X_test)

# prediction
y_test_pred = wclf.predict(X_test_scaled)

In [35]:
# Importing confusion matrix and performing computation
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_test_pred)

print(conf_mat)

# The order of the confusion matrix is:
#     - true negative (top left, tn)
#     - false positive (top right, fp)
#     - false negative (bottom left, fn)
#     - true positive (bottom right, tp)
tn, fp, fn, tp = conf_mat.ravel()

[[217   4]
 [  2  45]]


In [36]:
# (recall = tp / (tp + fn))
recall_score = round(sklearn.metrics.recall_score(y_test, y_test_pred), 3)
print(f"Recall Score: {recall_score}")

# (precision = tp / (tp + fp))
precision_score = round(sklearn.metrics.precision_score(y_test, y_test_pred), 3)
print(f"Precision Score: {precision_score}")

# A combined score
f1_score = round(sklearn.metrics.f1_score(y_test, y_test_pred), 3)
print(f"F1 Score: {f1_score}")

Recall Score: 0.957
Precision Score: 0.918
F1 Score: 0.938


In [37]:
# Copying the original labelling and shuffle it randomly
asteroids_random_y = asteroids_y.copy()
np.random.shuffle(asteroids_random_y)

In [38]:

f1_score_naive = round(sklearn.metrics.f1_score(asteroids_y, asteroids_random_y), 3)
print(f"Naive F1 Score: {f1_score_naive}")

Naive F1 Score: 0.19
