# Stellar classification

Data source: [Kaggle](https://www.kaggle.com/vinesmsuic/preprocessing-the-stardataset)

Stellar Classification uses the spectral data of stars to categorize them into different categories. The modern stellar classification system is known as the Morgan–Keenan (MK) classification system. It uses the old HR classification system to categorize stars with their chromaticity and uses Roman numerals to categorize the star’s size.

In this Dataset, we will be using Absolute Magnitude and B-V Color Index to Identify Giants and Dwarfs.

Columns:
- Vmag - Visual Apparent Magnitude of the Star
- Plx - Distance Between the Star and the Earth
- e_Plx - Standard Error of `Plx` (Drop the Row if you find the e_Plx is too high!)
- B-V - B-V color index.
- SpType - Spectral type
- Amag - Absolute Magnitude of the Star
- TargetClass - Whether the Star is Dwarf (0) or Giant (1)

Goal:
Predict star type

For the predictions, we will use `B-V` and `Amag` columns.

In [1]:
# Imports
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # seaborne is a package built on top of matplotlib.
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

sns.set() # activate seaborn to override all the matplotlib graphics

In [2]:
# Load data
df = pd.read_csv("Star39552_balanced.csv", delimiter=",")

In [3]:
# Select Target
y = df['TargetClass']

# Select Features
x = df[['B-V','Amag']]

In [4]:
df.describe()

Unnamed: 0,Vmag,Plx,e_Plx,B-V,Amag,TargetClass
count,39552.0,39552.0,39552.0,39552.0,39552.0,39552.0
mean,7.921309,7.117378,1.109705,0.744336,16.050687,0.5
std,1.308857,12.446291,0.788133,0.513987,2.443937,0.500006
min,-0.62,-27.84,0.42,-0.4,-0.35,0.0
25%,7.21,2.43,0.8,0.358,14.756514,0.0
50%,8.16,4.44,0.99,0.703,16.020827,0.5
75%,8.83,8.2325,1.23,1.129,17.590542,1.0
max,12.85,772.33,40.63,3.44,30.449015,1.0


Splitting the dataset into two parts, 25% will be used for testing and the rest for training.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.25)

We will use scaler to get better results.

In [6]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [7]:
def fit_and_predict(classifier, x_train, y_train, x_test, y_test):
    tic = time.perf_counter()
    classifier.fit(x_train, y_train)
    toc = time.perf_counter()

    print(f"Fitting took {toc - tic:0.4f} seconds")

    tic = time.perf_counter()
    y_pred = classifier.predict(x_test)
    toc = time.perf_counter()

    print(f"Predicting on the test data took {toc - tic:0.4f} seconds")

    print()
    print('The score on train dataset is {}'.format(classifier.score(x_train, y_train)))
    print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))
    print()
    

    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix : \n", cm)    
    
    return y_pred

## A random Tree

Decision Trees are a non-parametric supervised learning classifiers. The model predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

In [8]:
from sklearn import tree

tree = tree.DecisionTreeClassifier(max_depth=5)

tic = time.perf_counter()
tree.fit(x_train, y_train)
toc = time.perf_counter()

print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = tree.predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")
print()

print('The score on train dataset is {}'.format(tree.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))
print()

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)    

Fitting took 0.0483 seconds
Predicting on the test data took 0.0014 seconds

The score on train dataset is 0.8816073354908306
The test accuracy is 0.8782362459546925

Confusion Matrix : 
 [[4159  780]
 [ 424 4525]]


The decision tree was really fast and it classifies our data fairly well.

## Random Forest Classification

A random forest is an ensamble learning method that uses a number of decision tree classifiers on various sub-samples of the dataset. These decision trees than vote to improve the predictive accuracy and control over-fitting.

In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
forest = RandomForestClassifier()

tic = time.perf_counter()
forest.fit(x_train, y_train)
toc = time.perf_counter()

print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = forest.predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")
print()

print('The score on train dataset is {}'.format(forest.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))
print()

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)    

Fitting took 4.7286 seconds
Predicting on the test data took 0.2432 seconds

The score on train dataset is 0.9997977346278317
The test accuracy is 0.8631674757281553

Confusion Matrix : 
 [[4233  706]
 [ 647 4302]]


The training accuracy without any parameters is exceptionaly high, but it drops dramatically, during testing. This suggests that our model is overfitting. We will tweak the classifier's parameters to fix this.

In [11]:
forest = RandomForestClassifier(max_depth=5)

tic = time.perf_counter()
forest.fit(x_train, y_train)
toc = time.perf_counter()

print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = forest.predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")

print()
print('The score on train dataset is {}'.format(forest.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))
print()

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)

Fitting took 1.7115 seconds
Predicting on the test data took 0.0544 seconds

The score on train dataset is 0.8832928802588996
The test accuracy is 0.8789441747572816

Confusion Matrix : 
 [[4195  744]
 [ 453 4496]]


Limiting the maximal depth of trees lowered the training accuracy, but the model did a little bit better on the test data. This also decreased the training and predicting times. Unfortunatelly, it doesn't provide any advatage over using a single decision tree. The accuracy is comparable and training times are much longer.

## Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of probability. Gaussian Naive Bayes assumes Gaussian distribution. In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations.

In [12]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
tic = time.perf_counter()
gnb.fit(x_train, y_train)
toc = time.perf_counter()

print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = gnb.fit(x_train, y_train).predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")
print()


print('The score on train dataset is {}'.format(gnb.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)

Fitting took 0.0065 seconds
Predicting on the test data took 0.0051 seconds

The score on train dataset is 0.8806634304207119
The test accuracy is 0.8777305825242718
Confusion Matrix : 
 [[4151  788]
 [ 421 4528]]


The accuracy of Naive Bayes is comparable to that of the decision tree and the random forest, and training times comparable to the decision tree. 

## Logistic regression

Logistic regression, despite its name, is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [13]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
tic = time.perf_counter()
log_reg.fit(x_train, y_train)
toc = time.perf_counter()

print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = log_reg.fit(x_train, y_train).predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")
print()


print('The score on train dataset is {}'.format(log_reg.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)

Fitting took 0.0295 seconds
Predicting on the test data took 0.0153 seconds

The score on train dataset is 0.8797532362459547
The test accuracy is 0.8767192556634305
Confusion Matrix : 
 [[4229  710]
 [ 509 4440]]


The logistic regression was so far the fastest of all the methods, which we employed and the accuracy was comparable with the previous algorithms.

## Voting Classifier

This is an ensamble learning technique that combines different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

In [14]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[('lr', log_reg), ('rf', forest), ('gnb', gnb)], voting='hard')

tic = time.perf_counter()
voting_clf = voting_clf.fit(x_train, y_train)
toc = time.perf_counter()
print(f"Fitting took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
y_pred = voting_clf.fit(x_train, y_train).predict(x_test)
toc = time.perf_counter()

print(f"Predicting on the test data took {toc - tic:0.4f} seconds")
print()


print('The score on train dataset is {}'.format(voting_clf.score(x_train, y_train)))
print('The test accuracy is {}'.format(accuracy_score(y_test, y_pred)))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm)

Fitting took 1.5379 seconds
Predicting on the test data took 1.4454 seconds

The score on train dataset is 0.8820118662351673
The test accuracy is 0.879247572815534
Confusion Matrix : 
 [[4184  755]
 [ 439 4510]]


## Conclusion

We used several classifiers (namely: `DecisionTreeClassifier`, `RandomForestClassifier`, `GaussianNB`, and `LogisticRegression`)to fit the Stellar classification data. At the end we combined our classifiers into a single one using the `VotingClassifier`. 

The accuracy of all the classifier we've used was similar and therefore, the decision which one we should use hast to depend on something else. We found that the fitting and the prediction times varied a lot. The fastest method was `GaussianNB`, followed by `LogisticRegression`. The third place goes to `DecisionTreeClassifier`.

The `RandomForestClassifier` and `VotingClassifier` were the slowest and they provided no additional benefits.