#  Introduction

The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar.


obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS

alpha = Right Ascension angle (at J2000 epoch)

delta = Declination angle (at J2000 epoch)

u = Ultraviolet filter in the photometric system

g = Green filter in the photometric system

r = Red filter in the photometric system

i = Near Infrared filter in the photometric system

z = Infrared filter in the photometric system

run_ID = Run Number used to identify the specific scan

rereun_ID = Rerun Number to specify how the image was processed

cam_col = Camera column to identify the scanline within the run

field_ID = Field number to identify each field

spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)

class = object class (galaxy, star or quasar object)

redshift = redshift value based on the increase in wavelength

plate = plate ID, identifies each plate in SDSS

MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

# Import Libraries


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

# Read Data


In [None]:
data= pd.read_csv("/kaggle/input/stellar-classification-dataset-sdss17/star_classification.csv")
data

# Data Describe


In [None]:
data.head(6)

In [None]:
data.info()

In [None]:
data.describe()

# Data Preprocessing & Cleaning

In [None]:
df=data.copy()
df

In [None]:
df.isnull().sum()

there is no null values

In [None]:
df['class'].value_counts()

In [None]:
b=sns.countplot(x= 'class' ,data = df ,palette='coolwarm'  )
plt.show()

We have the problem of imbalance It means that we need to process the dataset before modeling.

# Data Visualization

In [None]:
df.hist(bins = 10 , figsize= (14,14))
plt.show()

In [None]:
df.columns

In [None]:
from astropy.coordinates import SkyCoord
import astropy.units as u
coords = SkyCoord(ra=df['alpha']*u.degree, dec=df['delta']*u.degree, frame='icrs')

fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='mollweide')
ax.scatter(coords.ra.wrap_at(180*u.degree).radian, coords.dec.radian, s=1)
ax.grid()
plt.show()

The positions of the objects can be plotted on a celestial sphere, allowing you to see the distribution of objects in the sky

In [None]:
fig = px.pie(df, names='cam_col', title='cam_col',color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

We found that 19.6% of the scan was divided into 4 columns to cover a larger area of the sky, 18.6% of the scan was divided into 3 columns, and 18.5% was divided into 5 columns.

In [None]:
size = [s if s >= 0 else 0 for s in df['redshift']]
fig = px.scatter(df, x='alpha', y='delta', color='class', size= size)
fig.update_layout(title='SDSS Spectroscopic Observations',
                  xaxis_title='Right Ascension (deg)',
                  yaxis_title='Declination (deg)')

fig.show()

The position of each fiber on a plate can be plotted on a map, allowing you to see the distribution of spectroscopic observations across the sky.

In [None]:
fig = px.histogram(df, x="redshift")
fig.show()

A redshift histogram shows the distribution of objects at different redshift values, which can provide insights into the large-scale structure of the universe.

In [None]:
sns.scatterplot(data=df, x="class", y="redshift")

In [None]:
df['spec_obj_ID'].value_counts()

In [None]:
fig = px.histogram(df, x="spec_obj_ID")
fig.show()

spec_obj_ID Distribution

In [None]:
sns.scatterplot(data=df, x="class", y="spec_obj_ID")

This means that two different observations with the same spec_obj_ID must share the output class, which is either a galaxy, star, or quasar

In [None]:
sns.histplot(x=df.plate)
plt.title("plate Distribution", color="red", fontsize=18);

each plate used in the SDSS spectroscopic survey. Each plate contains multiple fibers that collect the light from different objects Such as the plates from zero to 2000 containing approximately 2,500 fibers of light to collect light from different objects

In [None]:
# Define colors
color = df['u'] - df['g']
# Define magnitude
mag = df['r']

# Plot CMD
fig = px.scatter(x=color, y=mag, color=color, opacity=0.5)
fig.update_layout(xaxis_title='u - g', yaxis_title='r')
fig.show()

Using Photometric filters: CMDs(Color Magnitude Diagrams) plot the brightness of an object in one filter (magnitude) against its brightness in another filter (color). This can reveal information about the properties of the objects in the dataset, such as their temperature, metallicity, or age.

In [None]:
fig = px.scatter(df, x='delta', y='redshift', color='class', color_discrete_sequence=px.colors.qualitative.Dark24, hover_name='obj_ID')
fig.update_layout(title='Declination Angle vs Redshift', xaxis_title='Declination Angle', yaxis_title='Redshift')
fig.show()

In [None]:
fig = px.scatter(df, x='alpha', y='redshift', color='class', color_discrete_sequence=px.colors.qualitative.Dark2, hover_name='obj_ID')
fig.update_layout(title='Declination Angle vs Redshift', xaxis_title='Declination Angle', yaxis_title='Redshift')
fig.show()

# Encoding the object Dataset

In [None]:
df.describe(include=object)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
class_le=LabelEncoder()

In [None]:
df['class'] = class_le.fit_transform(df['class'])

In [None]:
df.info()

# Future Selection

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(df.corr() , annot = True , cmap = "YlGnBu")

# train test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x=df.drop('class',axis=1).values
y=df['class'].values

In [None]:
x_train , x_test , y_train , y_test = train_test_split(x,y, test_size= 0.33 , random_state= 42)

# Let's dealing with the problem of imbalance in our target column

# Sampling With SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter
smote = SMOTE(random_state=42)
print('Original dataset shape %s' % Counter(y))
print('Original ytrain dataset shape %s' % Counter(y_train))
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)
print('Resampled ytrain dataset shape %s' % Counter(y_train_smote))

Now the imbalance problem is solved

# Modeling

# RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(max_depth=7 , max_features=3,n_estimators= 100)
rf.fit(x_train_smote, y_train_smote )

In [None]:
rf.score(x_train_smote, y_train_smote)

In [None]:
rf.score(x_test,y_test)

# Model Evaluation

In [None]:
from sklearn.metrics import confusion_matrix , classification_report

In [None]:
v = confusion_matrix(y_test , rf.predict(x_test))
v

In [None]:
from mlxtend.plotting import plot_confusion_matrix
plot_confusion_matrix(v , class_names=["GALAXY","QSO","STAR"], cmap='YlOrRd')

In [None]:
print (classification_report(y_test  , rf.predict(x_test)))