## Star Classification Dataset on Kaggle

**Source:** Kaggle.com

This dataset focuses on predicting the type of star based on its properties. The target variable is **Star Type**, which classifies stars into six categories (0: Brown Dwarf, 1: Red Dwarf, 2: White Dwarf, 3: Main Sequence, 4: Supergiants, 5: Hypergiants) based on their temperature, size, and luminosity.

**Features:**

* **Temperature (K):** The star's temperature in Kelvin, influencing its color and energy output.
* **Luminosity (L/Lo):** The star's brightness compared to the Sun, indicating its energy production.
* **Radius (R/Ro):** The star's size relative to the Sun, providing insight into its physical structure.
* **Absolute Magnitude (Mv):** A standardized measure of the star's intrinsic brightness for comparison across vast distances.
* **Star Color:** An indicator of the star's temperature and composition.
* **Spectral Class:** Classification based on spectral lines, revealing the star's chemical makeup and evolutionary stage.

## Objective of this Exploratory Data Analysis (EDA) for Stellar Dataset

The primary objective of this EDA is to delve into the dataset, uncovering relationships between the features and the target variable, **Star Type**. This analysis will facilitate a deeper understanding of the data, identify patterns, trends, and anomalies, and ultimately guide the development of effective predictive models.

**Specific Goals of the EDA:**

* **Examine Feature Distributions:** Investigate the statistical distributions of key features like Temperature, Luminosity, Radius, and Absolute Magnitude to understand their range, central tendency, and variance.
* **Analyze Correlations:** Assess the correlations between numerical features (e.g., Temperature vs. Luminosity, Radius vs. Absolute Magnitude) to identify potential relationships that can influence the Star Type.
* **Explore Target Variable Distribution:** Understand the distribution of the Star Type categories within the dataset, analyzing how different star types are represented in terms of their feature values.
* **Identify Data Quality Issues:** Detect and address any outliers or missing values that may impact the analysis and modeling process.
* **Visualize Relationships:** Utilize various plots (scatter plots, histograms, pair plots, heatmaps) to visualize relationships between features and their impact on Star Type predictions.
* **Feature Engineering:** Consider creating new features (e.g., temperature ranges, luminosity bins) that might enhance predictive performance in subsequent modeling stages.

In [1]:
# Importing the nessesary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score

In [2]:
# Load the dataset
df = pd.read_csv('../data/star_data.csv')

In [3]:
# Understanding the dataset
print(df.head(10))

  Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv)  \
0             NaN              NaN          NaN                    NaN   
1            3042           0.0005       0.1542                   16.6   
2            2600           0.0003        0.102                   18.7   
3            2800           0.0002                               16.65   
4            1939         0.000138        0.103                  20.06   
5            2840                          0.11                  16.98   
6            2637          0.00073        0.127                  17.22   
7            2600           0.0004        0.096                   17.4   
8            2650          0.00069         0.11                  17.45   
9            2700          0.00018         0.13                  16.05   

   Star type Star color Spectral Class  
0        NaN        NaN            NaN  
1        0.0        Red              M  
2        0.0        Red              M  
3        0.0        R

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Temperature (K)         239 non-null    object 
 1   Luminosity(L/Lo)        239 non-null    object 
 2   Radius(R/Ro)            239 non-null    object 
 3   Absolute magnitude(Mv)  239 non-null    object 
 4   Star type               239 non-null    float64
 5   Star color              239 non-null    object 
 6   Spectral Class          239 non-null    object 
dtypes: float64(1), object(6)
memory usage: 13.3+ KB
None


The dataset contains 240 entries and 7 columns, with most features being categorical (object type), except for "Star type" which is a numerical variable, and all columns have no missing values.

In [5]:
print(df.describe)

<bound method NDFrame.describe of     Temperature (K) Luminosity(L/Lo) Radius(R/Ro) Absolute magnitude(Mv)  \
0               NaN              NaN          NaN                    NaN   
1              3042           0.0005       0.1542                   16.6   
2              2600           0.0003        0.102                   18.7   
3              2800           0.0002                               16.65   
4              1939         0.000138        0.103                  20.06   
..              ...              ...          ...                    ...   
235           38940           374830         1356                  -9.93   
236           30839           834042         1194                 -10.63   
237            8829           537493         1423                 -10.73   
238            9235           404940         1112                 -11.23   
239           37882           294903         1783                   -7.8   

     Star type Star color Spectral Class  
0         


The dataset contains 240 rows and 7 columns, with both numerical and categorical data representing star properties, though some entries contain missing or improperly formatted values.

In [6]:
print(df.isnull().sum())

Temperature (K)           1
Luminosity(L/Lo)          1
Radius(R/Ro)              1
Absolute magnitude(Mv)    1
Star type                 1
Star color                1
Spectral Class            1
dtype: int64


In [7]:
df

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,,,,,,,
1,3042,0.0005,0.1542,16.6,0.0,Red,M
2,2600,0.0003,0.102,18.7,0.0,Red,M
3,2800,0.0002,,16.65,0.0,Red,M
4,1939,0.000138,0.103,20.06,0.0,Red,M
...,...,...,...,...,...,...,...
235,38940,374830,1356,-9.93,5.0,Blue,O
236,30839,834042,1194,-10.63,5.0,Blue,O
237,8829,537493,1423,-10.73,5.0,White,A
238,9235,404940,1112,-11.23,5.0,White,A
