## **Importing the libraries required**

In [1]:
# Importing the basic libraries we will require for the project

import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
import seaborn as sns;
import csv,json;
import os;

# Importing the Machine Learning models we require from Scikit-Learn
from sklearn import tree;
from sklearn.tree import DecisionTreeClassifier;
from sklearn.ensemble import RandomForestClassifier;

# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV;
from sklearn.metrics import recall_score, roc_curve, classification_report, confusion_matrix;
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder;
from sklearn.compose import ColumnTransformer;
from sklearn.impute import SimpleImputer;
from sklearn.pipeline import Pipeline;
from sklearn import metrics, model_selection;

# Setting the random seed to 1 for reproducibility of results
import random;
random.seed(1);
np.random.seed(1);

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

In [5]:
df_astro = pd.read_csv('Skyserver250k.csv');
df_astro.head()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1237661976015274033,196.362072,7.667016,19.32757,19.20759,19.16249,19.07652,18.86196,3842,301,4,102,2020027785916999680,QSO,1.984419,1794,54504,594
1,1237661362373066810,206.614664,45.924279,18.95918,17.09173,16.25019,15.83413,15.55686,3699,301,5,121,1649585252231833600,GALAXY,0.064456,1465,53082,516
2,1237661360767238272,220.294728,40.894575,17.75587,16.547,16.67694,16.7778,16.88097,3699,301,2,194,3812387877359296512,STAR,-0.000509,3386,54952,330
3,1237665440983416884,206.315349,27.438152,19.29195,19.1272,19.03992,18.76714,18.73874,4649,301,2,152,6762291282364878848,QSO,1.882893,6006,56105,496
4,1237665531717812262,228.092653,20.807371,19.19731,18.26143,17.89954,17.7613,17.68726,4670,301,3,201,4454292673071960064,STAR,-0.000295,3956,55656,846


In [6]:
df_astro.shape

(250000, 18)

- This dataset has **250,000 rows and 18 columns**.
- The dataset is quite voluminous, and has a high rows-to-columns ratio. This is quite **typical of astronomical datasets**, due to the vast number of celestial objects in the universe that can be detected today by modern telescopes and observatories.

In [7]:
df_astro['class'].value_counts()

GALAXY    127117
STAR       96116
QSO        26767
Name: class, dtype: int64

## **Data Overview**
### **First 5 & Last 5 Rows of the Dataset**
Let's **view the first few rows and last few rows** of the dataset in order to understand its structure a little better.

We will use the head() and tail() methods from Pandas to do this.

In [8]:
# first 5 rows of the dataset
df_astro.head()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1237661976015274033,196.362072,7.667016,19.32757,19.20759,19.16249,19.07652,18.86196,3842,301,4,102,2020027785916999680,QSO,1.984419,1794,54504,594
1,1237661362373066810,206.614664,45.924279,18.95918,17.09173,16.25019,15.83413,15.55686,3699,301,5,121,1649585252231833600,GALAXY,0.064456,1465,53082,516
2,1237661360767238272,220.294728,40.894575,17.75587,16.547,16.67694,16.7778,16.88097,3699,301,2,194,3812387877359296512,STAR,-0.000509,3386,54952,330
3,1237665440983416884,206.315349,27.438152,19.29195,19.1272,19.03992,18.76714,18.73874,4649,301,2,152,6762291282364878848,QSO,1.882893,6006,56105,496
4,1237665531717812262,228.092653,20.807371,19.19731,18.26143,17.89954,17.7613,17.68726,4670,301,3,201,4454292673071960064,STAR,-0.000295,3956,55656,846


In [9]:
# view the last 5 rows of the dataset
df_astro.tail()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
249995,1237661360767565997,221.101714,40.641045,19.30748,18.22145,17.61426,17.3224,17.02841,3699,301,2,199,1572955889466370048,GALAXY,0.158805,1397,53119,268
249996,1237667783903084693,171.645089,22.797546,19.19911,17.79553,17.03988,16.63705,16.31786,5194,301,6,381,2814852916416899072,GALAXY,0.03443,2500,54178,375
249997,1237648704591233226,215.751103,0.044486,18.88386,17.51738,16.89393,16.39914,16.07888,752,301,4,482,342430828837496832,GALAXY,0.078681,304,51609,572
249998,1237660634386923630,163.501339,44.470389,18.49867,16.73666,15.88036,15.48524,15.14631,3530,301,1,286,1614541342266910720,GALAXY,0.048519,1434,53053,3
249999,1237668271376236954,236.495641,12.142432,19.39894,18.4055,18.06466,17.93771,17.87524,5308,301,2,295,5517031135849500672,STAR,0.000864,4900,55739,442


### **Datatypes of the Features**
Next, **let's check the datatypes** of the columns in the dataset. 

We are interested to know how many numerical and how many categorical features this dataset possesses.

In [10]:
# check the datatypes of the columns in the 
df_astro.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 18 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   objid      250000 non-null  int64  
 1   ra         250000 non-null  float64
 2   dec        250000 non-null  float64
 3   u          250000 non-null  float64
 4   g          250000 non-null  float64
 5   r          250000 non-null  float64
 6   i          250000 non-null  float64
 7   z          250000 non-null  float64
 8   run        250000 non-null  int64  
 9   rerun      250000 non-null  int64  
 10  camcol     250000 non-null  int64  
 11  field      250000 non-null  int64  
 12  specobjid  250000 non-null  uint64 
 13  class      250000 non-null  object 
 14  redshift   250000 non-null  float64
 15  plate      250000 non-null  int64  
 16  mjd        250000 non-null  int64  
 17  fiberid    250000 non-null  int64  
dtypes: float64(8), int64(8), object(1), uint64(1)
memory usage: 34.3

- As we can see above, apart from the `class` variable (the target variable) which is of the **object** datatype and is **categorical** in nature, all the other predictor variables here are **numerical** in nature, as they have **int64** and **float64** datatypes. 
- So this is a **classification problem where the original feature set uses entirely numerical features.** Numerical datasets like this which are about values of measurements, are **quite often found in astronomy**, and are ripe for machine learning problem solving, due to the affinity for numerical calculations that computers have.
- The above table also confirms what we found earlier, that there are 250,000 rows and 18 columns in the original dataset. Since every column here has the same number (250,000) of non-null values, we can also conclude that **there is no missing data in the table** (due to the high quality of the data source), and we can proceed without needing to worry about missing value imputation techniques.

In [12]:
df_astro = df_astro.sample(n=50000)

### **Missing Values**

In [14]:
# Checking for any missing values just in case
df_astro.isnull().sum()

objid        0
ra           0
dec          0
u            0
g            0
r            0
i            0
z            0
run          0
rerun        0
camcol       0
field        0
specobjid    0
class        0
redshift     0
plate        0
mjd          0
fiberid      0
dtype: int64

- It is hence confirmed that there are **no missing values** in this dataset.

### **Duplicate Rows**
Let's also do a quick check to see if any of the rows in this dataset may be duplicates of each other, 

even though we know that will not be the case given the source of this data.

In [15]:
# Let's also check for duplicate rows in the dataset
df_astro.duplicated().sum()

0

As seen above, there are **no duplicate rows** in the dataset either.

### **Class Distribution**
Let's now look at the percentage class distribution of the target variable `class` in this classification dataset.

In [17]:
### Percentage class distribution of the target variable "class"
df_astro['class'].value_counts(1)*100

GALAXY    51.074
STAR      38.368
QSO       10.558
Name: class, dtype: float64

- **More than 50%** of the rows in this dataset are **Galaxies**.
- Over **38%** of the instances are **Stars**, and just over 10% of the rows belong to the **QSO (Quasar)** class.
- As mentioned while giving the context for this problem statement, although they are among the most luminous objects in interstellar space, **quasars are very rare** for astronomers to observe. So it makes sense that they comprise the smallest percentage of the data points present in the `class` variable.
- This can hence be considered a **somewhat imbalanced** classification problem, but due to the size of the dataset, even the smallest class (QSO - quasar) has over 25,000 examples. Even after train-test splits, that should be enough training data for a machine learning algorithm to understand the patterns leading to that classification.

In [19]:
le = LabelEncoder()
df_astro["class"] = le.fit_transform(df_astro["class"])
df_astro["class"] = df_astro["class"].astype(int)

In [20]:
df_astro['class']

240208    2
18744     0
207175    0
18669     0
189086    2
         ..
12026     0
101461    2
146611    1
152140    0
168265    1
Name: class, Length: 50000, dtype: int32

### **Statistical Summary**
Since the predictor variables in this machine learning problem are all numerical, a **statistical summary** is definitely required so that we can understand some of the statistical properties of the features of our dataset.

In [21]:
# We would like the format of the values in the table to be simple float numbers with 5 decimal places, hence the code below
pd.set_option('display.float_format', lambda x: '%.5f' % x)

# Let's view the statistical summary of the columns in the dataset
df_astro.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
objid,50000.0,1.2376625924027167e+18,7207093862089.521,1.2376459429054385e+18,1.237657629514236e+18,1.2376622680743939e+18,1.2376672115999048e+18,1.2376805308128955e+18
ra,50000.0,178.38919,77.87886,0.01518,138.12116,181.08311,224.55207,359.99357
dec,50000.0,24.46484,20.08817,-19.50182,6.84518,23.1235,39.70158,84.79483
u,50000.0,18.63623,0.82798,11.41754,18.20821,18.86979,19.26691,19.59998
g,50000.0,17.40655,0.98268,9.66834,16.84567,17.5123,18.05566,19.99148
r,50000.0,16.88019,1.12646,9.05049,16.19581,16.8897,17.58307,31.41264
i,50000.0,16.62614,1.20586,8.80997,15.863,16.59721,17.3432,29.09998
z,50000.0,16.46675,1.27357,9.22884,15.62477,16.42961,17.23214,28.75626
run,50000.0,3985.58622,1678.04277,109.0,2830.0,3910.0,5061.0,8162.0
rerun,50000.0,301.0,0.0,301.0,301.0,301.0,301.0,301.0


**Observations:**
- The maximum value of `redshift` is 6.4 and minimum value is 0.16.
- The mean of alpha (`ra`) is 178.3 and standard deviation is 77.87 whereas mean and standard deviation of delta(dec) variable is 24.4 and 20.8.
- The statistical summary of `r`,`i` and `z` variables are more or less similar, their range of values are same.  
- The dec and redshift features in the data have negative data points.

In [22]:
# Number of unique values in each column
df_astro.nunique()

objid        50000
ra           50000
dec          50000
u            44423
g            46340
r            46771
i            47078
z            47286
run            527
rerun            1
camcol           6
field          817
specobjid    50000
class            3
redshift     49713
plate         5726
mjd           2134
fiberid        996
dtype: int64

- The `objid` and `specobjid` columns are clearly unique IDs, that is why they have the same number of values as the total number of rows in the dataset.

### **Data Preprocessing - Removal of ID columns**
Since the `objid` and `specobjid` columns are unique IDs, they will not add any predictive power to the machine learning model, and they can hence be removed.

In [23]:
# Removing the objid and specobjid columns from the dataset
df_astro.drop(columns=['objid', 'specobjid'], inplace=True)