# Special Topics I Problem Set 5 (Classification)

First of all, download the dataset needed to do this exercise from this <a href="https://github.com/mzhoolideh/KNTU_ML_2023/blob/main/data/ps5data.csv" download>link</a>.

The dataset provided to you includes several features of stars.

Some of them are:

- Absolute Temperature (in $K$)
- Relative Luminosity ($L/L_{\odot}$)
- Relative Radius ($R/R_{\odot}$)
- Absolute Magnitude ($M_{v}$)
- Star Color (**White**, **Red**, **Blue**, **Yellow**, **yellow-orange** etc)
- Spectral Class (**O**, **B**, **A**, **F**, **G**, **K**, **M**)
- Star Type (**Red Dwarf**, **Brown Dwarf**, **White Dwarf**, **Main Sequence**, **SuperGiants**, **HyperGiants**)
- $L_{\odot} = 3.828 \times 10^{26} \; \text{Watts}$ (Average Luminosity of Sun)
- $R_{\odot} = 6.9551 \times 10^{8} \; \text{m}$ (Average Radius of Sun)

Import necessary libraries

In [61]:
# Do it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Read the dataset and store it in a variable.

In [62]:
# Do it.
df = pd.read_csv('../../data/ps5data.csv')

## Part 1: Data Analysis and Preprocessing

Display the first $5$ rows of the dataset.

In [63]:
# Do it.
df.head(5)

Unnamed: 0,Temperature (K),Luminosity (L/Lo),Radius (R/Ro),Absolute magnitude (Mv),Star type,Star category,Star color,Spectral Class
0,3068,0.0024,0.17,16.12,0,Brown Dwarf,Red,M
1,3042,0.0005,0.1542,16.6,0,Brown Dwarf,Red,M
2,2600,0.0003,0.102,18.7,0,Brown Dwarf,Red,M
3,2800,0.0002,0.16,16.65,0,Brown Dwarf,Red,M
4,1939,0.000138,0.103,20.06,0,Brown Dwarf,Red,M


Is there any missing data in the dataset provided to you? Show the number of missing data in each column of the dataset.

In [64]:
# Do it.
mising_data = df.isnull().sum()
print(mising_data)
print(df.shape)

Temperature (K)            0
Luminosity (L/Lo)          0
Radius (R/Ro)              0
Absolute magnitude (Mv)    0
Star type                  0
Star category              0
Star color                 0
Spectral Class             0
dtype: int64
(240, 8)


Display the number of non-null data in each column, the datatype of the data in each column, and the size of the dataset in the RAM space of your system.

In [65]:
# Do it.
df.info()
total_memory_usage = df.memory_usage(deep=True).sum()
print(f"Total Memory Usage: {total_memory_usage} bytes")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Temperature (K)          240 non-null    int64  
 1   Luminosity (L/Lo)        240 non-null    float64
 2   Radius (R/Ro)            240 non-null    float64
 3   Absolute magnitude (Mv)  240 non-null    float64
 4   Star type                240 non-null    int64  
 5   Star category            240 non-null    object 
 6   Star color               240 non-null    object 
 7   Spectral Class           240 non-null    object 
dtypes: float64(3), int64(2), object(3)
memory usage: 15.1+ KB
Total Memory Usage: 54824 bytes


Display the **count**, **mean**, **standard deviation**, **minimum**, **first quartile**, **median**, **third quartile**, and **maximum** data of numeric columns from this dataset in another dataset.

In [66]:
# Do it.
newDf = df
newDf.describe()

Unnamed: 0,Temperature (K),Luminosity (L/Lo),Radius (R/Ro),Absolute magnitude (Mv),Star type
count,240.0,240.0,240.0,240.0,240.0
mean,10497.4625,107188.361635,237.157781,4.382396,2.5
std,9552.425037,179432.24494,517.155763,10.532512,1.711394
min,1939.0,8e-05,0.0084,-11.92,0.0
25%,3344.25,0.000865,0.10275,-6.2325,1.0
50%,5776.0,0.0705,0.7625,8.313,2.5
75%,15055.5,198050.0,42.75,13.6975,4.0
max,40000.0,849420.0,1948.5,20.06,5.0


Display the names of all the columns in the dataset in the form of an Index object.

In [67]:
# Do it.
# colNames = df.columns
colNames = df.keys()
# colNames = list(df)
# colNames = [col for col in df]
# colNames = df.iloc[0,:]
colNames


Index(['Temperature (K)', 'Luminosity (L/Lo)', 'Radius (R/Ro)',
       'Absolute magnitude (Mv)', 'Star type', 'Star category', 'Star color',
       'Spectral Class'],
      dtype='object')

In this dataset, how many data are there from each spectral class? Display it.

In [68]:
# Do it.
spectralClassCounts = df["Spectral Class"].value_counts()
spectralClassCounts

M    111
B     46
O     40
A     19
F     17
K      6
G      1
Name: Spectral Class, dtype: int64

Replace the spectral class that has the largest number of data in the dataset with the value $0$ and the rest of the spectral classes with the value $1$.

In [69]:
# Do it.
largestSpectral = spectralClassCounts.idxmax()
df['Spectral Class'] = df['Spectral Class'].replace(largestSpectral, 0)
df['Spectral Class'] = df['Spectral Class'].apply(lambda x: 1 if x != 0 else 0)

In your opinion, why did we replace the spectral class with the largest number of data with the value $0$ and the rest of the spectral classes with the value $1$? Write your analysis.

>
> it's classification problem so we want to train our model in binary situation for applying sigmoid ReLu or ... funcs
>

How many data of each type of star are there in this dataset? Display it.

In [70]:
# Do it.
starTypeCounts = df['Star type'].value_counts()
starTypeCounts

0    40
1    40
2    40
3    40
4    40
5    40
Name: Star type, dtype: int64

In the star color column, replace **red** color with value $0$, **yellow** color with value $1$, **white** color with value $2$ and **blue** color with value $3$.

In [71]:
# Do it.
df['Star color'] = df['Star color'].replace({'Red': 0, 'Yellow': 1, 'White': 2, 'Blue': 3})

## Part 2: Machine Learning

Consider the **temperature**, **luminosity**, **radius** and **absolute magnitude** of the star as *features* and the **spectral class** of the star as *target*.

In [72]:
# Do it.
features = df[['Temperature (K)', 'Luminosity (L/Lo)', 'Radius (R/Ro)', 'Absolute magnitude (Mv)']]
target = df['Spectral Class']
Data = features.join(target)
Data

Unnamed: 0,Temperature (K),Luminosity (L/Lo),Radius (R/Ro),Absolute magnitude (Mv),Spectral Class
0,3068,0.002400,0.1700,16.12,0
1,3042,0.000500,0.1542,16.60,0
2,2600,0.000300,0.1020,18.70,0
3,2800,0.000200,0.1600,16.65,0
4,1939,0.000138,0.1030,20.06,0
...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,1
236,30839,834042.000000,1194.0000,-10.63,1
237,8829,537493.000000,1423.0000,-10.73,1
238,9235,404940.000000,1112.0000,-11.23,1


Consider $80\%$ of the data as *train* data and the remaining $20\%$ as *test* data.

> Pay attention that there should be no order in separating the data and the data should be separated completely randomly.

In [77]:
# Do it.
train_data, test_data = train_test_split(Data, test_size=0.2, random_state=42, shuffle=True)
X_train ,X_test ,Y_train ,Y_test = train_test_split(features ,target ,test_size=0.2, random_state=42)

Use the Logistic regression model and fit it on the *train* data.

In [74]:
# Do it.
logreg = LogisticRegression()
logreg.fit(X_train,Y_train)

Using the model you built, make predictions for the *test* data and display your predictions.

In [75]:
# Do it.

Get the accuracy of your model to predict unseen data.

In [76]:
# Do it.