My name is Utsav Agarwal, and some of the things I am interested in are Rubik's Cubes and Astronomy.

I have written a program that shows the correlations between a star's temperature, luminosity, radius, and absolute magnitude and the star's color. The code also tests a few different machine learning algorithms to see which one would fit the model the best. I decided to use this dataset and do this project because the dataset I found on Kaggle looked really interesting. After talking to JoJo during office hours, he suggested I answer the question, "Does having more features in your dataset always result in a better model?"

I only have one codebox with my final code, but I added comments throughout explaining what that section did and any changes I had to make. There is also a section at the end where I answer the question above, list 2 problems I faced, and list some of the extra resources I used.

In [4]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [6]:
og_df = pd.DataFrame(pd.read_csv('Stars.csv'))

"""
The Color feature of the dataset had a lot of colors that were essentially the same but used different names
This next part is where I went and made all the colors similar to each other have the same name
"""
#Change colors related to yellow and white to Yellowish White
og_df = og_df.replace(to_replace="White Yellow", value="yellow-white")
og_df = og_df.replace(to_replace="Yellowish", value="yellow-white")
og_df = og_df.replace(to_replace="White-Yellow", value="yellow-white")
og_df = og_df.replace(to_replace="White Yellow", value="yellow-white")
og_df = og_df.replace(to_replace="Whitish", value="yellow-white")
og_df = og_df.replace(to_replace="White", value="yellow-white")
og_df = og_df.replace(to_replace="white Yellow", value="yellow-white")
og_df = og_df.replace(to_replace="yellowish", value="yellow-white")
og_df = og_df.replace(to_replace="white", value="yellow-white")
og_df = og_df.replace(to_replace="yellow-white", value="Yellowish White")
#Change colors related to orange to Orange
og_df = og_df.replace(to_replace="Pale yellow orange", value="Orange")
og_df = og_df.replace(to_replace="Orange-Red", value="Orange")
#Change Blue-white to Blue White
og_df = og_df.replace(to_replace="Blue-white", value="Blue White")
og_df = og_df.replace(to_replace="Blue-White", value="Blue White")

"""
I create dataframes with the features I intend on using
I turn the color dataframe to a NumPy array so I can use .ravel(), which changes the shape of the dataframe to suit the Machine Learning Models
"""
T = og_df[['Temperature']].copy()
L = og_df[['L']].copy()
color = og_df[['Color']].copy()
color = color.to_numpy()
radius = og_df[['R']].copy()
absolute_magnitude = og_df[['A_M']].copy()

"""
I use .train_test_split() to create the train and test sets for the features I intended on using: Temperature, Luminosity (L)*, Relative Radius (R)*, Absolute Magnitude (A_M), and color
I then make the dataframes that have the combination of the features I want to use to predict the color

If the beginning of the variable name has:
TL - Combination of Temperature and Luminosity
TLR - Combination of Temperature, Luminosity, and Relative Radius
TLRA - Combination of Temperature, Luminosity, Relative Radius, and Absolute Magnitude

If the variable name has 'C' after the letters above, it refers to the Machine Learning model created


*Luminosity and Radius here are relative to the Sun
"""
temp_train, temp_test, l_train, l_test, color_train, color_test, radi_train, radi_test, mag_train, mag_test= train_test_split(T, L, color, radius, absolute_magnitude, test_size = 0.2)
TL_train = pd.concat([temp_train, l_train], axis=1)
TL_test = pd.concat([temp_test, l_test], axis=1)
TLR_train = pd.concat([temp_train, l_train, radi_train], axis=1)
TLR_test = pd.concat([temp_test, l_test, radi_test], axis=1)
TLRA_train = pd.concat([temp_train, l_train, radi_train, mag_train], axis=1)
TLRA_test = pd.concat([temp_test, l_test, radi_test, mag_test], axis=1)

color_train = color_train.ravel()

"""
Before comparing the features, I had to find the best model to use
I trid K Nearest Neighbors first, but the accuracy of the model was lower than anticipated
The Random Forest model yielded a much better accuracy
The Decision Tree model was better than KNN, but not as good as Random Forest*

*This is most likely because the Random Forest Machine Learning Model utilizes multiple Decision Trees within its algorithm, making it a more refined version of it
"""
print("TESTING MODELS")
print('------------------------------------------------------------------------------------------------------------------')

#Testing different models, K Nearest Neighbors
n=8
TLC = KNeighborsClassifier(n_neighbors = n)
TLC.fit(TL_train, color_train)
TLC_pre = TLC.predict(TL_test)

print("1. Accuracy Score of KNN (Temperature+Luminosity->Color) is {}".format(accuracy_score(color_test, TLC_pre)))

"""
Before trying other models, I wanted to make sure that the low accuracy score wasn't due to the data (as in the star's temperature and luminosity didn't correlate with
its color), so I tried the KNN model with just Temperature and Luminosity. Both models were at about the same accuracy, or lower, so I knew the problem was with the model.
It was here that I discoverd I initially made a mistake with the replace statements I used earlier, and was able to clean up the data properly.
"""
TC = KNeighborsClassifier(n_neighbors = n)
TC.fit(temp_train, color_train)
TC_pre = TC.predict(temp_test)

print("2. Accuracy Score of KNN (Temperature->Color) is {}".format(accuracy_score(color_test, TC_pre)))

LC = KNeighborsClassifier(n_neighbors = n)
LC.fit(l_train, color_train)
LC_pre = LC.predict(l_test)

print("3. Accuracy Score of KNN (Luminosity->Color) is {}".format(accuracy_score(color_test,LC_pre)))

print('------------------------------------------------------------------------------------------------------------------')

# Testing Random Forest
rTLC = RandomForestClassifier(max_depth = 4)
rTLC.fit(TL_train, color_train)
rTLC_pre = rTLC.predict(TL_test)

print("4. Accuracy Score of Random Forest (Temperature+Luminosity->Color) is {}".format(accuracy_score(color_test, rTLC_pre)))
"""
The accuracy ranges of the Random Forest Model are at a nice level, from 0.8 - 0.95. I decided to try one more model to see if I could improve the accuracy.
"""
print('------------------------------------------------------------------------------------------------------------------')
# Trying Decision Tree
dTLC = DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
dTLC.fit(TL_train, color_train)
dTLC_pre = dTLC.predict(TL_test)

print("5. Accuracy Score of Decision Tree (Temperature+Luminosity->Color) is {}".format(accuracy_score(color_test, dTLC_pre)))

"""
The accuracy of the Decision Tree model was lower than that of the Random Forest, from 0.75 - 0.85. This makes sense, as I learned later that the
Random Forest algorithm uses multiple Decision Trees, hence the name forest. So there was a high chance of this model having a lower accuracy than
that of the Random Forest.
"""
print('------------------------------------------------------------------------------------------------------------------')
print('TESTING FEATURES')
print('------------------------------------------------------------------------------------------------------------------')
"""
With the model decided, I wanted to check the relationships between different features of a star and it's color.
I started by testing models that only use Temperature and only use Luminosity, and added on the Relative Radius and Absolute Magnitude to see if they
had an affect on the accuracy. 
"""

rTC = RandomForestClassifier(max_depth = 4)
rTC.fit(temp_train, color_train)
rTC_pre = rTC.predict(temp_test)

print("6. Accuracy Score of Random Forest (Temperature->Color) is {}".format(accuracy_score(color_test, rTC_pre)))

print('------------------------------------------------------------------------------------------------------------------')

rLC = RandomForestClassifier(max_depth = 4)
rLC.fit(l_train, color_train)
rLC_pre = rLC.predict(l_test)

print("7. Accuracy Score of Random Forest (Luminosity->Color) is {}".format(accuracy_score(color_test,rLC_pre)))


print('------------------------------------------------------------------------------------------------------------------')

"""
With the accuracy of the Temperature and Luminosity solo models being lower than the model which included both, I started to add Relative Radius to the 
model to see its effect.
"""

rTLRC = RandomForestClassifier(max_depth = 4)
rTLRC.fit(TLR_train, color_train)
rTLRC_pre = rTLRC.predict(TLR_test)

print("8. Accuracy Score of Random Forest (Temperature+Luminosity+Radius->Color) is {}".format(accuracy_score(color_test, rTLRC_pre)))

print('------------------------------------------------------------------------------------------------------------------')

"""
The model with Temperature, Luminosity, and Relative Radius had an accuracy of about 87.5 - 98, so I believe the relative radius of a star had a slight effect on its color.
With that completed, I added Absolute Magnitude to the model to see if that had any effect.
"""
rTLRAC = RandomForestClassifier(max_depth = 4)
rTLRAC.fit(TLRA_train, color_train)
rTLRAC_pre = rTLRAC.predict(TLRA_test)

print("9. Accuracy Score of Random Forest (Temperature+Luminosity+Radius+Absolute Magnitude->Color) is {}".format(accuracy_score(color_test, rTLRAC_pre)))

"""
The model with Absolute Magnitude included with Temperature, Luminosity, and Relative Radius had an accuracy that was usually lower than that of the model without it, sometimes
equal, and rarely higher. So I conclude that Absolute Magnitude doesn't have an effect on a star's color.
"""
print('------------------------------------------------------------------------------------------------------------------')
print('Task complete. Thank you.')
print('Program made by Utsav Agarwal')

TESTING MODELS
------------------------------------------------------------------------------------------------------------------
1. Accuracy Score of KNN (Temperature+Luminosity->Color) is 0.7708333333333334
2. Accuracy Score of KNN (Temperature->Color) is 0.7916666666666666
3. Accuracy Score of KNN (Luminosity->Color) is 0.6875
------------------------------------------------------------------------------------------------------------------
4. Accuracy Score of Random Forest (Temperature+Luminosity->Color) is 0.8333333333333334
------------------------------------------------------------------------------------------------------------------
5. Accuracy Score of Decision Tree (Temperature+Luminosity->Color) is 0.7916666666666666
------------------------------------------------------------------------------------------------------------------
TESTING FEATURES
------------------------------------------------------------------------------------------------------------------
6. Accuracy S

-------
Question: Does having more features always result in a better model?

The answer I got to the question is no. While it is often true, seen by the models with only one feature (Models 6 and 7) having the lowest accuracy, the accuracy of Model 9, which has the most features, shows that it isn't always true. This could be due to the fact that the feature Absolute Magnitude doesn't have any effect on the color of the star. 

---------------------

Problems I Faced:

Problem 1:

I faced my first problem when I was starting to test out the models. I couldn't get any of them to work because of an error where the .fit() method was expecting 1d array but an array with a different shape was passed. I searched up the error on stackoverflow and saw the solution was in the method .ravel(). The .ravel() method changes the shape of a NumPy array to be 1d.

The error message I encountered: "DataConversionWarning: A column-vector y was passed when a 1d array was expected."

Problem 2:

After the KNN models weren't as accurate as I had hoped, and after I found out that the problem wasn't with how well the temperature and luminosity features correlated with color, I displayed the different train and test sets to see if the problem lay in the data. It turns out that the .replace() functions I was previously using weren't doing what they were supposed to, and the train and test sets still had to deal with weirdly named colors. I checked on stackoverflow and the Pandas documentation, and it turns out I was using them wrong. I had to set the dataframe equal to itself after I replaced the color names.

What I was doing before: og_df.replace(to_replace="Blue-White", value="Blue White")

What I had to do: og_df = og_df.replace(to_replace="Blue-White", value="Blue White")



----------
Extra Sources I Used in this Project:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html                                (and other pages on the Pandas Documentation)

https://www.kaggle.com/datasets/brsdincer/star-type-classification                                        (Dataset)

stackoverflow.com                                                                                         (Reference)

https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87         (Reference)

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html           (and other pages on the Sklearn Documentation)