## 1. Selection of Samples:
SMOTE first selects a random sample from the minority class.

## 2. Finding Neighbors:
It then finds the k-nearest neighbors of this sample in the feature space. These neighbors are also members of the minority class.

## 3. Synthesis of New Samples:
For generating a synthetic sample, SMOTE chooses one of the k-nearest neighbors and computes a vector in feature space that connects the selected sample to its neighbor. A new sample is then synthesized along this line. Specifically, it takes the difference between the feature vector of the randomly chosen sample and its nearest neighbor, multiplies this difference by a random number between 0 and 1, and then adds it back to the feature vector of the chosen sample.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# This script assumes that you have loaded your data and instatiated the variables: X_train, X_test, y_train, y_test
# using the train_test_split function from sklearn.model_selection. You can change the uncommented lines 5-9 below to use
# the script on other data, or uncomment to run SMOTE on the orignal Kaggle Stellar Classification Set: https://www.kaggle.com/datasets/deepu1109/star-dataset/data

file_path = "/content/drive/MyDrive/EC503 Final Project/Datasets/MK csv #1.csv" ##MK1 aka ORIGINAL
df = pd.read_csv(file_path) #for excel
X = df[['Temperature (K)', 'Luminosity(L/Lo)', 'Radius(R/Ro)', 'Absolute magnitude(Mv)', 'Color']] ##MK1
Y = df['Spectral Class']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

# Check initial class distribution
print("Class distribution in training set before any operation:")
print(y_train.value_counts())

# Remove samples from classes with only one instance
class_counts = y_train.value_counts()
single_instance_classes = class_counts[class_counts == 1].index
X_train = X_train[~y_train.isin(single_instance_classes)]
y_train = y_train[~y_train.isin(single_instance_classes)]

# Check class distribution after removing single-instance classes
print("Class distribution in training set after removing single-instance classes:")
print(y_train.value_counts())

# Applying SMOTE
min_count = y_train.value_counts().min()
if min_count > 1:
    smote = SMOTE(random_state=42, k_neighbors=min_count-1)
    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
    print("Class distribution in training set after SMOTE:")
    print(y_train_smote.value_counts())
else:
    print("Cannot apply SMOTE as one of the classes has fewer than 2 samples.")