# Nearest Earth Object Classification

### By Michael D'Arcy-Evans and Isabel Tilles

## Introduction:
The solar system is a dangerous place. One fear that scientists have is that we will be harmed an objects as it hurtles towards us from space. We spot objects like these coming towards Earth all the time, but how do we know if they are going to devastate us or simply provide a good comet show? We are seeking to solve that issue by creating a reliable classifier to determine whether or not a Near Earth Object (NEO) is hazardous.  
We got our data from the [Nearest Earth Objects Dataset (1910-2024)](\"https://www.kaggle.com/datasets/ivansher/nasa-nearest-earth-objects-1910-2024\"). Based on these instances, we hope to predict whether or not other unseen instances of NEOs will be classified as `is_hazardous`.  
#TODO: What classifier performed the best?

## Imports for Analysis

In [34]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MyDummyClassifier, MyNaiveBayesClassifier, MyDecisionTreeClassifier, MyRandomForestClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation


## Data Analysis



In [35]:
labels = ["True","False"]
space_table = MyPyTable().load_from_file('input_data/space.csv')

space_table.remove_rows_with_missing_values()

space_dict = myutils.count_label_occurrences(space_table.get_column("is_hazardous"))
for value in sorted(space_dict.keys()):
    print(value,space_dict[value])
    pass

discretized_label_dict = myutils.preprocess_table(space_table)
print(discretized_label_dict)
y_true = space_table.get_column(space_table.column_names.index("is_hazardous"))  
X_train = [row[:space_table.column_names.index("is_hazardous")] + row[space_table.column_names.index("is_hazardous") + 1:] for row in space_table.data]

tree_classifier = MyDecisionTreeClassifier()
y_pred = myevaluation.cross_val_predict(tree_classifier,X_train,y_true,10)
print(myevaluation.pseudo_classification_report(y_true,y_pred,labels))
# print("Decision Tree Structure:", tree_classifier.tree)
 
header = space_table.column_names
tree_classifier.print_decision_rules(attribute_names=header)
tree_classifier.visualize_tree(
    "space_tree.dot", "space_tree", attribute_names=header
)


False 295009
True 43162
{'min_labels': ['0.0 to 0.1', '0.1 to 1.1'], 'max_labels': ['0.0 to 0.3', '0.3 to 2.5'], 'velocity_labels': ['0.0 to 53433.8', '53433.8 to 138171.4'], 'miss_labels': ['0.0 to 43869660.1', '43869660.1 to 74715777.4']}
Decision Tree Classifier: Accuracy = 0.74, Error Rate = 0.26, Precision = 0.7159, Recall = 0.8040, F1 Score = 0.7574
╒═════════════════╤════════╤═════════╕
│ Is Hazardous:   │   True │   False │
╞═════════════════╪════════╪═════════╡
│ True            │    804 │     196 │
├─────────────────┼────────┼─────────┤
│ False           │    319 │     681 │
╘═════════════════╧════════╧═════════╛
IF estimated_diameter_min == 0 AND relative_velocity == 0 AND miss_distance == 0 THEN class = False.
IF estimated_diameter_min == 0 AND relative_velocity == 0 AND miss_distance == 1 THEN class = False.
IF estimated_diameter_min == 0 AND relative_velocity == 1 AND miss_distance == 0 THEN class = True.
IF estimated_diameter_min == 0 AND relative_velocity == 1 AND miss_

In [36]:
myutils.print_dataset_info(space_table)

This dataset has 2000 instances and 5 attributes
Dataset attribute breakdown:  estimated_diameter_min is of type <class 'int'> and estimated_diameter_max is of type <class 'int'> and relative_velocity is of type <class 'int'> and miss_distance is of type <class 'int'> and is_hazardous is of type <class 'str'> .
The attribute we are trying to predict is is_hazardous. It can be {'False', 'True'}.


To evaluate our Decision Trees' quality, we chose to focus on getting the highest Recall(?) because it is important to be able to predict a dangerous NEO as often as possible and a false negative would be devastating, while the consequences of a false alarm are not as tragic.