# Physically Hazardous Asteroid Prediction

## Phase 1 - Data Cleaning and Normalization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
%matplotlib inline
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint




### Dataset

The dataset being used for this agent is a subset of data taken from the NASA NEO database. This particular subset of data is noted as being well-maintained and clean, but also contains a relatively small number of datapoints. There are other datasets that will require a fair bit more cleaning but also contain vastly more datapoints that could be used as an alternative should the initial dataset prove to be insufficient.

In [2]:
#loading dataset, source: https://www.kaggle.com/datasets/lovishbansal123/nasa-asteroids-classification
df = pd.read_csv('../nasa.csv')

#checking data shape
print("Number of Datapoints: " + str(df.shape[0]))
print("Number of Features: " + str(df.shape[1]))

Number of Datapoints: 4687
Number of Features: 40


In [3]:
#examining dataset
df.head()

Unnamed: 0,Neo Reference ID,Name,Absolute Magnitude,Est Dia in KM(min),Est Dia in KM(max),Est Dia in M(min),Est Dia in M(max),Est Dia in Miles(min),Est Dia in Miles(max),Est Dia in Feet(min),...,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Equinox,Hazardous
0,3703080,3703080,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,J2000,True
1,3723955,3723955,21.3,0.146068,0.326618,146.067964,326.617897,0.090762,0.202951,479.22562,...,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,J2000,False
2,2446862,2446862,20.3,0.231502,0.517654,231.502122,517.654482,0.143849,0.321655,759.521423,...,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,J2000,True
3,3092506,3092506,27.4,0.008801,0.019681,8.801465,19.680675,0.005469,0.012229,28.876199,...,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,J2000,False
4,3514799,3514799,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,J2000,True


In [4]:
#checking data for null values
df.isna().sum()

Neo Reference ID                0
Name                            0
Absolute Magnitude              0
Est Dia in KM(min)              0
Est Dia in KM(max)              0
Est Dia in M(min)               0
Est Dia in M(max)               0
Est Dia in Miles(min)           0
Est Dia in Miles(max)           0
Est Dia in Feet(min)            0
Est Dia in Feet(max)            0
Close Approach Date             0
Epoch Date Close Approach       0
Relative Velocity km per sec    0
Relative Velocity km per hr     0
Miles per hour                  0
Miss Dist.(Astronomical)        0
Miss Dist.(lunar)               0
Miss Dist.(kilometers)          0
Miss Dist.(miles)               0
Orbiting Body                   0
Orbit ID                        0
Orbit Determination Date        0
Orbit Uncertainity              0
Minimum Orbit Intersection      0
Jupiter Tisserand Invariant     0
Epoch Osculation                0
Eccentricity                    0
Semi Major Axis                 0
Inclination   

In [5]:
#NOTE: Review the NEO glossary definitions and determine if any additional columns can be dropped from the dataset;
#some of these may be unnecessary in determining classification or even just redundant, be sure to note any extra columns that
#are being dropped and why, can document in more detail in the report as necessary. May also be a good idea to reorganize the
#cells so that columns that are being dropped get dropped before visuals, and another reiteration of the shape of final data set
#including features and such afterward

In [6]:
#examining data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4687 entries, 0 to 4686
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Neo Reference ID              4687 non-null   int64  
 1   Name                          4687 non-null   int64  
 2   Absolute Magnitude            4687 non-null   float64
 3   Est Dia in KM(min)            4687 non-null   float64
 4   Est Dia in KM(max)            4687 non-null   float64
 5   Est Dia in M(min)             4687 non-null   float64
 6   Est Dia in M(max)             4687 non-null   float64
 7   Est Dia in Miles(min)         4687 non-null   float64
 8   Est Dia in Miles(max)         4687 non-null   float64
 9   Est Dia in Feet(min)          4687 non-null   float64
 10  Est Dia in Feet(max)          4687 non-null   float64
 11  Close Approach Date           4687 non-null   object 
 12  Epoch Date Close Approach     4687 non-null   int64  
 13  Rel

### Data Preparation

The 'name' and 'Neo Reference ID' columns will likely be dropped as they are used to identify objects but don't necessarily have any bearing on the determination of an asteroid. Along with these, the 'Close Approach Date', 'Orbiting Body', 'Orbit Determination Date', and 'Equinox' columns will be removed, because these non-numeric data types are difficult to process and are not listed among the determining factors in a PHA.

In [7]:
df.drop(['Neo Reference ID', 'Name', 'Close Approach Date', 'Orbiting Body', 
         'Orbit Determination Date', 'Equinox'], axis=1, inplace=True)

Prediction of Potentially Hazardous Asteroids (PHAs) are complex problems with a number of factors. For the sake of simplicity in this project and for greater control over the model and its features, we will focus on the most vital aspects of determining what asteroids/comets are Near Earth Objects (NEOs) and which of these objects are PHAs.

In [8]:
df.drop(['Est Dia in KM(min)', 'Est Dia in KM(max)', 'Est Dia in Miles(min)', 'Est Dia in Miles(max)',
        'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 'Relative Velocity km per hr',
        'Miles per hour', 'Miss Dist.(miles)', 'Miss Dist.(lunar)', 'Miss Dist.(Astronomical)', 'Epoch Date Close Approach',
        'Orbit ID', 'Jupiter Tisserand Invariant', 'Epoch Osculation', 'Relative Velocity km per sec',
        'Orbit Uncertainity', 'Eccentricity', 'Inclination', 'Asc Node Longitude', 'Perihelion Arg',
        'Perihelion Time', 'Mean Anomaly', 'Mean Motion'], axis=1, inplace=True)

The output label is currently a boolean data type. We will change the output label to a numeric data type.

In [9]:
df['Hazardous'] = df['Hazardous'].astype(int)

#### Final Dataset Shape

In [10]:
#verifying final counts of rows and columns
print("Number of Datapoints: " + str(df.shape[0]))
print("Number of Features: " + str(df.shape[1]))

Number of Datapoints: 4687
Number of Features: 10


### Examining Dataset Features

In [11]:
df.describe()

Unnamed: 0,Absolute Magnitude,Est Dia in M(min),Est Dia in M(max),Miss Dist.(kilometers),Minimum Orbit Intersection,Semi Major Axis,Orbital Period,Perihelion Distance,Aphelion Dist,Hazardous
count,4687.0,4687.0,4687.0,4687.0,4687.0,4687.0,4687.0,4687.0,4687.0,4687.0
mean,22.267865,204.604203,457.508906,38413470.0,0.08232,1.400264,635.582076,0.813383,1.987144,0.161084
std,2.890972,369.573402,826.391249,21811100.0,0.0903,0.524154,370.954727,0.242059,0.951519,0.367647
min,11.16,1.010543,2.259644,26609.89,2e-06,0.61592,176.557161,0.080744,0.803765,0.0
25%,20.1,33.462237,74.823838,19959280.0,0.014585,1.000635,365.605031,0.630834,1.266059,0.0
50%,21.9,110.803882,247.765013,39647710.0,0.047365,1.240981,504.947292,0.833153,1.618195,0.0
75%,24.5,253.837029,567.596853,57468630.0,0.123593,1.678364,794.195972,0.997227,2.451171,0.0
max,32.1,15579.552413,34836.938254,74781600.0,0.477891,5.072008,4172.231343,1.299832,8.983852,1.0


## Phase 2 - Overfitting

In [12]:
#starting by separating input features and output label into separate variables
#get all rows and all columns except last column for input features
X = df.drop(['Hazardous'], axis='columns')
#get all rows and only last column for output label
Y = df['Hazardous']

In [13]:
#verifying datasets separated correctly
X.shape

(4687, 9)

In [14]:
Y.shape

(4687,)

### Data Normalization

#### Data Normalization should be done AFTER data is split into training and testing datasets and ONLY on training dataset.

In [15]:
# Mean normalization
min = X.min(axis = 0) 
max = X.max(axis = 0) 
mean = X.mean(axis = 0)
X = (X - mean) / (max - min)

In [16]:
#initializing model
model = Sequential()
model.add(Dense(256, input_dim = 9, activation = 'relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))




In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 256)               2560      
                                                                 
 dense_1 (Dense)             (None, 256)               65792     
                                                                 
 dense_2 (Dense)             (None, 256)               65792     
                                                                 
 dense_3 (Dense)             (None, 256)               65792     
                                                                 
 dense_4 (Dense)             (None, 256)               65792     
                                                                 
 dense_5 (Dense)             (None, 1)                 257       
                                                                 
Total params: 265985 (1.01 MB)
Trainable params: 265985 

In [18]:
#compiling model
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])




In [19]:
#early stopping callback
callback_x = EarlyStopping(monitor='loss', mode='min', patience=20, verbose=1)

In [20]:
model.fit(x = X, y = Y, epochs = 256, verbose = 1)

Epoch 1/256


Epoch 2/256
Epoch 3/256
Epoch 4/256
Epoch 5/256
Epoch 6/256
Epoch 7/256
Epoch 8/256
Epoch 9/256
Epoch 10/256
Epoch 11/256
Epoch 12/256
Epoch 13/256
Epoch 14/256
Epoch 15/256
Epoch 16/256
Epoch 17/256
Epoch 18/256
Epoch 19/256
Epoch 20/256
Epoch 21/256
Epoch 22/256
Epoch 23/256
Epoch 24/256
Epoch 25/256
Epoch 26/256
Epoch 27/256
Epoch 28/256
Epoch 29/256
Epoch 30/256
Epoch 31/256
Epoch 32/256
Epoch 33/256
Epoch 34/256
Epoch 35/256
Epoch 36/256
Epoch 37/256
Epoch 38/256
Epoch 39/256
Epoch 40/256
Epoch 41/256
Epoch 42/256
Epoch 43/256
Epoch 44/256
Epoch 45/256
Epoch 46/256
Epoch 47/256
Epoch 48/256
Epoch 49/256
Epoch 50/256
Epoch 51/256
Epoch 52/256
Epoch 53/256
Epoch 54/256
Epoch 55/256
Epoch 56/256
Epoch 57/256
Epoch 58/256
Epoch 59/256
Epoch 60/256
Epoch 61/256
Epoch 62/256
Epoch 63/256
Epoch 64/256
Epoch 65/256
Epoch 66/256
Epoch 67/256
Epoch 68/256
Epoch 69/256
Epoch 70/256
Epoch 71/256
Epoch 72/256
Epoch 73/256
Epoch 74/256
Epoch 75/256
Epoch 76/256
Epoch 77/256
Epoch 

<keras.src.callbacks.History at 0x26b36076cd0>