# Predicting Hazardous Asteroids

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span><ul class="toc-item"><li><span><a href="#What" data-toc-modified-id="What-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>What</a></span></li><li><span><a href="#Why" data-toc-modified-id="Why-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Why</a></span></li></ul></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Process-Data" data-toc-modified-id="Process-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Process Data</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Understand-data" data-toc-modified-id="Understand-data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Understand data</a></span></li><li><span><a href="#Clean-data" data-toc-modified-id="Clean-data-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Clean data</a></span></li><li><span><a href="#Visualize-data" data-toc-modified-id="Visualize-data-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Visualize data</a></span></li></ul></li><li><span><a href="#Split-data" data-toc-modified-id="Split-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Split data</a></span></li><li><span><a href="#Find-Model" data-toc-modified-id="Find-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Find Model</a></span></li><li><span><a href="#Model-1" data-toc-modified-id="Model-1-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model 1</a></span></li><li><span><a href="#Model-2" data-toc-modified-id="Model-2-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Model 2</a></span></li></ul></div>

***

## Intro

### What

NASA (National Aeronautics and Space Administration) is a world-renown organization headquartered in the United States.  The organization was created during the Space Race with Russia, in response to Sputnik 1 being launched into orbit around Earth.  Since then, NASA has landed people on the moon, rovers on Mars, telescopes into deep space, built (with international partnerships) a space station, and is still working to develop new technologies every day.  Part of their research takes a look at asteroids and attempts to determine if asteroids are hazardous or safe to Earth, so that we can take proper action should it be necessary.  The data used in this project comes from [Kaggle](https://www.kaggle.com/shrutimehta/nasa-asteroids-classification), but originates from NASA.

This data contains 40 variables which can be summarized as follows:

- The first two columns contain identical identifier values and are not too important or beneficial for a model. 
- The next feature is the absolute magnitude which looks looks at the brightness of an celestial object, according to the definition of absolute magnitude, as it would be seen at a distance of 10 parsecs (equal to 1.9174E+14 miles).  
- The next set of features are related to the diameter of asteroids.  Estimates are made in kilometers (km), meters (m), miles (mi), and feet (ft), with data for the maximum and minimum of each distance.  
- There are two columns addressing the date asteroids will approach Earth, by date and periods (epoch).  
- Features also include the speed of the asteroid, the distance from the earth the asteroid will pass, measured in astronomical, lunar, km, and mi units.  
- There are a number of columns dealing with the orbit pattern including the orbital period, perihelion distance, aphelion distance, eccentricity and the like.

There is one target in the 40 columns:

- The target variable is `hazardous` column, showing whether the asteroid is hazardous or not, based on size, speed, and orbit.

### Why

My grandfather worked for NASA during the Apollo missions.  This alone got me interested in a career in aeronautics, as well as witnessing SpaceX's many successes (and failures) and eventual partnership with NASA to launch American astronauts from American soil.  Today, my interest in space continues to grow and one of my goals is to work for an aeronautics company in the future.  Chances of me getting that a role like that are slim, right now, so the best thing I can do is work with data they provide to continue building my skills and experience.

Let's get started!

***

## Imports

First, we will import "global" packages that we will use throughout our code.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

***

## Process Data

### Load data

Thanks to this data coming from Kaggle, there are no special steps or attributes necessary to load data into a data frame.  We can just use the `read_csv()` function included in the Pandas library.

In [2]:
# load csv into pandas dataframe
asteroids = pd.read_csv("nasa.csv")

### Understand data

Now that the data has been loaded, let's take a run-down of contents.  Again, this data comes from Kaggle and thus has already been preprocessed, but we might want to do more to make it our own.

In [3]:
# view basic info about columns
asteroids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4687 entries, 0 to 4686
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Neo Reference ID              4687 non-null   int64  
 1   Name                          4687 non-null   int64  
 2   Absolute Magnitude            4687 non-null   float64
 3   Est Dia in KM(min)            4687 non-null   float64
 4   Est Dia in KM(max)            4687 non-null   float64
 5   Est Dia in M(min)             4687 non-null   float64
 6   Est Dia in M(max)             4687 non-null   float64
 7   Est Dia in Miles(min)         4687 non-null   float64
 8   Est Dia in Miles(max)         4687 non-null   float64
 9   Est Dia in Feet(min)          4687 non-null   float64
 10  Est Dia in Feet(max)          4687 non-null   float64
 11  Close Approach Date           4687 non-null   object 
 12  Epoch Date Close Approach     4687 non-null   int64  
 13  Rel

Each column has the same number of non-null rows which matches the number of rows the dataframe contains (see RangeIndex value at the top of the result pane).  That means there are likely no nulls in the data.  We can see here that there are four (4) columns with the `object` data type and five (5) columns with the `int64` data type.  Let's look at what these columns contain to see if they are helpful. We already plan on dropping the first two `int64` columns, but we will take a look to understand why. We will look at the others following in order, for readability.  

First, `Neo Reference ID`.

In [4]:
# look at 'Neo Reference ID'
asteroids["Neo Reference ID"]

#asteroids["Neo Refernce ID"].sum() = 15337259656

0       3703080
1       3723955
2       2446862
3       3092506
4       3514799
         ...   
4682    3759007
4683    3759295
4684    3759714
4685    3759720
4686    3772978
Name: Neo Reference ID, Length: 4687, dtype: int64

In [5]:
# see unique values in column
asteroids["Neo Reference ID"].unique()

#asteroids["Neo Reference ID"].unique().sum() = 12150256364

array([3703080, 3723955, 2446862, ..., 3759714, 3759720, 3772978],
      dtype=int64)

As mentioned before, these values just identify asteroids.  Interesting enough, if you put `.sum()` at the end, it will give you the sum of the column, adding each value together.  That is not the interesting part, as that should be obvious.  What is interesting is if you add `.sum()` to `.unique()` you get a smaller value when you would expect the same value.  This means some asteroids can be found more than once in this data set.  We might want to remove duplicates as they may throw the model off or train it too well. 

The `Name` column should contain the same values, but we can check just to make sure.

In [6]:
# look at 'Name'
asteroids["Name"]

#asteroids["Name"].sum() = 15337259656

0       3703080
1       3723955
2       2446862
3       3092506
4       3514799
         ...   
4682    3759007
4683    3759295
4684    3759714
4685    3759720
4686    3772978
Name: Name, Length: 4687, dtype: int64

In [7]:
# see unique values in column
asteroids["Name"].unique()

#asteroids["Name"].unique().sum() = 12150256364

array([3703080, 3723955, 2446862, ..., 3759714, 3759720, 3772978],
      dtype=int64)

Sure enough, as expected, both `Neo Reference ID` and `Name` contain the same values.  We can drop these columns after we use them to remove duplicates.  

Let's look at `Close Approach Date`.

In [8]:
# look at 'Close Approach Date'
asteroids["Close Approach Date"].head()

0    1995-01-01
1    1995-01-01
2    1995-01-08
3    1995-01-15
4    1995-01-15
Name: Close Approach Date, dtype: object

We don't have access to the metadata, but from the column name, `Close Approach Date` is most likely the date the asteroid will pass closest to the earth.  While it might be interesting to see if there is a pattern in dates relating to if an asteroid is hazardous or not, that exploration is for another project.  If we wanted to use this column, we could encode it, but that would result in over 4000 new columns.  Instead of doing that, for this project, we will drop this column. 

Let's look at `Epoch Date Close Approach`.

In [9]:
# look at 'Epoch Date Close Approach'
asteroids["Epoch Date Close Approach"]

0        788947200000
1        788947200000
2        789552000000
3        790156800000
4        790156800000
            ...      
4682    1473318000000
4683    1473318000000
4684    1473318000000
4685    1473318000000
4686    1473318000000
Name: Epoch Date Close Approach, Length: 4687, dtype: int64

In [10]:
# see unique values in column
asteroids["Epoch Date Close Approach"].unique()

array([ 788947200000,  789552000000,  790156800000,  790761600000,
        792230400000,  792835200000,  793440000000,  794649600000,
        795254400000,  795859200000,  797324400000,  797929200000,
        798534000000,  799916400000,  800521200000,  801126000000,
        802594800000,  803199600000,  803804400000,  805186800000,
        805791600000,  806396400000,  807865200000,  808470000000,
        809074800000,  810543600000,  811148400000,  811753200000,
        813135600000,  813740400000,  814345200000,  815817600000,
        816422400000,  817027200000,  818409600000,  819014400000,
        819619200000,  821088000000,  821692800000,  822297600000,
        823766400000,  824371200000,  824976000000,  826272000000,
        827481600000,  828946800000,  829551600000,  830156400000,
        831538800000,  832143600000,  832748400000,  834217200000,
        834822000000,  835426800000,  836809200000,  837414000000,
        838018800000,  839487600000,  840092400000,  840697200

There are a number of unique values, but it might be the `Close Approach Date` in epoch format.  Since we are not completely sure, we will drop this column too.  

Let's look at `Orbiting Body`.

In [11]:
# look at 'Orbiting Body'
asteroids["Orbiting Body"]

0       Earth
1       Earth
2       Earth
3       Earth
4       Earth
        ...  
4682    Earth
4683    Earth
4684    Earth
4685    Earth
4686    Earth
Name: Orbiting Body, Length: 4687, dtype: object

In [12]:
# see unique values in column
asteroids["Orbiting Body"].unique()

array(['Earth'], dtype=object)

Looking at the data in the `Orbiting Body` column, we can tell that the value is the body the asteroid is orbiting.  Now, basic knowledge of the solar system will tell us that we (Earth) orbit the sun, as does everything else in our the solar system.  We then have an idea of what this column will contain for the values we cannot see.  When `.unique()` argument, we are able to find each distinct value in a column.  Having 4686 rows with the same value  ('Earth') seems odd, but since this data is determining if an asteroid is hazardous to Earth and not the entire solar system, it makes sense.  We can remove this column as it provides no helpful information for classification.

Let's take a look at `Orbit ID`.

In [13]:
# look at 'Orbit ID'
asteroids["Orbit ID"]

0       17
1       21
2       22
3        7
4       25
        ..
4682     4
4683     2
4684    17
4685     6
4686    13
Name: Orbit ID, Length: 4687, dtype: int64

In [14]:
# see unique values
asteroids["Orbit ID"].unique()

array([ 17,  21,  22,   7,  25,  40,  43, 100,  30,  12,  23,   5,  42,
        26,   4,  27,  16,  29,  13,   8,  32,  10,   2, 117,  14,  34,
         6,  41,  80,  39,  48,  11,   9,  69,  36,  44,  45,  52,  18,
        24,  19,  72, 253,  50,  75,  38, 121,  67,  37,  28,  94,  60,
        55,  15,  57, 101,  78,   3,  51,  20,  33, 109,  49, 167,  47,
        65, 115,  59,  68,  97,  77,  83,  54,  56,  84,  31,  70,  73,
        87, 236,  53, 193, 164,  64, 271,  35, 412, 138,  85,  88,  96,
       184,  74, 143, 128,  61,   1, 154, 104, 133, 328, 120, 192,  62,
        46, 111, 112,  91, 370,  92,  93, 137,  95,  81, 105, 190, 134,
        71, 122, 182,  89, 146, 350, 102,  66,  58, 132,  63, 131, 165,
       238,  99, 159, 214, 140, 185, 147, 229,  90, 213,  82, 108, 116,
       149, 113, 289, 211, 158, 156,  76, 188,  79, 611, 175, 212, 264,
       114, 130, 170, 324, 119, 127, 259, 453, 285, 123, 337, 103, 106,
       362, 386, 335, 125, 126, 157, 148, 163, 176, 422, 243, 20

If we were to look at a map of asteroids orbiting Earth, we would see rings going around the planet.  Without the meta data it is difficult to be certain what this column is trying to share with us, but it could be an identifier of an orbital pattern.  Because we don't know exactly what this column means, we will remove it from this classification project.

Let's look at `Orbit Determination Date`.

In [15]:
# look at 'Orbit Determination Date'
asteroids["Orbit Determination Date"].head()

0    2017-04-06 08:36:37
1    2017-04-06 08:32:49
2    2017-04-06 09:20:19
3    2017-04-06 09:15:49
4    2017-04-06 08:57:58
Name: Orbit Determination Date, dtype: object

This column contains more dates but this column contains times as well.  Again, without meta data we cannot conclude for sure what the definition of this column is, but it might be pertaining to the date and time an asteroid completes an orbit around Earth.  Much like `Close Approach Date`, `Orbit Determination Date` is not helpful for this classification project, thought it might provide helpful insight for other projects.

Next, we will check out the `Orbit Uncertainity` column.

In [16]:
# look at 'Orbit Uncertainty'
asteroids["Orbit Uncertainity"]

0       5
1       3
2       0
3       6
4       1
       ..
4682    8
4683    6
4684    6
4685    5
4686    6
Name: Orbit Uncertainity, Length: 4687, dtype: int64

In [17]:
asteroids["Orbit Uncertainity"].unique()

array([5, 3, 0, 6, 1, 4, 2, 7, 8, 9], dtype=int64)

The column name has a spelling error, adding an extra 'i' so if we keeping it we could make that change.  However, we do not have much information about this column and therefore we will not use it in this classification project.

Finally, let's look at `Equinox`.

In [18]:
# look at 'Equinox'
asteroids["Equinox"]

0       J2000
1       J2000
2       J2000
3       J2000
4       J2000
        ...  
4682    J2000
4683    J2000
4684    J2000
4685    J2000
4686    J2000
Name: Equinox, Length: 4687, dtype: object

In [19]:
# see unique values in `Equinox` column
asteroids["Equinox"].unique()

array(['J2000'], dtype=object)

The "equinox" is the time or date when the equator of the sun matches the equator of the earth, causing day and night to be equal lengths.  In this case, the values are obviously not in standard date and time, however they are in date and time.  This format is called the "standard equinox (and epoch)" where "J" stands for "Julian epoch" and "2000" refers to January 1, 2000, 12:00 Terrestrial Time ([more here](https://community.esri.com/t5/coordinate-reference-systems/drifting-of-the-celestial-sphere-what-is-j2000/ba-p/902058)).  This is a standard value being used since 1984 and is not helpful to determining if an asteroid is hazardous or not, so we will remove this column from our project.

None of the columns with an `object` or `int64` data type are helpful when determining if an asteroid is hazardous or not so we will drop them when we clean the data.

Also, earlier we mentioned removing the first two columns as they are identifiers and contain the exact same data.  We will do this in the next step as well.

### Clean data 

In [20]:
# drop columns with 'object' or 'int64' dtypes
asteroids = asteroids.select_dtypes(exclude = ["object", "int64"])    

[Code from Stack Overflow](https://stackoverflow.com/questions/48817592/how-to-drop-dataframe-columns-based-on-dtype)

### Visualize data

***

## Split data

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X = asteroids.drop(columns = ["Hazardous"], axis = 1)
y = asteroids["Hazardous"]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

***

## Find Model

In [25]:
from sklearn.model_selection import GridSearchCV

***

## Model 1

***

## Model 2