# Asteroid Logistic Regression
The data is about Asteroids - NeoWs.
NeoWs (Near Earth Object Web Service) is a RESTful web service for near earth Asteroid information. With NeoWs a user can: search for Asteroids based on their closest approach date to Earth, lookup a specific Asteroid with its NASA JPL small body id, as well as browse the overall data-set.

## Acknowledgements
Data-set: All the data is from the (http://neo.jpl.nasa.gov/). This API is maintained by SpaceRocks Team: David Greenfield, Arezu Sarvestani, Jason English and Peter Baunach.

## Inspiration
Finding potential hazardous and non-hazardous asteroids
Features responsible for claiming an asteroid to be hazardous

## Variables: (n=21)

* **Absolute Magnitude**: Measure of asteroid's intrinsic mean brightness
* **Estimated Diameter**: Approximate asteroid size measured by diameter
* **Minimum Orbit Intersection**: minimum distance between osculating orbits of Earth and the asteroid
* **Orbital Period**: time it takes for asteroid to complete one full orbit
* **Perihelion Distance**: Distance of asteroid from sun when it is closest

In [352]:
#from google.colab import drive
#drive.mount('/content/drive')

In [353]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(rc={"figure.figsize": (10, 6)})
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.style.use('ggplot') # setting the plot style
%matplotlib inline
from __future__ import print_function, unicode_literals, division

# ignore various warnings
import warnings
warnings.filterwarnings('ignore')

In [354]:
asteroidData = pd.read_csv("nasa.csv")

# Exploratory Data Analysis


In [355]:
asteroidData.head()


Unnamed: 0,Neo Reference ID,Name,Absolute Magnitude,Est Dia in KM(min),Est Dia in KM(max),Est Dia in M(min),Est Dia in M(max),Est Dia in Miles(min),Est Dia in Miles(max),Est Dia in Feet(min),...,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Equinox,Hazardous
0,3703080,3703080,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,J2000,True
1,3723955,3723955,21.3,0.146068,0.326618,146.067964,326.617897,0.090762,0.202951,479.22562,...,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,J2000,False
2,2446862,2446862,20.3,0.231502,0.517654,231.502122,517.654482,0.143849,0.321655,759.521423,...,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,J2000,True
3,3092506,3092506,27.4,0.008801,0.019681,8.801465,19.680675,0.005469,0.012229,28.876199,...,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,J2000,False
4,3514799,3514799,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,J2000,True


There are many redundant variables. We are going to drop some columns.




In [356]:
#Let's get the columns of our dataset
asteroidData.columns

Index(['Neo Reference ID', 'Name', 'Absolute Magnitude', 'Est Dia in KM(min)',
       'Est Dia in KM(max)', 'Est Dia in M(min)', 'Est Dia in M(max)',
       'Est Dia in Miles(min)', 'Est Dia in Miles(max)',
       'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 'Close Approach Date',
       'Epoch Date Close Approach', 'Relative Velocity km per sec',
       'Relative Velocity km per hr', 'Miles per hour',
       'Miss Dist.(Astronomical)', 'Miss Dist.(lunar)',
       'Miss Dist.(kilometers)', 'Miss Dist.(miles)', 'Orbiting Body',
       'Orbit ID', 'Orbit Determination Date', 'Orbit Uncertainity',
       'Minimum Orbit Intersection', 'Jupiter Tisserand Invariant',
       'Epoch Osculation', 'Eccentricity', 'Semi Major Axis', 'Inclination',
       'Asc Node Longitude', 'Orbital Period', 'Perihelion Distance',
       'Perihelion Arg', 'Aphelion Dist', 'Perihelion Time', 'Mean Anomaly',
       'Mean Motion', 'Equinox', 'Hazardous'],
      dtype='object')

Let's trim down the unwanted ones.


In [357]:
asteroidData.drop(['Neo Reference ID', 'Name','Est Dia in M(min)', 'Est Dia in M(max)','Close Approach Date',
       'Epoch Date Close Approach','Est Dia in Miles(max)','Est Dia in Miles(min)','Est Dia in Feet(min)', 'Est Dia in Feet(max)',
        'Relative Velocity km per sec','Epoch Date Close Approach', 'Orbit Determination Date', 'Orbiting Body', 'Equinox','Miss Dist.(lunar)','Miss Dist.(kilometers)','Miss Dist.(miles)','Miles per hour','Est Dia in KM(max)'], axis=1, inplace=True)

We check for missing values


In [358]:
print(asteroidData.isnull().sum())

Absolute Magnitude             0
Est Dia in KM(min)             0
Relative Velocity km per hr    0
Miss Dist.(Astronomical)       0
Orbit ID                       0
Orbit Uncertainity             0
Minimum Orbit Intersection     0
Jupiter Tisserand Invariant    0
Epoch Osculation               0
Eccentricity                   0
Semi Major Axis                0
Inclination                    0
Asc Node Longitude             0
Orbital Period                 0
Perihelion Distance            0
Perihelion Arg                 0
Aphelion Dist                  0
Perihelion Time                0
Mean Anomaly                   0
Mean Motion                    0
Hazardous                      0
dtype: int64


In [359]:
asteroidData.head()

Unnamed: 0,Absolute Magnitude,Est Dia in KM(min),Relative Velocity km per hr,Miss Dist.(Astronomical),Orbit ID,Orbit Uncertainity,Minimum Orbit Intersection,Jupiter Tisserand Invariant,Epoch Osculation,Eccentricity,...,Inclination,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Hazardous
0,21.6,0.12722,22017.003799,0.419483,17,5,0.025282,4.634,2458000.5,0.425549,...,6.025981,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,True
1,21.3,0.146068,65210.346095,0.383014,21,3,0.186935,5.457,2458000.5,0.351674,...,28.412996,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,False
2,20.3,0.231502,27326.560182,0.050956,22,0,0.043058,4.557,2458000.5,0.348248,...,4.237961,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,True
3,27.4,0.008801,40225.948191,0.285322,7,6,0.005512,5.093,2458000.5,0.216578,...,7.905894,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,False
4,21.6,0.12722,35426.991794,0.407832,25,1,0.034798,5.154,2458000.5,0.210448,...,16.793382,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,True


In a classification problem, the target variable (also known as the dependent variable or label) represents the class that you want to predict. Often, the target variable is categorical, which means it can take on non-numeric values, such as class names (e.g., "dog", "cat", "car") or categories (e.g., "yes", "no", "maybe"). Machine learning models generally expect numerical input, both for features and the target variable. This is where LabelEncoder comes in.

# Why Use LabelEncoder in the Target Variable:
*  Machine Learning Models Typically Require Numeric Input:

Most machine learning algorithms (such as logistic regression, decision trees, random forests, SVMs, etc.) require numerical input. If your target variable contains categorical or string-based class labels (like "dog", "cat"), the model cannot directly process them.
LabelEncoder is used to convert these categorical class labels into numerical labels. For example, if your target has classes ["dog", "cat", "bird"], LabelEncoder might convert them to [0, 1, 2].
*  Consistency:

Label encoding ensures a consistent numeric representation for each class. For instance, if you have a dataset with categories like ["yes", "no"], LabelEncoder might map them to [1, 0], respectively. This consistent mapping is necessary for the model to understand the relationship between input features and the target classes.
*  Efficient Model Training:

Many machine learning models rely on mathematical operations, which are more efficient with numeric representations. Encoding categorical labels numerically makes the training process faster and more efficient.
*  Model Interpretability:

Some models, such as logistic regression, can provide probabilities for each class. With numeric labels, the models know how to assign and interpret probabilities for class predictions more easily, making the results interpretable.

In [360]:
# Here encoding the 'Hazardous' column.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [361]:
asteroidData['Hazardous'] = le.fit_transform(asteroidData['Hazardous'])

In [362]:
asteroidData['Hazardous'].value_counts()


Hazardous
0    3932
1     755
Name: count, dtype: int64

Having an imbalanced dataset is a problem in classification tasks because the distribution of the target classes is heavily skewed. This means one class (typically called the majority class) has significantly more examples than the other class(es) (the minority class). This imbalance can lead to several issues during model training, evaluation, and real-world performance.

Key Reasons Why an Imbalanced Dataset is a Problem:

1. **Biased Model Predictions:**
Machine learning models tend to be biased toward predicting the majority class because it dominates the training data. As a result, the model may simply predict the majority class for most inputs, achieving high overall accuracy without actually learning to distinguish the minority class.
For example, if 95% of the data belongs to class A and 5% belongs to class B, a model could achieve 95% accuracy simply by predicting class A for every instance, even though it completely ignores class B.
2. **Misleading Accuracy:**
Accuracy is not a good metric for evaluating models trained on imbalanced datasets. The model could appear to have a high accuracy by mostly predicting the majority class, but this would hide its poor performance on the minority class.
Example: In a binary classification problem where 95% of the data belongs to the majority class, predicting the majority class for all instances yields a 95% accuracy, but this doesn’t tell you how well the model performs on the minority class, which might be the class of interest.
3.** Poor Generalization for the Minority Class:**
The model might not generalize well to the minority class because it sees far fewer examples of that class during training. This can lead to poor recall or precision for the minority class in real-world applications where recognizing rare events (like fraud detection, disease diagnosis, or defect detection) is critical.
The minority class may not be sufficiently represented during training, leading the model to fail at identifying patterns relevant to that class.
4. **Skewed Decision Boundaries:**
Many classification algorithms, especially those based on decision boundaries (like logistic regression or SVM), might learn a biased decision boundary that favors the majority class. This happens because the model tries to minimize the overall error, but it does so at the expense of the minority class.
As a result, the decision boundary might be skewed, making it harder for the model to correctly classify the minority class.
5. **Evaluation Metrics Are Misleading:**
With an imbalanced dataset, metrics like accuracy can give misleading results because they focus on overall performance. Instead, you should rely on metrics like:
Precision: The proportion of true positives among the predicted positives.
Recall: The proportion of true positives among the actual positives (important for minority class detection).
F1-score: The harmonic mean of precision and recall, balancing both.
ROC-AUC (Area Under the Curve): Evaluates how well the model can distinguish between classes across different thresholds.
Confusion Matrix: Can show the true positives, false positives, true negatives, and false negatives, providing a clearer picture of performance for both classes.

In [363]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score
import time
import matplotlib.pyplot as plt

# Load dataset and preprocess the data
asteroidData = pd.read_csv("nasa.csv")

# Drop unnecessary columns
drop_cols = ['Neo Reference ID', 'Name', 'Est Dia in M(min)', 'Est Dia in M(max)', 
             'Close Approach Date', 'Epoch Date Close Approach', 'Est Dia in Miles(max)', 
             'Est Dia in Miles(min)', 'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 
             'Relative Velocity km per sec', 'Epoch Date Close Approach', 'Orbit Determination Date', 
             'Orbiting Body', 'Equinox', 'Miss Dist.(lunar)', 'Miss Dist.(kilometers)', 
             'Miss Dist.(miles)', 'Miles per hour', 'Est Dia in KM(max)']

asteroidData.drop(columns=drop_cols, inplace=True)

# Encode target variable and scale features
le = LabelEncoder()
asteroidData['Hazardous'] = le.fit_transform(asteroidData['Hazardous'])
y = asteroidData['Hazardous']
X = asteroidData.drop(columns=['Hazardous'])
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

# Initialize model list and function to evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    start_time = time.time()
    model.fit(X_train, y_train)
    delta_time = time.time() - start_time
    y_pred = model.predict(X_test)
    
    # Calculate metrics, including accuracy
    return {
        'delta_time': delta_time,
        'accuracy': accuracy_score(y_test, y_pred),  # Add accuracy calculation here
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred),
        'confusion_matrix': confusion_matrix(y_test, y_pred),
        'classification_report': classification_report(y_test, y_pred, output_dict=True)
    }

# Create models
models = {
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(kernel='linear', probability=True),
    'Decision Tree': DecisionTreeClassifier(random_state=0),
}

# Evaluate each model and store results
model_results = []
for name, model in models.items():
    result = evaluate_model(model, X_train, y_train, X_test, y_test)
    result['model'] = name
    model_results.append(result)

# Evaluate ensemble model
ensemble_model = VotingClassifier(estimators=[
    ('lr', models['Logistic Regression']),
    ('svm', models['SVM']),
    ('tree', models['Decision Tree'])], voting='soft')

ensemble_result = evaluate_model(ensemble_model, X_train, y_train, X_test, y_test)
ensemble_result['model'] = 'Ensemble'
model_results.append(ensemble_result)

### Create and train a Random Forest model

In [364]:
# Create and train a Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf.predict(X_test)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")

# Confusion Matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:\n", cm_rf)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 99.25%
Confusion Matrix:
 [[792   1]
 [  6 139]]
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       793
           1       0.99      0.96      0.98       145

    accuracy                           0.99       938
   macro avg       0.99      0.98      0.99       938
weighted avg       0.99      0.99      0.99       938



To append the Random Forest model and its results to the existing models dictionary and model_results list, you can follow these steps:

In [365]:
# Step 1: Append the Random Forest model to the models dictionary
models['Random Forest'] = rf

# Step 2: Create a results dictionary for the Random Forest model
result_rf = {
    'model': 'Random Forest',
    'accuracy': accuracy_rf,
    'confusion_matrix': cm_rf,
    'classification_report': classification_report(y_test, y_pred_rf, output_dict=True)  # Convert to dict for consistency
}

# Step 3: Append the Random Forest result to the model_results list
model_results.append(result_rf)

# Results

In [366]:
# Convert model results to DataFrame for easy analysis
results_df = pd.DataFrame(model_results)

# Print classification reports and confusion matrices for each model
for model_info in model_results:
    print(f"---------- {model_info['model']} TRAINING TIME --------------")
    print("-------- Confusion Matrix --------")
    print(model_info['confusion_matrix'])
    print("-------- Classification Report --------")
    print(classification_report(y_test, models.get(model_info['model'], ensemble_model).predict(X_test)))
    print("\n")

---------- Logistic Regression TRAINING TIME --------------
-------- Confusion Matrix --------
[[780  13]
 [ 21 124]]
-------- Classification Report --------
              precision    recall  f1-score   support

           0       0.97      0.98      0.98       793
           1       0.91      0.86      0.88       145

    accuracy                           0.96       938
   macro avg       0.94      0.92      0.93       938
weighted avg       0.96      0.96      0.96       938



---------- SVM TRAINING TIME --------------
-------- Confusion Matrix --------
[[778  15]
 [ 16 129]]
-------- Classification Report --------
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       793
           1       0.90      0.89      0.89       145

    accuracy                           0.97       938
   macro avg       0.94      0.94      0.94       938
weighted avg       0.97      0.97      0.97       938



---------- Decision Tree TRAINING TIME ---

In this dataset, the confusion matrix provides detailed insight into how well each model differentiates between hazardous and non-hazardous objects:

1. **Confusion Matrix Analysis**:
   - The entries in the confusion matrix represent:
     - **True Negatives (TN)**: Objects correctly identified as non-hazardous.
     - **False Positives (FP)**: Non-hazardous objects incorrectly classified as hazardous.
     - **False Negatives (FN)**: Hazardous objects incorrectly classified as non-hazardous.
     - **True Positives (TP)**: Objects correctly identified as hazardous.

   Given this context:
   - Models with fewer **False Negatives** are preferable, as misclassifying hazardous objects (FN) has a higher potential risk. For instance, in the `Decision Tree` model's confusion matrix:
     \[
     \begin{bmatrix}
     792 & 1 \\
     5 & 140
     \end{bmatrix}
     \]
     There are only 5 false negatives, indicating high recall for identifying hazardous objects.
   - Similarly, models with fewer **False Positives** are beneficial since minimizing them ensures fewer resources are wasted on objects misclassified as hazardous.

2. **Choosing the Right Metric for Hyperparameter Tuning**:
   - **Recall** is essential here, particularly for the hazardous class, because a high recall means that the model correctly identifies most hazardous objects, minimizing the risk of overlooking potential threats.
   - **Precision** is also significant, as it reflects the model’s accuracy in predicting hazardous objects without too many false alarms. However, in high-stakes scenarios, recall often takes priority.
   - **F1 Score** is a balanced metric (harmonic mean of precision and recall) and is useful when there’s a need to balance both precision and recall, especially if neither metric can be compromised.

   For this dataset, focusing on **recall for the hazardous class** is likely the best choice since it ensures that the model identifies as many true hazardous objects as possible. Alternatively, using **F1 Score** can provide a balance if both recall and precision are important.

### Recommendation for Hyperparameter Tuning
- **Recall** for the hazardous class (`1`) should be the main metric for hyperparameter tuning.
- **F1 Score** is a good secondary metric if you want to balance precision and recall without overly sacrificing either.

This tuning approach will help ensure the model’s effectiveness in identifying hazardous objects while keeping false positives within acceptable limits.