# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [9]:
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv('neo_v2.csv')
# Quick check to confirm load
df.head()


Unnamed: 0,id,name,est_diameter_min,est_diameter_max,relative_velocity,miss_distance,orbiting_body,sentry_object,absolute_magnitude,hazardous
0,2162635,162635 (2000 SS164),1.198271,2.679415,13569.249224,54839740.0,Earth,False,16.73,False
1,2277475,277475 (2005 WK4),0.2658,0.594347,73588.726663,61438130.0,Earth,False,20.0,True
2,2512244,512244 (2015 YE18),0.72203,1.614507,114258.692129,49798720.0,Earth,False,17.83,False
3,3596030,(2012 BV13),0.096506,0.215794,24764.303138,25434970.0,Earth,False,22.2,False
4,3667127,(2014 GE35),0.255009,0.570217,42737.733765,46275570.0,Earth,False,20.09,True


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [10]:
# Check for missing values in each column
df.isnull().sum()
# Extract year again and handle missing years
df['year'] = df['name'].str.extract(r'\((\d{4})')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
# Check missing years
df['year'].isnull().sum()


np.int64(6)

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [11]:
# Review extreme values in diameter
df[['est_diameter_min', 'est_diameter_max']].describe()
### Diameter values are extremely skewed.
#A small number of very large asteroids (40â€“85 km) distort analysis.
#These are real observations so I should not delete them.
#I will cap them for visualization and modeling purposes.
# Create capped versions of diameter columns for analysis
df['est_diameter_max_capped'] = df['est_diameter_max'].clip(upper=5)
df['est_diameter_min_capped'] = df['est_diameter_min'].clip(upper=5)
# Check velocity and miss distance for extreme values
df[['relative_velocity', 'miss_distance']].describe()



Unnamed: 0,relative_velocity,miss_distance
count,90836.0,90836.0
mean,48066.918918,37066550.0
std,25293.296961,22352040.0
min,203.346433,6745.533
25%,28619.020645,17210820.0
50%,44190.11789,37846580.0
75%,62923.604633,56549000.0
max,236990.128088,74798650.0


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [12]:
df.columns

# Drop unnecessary columns
df_clean = df.drop(columns=['id', 'name', 'orbiting_body'])


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [13]:
# Check unique values for boolean-like columns 
df_clean['hazardous'].unique()
df_clean['sentry_object'].unique()

# Ensure numeric columns are correctly typed
numeric_cols = [
    'absolute_magnitude',
    'est_diameter_min',
    'est_diameter_max',
    'relative_velocity',
    'miss_distance',
    'year'
]

df_clean[numeric_cols] = df_clean[numeric_cols].apply(
    pd.to_numeric, errors='coerce'
)


df_clean.info()
df_clean.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90836 entries, 0 to 90835
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   est_diameter_min         90836 non-null  float64
 1   est_diameter_max         90836 non-null  float64
 2   relative_velocity        90836 non-null  float64
 3   miss_distance            90836 non-null  float64
 4   sentry_object            90836 non-null  bool   
 5   absolute_magnitude       90836 non-null  float64
 6   hazardous                90836 non-null  bool   
 7   year                     90830 non-null  float64
 8   est_diameter_max_capped  90836 non-null  float64
 9   est_diameter_min_capped  90836 non-null  float64
dtypes: bool(2), float64(8)
memory usage: 5.7 MB


Unnamed: 0,est_diameter_min,est_diameter_max,relative_velocity,miss_distance,absolute_magnitude,year,est_diameter_max_capped,est_diameter_min_capped
count,90836.0,90836.0,90836.0,90836.0,90836.0,90830.0,90836.0,90836.0
mean,0.127432,0.284947,48066.918918,37066550.0,23.527103,2014.048552,0.27898,0.126301
std,0.298511,0.667491,25293.296961,22352040.0,2.894086,32.827234,0.479288,0.234276
min,0.000609,0.001362,203.346433,6745.533,9.23,1929.0,0.001362,0.000609
25%,0.019256,0.043057,28619.020645,17210820.0,21.34,2010.0,0.043057,0.019256
50%,0.048368,0.108153,44190.11789,37846580.0,23.7,2016.0,0.108153,0.048368
75%,0.143402,0.320656,62923.604633,56549000.0,25.7,2019.0,0.320656,0.143402
max,37.89265,84.730541,236990.128088,74798650.0,33.2,6743.0,5.0,5.0


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
Yes. I found missing data, irregular data, unnecessary data, and inconsistent data.
2. Did the process of cleaning your data give you new insights into your dataset?
Yes. Cleaning confirmed that extreme asteroid sizes are rare but real and should not be deleted. It also showed that hazard classification is not strongly tied to size alone.
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?
Yes. My visualizations require capped or transformed diameter values to avoid distortion