# Cleaning the data:

The original dataset, called "Asteroid_Dataset.csv" is available in the data directory. Let's see how the dataset is organized and what are its contents:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/Asteroid_Dataset.csv")

  df = pd.read_csv("../data/Asteroid_Dataset.csv")


## Dropping columns:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126131 entries, 0 to 126130
Data columns (total 35 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   spkid           126131 non-null  int64  
 1   full_name       126131 non-null  object 
 2   pha             126131 non-null  object 
 3   H               126131 non-null  float64
 4   diameter        126131 non-null  object 
 5   albedo          126131 non-null  float64
 6   diameter_sigma  126131 non-null  object 
 7   e               126131 non-null  float64
 8   a               126131 non-null  float64
 9   q               126131 non-null  float64
 10  i               126131 non-null  float64
 11  om              126131 non-null  float64
 12  w               126131 non-null  float64
 13  ma              126131 non-null  float64
 14  ad              126131 non-null  float64
 15  n               126131 non-null  float64
 16  tp              126131 non-null  float64
 17  tp_cal    

---

There are many columns which are not useful for our analysis. 
- We should drop all uncertainty ($\sigma$) columns from the dataset.
- We should also drop some of the orbital parameters.

In [4]:
col = df.columns
col

Index(['spkid', 'full_name', 'pha', 'H', 'diameter', 'albedo',
       'diameter_sigma', 'e', 'a', 'q', 'i', 'om', 'w', 'ma', 'ad', 'n', 'tp',
       'tp_cal', 'per', 'per_y', 'moid', 'moid_ld', 'sigma_e', 'sigma_a',
       'sigma_q', 'sigma_i', 'sigma_om', 'sigma_w', 'sigma_ma', 'sigma_ad',
       'sigma_n', 'sigma_tp', 'sigma_per', 'class', 'rms'],
      dtype='object')

In [5]:
drop1 = df.loc[:, "om":"ma"].columns
drop2 = df.loc[:, "n":"tp_cal"]
drop3 = df.loc[:, "moid": "sigma_per"].columns
drop4 = ["rms"]
drop1

Index(['om', 'w', 'ma'], dtype='object')

In [6]:
df.drop(drop1, axis = 1, inplace = True)
df.drop(drop2, axis = 1, inplace = True)
df.drop(drop3, axis = 1, inplace = True)
df.drop(drop4, axis = 1, inplace = True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126131 entries, 0 to 126130
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   spkid           126131 non-null  int64  
 1   full_name       126131 non-null  object 
 2   pha             126131 non-null  object 
 3   H               126131 non-null  float64
 4   diameter        126131 non-null  object 
 5   albedo          126131 non-null  float64
 6   diameter_sigma  126131 non-null  object 
 7   e               126131 non-null  float64
 8   a               126131 non-null  float64
 9   q               126131 non-null  float64
 10  i               126131 non-null  float64
 11  ad              126131 non-null  float64
 12  per             126131 non-null  float64
 13  per_y           126131 non-null  float64
 14  class           126131 non-null  object 
dtypes: float64(9), int64(1), object(5)
memory usage: 14.4+ MB


## Filling values:

There are no Null values in the dataset. Let's have a look at the first 5 elements of the dataset:

In [8]:
df.head()

Unnamed: 0,spkid,full_name,pha,H,diameter,albedo,diameter_sigma,e,a,q,i,ad,per,per_y,class
0,2000001,' 1 Ceres',N,3.4,939.4,0.09,0.2,0.077557,2.767657,2.553006,10.588621,2.982308,1681.77074,4.604437,MBA
1,2000002,' 2 Pallas',N,4.2,545.0,0.101,18.0,0.229972,2.773841,2.135935,34.832934,3.411748,1687.410991,4.61988,MBA
2,2000003,' 3 Juno',N,5.33,246.596,0.214,10.594,0.256936,2.668285,1.982706,12.991044,3.353865,1592.01377,4.358696,MBA
3,2000004,' 4 Vesta',N,3.0,525.4,0.4228,0.2,0.088516,2.362014,2.152938,7.141893,2.57109,1325.934723,3.630211,MBA
4,2000005,' 5 Astraea',N,6.9,106.699,0.274,3.14,0.190913,2.574037,2.082619,5.367427,3.065455,1508.414423,4.129814,MBA


The third column, called *pha*, is a classifying variable: it represents if the asteroid can collide against Earth. `'Y'` if it does, `'N'` if it does not. It would be better if boolean values were stored in that column instead of a char.

In [9]:
df["pha"].value_counts()

pha
N    125975
Y       156
Name: count, dtype: int64

In [10]:
df["pha"] = df["pha"] == 'Y'

In [11]:
df["pha"].value_counts()

pha
False    125975
True        156
Name: count, dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126131 entries, 0 to 126130
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   spkid           126131 non-null  int64  
 1   full_name       126131 non-null  object 
 2   pha             126131 non-null  bool   
 3   H               126131 non-null  float64
 4   diameter        126131 non-null  object 
 5   albedo          126131 non-null  float64
 6   diameter_sigma  126131 non-null  object 
 7   e               126131 non-null  float64
 8   a               126131 non-null  float64
 9   q               126131 non-null  float64
 10  i               126131 non-null  float64
 11  ad              126131 non-null  float64
 12  per             126131 non-null  float64
 13  per_y           126131 non-null  float64
 14  class           126131 non-null  object 
dtypes: bool(1), float64(9), int64(1), object(4)
memory usage: 13.6+ MB


**Watch out!** Have a look at the data types of columns *diameter* and *diameter_sigma*:

In [13]:
df["diameter"].dtype

dtype('O')

In [14]:
df["diameter_sigma"].dtype

dtype('O')

They both have `dtype = 'object'`.

In [22]:
df["diameter"] = pd.to_numeric(df["diameter"], errors = "coerce")
df["diameter_sigma"] = pd.to_numeric(df["diameter_sigma"], errors = "coerce")

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126131 entries, 0 to 126130
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   spkid           126131 non-null  int64  
 1   full_name       126131 non-null  object 
 2   pha             126131 non-null  bool   
 3   H               126131 non-null  float64
 4   diameter        126128 non-null  float64
 5   albedo          126131 non-null  float64
 6   diameter_sigma  126035 non-null  float64
 7   e               126131 non-null  float64
 8   a               126131 non-null  float64
 9   q               126131 non-null  float64
 10  i               126131 non-null  float64
 11  ad              126131 non-null  float64
 12  per             126131 non-null  float64
 13  per_y           126131 non-null  float64
 14  class           126131 non-null  object 
dtypes: bool(1), float64(11), int64(1), object(2)
memory usage: 13.6+ MB


Problem solved. All the columns have the correct data type.

## Save the new DataFrame:

In [24]:
df.to_csv("../data/clean_data.csv", index = False) # index = False --> Doesn't store the Index in the csv