In [1]:
import pandas as pd

DATA = '/kaggle/input/meteorite-landings-on-earth-data/Meteorite_Landings.csv'
df = pd.read_csv(filepath_or_buffer=DATA).drop(columns=['Unnamed: 10', 'GeoLocation'])
df.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.775,6.08333
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      45716 non-null  object 
 1   id        45716 non-null  int64  
 2   nametype  45716 non-null  object 
 3   recclass  45716 non-null  object 
 4   mass (g)  45585 non-null  float64
 5   fall      45716 non-null  object 
 6   year      45425 non-null  float64
 7   reclat    38401 non-null  float64
 8   reclong   38401 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 3.1+ MB


In [3]:
df.nunique()

name        45716
id          45716
nametype        2
recclass      466
mass (g)    12576
fall            2
year          265
reclat      12738
reclong     14640
dtype: int64

Let's take a quick look at a couple of columns that only have two different values.

In [4]:
df['nametype'].value_counts().to_dict()

{'Valid': 45641, 'Relict': 75}

In [5]:
df['fall'].value_counts().to_dict()

{'Found': 44609, 'Fell': 1107}

Let's start by making some exploratory graphs.

In [6]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'
express.bar(data_frame=df['recclass'].value_counts().to_frame().reset_index().head(n=40), x='recclass', y='count')

Okay that's interesting. Our dataset is dominated by eight classes, and then there's a long tail.

Let's take a look at the mass data; how do we expect it to be distributed? There are probably meteors too small to be noticed, but also a practical limit on how large a meteor could be, so we might expect a normal distribution. 

In [7]:
from warnings import filterwarnings

filterwarnings(action='ignore', category=RuntimeWarning)
express.histogram(data_frame=df[df['mass (g)'] < 100], x='mass (g)')

That's not a normal distribution, but it's definitely got only one mode. We had to cut off the right tail for clarity but we didn't really lose any thing crucial to the meaning of this graph by doing so.

We have some annual data, so let's take a look at that too.

In [8]:
express.histogram(data_frame=df[(df['year'] > 1850) & (df['year'] < 2025)], x='year')

That's an interesting plot. Do we believe that there were almost no meteors prior to 1974 and then suddenly there were thousands per year? No, it seems more likely there was sudden change in people looking for meteors and the methods they used.

We knew this moment was coming; let's make a map.

In [9]:
express.scatter_mapbox(data_frame=df[(df['year'] > 1970) & (df['year'] < 2025)], lat='reclat', lon='reclong', color='year',
                       mapbox_style='open-street-map', height=800, zoom=1, center={'lat': 0, 'lon': 0}, hover_name='name')

Interestingly, there are parts of the world where we don't know much about their meteors until very recently.

Let's take another look, but color by the mass.

In [10]:
express.scatter_mapbox(data_frame=df[df['mass (g)'] < 1000], lat='reclat', lon='reclong', color='mass (g)',
                       mapbox_style='open-street-map', height=800, zoom=1, center={'lat': 0, 'lon': 0}, hover_name='name')

The distribution of mass doesn't show a clear geographic pattern, although maybe if we squint we can see that smaller meteors are more prevalent near populated areas.