Let's load up our data and do a little data cleansing. We need to clean up some trash that has crept into our data, and we want to cap our durations at an hour. If we don't cap our durations at an hour the outliers will make almost all of them nearly zero.

Also it looks like we have pretty solid date/time data, so we're going to look at seasonality in several dimensions. That means we need to parse several fields out of our date/time data.

In [1]:
import pandas as pd

UFO = '/kaggle/input/ufo-sightings/ufo_sightings_scrubbed.csv'
df = pd.read_csv(filepath_or_buffer=UFO, parse_dates=['datetime', 'date posted'], low_memory=False)
df.columns = [column.strip() for column in df.columns]
df['latitude'] = df['latitude'].apply(func=lambda x: float(x.replace('q', '')) if 'q' in x else float(x))
df['duration (seconds)'] = df['duration (seconds)'].apply(func=lambda x: float(x.replace('`', '')))
df['duration (seconds)'] = df['duration (seconds)'].apply(func=lambda x: x if x < 3600 else 3600)
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year
df['day_of_week'] = df['datetime'].dt.dayofweek
df['hour'] = df['datetime'].dt.hour
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,month,year,day_of_week,hour
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941111,10,1949,0,20
1,1949-10-10 21:00:00,lackland afb,tx,,light,3600.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082,10,1949,0,21
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667,10,1955,0,17
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645833,10,1956,2,21
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803611,10,1960,0,20


Let's make a map. We have global data, but we already suspect sightings are not randomly or evenly distributed around the glob.

In [2]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'
express.scatter_mapbox(data_frame=df, mapbox_style='open-street-map', height=800, zoom=1, lat='latitude', lon='longitude', color='duration (seconds)')

This map makes it look like almost all UFO sightings are in the United States.

Here's what our durations look like. 

In [3]:
express.histogram(data_frame=df, x='duration (seconds)', log_y=False)

The way those peaks are distributed suggests they've been quantized, maybe to five-minute intervals. 

Let's look at the distribution by country.

In [4]:
express.bar(data_frame=df['country'].value_counts().to_frame().reset_index(), x='country', y='count')

This distribution is more an artifact of the completeness of the country field than it is a representation of the actual sighting countries, as is suggested from looking at the sightings not in these five countries shown in the map above.

We get a smoother histogram if we plot the distribution by year instead of by date.

In [5]:
express.histogram(data_frame=df, x='year')

It's crazy how sightings accelerate so much after 1990. I wonder what caused that.

Do we see any seasonality from month to month?

In [6]:
express.histogram(data_frame=df, x='month')

We don't see a lot of seasonality, but it is probably not surprising that we see an increase in sightings during the warmer months in the northern hemisphere.

How about seasonality relative to the day of the week?

In [7]:
express.histogram(data_frame=df, x='day_of_week')

There are more sightings on weekends, but not a lot.

How about seasonality by hour of the day? 

In [8]:
express.histogram(data_frame=df, x='hour')

Not surprisingly there are far more sightings at night. 

Let's add a map of just the sightings in the United States. How would we expect them to be concentrated?

In [9]:

express.scatter_mapbox(data_frame=df[df['country'] == 'us'], mapbox_style='open-street-map', height=800, zoom=3, lat='latitude', lon='longitude', color='duration (seconds)')

This looks a lot like a population density map. Is that surprising? Probably not, because this dataset captures sightings, not UFOs. And because sightings require an observer as well as a phenomenon, we should probably expect them to be distributed like population. Also, to the extent UFOs are real, we don't have any idea how they are distributed.