Ideas from Kaggle site:
- What areas of the country are most likely to have UFO sightings?
- Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal?
- Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers?
- What are the most common UFO descriptions?

New idea:
- Add weather, population... about the sight?
- Military base, airport near the sight?
- 

Some new data links:
- https://www.kaggle.com/sogun3/uspollution

In [98]:
%matplotlib inline

import warnings
import pandas as pd
import numpy as np
import seaborn as sns

### Reading data
- location not found or blank (0.8146%) 
- erroneous or blank time (8.0237%)

In [99]:
# There are some rows with an extra comma that gave reading error
# For now we skip them ~ 300 rows
# TODO: parse warning text and fix them
df = pd.read_csv("../data/complete.csv", error_bad_lines=False, warn_bad_lines=False, low_memory=False)
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


#### Not useful columns

In [100]:
# date posted seem not useful
df.drop(columns=["date posted"], inplace=True)

# this column is the same as duration (seconds)
df.drop(columns=["duration (hours/min)"], inplace=True)

# Save comment to seperate variable for NLP
comments = df.loc[df.comments.notna(), "comments"]
df.drop(columns=["comments"], inplace=True)

#### Casting column types

In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88679 entries, 0 to 88678
Data columns (total 8 columns):
datetime              88679 non-null object
city                  88679 non-null object
state                 81270 non-null object
country               76314 non-null object
shape                 85757 non-null object
duration (seconds)    88677 non-null object
latitude              88679 non-null object
longitude             88679 non-null float64
dtypes: float64(1), object(7)
memory usage: 5.4+ MB


In [102]:
# Column has wrong value in latitude
df = df[df.latitude != '33q.200088']
df["latitude"] = df.latitude.astype(float)

In [103]:
df[df["duration (seconds)"].isna()]

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),latitude,longitude
41825,4/19/2009 05:22,new orleans,la,us,rectangle,,29.954444,-90.075
74039,8/16/2002 01:30,le garn (ard&egrave;che) (france),,,formation,,44.307257,4.472312


In [104]:
# somehow duration seconds has mixed type float and string 
# -> fixed by using low_memory=False on read_csv
df["duration"] = df["duration (seconds)"].str.replace("`", "")
df["duration"] = df["duration"].astype(np.float32)

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88678 entries, 0 to 88678
Data columns (total 9 columns):
datetime              88678 non-null object
city                  88678 non-null object
state                 81269 non-null object
country               76314 non-null object
shape                 85756 non-null object
duration (seconds)    88676 non-null object
latitude              88678 non-null float64
longitude             88678 non-null float64
duration              88676 non-null float32
dtypes: float32(1), float64(2), object(6)
memory usage: 6.4+ MB


In [106]:
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),latitude,longitude,duration
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,29.883056,-97.941111,2700.0
1,10/10/1949 21:00,lackland afb,tx,,light,7200,29.38421,-98.581082,7200.0
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,53.2,-2.916667,20.0
3,10/10/1956 21:00,edna,tx,us,circle,20,28.978333,-96.645833,20.0
4,10/10/1960 20:00,kaneohe,hi,us,light,900,21.418056,-157.803611,900.0


#### Fill NAs

In [107]:
# TODO:
# Infer state from city
# Infer country from state
# Infer duration second from duration hour?
df["country"].isna()

0        False
1         True
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18        True
19        True
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
88649    False
88650    False
88651    False
88652    False
88653    False
88654    False
88655    False
88656    False
88657    False
88658    False
88659     True
88660    False
88661    False
88662    False
88663     True
88664    False
88665    False
88666     True
88667    False
88668     True
88669    False
88670    False
88671    False
88672    False
88673    False
88674    False
88675    False
88676    False
88677    False
88678    False
Name: country, Length: 88678, dtype: bool