### Data Cleaning

Tasks:
- convert datetime to actual dt object with this notation `%YYYY-%MM-%DD %HH:%mm:%ss` to ensure max compatibility with existing code
- clean comments all the `,` and `[ ]` need to be removed, the string needs to be wrapped in quotation marks to avoid breaking the csv
- create a country field, for cities in the us it's fairly easy: filter every field in 'state' that does not have `<br/>` as value. For cities outside it's a bit more complicated, maybe it's possible to use the 'city' field and get the string between parenthesis (which usually contains info on countries.
- add coordinates to cities (for now feasible solution is: https://geopy.readthedocs.io/en/stable/#

In [1]:
import pandas as pd
from geopy.geocoders import Nominatim

In [19]:
allYears = pd.read_csv('data/allyears.csv')

In [52]:
allYears.head(5)

Unnamed: 0.1,Unnamed: 0,datetime,city,state,shape,comment
0,0,12/23/20 07:30,Canadian Lakes,MI,Light,"light in sky ,no sound"
1,1,12/23/20 03:18,Grants,CA,Light,Bright Orb seen below plane level at Mach 1+ S...
2,2,12/23/20 03:00,East Quogue,NY,Other,"flickering colored lights in the night sky, in..."
3,3,12/22/20 22:45,Indian River Shores,FL,Changing,Orange orb over Indian River Shores Florida
4,4,12/22/20 22:05,lake elsinore,CA,Light,saw a glowing object.shape not discerned. trav...


Dropping accidentally scraped duplicates

In [26]:
noDuplicates = allYears.drop_duplicates(subset="comment", keep="first")

In [41]:
consistentStructure = noDuplicates.drop(['Unnamed: 0'], axis=1)

In [39]:
#check nan
noDuplicates.isna().sum()

Unnamed: 0    0
datetime      0
city          0
state         0
shape         0
comment       1
dtype: int64

Convert datetime from string to timedate obj

In [47]:
consistentStructure['datetime'] = pd.to_datetime(consistentStructure['datetime'])

In [51]:
consistentStructure.head(5)

Unnamed: 0,datetime,city,state,shape,comment
0,2020-12-23 07:30:00,Canadian Lakes,MI,Light,"light in sky ,no sound"
1,2020-12-23 03:18:00,Grants,CA,Light,Bright Orb seen below plane level at Mach 1+ S...
2,2020-12-23 03:00:00,East Quogue,NY,Other,"flickering colored lights in the night sky, in..."
3,2020-12-22 22:45:00,Indian River Shores,FL,Changing,Orange orb over Indian River Shores Florida
4,2020-12-22 22:05:00,lake elsinore,CA,Light,saw a glowing object.shape not discerned. trav...


In [50]:
#check formats
consistentStructure.dtypes

datetime    datetime64[ns]
city                object
state               object
shape               object
comment             object
dtype: object