# UFO Data Preprocessing Notebook 🚀👽🏰

[!WARNING]  
Download the data before running this notebook. Check the documentation.

## Data Loading 📥

In this cell, we load the UFO raw dataset:

- **UFO Data:** Loaded from a CSV file.

In [7]:
import pandas as pd
import numpy as np
import os

# ============================================================
# SETUP: Define output directory relative to this script
# ============================================================
# Get the absolute path of the directory where this script is located
# In a notebook, __file__ is not defined so we use os.getcwd() as a fallback.
try:
    BASE_DIR = os.path.dirname(os.path.abspath(__file__))
except NameError:
    BASE_DIR = os.getcwd()

# Define the folder where raw data is stored (assumed to be "../data/raw")
RAW_DIR = os.path.join(BASE_DIR, "..", "data", "raw")

# Define the folder where processed data will be saved (assumed to be "../data/processed")
PROCESSED_DIR = os.path.join(BASE_DIR, "..", "data", "processed")
os.makedirs(PROCESSED_DIR, exist_ok=True)  # Create the folder if it doesn't exist

# -------------------------------------------------------------------
# Cell 1: Data Loading 📥
# -------------------------------------------------------------------
# Build the absolute paths for each dataset
ufo_path = os.path.join(RAW_DIR, "nuforc_reports.csv")

In [8]:
# Load UFO data from CSV using the absolute path
ufo_df = pd.read_csv(ufo_path)

## Filtering UFO Data for USA Reports 🇺🇸

This cell filters the UFO dataset so that we only work with USA reports and removes rows missing both the city and state information.

In [9]:
# Filter UFO data to include only reports from the USA
ufo_df = ufo_df.loc[ufo_df['country'] == "USA"]

# Drop rows that have missing values in both 'city' and 'state'
ufo_df.dropna(subset=['city', 'state'], how='all', inplace=True)

## Geocoding UFO Report Locations 🌍

We use the geopy library to obtain geographic coordinates (latitude and longitude) for each UFO report based on its city, state, and country.

- **Nominatim** is used as the geocoder.
- **RateLimiter** is applied to avoid hitting the request limit.

In [None]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

# Initialize the Nominatim geocoder with a custom user agent
geolocator = Nominatim(user_agent="ufo_project")
# Use RateLimiter to ensure at least one second delay between geocoding calls
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coordinates(row):
    # Construct the location string from city, state, and country
    location = geocode(f"{row['city']}, {row['state']}, {row['country']}")
    print(location)  # Debug: print the location result
    if location:
        return location.latitude, location.longitude
    else:
        return np.nan, np.nan  # Return NaN if no location is found

# Apply the geocoding function to each row and create new columns for latitude and longitude
ufo_df[["latitude", "longitude"]] = ufo_df.apply(get_coordinates, axis=1, result_type="expand")

## Cleaning and Saving the Processed UFO Data 💾

We drop rows without valid coordinates and save the cleaned UFO data to a CSV file.

In [11]:
# Drop rows in the UFO dataset where both 'latitude' and 'longitude' are missing
ufo_df.dropna(subset=['latitude', 'longitude'], how='all', inplace=True)

# Build the absolute path for the output file
ufo_processed_path = os.path.join(PROCESSED_DIR, "ufo_processed.csv")

# Save the processed UFO data to a CSV file in the processed data folder
ufo_df.to_csv(ufo_processed_path, index=False)

print("UFO processed data saved to:", ufo_processed_path)

UFO processed data saved to: /home/lferr10/code/leovsferreira/ufo-conspiracy/notebooks/../data/processed/ufo_processed.csv
