# UFO Data Preprocessing Notebook 🚀👽🏰

[!WARNING]  
Download the data before running this notebook. Check the documentation.

## Data Loading 📥

In this cell, we load the UFO raw dataset:

- **UFO Data:** Loaded from a CSV file.

In [1]:
import pandas as pd
import numpy as np
import os

# ============================================================
# SETUP: Define output directory relative to this script
# ============================================================
# Get the absolute path of the directory where this script is located
# In a notebook, __file__ is not defined so we use os.getcwd() as a fallback.
try:
    BASE_DIR = os.path.dirname(os.path.abspath(__file__))
except NameError:
    BASE_DIR = os.getcwd()

# Define the folder where raw data is stored (assumed to be "../data/raw")
RAW_DIR = os.path.join(BASE_DIR, "..", "data", "raw")

# Define the folder where processed data will be saved (assumed to be "../data/processed")
PROCESSED_DIR = os.path.join(BASE_DIR, "..", "data", "processed")
os.makedirs(PROCESSED_DIR, exist_ok=True)  # Create the folder if it doesn't exist


# Build the absolute paths for each dataset
ufo_path = os.path.join(RAW_DIR, "nuforc_reports.csv")

# Load UFO data from CSV using the absolute path
ufo_df = pd.read_csv(ufo_path)

print("✅ Dataset loaded!")

print(ufo_df.head())

✅ Dataset loaded!
           datetime             city             state country     shape  \
0  01/31/2023 19:49       Gothenburg                NE     USA  Fireball   
1  01/31/2023 19:15  Sulphur Springs                IN     USA    Circle   
2  01/31/2023 18:38       El Granada                CA     USA     Light   
3  01/31/2023 14:30           Layton                UT     USA    Circle   
4  01/31/2023 08:20          Larnaca  Larnaca District  Cyprus     Other   

                                             summary  
0  Saw trailblazer that moves very rapid and move...  
1                        Fast spherical orange light  
2  Image of these UFO's were capture with Nikon D...  
3  Working outside with two dozen others looked u...  
4                               Something like drone  


## Filtering UFO Data

This cell filters the UFO dataset by removing rows missing city, state, and country information.

In [2]:
# Drop rows that have missing values in 'city', 'state', and 'country'
ufo_df.dropna(subset=['city', 'state', 'country'], how='all', inplace=True)

print("Rows with missing values in 'city', 'state', and 'country' dropped!")

print(ufo_df.head())

Rows with missing values in 'city', 'state', and 'country' dropped!
           datetime             city             state country     shape  \
0  01/31/2023 19:49       Gothenburg                NE     USA  Fireball   
1  01/31/2023 19:15  Sulphur Springs                IN     USA    Circle   
2  01/31/2023 18:38       El Granada                CA     USA     Light   
3  01/31/2023 14:30           Layton                UT     USA    Circle   
4  01/31/2023 08:20          Larnaca  Larnaca District  Cyprus     Other   

                                             summary  
0  Saw trailblazer that moves very rapid and move...  
1                        Fast spherical orange light  
2  Image of these UFO's were capture with Nikon D...  
3  Working outside with two dozen others looked u...  
4                               Something like drone  


## Geocoding UFO Report Locations 🌍

We use the geopy library to obtain geographic coordinates (latitude and longitude) for each UFO report based on its city, state, and country.

- **Nominatim** is used as the geocoder.
- **RateLimiter** is applied to avoid hitting the request limit.

[!WARNING]  
I won't be displaying the output of the next cels since it is too long and it takes much time to run.

In [None]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

# Initialize the Nominatim geocoder with a custom user agent
geolocator = Nominatim(user_agent="ufo_project")
# Use RateLimiter to ensure at least one second delay between geocoding calls
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coordinates(row):
    """
    Converts a location (city, state, country) into geographic coordinates (latitude, longitude).

    Parameters:
        row (pd.Series): A row from a DataFrame containing 'city', 'state', and 'country' columns.

    Returns:
        tuple: A tuple containing latitude and longitude as floats. If the location cannot be found,
               returns (np.nan, np.nan) to indicate missing data.

    Process:
        - Constructs a location string from the city, state, and country fields in the row.
        - Uses the Nominatim geocoder (via RateLimiter) to query the coordinates for the location.
        - Handles cases where the geocoder cannot find a match by returning NaN values.
        - Prints the geocoding result for debugging purposes.
    """
    # Construct the location string from city, state, and country
    location = geocode(f"{row['city']}, {row['state']}, {row['country']}")
    print(location)  # Debug: print the location result
    if location:
        return location.latitude, location.longitude
    else:
        return np.nan, np.nan  # Return NaN if no location is found

# Apply the geocoding function to each row and create new columns for latitude and longitude
ufo_df[["latitude", "longitude"]] = ufo_df.apply(get_coordinates, axis=1, result_type="expand")

print("UFO data updated with lat/lon! 📍")

print(ufo_df.head())

## Cleaning and Saving the Processed UFO Data 💾

We drop rows without valid coordinates and save the cleaned UFO data to a CSV file.

In [None]:
# Drop rows in the UFO dataset where both 'latitude' and 'longitude' are missing
ufo_df.dropna(subset=['latitude', 'longitude'], how='all', inplace=True)

# Build the absolute path for the output file
ufo_processed_path = os.path.join(PROCESSED_DIR, "ufo_processed.csv")

# Save the processed UFO data to a CSV file in the processed data folder
ufo_df.to_csv(ufo_processed_path, index=False)

print("💾 Saved processed UFO dataset!")

print(ufo_df.head())

print(ufo_df.shape)