# <p style="text-align:center;">Notebook I</p>

<p style="text-align:center;">Observing the features of the NOAA Marine Mammal Strandings dataset and its characteristics</p>


## Imports

Typical imports like pandas.


In [None]:
import pandas as pd
from pathlib import Path

## Loading

Read the data into a dataframe. Since it is an excel file, the sheet the data is on is specified.


In [None]:
df = pd.read_excel("../data/raw/UNC-DataRequest-01302026.xlsx", sheet_name="2015-2024")

## Shape

There aren't too many rows but we can get a lot of contextual features to expand columns.


In [None]:
print("Data is {} rows by {} columns".format(*df.shape))

In [None]:
print(df.columns)

Listing the top 25 entries with the most missing data. Most of the missingness is due to an entry having a special entry but for the sake of training, they will be dropped.


In [None]:
print(
    df.isna()
    .sum()
    .sort_values(ascending=False)
    .apply(lambda x: f"{(x / len(df)):.2%} missing")
    .head(25)
)

Dropping columns with a significant number of missing values and ones not relevant to modeling or analysis.


In [None]:
df = df.drop(
    columns=[
        "Condition Comments",
        "Weight Units",
        "Report Type",
        "Weight",
        "Euthanized During Transport",
        "Relocated Flag",
        "Location Hazardous Public",
        "Inaccesible Flag",
        "Abandoned/Orphaned Flag",
        "Additional Remarks",
        "Unknown/CBD Flag",
        "Out of Habitat Flag",
        "Sick Flag",
        "2007 Partial Carcass",
        "Necropy Carcass Fresh/Frozen",
        "Date of Exam",
        "Extent of Exam",
        "Examined",
        "Weight actual/estimate",
        "Condition at Examination",
        "Date of Examination",
        "Length Units",
        "2007 Whole Carcass",
        "Died During Transport Flag",
        "Latitude Units",
        "Longitude Units",
        "Common Name",
    ]
)

In [None]:
df.head(5)

In [None]:
df[
    (df["Latitude Actual/Estimate"] == "actual")
    & (df["Longitude Actual/Estimate"] == "actual")
]

In [None]:
df[["Latitude Units", "Longitude Units"]]