## Question 1
The goal is to analyze the FAA Wildlife Strike Database to identify patterns and trends in wildlife
strikes to civil aircraft. We must clean and reduce the dataset to only include relevant features for
the analysis. We must examine factors such as aircraft type, wildlife involved, location, and time.
After extracting factors like these we perform statistical tests, test hypotheses, and create visualizations
to help reduce the occurance and impact of wildlife strikes on civil aircraft in the future.

## Question 2

Load the data using pandas and inspect it.

Perform the initial inspection of the data, its shape, types, etc.

Evaluate the dataset and perform at least three type of data preparation and justify the approach that is taken to prepare the data for analysis. Data prep can include, but is not limited to: handling missing values, data types, duplicates, etc. You will need to ensure that your data preparation addressed issues in at least 7 fields in the data.

Prepare meaningful* summary statistics for 3 continuous variables and 3 categorical variables.
Note: meaningful summary statistics explains the statistical summary of relevant fields in a coherent manner.

In [1]:
import pandas as pd 
import numpy as np
data = pd.read_csv("Bird_Strikes_1990_2023.csv")
data

FileNotFoundError: [Errno 2] No such file or directory: 'Bird_Strikes_1990_2023.csv'

In [None]:
data.dtypes

In [None]:
data.info(verbose=True)

In [None]:
print("\nMissing values per column (top 10):")
print(data.isnull().sum().sort_values(ascending=False).head(10))

In [None]:
# fix categories 
data.info(verbose=True)

In [None]:
# Identify object-type columns (potential categorical variables)
cat_cols = data.select_dtypes(include='object').columns
print("Possible categorical fields:", len(cat_cols))
print(cat_cols.tolist())


In [None]:
data = data.copy()
# standardize
for col in cat_cols:
    data[col] = data[col].astype(str).str.strip()
    data[col] = data[col].replace(['nan', 'NaN', 'None', 'UNKNOWN', 'Unknown'], np.nan)

# normalize text capitalization for key fields
data['STATE'] = data['STATE'].str.upper()
data['AIRPORT'] = data['AIRPORT'].str.title()
data['SPECIES'] = data['SPECIES'].str.title()
data['OPERATOR'] = data['OPERATOR'].str.title()
data['PHASE_OF_FLIGHT'] = data['PHASE_OF_FLIGHT'].str.title()
data['TIME_OF_DAY'] = data['TIME_OF_DAY'].str.title()
data['SIZE'] = data['SIZE'].str.title()
data['WARNED'] = data['WARNED'].str.capitalize()


for col in ['STATE', 'AIRPORT', 'SPECIES', 'PHASE_OF_FLIGHT', 'TIME_OF_DAY', 'SIZE']:
    data[col] = data[col].fillna('Unknown')

# convert them to category dtype 
for col in cat_cols:
    data[col] = data[col].astype('category')

# take care of dates and times 
data['TIME'] = pd.to_datetime(data['TIME'], format='%H:%M', errors='coerce').dt.time

data['INCIDENT_DATE'] = pd.to_datetime(data['INCIDENT_DATE'], errors='coerce') 

data

In [None]:
# check changes 
data.info(verbose=True)

In [None]:
data

In [None]:
# handling null values 
# if a column is missing more than 50% of the data, then remove it 
threshold = len(data) * 0.5
data = data.dropna(thresh=threshold, axis=1)
print("Remaining columns:", data.shape[1])

In [None]:
data

In [None]:
# check for duplicates 
data.duplicated().sum()

In [None]:
data['TIME_OF_DAY'].unique()

In [None]:
# the data has unkonws for time of day even though the the time is known. 
# Here I define times of the day to the unique categoires of the time of day (dawn,day,dusk,etc.) 
# to add values in the data that we can directly infer all while maintaining integrity
data = data.copy()

def infer_time_of_day(t):
    if pd.isna(t):
        return np.nan
    h = t.hour
    if 5 <= h < 7:
        return 'Dawn'
    elif 7 <= h < 18:
        return 'Day'
    elif 18 <= h < 20:
        return 'Dusk'
    else:
        return 'Night'

# Create a new inferred column from TIME
data['TIME_OF_DAY_INFERRED'] = data['TIME'].apply(infer_time_of_day)

# Replace 'Unknown' only where TIME_OF_DAY is missing or 'Unknown'
mask = (data['TIME_OF_DAY'] == 'Unknown') & data['TIME_OF_DAY_INFERRED'].notna()
data.loc[mask, 'TIME_OF_DAY'] = data.loc[mask, 'TIME_OF_DAY_INFERRED']

# Drop helper column
data.drop(columns='TIME_OF_DAY_INFERRED', inplace=True)


In [None]:
data

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data.isna().sum()

In [None]:
Question 4 — Hypothesis Testing (30 points)

Perform pairwise analysis of select features and evaluate the significance of the pattern or trend. A suitable value for alpha is 5%. Explain all results.

Create a scatterplot that shows the relationship between aircraft height and speed. Evaluate the correlation, the strength and the significance of the results.
Visualize the distribution of the aircraft speed during: 1) the approach phase of flight and 2) the landing roll phase of flight. Perform a 2 sample t-test and evaluate if there is a statistical difference between the speed during these two flight phases. Tip: if the data is skewed, you will need to address this prior to the statistical analysis.
Create a visualization of the aircraft damage grouped by phase of flight.
Evaluate if the results are statistically significant. Ensure that you use the appropriate test.
Perform ONE (1) additional statistical test.
Explain what you are testing and the reason this information is useful.
Visualize the data, state the hypothesis and explain if it is statistically significant.