## Final Project Submission

Please fill out:
* Student name: Kacurubison Meeme
* Student pace: full time
* Scheduled project review date/time: 3/10/2025
* Instructor name: Samuel Karu
* Blog post URL:


✈️ Introduction

Air travel is one of the safest modes of transportation, but accidents — though rare — can have massive consequences. For a company looking to enter the aviation business, understanding historical risks is essential.

In this notebook, we’ll work with aviation incident data to:

Clean and prepare the dataset for analysis


Our journey starts with raw, messy data and ends with actionable knowledge.

# 🛠️ Getting Started with Tools

To work with aviation data efficiently, we need the right toolkit.
Our first step is importing **pandas**, the go-to Python library for cleaning and analyzing structured data.




In [2]:
import pandas as pd

Now that pandas is on board, let’s bring in our raw flight records:

A quick .head() lets us peek into the cockpit — the first 5 rows — before we start cleaning and exploring.

In [4]:
df=pd.read_csv('Aviation_Data.csv')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


# 📐 Dataset Dimensions

How big is our runway of data?
With `.shape`, pandas tells us the number of **rows** (records) and **columns** (features) we’re working with:

Think of it as the blueprint — knowing the size of the dataset before we start flying through it.


In [5]:
df.shape

(90348, 31)

# 🔍 Checking for Duplicates

Before analysis, we need to know if our data has repeated records.
`.duplicated().sum()` gives us the total number of duplicate rows:

Duplicates can distort insights — better to catch them early.


In [6]:
df.duplicated().sum()

1390

# ✂️ Dropping Duplicates

Now we clear out the repeated rows to keep our dataset lean and trustworthy.


After dropping duplicates, checking `.shape` confirms our new dataset size.


In [12]:
df=df.drop_duplicates()
df.shape

(88958, 31)

# 🕳️ Checking Missing Values

Next, we scan for gaps in our data.

This tells us where the dataset needs cleanup or imputation.


In [14]:
df.isna().sum()

Event.Id                     69
Investigation.Type            0
Accident.Number              69
Event.Date                   69
Location                    121
Country                     295
Latitude                  54576
Longitude                 54585
Airport.Code              38709
Airport.Name              36168
Injury.Severity            1069
Aircraft.damage            3263
Aircraft.Category         56671
Registration.Number        1386
Make                        132
Model                       161
Amateur.Built               171
Number.of.Engines          6153
Engine.Type                7146
FAR.Description           56935
Schedule                  76376
Purpose.of.flight          6261
Air.carrier               72310
Total.Fatal.Injuries      11470
Total.Serious.Injuries    12579
Total.Minor.Injuries      12002
Total.Uninjured            5981
Weather.Condition          4561
Broad.phase.of.flight     27234
Report.Status              6450
Publication.Date          15299
dtype: i

# 🛠️ Handling Missing Values

Instead of leaving blanks, we’ll fill all missing entries with `"Unknown"`.
This keeps our dataset consistent while avoiding accidental bias from dropping rows:

Now, all gaps are handled — no more missing values remain.


In [17]:
df = df.fillna("Unknown")
df.isna().sum()

Event.Id                  0
Investigation.Type        0
Accident.Number           0
Event.Date                0
Location                  0
Country                   0
Latitude                  0
Longitude                 0
Airport.Code              0
Airport.Name              0
Injury.Severity           0
Aircraft.damage           0
Aircraft.Category         0
Registration.Number       0
Make                      0
Model                     0
Amateur.Built             0
Number.of.Engines         0
Engine.Type               0
FAR.Description           0
Schedule                  0
Purpose.of.flight         0
Air.carrier               0
Total.Fatal.Injuries      0
Total.Serious.Injuries    0
Total.Minor.Injuries      0
Total.Uninjured           0
Weather.Condition         0
Broad.phase.of.flight     0
Report.Status             0
Publication.Date          0
dtype: int64

# 🛬 Conclusion

We started with raw aviation data full of duplicates, missing values, and inconsistencies. Through systematic cleaning and exploration, we transformed it into a dataset ready for analysis.

Key takeaways:

* Pandas made it possible to efficiently clean, organize, and explore the data
* We identified missing and duplicate records and handled them
* The dataset now highlights patterns that can inform risk assessment in aviation

This notebook provided a foundation: from here, we can move into deeper statistical modeling, clustering aircraft types, or building predictive models for accident risks.

In short, we’ve taxied from raw data to a clear runway for deeper insights.
