In this notebook, we do data cleaning, transformation, and then some exploratory data analysis.

We use nycflights13 dataset which contains the information of all the flights that departed from New York City in 2013. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---
# The `nycflights13` datasets

The [Python nycflights13](https://pypi.org/project/nycflights13/) data package provides the same data as the [R nycflights13](https://cran.r-project.org/web/packages/nycflights13/index.html) package.

In [2]:
# install the package
!pip install nycflights13

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nycflights13
  Downloading nycflights13-0.0.3.tar.gz (8.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nycflights13
  Building wheel for nycflights13 (setup.py) ... [?25l[?25hdone
  Created wheel for nycflights13: filename=nycflights13-0.0.3-py3-none-any.whl size=8732740 sha256=c1dc9450ee310d909457329c55fb05d00c5e6294902449d1e43d31bd8d4db16e
  Stored in directory: /root/.cache/pip/wheels/0e/b7/7b/c129c6a2717d8825caa178f3b07e260cfb12c39f95fd165ff1
Successfully built nycflights13
Installing collected packages: nycflights13
Successfully installed nycflights13-0.0.3


In [3]:
# load the `flights` table
from nycflights13 import flights
print(type(flights))
flights.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [4]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute    

---
# Data cleaning

**Task-1**: There are some missing values in the `dep_delay` and `arr_delay` columns. How many flight records have missing values in `dep_delay`? How many flight records have missing value in `arr_delay`?

The expected answer is 8255 and 9430.

In [5]:
# ANSWER

# counting the missing values in dep_delay using isna() function
dep_delay_missing = pd.isna(flights["dep_delay"]).sum()
print(dep_delay_missing) 

#counting the missing values in arr_delay using isna() function
arr_delay_missing = pd.isna(flights["arr_delay"]).sum()
print(arr_delay_missing) 


8255
9430


---
**Task-2**: Clean the flights data by removing flight record that contain missing values in either `dep_delay` or `arr_delay` or both, and save the non-canceled flights in a new Padas DataFrame `not_canceled`. How many rows remained in `not_canceled`?

The expected answer is 327,346.

In [6]:
# ANSWER
# Removing missing values in dep_delay or arr_delay or both & saving
not_canceled = flights.dropna(subset=['dep_delay', 'arr_delay'])

#The number of rows in not_canceled flights
print(len(not_canceled))

327346


---
# Data transformation 

**For this section, we use `not_canceled` flights only.**

Find the **non-canceled** flights that satisfy each of the following conditions.



**Task-1** Flew during the winter months (December, January, February).

The resulting DataFrame should contain 77,029 rows.

In [7]:
# ANSWER
# Using non-canceled flights in winter months(December, January, February)
winter_flights = not_canceled[(not_canceled.month == 12) | (not_canceled.month == 1) | (not_canceled.month == 2)]

# Counting the number of rows in winter_flights
print(len(winter_flights))


77029


---
**Task-2**: Find non-canceled flights that were operated by United Airlines, and had an arrival delay of four or more hours.

The resulting DataFrame should contain 252 rows.

In [8]:
# ANSWER
# Non-canceled flights operated by United Airlines with arr_delay >= 240 minutes
un_delays = not_canceled[(not_canceled.carrier == 'UA') & (not_canceled.arr_delay >= 240)]

#Counting the number of rows in united_delays
print(len(un_delays))


252


---
**Task-3** Departed from LGA, and had an average flight speed greater than 150 mph (miles per hour).

The resulting DataFrame should contain 100,922 rows.

In [9]:
# Calculate average flight speed in mph using .loc
not_canceled['avg_speed'] = (not_canceled.distance / not_canceled.air_time) * 60

# Filter non-canceled flights that departed from LGA with avg_speed > 150 using 
                   #query method and then loc method : both works fine
lga_fast_flights = not_canceled.query("origin == 'LGA' & avg_speed > 150")
#lga_fast_flights = not_canceled.loc[(not_canceled.origin == "LGA") & 
                                            # (not_canceled.avg_speed > 150), :]


# Count the number of rows in lga_fast_flights
print(len(lga_fast_flights)) 
#I read and understood the warning message appearing after this line and chose to ignore it


100922


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_canceled['avg_speed'] = (not_canceled.distance / not_canceled.air_time) * 60


---
**Task-4** Had the longest departure delay in May.

The expected answer is flight MQ3744 on May 3 with departure delay of 878 minutes.

In [10]:
#ANSWER
# Filtering non-canceled flights that departed in May
may_flights = not_canceled[not_canceled.month == 5]

# Looking for the flight with the longest departure delay in May
longest_delay = may_flights[may_flights.dep_delay == may_flights.dep_delay.max()]

# Showing the flight information
print(longest_delay[['carrier', 'flight', 'dep_delay']])
   #ignore the number 195711, it is row index

       carrier  flight  dep_delay
195711      MQ    3744      878.0
