<a href="https://colab.research.google.com/github/kleczekr/tolkenizer/blob/master/cleaning_data_air_travel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploring and cleaning the air travel file
### Coursera Tableau project

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [3]:
df = pd.read_csv('/content/drive/MyDrive/deite/us_monthly_air_passengers.csv')

In [4]:
df.shape

(6278820, 18)

The data has 6278820 lines, grouped in 18 columns.

In [5]:
df.columns

Index(['passengers', 'airline_id', 'carrier_name', 'origin',
       'origin_city_name', 'origin_state_abr', 'origin_state_nm',
       'origin_country', 'origin_country_name', 'dest', 'dest_city_name',
       'dest_state_abr', 'dest_state_nm', 'dest_country', 'dest_country_name',
       'year', 'month', 'yearmonth'],
      dtype='object')

Let's count empty values:

In [6]:
df.isnull().sum()

passengers                  0
airline_id                448
carrier_name              448
origin                      0
origin_city_name            0
origin_state_abr       561441
origin_state_nm        561441
origin_country              2
origin_country_name         0
dest                        0
dest_city_name              0
dest_state_abr         589465
dest_state_nm          589465
dest_country                7
dest_country_name           0
year                        0
month                       0
yearmonth                   0
dtype: int64

The following combinations of null values seem to be apparent: airline_id with carrier_name, origin_state_abr with origin_state_nm, dest_state_abr with dest_state_nm. In addition, two records lack origin_country, and 7 lack dest_country.

I will first deal with the pairs of clusters of missing values and see if I can detect any patterns among them.

In [8]:
df[df.airline_id.isnull()].head()

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
0,0,,,AEX,"Alexandria, LA",LA,Louisiana,US,United States,AEX,"Alexandria, LA",LA,Louisiana,US,United States,2015,3,2015-03
1,0,,,AEX,"Alexandria, LA",LA,Louisiana,US,United States,AFW,"Dallas/Fort Worth, TX",TX,Texas,US,United States,2015,4,2015-04
2,0,,,AEX,"Alexandria, LA",LA,Louisiana,US,United States,ATL,"Atlanta, GA",GA,Georgia,US,United States,2015,3,2015-03
3,89,,,AEX,"Alexandria, LA",LA,Louisiana,US,United States,BOG,"Bogota, Colombia",,,CO,Colombia,2015,1,2015-01
4,108,,,AEX,"Alexandria, LA",LA,Louisiana,US,United States,BOG,"Bogota, Colombia",,,CO,Colombia,2015,3,2015-03


In [9]:
df[df.airline_id.isnull()].tail()

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
443,0,,,YIP,"Detroit, MI",MI,Michigan,US,United States,YIP,"Detroit, MI",MI,Michigan,US,United States,2000,6,2000-06
444,174,,,YOW,"Ottawa, Canada",ON,Ontario,CA,Canada,LAS,"Las Vegas, NV",NV,Nevada,US,United States,2006,11,2006-11
445,1416,,,YVR,"Vancouver, Canada",BC,British Columbia,CA,Canada,PDX,"Portland, OR",OR,Oregon,US,United States,2000,6,2000-06
446,1274,,,YVR,"Vancouver, Canada",BC,British Columbia,CA,Canada,SLC,"Salt Lake City, UT",UT,Utah,US,United States,2000,6,2000-06
447,1227,,,YYZ,"Toronto, Canada",ON,Ontario,CA,Canada,LAS,"Las Vegas, NV",NV,Nevada,US,United States,2006,11,2006-11


I can't see a detectable pattern in the rows which lack the AIRLINE_ID and CARRIER_NAME. Still, since these columns are not the most important, I am not going to delete these rows---I will rather fill the null values there with the string 'undefined'.

In [10]:
df.airline_id = df.airline_id.fillna('undefined')
df.carrier_name = df.carrier_name.fillna('undefined')

Let's now turn to origin and destination states.

In [11]:
df[df.dest_state_abr.isnull()].head()

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
3,89,undefined,undefined,AEX,"Alexandria, LA",LA,Louisiana,US,United States,BOG,"Bogota, Colombia",,,CO,Colombia,2015,1,2015-01
4,108,undefined,undefined,AEX,"Alexandria, LA",LA,Louisiana,US,United States,BOG,"Bogota, Colombia",,,CO,Colombia,2015,3,2015-03
5,83,undefined,undefined,AEX,"Alexandria, LA",LA,Louisiana,US,United States,BOG,"Bogota, Colombia",,,CO,Colombia,2015,4,2015-04
27,90,undefined,undefined,AEX,"Alexandria, LA",LA,Louisiana,US,United States,GUA,"Guatemala City, Guatemala",,,GT,Guatemala,2015,1,2015-01
28,217,undefined,undefined,AEX,"Alexandria, LA",LA,Louisiana,US,United States,GUA,"Guatemala City, Guatemala",,,GT,Guatemala,2015,2,2015-02


It seems that the rows with missing dest_state_abr and dest_state_nm are simply travels to countries other than United States.

In [12]:
df[df.origin_state_abr.isnull()].head()

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
70,787,undefined,undefined,AMS,"Amsterdam, Netherlands",,,NL,Netherlands,JFK,"New York, NY",NY,New York,US,United States,2009,5,2009-05
71,1036,undefined,undefined,AMS,"Amsterdam, Netherlands",,,NL,Netherlands,JFK,"New York, NY",NY,New York,US,United States,2009,7,2009-07
72,1026,undefined,undefined,ARN,"Stockholm, Sweden",,,SE,Sweden,MIA,"Miami, FL",FL,Florida,US,United States,2000,1,2000-01
73,703,undefined,undefined,ARN,"Stockholm, Sweden",,,SE,Sweden,MIA,"Miami, FL",FL,Florida,US,United States,2000,2,2000-02
74,707,undefined,undefined,ARN,"Stockholm, Sweden",,,SE,Sweden,MIA,"Miami, FL",FL,Florida,US,United States,2000,3,2000-03


Likewise, the rows with null values in origin_state_abr and origin_state_nm are the flights which originate outside of United States.

It makes sense to group all these rows by filling the missing values as 'Non-USA'.

In [14]:
df.origin_state_abr = df.origin_state_abr.fillna('Non-USA')
df.origin_state_nm = df.origin_state_nm.fillna('Non-USA')
df.dest_state_abr = df.dest_state_abr.fillna('Non-USA')
df.dest_state_nm = df.dest_state_nm.fillna('Non-USA')


Let's now consider the few other columns with missing values.

In [15]:
df[df.origin_country.isnull()]

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
6203035,2,21569,Amira Air GmbH,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,FLL,"Fort Lauderdale, FL",FL,Florida,US,United States,2015,4,2015-04
6232174,48,21630,Comlux Aruba NV,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,MCO,"Orlando, FL",FL,Florida,US,United States,2016,7,2016-07


In [16]:
df[df.origin_country_name == 'Namibia']

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
6203035,2,21569,Amira Air GmbH,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,FLL,"Fort Lauderdale, FL",FL,Florida,US,United States,2015,4,2015-04
6232174,48,21630,Comlux Aruba NV,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,MCO,"Orlando, FL",FL,Florida,US,United States,2016,7,2016-07


It seems that Namibia does not have an abbreviation in the present data. Let's see if the abbreviation 'NA' is taken; if not---we'll set it as the abbreviation for Namibia.

In [17]:
df[df.origin_country == 'NA']

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth


In [18]:
# it seems we're in the clear
df.origin_country = df.origin_country.fillna('NA')

Now let's take a look at the seven flights which lack the two-letter abbreviation in the dest_country column!

In [19]:
df[df.dest_country.isnull()]

Unnamed: 0,passengers,airline_id,carrier_name,origin,origin_city_name,origin_state_abr,origin_state_nm,origin_country,origin_country_name,dest,dest_city_name,dest_state_abr,dest_state_nm,dest_country,dest_country_name,year,month,yearmonth
2620229,0,20095,World Airways Inc.,ATL,"Atlanta, GA",GA,Georgia,US,United States,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2008,2,2008-02
2883909,0,20110,Antonov Company,IAH,"Houston, TX",TX,Texas,US,United States,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2014,4,2014-04
2914851,0,20151,Amerijet International,MIA,"Miami, FL",FL,Florida,US,United States,ERS,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2013,12,2013-12
2914852,0,20151,Amerijet International,MIA,"Miami, FL",FL,Florida,US,United States,ERS,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2014,2,2014-02
3037765,0,20195,Tradewinds Airlines,MIA,"Miami, FL",FL,Florida,US,United States,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2011,4,2011-04
5585804,0,20428,Volga-Dnepr Airlines,IAH,"Houston, TX",TX,Texas,US,United States,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2014,3,2014-03
5585805,0,20428,Volga-Dnepr Airlines,IAH,"Houston, TX",TX,Texas,US,United States,WDH,"Windhoek, Namibia",Non-USA,Non-USA,,Namibia,2014,5,2014-05


Again, all the flights are supposed to land in Namibia. We'll fill these.

In [20]:
df.dest_country = df.dest_country.fillna('NA')

Finally, let's take a look if we have any missing values left:

In [21]:
df.isnull().sum()

passengers             0
airline_id             0
carrier_name           0
origin                 0
origin_city_name       0
origin_state_abr       0
origin_state_nm        0
origin_country         0
origin_country_name    0
dest                   0
dest_city_name         0
dest_state_abr         0
dest_state_nm          0
dest_country           0
dest_country_name      0
year                   0
month                  0
yearmonth              0
dtype: int64

Success!

In [22]:
from google.colab import files
df.to_csv('monthly_air_passengers_full_2.csv', index=False)