<a href="https://colab.research.google.com/github/kleczekr/tolkenizer/blob/master/cleaning_data_air_travel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploring and cleaning the air travel file
### Coursera Tableau project

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('/content/US Monthly Air Passengers.csv')

In [4]:
df.shape

(271336, 17)

The data has 271336 lines, grouped in 17 columns.

In [5]:
df.columns

Index(['Sum_PASSENGERS', 'AIRLINE_ID', 'CARRIER_NAME', 'ORIGIN',
       'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',
       'ORIGIN_COUNTRY', 'ORIGIN_COUNTRY_NAME', 'DEST', 'DEST_CITY_NAME',
       'DEST_STATE_ABR', 'DEST_STATE_NM', 'DEST_COUNTRY', 'DEST_COUNTRY_NAME',
       'YEAR', 'MONTH'],
      dtype='object')

Let's count empty values:

In [6]:
df.isnull().sum()

Sum_PASSENGERS             0
AIRLINE_ID               448
CARRIER_NAME             448
ORIGIN                     0
ORIGIN_CITY_NAME           0
ORIGIN_STATE_ABR        9857
ORIGIN_STATE_NM         9857
ORIGIN_COUNTRY             0
ORIGIN_COUNTRY_NAME        0
DEST                       0
DEST_CITY_NAME             1
DEST_STATE_ABR         10351
DEST_STATE_NM          10351
DEST_COUNTRY               1
DEST_COUNTRY_NAME          1
YEAR                       1
MONTH                      1
dtype: int64

The following combinations of null values seem to be apparent: AIRLINE_ID with CARRIER_NAME, ORIGIN_STATE_ABR with ORIGIN_STATE_NM, DEST_STATE_ABR with DEST_STATE_NM. The following columns have each a single null value: DEST_CITY_NAME, DEST_COUNTRY, DEST_COUNTRY_NAME, YEAR and MONTH.

I will first deal with the pairs of clusters of missing values and see if I can detect any patterns among them.

In [17]:
df[df.AIRLINE_ID.isnull()].head()

Unnamed: 0,Sum_PASSENGERS,AIRLINE_ID,CARRIER_NAME,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,YEAR,MONTH


In [16]:
df[df.AIRLINE_ID.isnull()].tail()

Unnamed: 0,Sum_PASSENGERS,AIRLINE_ID,CARRIER_NAME,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,YEAR,MONTH


I can't see a detectable pattern in the rows which lack the AIRLINE_ID and CARRIER_NAME. Still, since these columns are not the most important, I am not going to delete these rows---I will rather fill the null values there with the string 'undefined'.

In [25]:
df.AIRLINE_ID = df.AIRLINE_ID.fillna('undefined')
df.CARRIER_NAME = df.CARRIER_NAME.fillna('undefined')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Let's now turn to origin and destination states.

In [20]:
df[df.DEST_STATE_ABR.isnull()].head()

Unnamed: 0,Sum_PASSENGERS,AIRLINE_ID,CARRIER_NAME,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,YEAR,MONTH


It seems that the rows with missing DEST_STATE_ABR and DEST_STATE_NM are simply travels to countries other than United States.

In [21]:
df[df.ORIGIN_STATE_ABR.isnull()].head()

Unnamed: 0,Sum_PASSENGERS,AIRLINE_ID,CARRIER_NAME,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,YEAR,MONTH


Likewise, the rows with null values in ORIGIN_STATE_ABR and ORIGIN_STATE_NM are the flights which originate outside of United States.

It makes sense to group all these rows by filling the missing values as 'Non-USA'.

In [19]:
df.ORIGIN_STATE_ABR = df.ORIGIN_STATE_ABR.fillna('Non-USA')
df.ORIGIN_STATE_NM = df.ORIGIN_STATE_NM.fillna('Non-USA')
df.DEST_STATE_ABR = df.DEST_STATE_ABR.fillna('Non-USA')
df.DEST_STATE_NM = df.DEST_STATE_NM.fillna('Non-USA')


Let's now consider the rows with single missing values.

In [22]:
df[df.DEST_CITY_NAME.isnull()].head()

Unnamed: 0,Sum_PASSENGERS,AIRLINE_ID,CARRIER_NAME,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,YEAR,MONTH
271335,3591,19393,Southwest Airlines Co.,DAL,"Dallas, TX",TX,Texas,US,United States,BWI,,Non-USA,Non-USA,,,,


It seems that there was a single flight with all these columns missing---the Southwest Airlines Co. originating in Dallas. As this row lacks even date of the flight, it only makes sense to remove it.

In [23]:
df = df[df.DEST_CITY_NAME.notna()]

Finally, let's take a look if we have any missing values left:

In [26]:
df.isnull().sum()

Sum_PASSENGERS         0
AIRLINE_ID             0
CARRIER_NAME           0
ORIGIN                 0
ORIGIN_CITY_NAME       0
ORIGIN_STATE_ABR       0
ORIGIN_STATE_NM        0
ORIGIN_COUNTRY         0
ORIGIN_COUNTRY_NAME    0
DEST                   0
DEST_CITY_NAME         0
DEST_STATE_ABR         0
DEST_STATE_NM          0
DEST_COUNTRY           0
DEST_COUNTRY_NAME      0
YEAR                   0
MONTH                  0
dtype: int64

Success!

In [27]:
from google.colab import files
df.to_csv('monthly_air_passengers.csv', index=False)