# Practical Pandas

Now we will use practical skills for a given case.

## 1. Load Data

Lets load all the data from the APP.ANALYST_TRAININGS table

In [None]:
import pandas as pd

df = pd.read_csv('Data/airlineDT.csv', sep=',')
df.head(5)

First we need to convert the data to a pandas dataframe. Do the conversion of "rows" to a pandas dataframe called "df":

To get a better understanding of content of the dataframe we can use the pandas function describe. Use pandas describe function:

We can also use the agg function. Here can we also show the column types. Use pandas agg function and show type information("dtypes"), count("count") and mean("mean")

Let us also check the top 10 in the dataframe. Print top 10 from the dataframe:

## 2. Clean data

We want to get rid of non significant variables which are "YEAR", "MONTH" and "DAY". This are contained in "TIME_HOUR". Columns that are not needed should be excluded in the SQL query instead. 
Write the code to drop the columns

Let's check wich columns we now have

In [None]:
df.columns

Some of the columns can be hard to read or to understand. Lets rename them

In [None]:
df.rename(columns={'DEP_TIME':'DEPARTURE_TIME', 'SCHED_DEP_TIME':'SCHEDULED_DEPARTURE_TIME', 'DEP_DELAY':'DEPARTURE_DELAY', 'ARR_TIME':'ARRIVAL_TIME', 
                  'SCHED_ARR_TIME':'SCHEDULED_ARRIVAL_TIME', 'ARR_DELAY':'ARRIVAL_DELAY',  'DEST':'DESTINATION', 'HOUR':'SCHEDULED_DEPARTURE_HOUR', 
                   'MINUTE':'SCHEDULED_DEPARTURE_MINUTE',}, inplace= True)

This is now the new dataframe

In [None]:
df.head(10)

## 3. Feature Engineer

"DEPARTURE_TIME", "SCHEDULED_DEPARTURE_TIME", "ARRIVAL_TIME" and "SCHEDULED_ARRIVAL_TIME" is in HHMM (hour and minutes combined as one number). "SCHEDULED_DEPARTURE_TIME" is a combination of "SCHEDULED_DEPARTURE_HOUR" and "SCHEDULED_DEPARTURE_MINUTE". Let's make the same columns for "DEPARTURE_TIME", "ARRIVAL_TIME" and "SCHEDULED_ARRIVAL_TIME".

Run the given loop. Why does it crash? 

In [None]:
columns = ['DEPARTURE_TIME', 'ARRIVAL_TIME', 'SCHEDULED_ARRIVAL_TIME']

for col in columns:
    HHMM = df[col] # Here we fetch the current column in the dataframe
    hour = (HHMM/100).astype(int) # We can find the hour in the column by deviding by 100 and set the result as an integer.
    minute = HHMM - hour*100 # Minutes acn be found by substracting hour multiplied with 100 from the original column. 
    
    new_column_name_hour = col[:col.find('_TIME')]+'_HOUR' # Here we delete "_TIME" in the name and replace it with "_HOUR"
    new_column_name_minute = col[:col.find('_TIME')]+'_MINUTE' # Here we delete "_TIME" in the name and replace it with "_MINUTE"
    df[new_column_name_hour] = hour  # Now we can add a new column with the values in hour
    df[new_column_name_minute] = minute # Now we can add a new column with the values in minute

Fix the crash:

Try to rerun the function:

In [None]:
columns = ['DEPARTURE_TIME', 'ARRIVAL_TIME', 'SCHEDULED_ARRIVAL_TIME']

for col in columns:
    HHMM = df[col] # Here we fetch the current column in the dataframe
    hour = (HHMM/100).astype(int) # We can find the hour in the column by deviding by 100 and set the result as an integer.
    minute = HHMM - hour*100 # Minutes acn be found by substracting hour multiplied with 100 from the original column. 
    
    new_column_name_hour = col[:col.find('_TIME')]+'_HOUR' # Here we delete "_TIME" in the name and replace it with "_HOUR"
    new_column_name_minute = col[:col.find('_TIME')]+'_MINUTE' # Here we delete "_TIME" in the name and replace it with "_MINUTE"
    df[new_column_name_hour] = hour  # Now we can add a new column with the values in hour
    df[new_column_name_minute] = minute # Now we can add a new column with the values in minute

Lets now check that the columns has been sucessfully created .....

In [None]:
df.columns

and check that the logic works

In [None]:
df[['DEPARTURE_TIME','DEPARTURE_HOUR', 'DEPARTURE_MINUTE']] .head()

Lets delete "DEPARTURE_TIME", "SCHEDULED_DEPARTURE_TIME", "ARRIVAL_TIME" and "SCHEDULED_ARRIVAL_TIME" since we now have splitted them out in new columns "_HOUR" and "_MINUTE". Delete the columns:

Check the dataframe

In [None]:
df.head()

## 4. Analyze and Visualize

What is the mean departure delay, arrival delay and air time for the different carriers from one destination to another?
We check this by using groupby, where the group is "CARRIER", "ORIGIN", "DESTINATION" and "DISTANCE". Then we use the mean function and pick out the following columns "DEPARTURE_DELAY","ARRIVAL_DELAY" and "AIR_TIME". Use pandas groupby with mean() and pick out the columns

pd.plot (pandas plotting tool) is a easy and a quick way to plot the data. For more complex plots other plotting tools are required (e.g matplotlib, bokeh etc. ) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

First let's print the distribution for departure delay and arrival delay. 

Plot the distribution for "DEPARTURE_DELAY"

Plot the distribution for "ARRIVAL_DELAY"

Is there any correlation between departure delay and arrival delay when grouped by scheduled departure hour?
An easy check is to plot a line plot. Where the data is grouped by "SCHEDULED_DEPARTURE_HOUR" and then pick-out "DEPARTURE_DELAY" and "ARRIVAL_DELAY" before plotting. Use the mean (easier to interpret the Y-axis) to summarize the data.  Do the plot:

Do we see the same pattern when the group is "AIR_TIME"?

Lets check the correlation between "DEPARTURE_DELAY" and "ARRIVAL_DELAY". (Straighter the line indicates stronger correlation)

Plot "DEPARTURE_DELAY" against "ARRIVAL_DELAY":

Is there any correlation between "DEPARTURE_DELAY" and "ARRIVAL_DELAY" when grouped by "SCHEDULED_DEPARTURE_HOUR" when only looking at the "CARRIER" UA?
(The only change from before is to add "CARRIER" to the grouping and after finding the mean, then look up "UA" in the index.)

What is the true correlation for  "DEPARTURE_DELAY", "ARRIVAL_DELAY", "AIR_TIME" and "DISTANCE" in the dataset?
This can be shown by picking out the wanted columns and then use pandas corr function.