# C3M5 Lesson 1 Practice Lab: Flight delays and cancellations - Plotting the time series

In this lesson, you will continue working with Flight delays and cancellations dataset that you already saw in module 3. Now you will look at some time series data and explore how the different variables change over time.

You will be working with the following columns:

- `Airline`: Name of the operating airline. If the value is “All Airlines”, the data given represents aggregated values.
- `Month`: Month of the flight in Month-Year format
- `Sectors_Scheduled`: How many flights were scheduled for the given airline and route for the given month
- `Cancellations`: Number of cancellations
- `Arrivals_Delayed`: Number of flights that arrived at the gate 15 minutes after the scheduled arrival time shown in the carriers' schedule. 


## General instructions
- **Replace any instances of `None` with your own code**. All `None`s must be replaced.
- **Compare your results with the expected output** shown below the code.
- **Check the solution** using the expandable cell to verify your answer.

Happy coding!

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%">
<strong>Important note</strong>: Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 
</div>

## Table of contents
- [Step 1: Import libraries](#import-libraries)
- [Step 2: Load the data](#load-the-data)
    - [Filter and process the dataset](#filter-and-process-the-dataset)
- [Step 3: Plotting the time series](#plotting-the-time-series)

<a id="import-libraries"></a>

## Step 1: Import libraries
Begin by importing the pandas library

In [None]:
import pandas as pd
import seaborn as sns

<a id="load-the-data"></a>

## Step 2: Load the data
Begin by loading the data. Run the cell below to load the data.

In [None]:
df = pd.read_csv("otp_time_series_web.csv")
df.head()

The "Month" column should be a datetime, but it most likely was cast as an object and contains strings. Run the cell below  to see the type of the first element in the "Month" column.

In [None]:
df.dtypes

The "Month" column is indeed an object, so you need to fix that!

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the next cell to convert the "Month" column to datetime</li>
        </ol>
</div>

You will need to specify the format, since the one in the DataFrame is not very standard. To learn more about the different formats you can define, and what they represent, you can take a look at the [🔗strftime documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

In [None]:
### START CODE HERE ###

# convert the Month column to datetime
df["Month"] = None(df["Month"], format="%b-%y")

### END CODE HERE ###

# check the new data type
df.dtypes

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 

```python
Route                         object
Departing_Port                object
Arriving_Port                 object
Airline                       object
Month                 datetime64[ns]
Sectors_Scheduled            float64
Sectors_Flown                  int64
Cancellations                float64
Departures_On_Time           float64
Arrivals_On_Time             float64
Departures_Delayed           float64
Arrivals_Delayed             float64
Year                           int64
Month_Num                      int64
dtype: object
```
</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# convert the "Month column to datetime
df["Month"] = pd.to_datetime(df["Month"], format="%b-%y")
```
</details>

As you can see, the Month column changed from Jan-04 to a numeric format 2004-01-01, but still conveys the same information.

### Filter and process the dataset

As a first step in your analysis, you will examine the overall trends in delays and cancellations across all airlines. By aggregating the data by month, you can uncover seasonal patterns or changes over time that may impact flight reliability.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the next cell to filter the DataFrame and keep the observations where "Airline"="All Airlines".</li>
            <li>Group the data by month. Keep columns "Cancellations", "Arrivals_Delayed", and "Sectors_Scheduled" and aggregate with the sum. (already implemented)</li>
            <ul>
                <li>Note that this automatically sets "Month" as the index.</li>
            </ul>
        </ol>
</div>


In [None]:
### START CODE HERE ###

# Filter the data
df_all_airlines = df[None]

### END CODE HERE ###

# Group by month and aggregate with the sum.
# Keep columns "Cancellations", "Arrivals_Delayed", and "Sectors_Scheduled".
df_per_month = df_all_airlines.groupby("Month")[[
    "Cancellations", "Arrivals_Delayed", "Sectors_Scheduled"]].sum()

# visualize the dataframe
df_per_month.head()

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>
<img src="imgsL1/grouped_df.png" width="350">

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Filter the data
df_all_airlines = df[df["Airline"]=="All Airlines"]
```
</details>

<a id="plotting-the-time-series"></a>

## Step 3: Plotting the time series
Now that you've aggregated the data by month, plotting the time series will help you quickly identify trends in delays and cancellations, such as seasonal spikes or consistent patterns. Start with looking at the total amount of scheduled flights over time.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below to plot the `Arrivals_Delayed` column of the time series using the `plot` method</li>
        </ol>
</div>

In [None]:
### START CODE HERE ###

# plot the time series
sns.None(df_per_month["Arrivals_Delayed"])

### END CODE HERE ###

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

<img src="imgsL1/single_line.png" width="400">

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# plot the time series
sns.lineplot(df_per_month["Arrivals_Delayed"])
```
</details>

It seems like there is an overall increasing pattern, however, there is a 3 year period where the delayed flights plummet. It seems like it was an exceptionally good three years for the airlines as there were so few delays. Or are you perhaps missing some context? Can you guess what happened to have caused these changes?

When looking at the delays, you are interested not only in total delayed flights, but should think about the broader context as well: what is the total number of flights and how many were cancelled? `.plot()` method allows you to easily plot more than one column and if you set it without parameters, it will plot all of the columns in the dataframe on the same plot.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below to plot the "Cancellations", "Arrivals_Delayed", and "Sectors_Scheduled" columns of the time series. You can use the <code>plot()</code> method</li>
        </ol>
</div>

In [None]:
### START CODE HERE ###

# plot the time series
sns.None

### END CODE HERE ###

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

<img src="imgsL1/ts_plot.png" width="400">

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# plot the time series
sns.lineplot(df_per_month)
```

Alternatively, you could set the column names explicitly like this:

```python
# plot the time series
sns.lineplot(df_per_month[["Cancellations", "Arrivals_Delayed", "Sectors_Scheduled"]])
```
</details>

Now you can see that not only the total delays plummet, but the number of scheduled flights plummets as well in 2020, while there were also much higher number of cancellations. This broader context helps you understand that it may not have been some great, but actually some terrible years for the airlines. Can you guess what event caused this?

Congratulations for making it until the end of this lab. Hope you enjoyed it! 