# Tunisair Data Exploratory

In [2]:
import pandas as pd
from ydata_profiling import ProfileReport

### Load Dataset and Get Overview of the data

In [3]:
tunisair_df = pd.read_csv('datasets/tunisair_flights_dataset.csv')
df = tunisair_df.copy()
df.head()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


The dataset has 107833 rows and 9 columns

In [6]:
df.shape

(107833, 9)

There seems to be no missing values in the dataset and all the columns are object types except the arrival delay

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Filght_date               107833 non-null  object 
 1   Flight_ID                 107833 non-null  object 
 2   Departure point           107833 non-null  object 
 3   Arrival point             107833 non-null  object 
 4   Scheduled_departure_time  107833 non-null  object 
 5   Scheduled_arrival_time    107833 non-null  object 
 6   STATUS                    107833 non-null  object 
 7   Aircraft_code             107833 non-null  object 
 8   Arrival delay             107833 non-null  float64
dtypes: float64(1), object(8)
memory usage: 7.4+ MB


In [7]:
df.isnull().sum()

Filght_date                 0
Flight_ID                   0
Departure point             0
Arrival point               0
Scheduled_departure_time    0
Scheduled_arrival_time      0
STATUS                      0
Aircraft_code               0
Arrival delay               0
dtype: int64

In [8]:
# Get summary statistics of the dataset
df.describe(include='all')

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
count,107833,107833,107833,107833,107833,107833,107833,107833,107833.0
unique,1011,1861,132,128,81697,85136,5,68,
top,2018-08-31,WKL 0000,TUN,TUN,2017-06-23 06:00:00,2016-01-19 01.00.00,ATA,TU 320IMU,
freq,183,3105,42522,42572,8,6,93679,4724,
mean,,,,,,,,,48.733013
std,,,,,,,,,117.135562
min,,,,,,,,,0.0
25%,,,,,,,,,0.0
50%,,,,,,,,,14.0
75%,,,,,,,,,43.0


### Use ydata profiling to generate the report

In [9]:
profile = ProfileReport(df, title='Tunisair EDA')
profile.to_file(output_file='profile-report.html')

Summarize dataset: 100%|█████████████| 19/19 [00:01<00:00, 11.93it/s, Completed]
Generate report structure: 100%|██████████████████| 1/1 [00:02<00:00,  2.08s/it]
Render HTML: 100%|████████████████████████████████| 1/1 [00:00<00:00,  5.27it/s]
Export report to file: 100%|█████████████████████| 1/1 [00:00<00:00, 689.17it/s]


#### Report Findings

* There are no missing values nor duplicates in the ydata profile report
* The profile report is able to distinguish the different data types of the columns as opposed to `df.info()` e.g. the datetime column is distinctively shown as opposed to object type for the same column in df.info
* There distinct values for each columns:
   * Flight date - 1011
   * Flight id - 1861
   * Departure time - 132
   * Arrival point - 128
   * Scheduled departure time - 81697
   * Scheduled arrival time - 85136
   * Aircraft code - 68
* 35% of the flights did not have delayed arrival times
* There are 5 distinct status:
  * ATA - Actual Arrival Time
  * SCH - Scheduled
  * DEP - Departure
  * RTR - Radio Telephony Restricted
  * DEL - Delayed
* There is a slight positive correlation between arrival delays and the different status with a coefficient of 0.030 recorded. This indicates that a change in a flight status slighlty increases (if at all) the arrival delay time of the same flight.
* Top 3 airport departure and arrival airports basing on their IATA Codes are:
  * TUN - Tunis-Carthage international airpor
  * DJE - Djerba-Zarzis International Airport
  * ORY - Paris Orly Airport