## Project- AirFares
--- -------------------

### Introduction
--- -------------------

This project studies the german domestic airfares during the period 25-10-2019 to 24-04-2020 between major airports in Germany. The aim of the project is to apply the techniques of machine learning on the dataset and understand the trends in pricing with respect to the various features such the booking date, departure & arrival cities, departure time etc. 
#### Details of dataset:
-- -------------------
1. Source: [https://www.kaggle.com/datasets/darjand/domestic-german-air-fares](https://www.kaggle.com/datasets/darjand/domestic-german-air-fares)
2. Generation mode: web scraping
3. Time period considered: 25-10-2019 to 24-04-2020 (6 months).
4. Total entries: 63,000
5. Features:
    * departure_city: The city from which the flight departs.
    * arrival_city: The city to which the flight arrives.
    * scrape_date: The date when flight price information was retrieved.
    * departure_date: The departure date of the flight (25-10-2019 to 24-04-2020).
    * departure_date_distance: How far in advance (e.g., "1 week") the flight was booked.
    * departure_time: The departure time of the flight.
    * arrival_time: The arrival time of the flight.
    * airline: The airline that operates the flight.
    * stops: The number of layovers or stops during the flight.
    * price (€): The price of the flight ticket in Euros.

#### Imports:
-- ----------

In [1]:
#imports
import numpy as np
import pandas as pd
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score
from datetime import datetime

pd.set_option('display.max_colwidth', 50)


#### Load the dataset:
-- ------------------

In [2]:
#Load the data set
df = pd.read_csv('./data/German Air Fares.csv');


#### Data Understanding:
-- ----------------------

##### Basic statistics:
-- ----------------------

In [3]:
#Basic statistics
df.shape;                                           #--> (62626, 10)
df.info();df.isna().sum();                          #--> (No null objects)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62626 entries, 0 to 62625
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   departure_city           62626 non-null  object
 1   arrival_city             62626 non-null  object
 2   scrape_date              62626 non-null  object
 3   departure_date           62626 non-null  object
 4   departure_date_distance  62626 non-null  object
 5   departure_time           62626 non-null  object
 6   arrival_time             62626 non-null  object
 7   airline                  62626 non-null  object
 8   stops                    62626 non-null  object
 9   price (€)                62626 non-null  object
dtypes: object(10)
memory usage: 4.8+ MB


##### Rename columns
-- ----------------------

In [4]:
#-----------------------------------------------------------------------------------
df = df.rename(columns={'price (€)': 'price'});                              #--> Rename 'price (€)' to 'price'
df.departure_city.unique();df.arrival_city.unique();
# df['departure_city'] = df['departure_city'].astype(str).str.split().str[1]  #--> Rename departure city
# df['arrival_city'] = df['arrival_city'].astype(str).str.split().str[1]      #--> Rename arrival city


##### Understanding departure and scrape dates
-- ----------------------

In [5]:
#Departure dates
dep_dates = pd.to_datetime(df['departure_date'], format='%d.%m.%Y');
dep_dates = sorted(dep_dates.unique()); 
dep_dates = pd.Series(dep_dates);
dep_dates.shape;                                    
dep_dates.diff(periods=1).unique();                

#Scrape dates
scrape_dates = pd.to_datetime(df['scrape_date'], format='%d.%m.%Y');
scrape_dates = sorted(scrape_dates.unique()); 
scrape_dates = pd.Series(scrape_dates);
scrape_dates.shape;                                    
scrape_dates.diff(periods=1).unique();        


###### Conclusion: 
-- --------------
* **Departure dates:**
    * 42 unique departure dates
    * departure date frequencies are not unique -> ['1 days', '11 days', '5 days', '44 days', '85 days']
* **Scrape dates:**
    * scrape date frequencies are only from 18-24.10.2019
    * scrape date frequencies are unique -> 1 day

#####  Understanding departure-date distance
-- ----------------------------------------

In [6]:
df.departure_date_distance.value_counts();           
#--> 
# 6 months    12672
# 6 weeks     11222
# 1 month     10092
# 1 week       9949
# 3 month      9748
# 2 weeks      7850
# 2 week       1093

###### Conclusion: 
-- --------------
* **Departure-date distance:**
    * scrape_date can probably be omitted, redundant feature?

##### Understanding departure and arrival times
-- ---------------------------------------------

In [7]:
#To-Do:
#Time formats are not unique:
#--> am/pm and 24 hr formats mixed ->DONE
#--> no consistent display schema ->DONE
#--> probably convert into categorical data: early-morning, morning, day, evening, night depending on time values
count = 0
for i in df['departure_time']:
    if(i.find('Uhr') >= 0):
        df['departure_time'][count] = datetime.strptime(i, '%H:%M Uhr').time()
    else:
        df['departure_time'][count] = datetime.strptime(i, '%I:%M%p').time()
    count += 1
    
count = 0
for i in df['arrival_time']:
    if(i.find('Uhr') >= 0):
        df['arrival_time'][count] = datetime.strptime(i, '%H:%M Uhr').time()
    else:
        df['arrival_time'][count] = datetime.strptime(i, '%I:%M%p').time()
    count += 1

##### Understanding Airlines
-- ---------------------------------------------

In [8]:
df.airline.value_counts();
#To-Do:
#--> Rename easyjet to Easyjet
pd.options.mode.chained_assignment = None
mul = []
count = 0
for i in df['airline']:
    if(i == 'Mehrere Fluglinien' or i == 'Multiple Airlines'):
        mul.append(count)
    df['airline'][count] = i.replace('easyJet', 'EasyJet')
    count += 1
df.drop(mul, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
df.airline.value_counts()

airline
Lufthansa                        45912
Eurowings                        12289
EasyJet                           2935
KLM                                341
Luxair                             290
British Airways                    197
Air France                         194
Swiss International Air Lines      140
Austrian Airlines                   56
LOT-Polish Airlines                 44
Flybe                                3
SAS                                  2
Alitalia                             1
Name: count, dtype: int64

###### Conclusion: 
-- --------------
* **Airlines:**
    * There are 69 'Mehrere Fluglinien', Remove these entries?

##### Understanding stops
---------------------------------------------

In [9]:
df.stops.value_counts();
#To-Do:
#--> Rename stops properly
count = 0
for i in df['stops']:
    if(i.find('d') >= 0):
        df['stops'][count] = 0
    elif(i.find('2') >= 0):
        df['stops'][count] = 2
    else:
        df['stops'][count] = 1
    count += 1
df['stops'] = df['stops'].astype(int)
df.stops.value_counts()

stops
0    29278
1    29123
2     4003
Name: count, dtype: int64

##### Understanding prices
--------------------------


In [10]:
df.price.value_counts();
#To-Do:
#--> format into int properly
count = 0
for i in df['price']:
    if(i.find(',') >= 0):
        df['price'][count] = i.replace(',', '')
    count += 1
df['price'] = df['price'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62404 entries, 0 to 62403
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   departure_city           62404 non-null  object
 1   arrival_city             62404 non-null  object
 2   scrape_date              62404 non-null  object
 3   departure_date           62404 non-null  object
 4   departure_date_distance  62404 non-null  object
 5   departure_time           62404 non-null  object
 6   arrival_time             62404 non-null  object
 7   airline                  62404 non-null  object
 8   stops                    62404 non-null  int32 
 9   price                    62404 non-null  int32 
dtypes: int32(2), object(8)
memory usage: 4.3+ MB


In [11]:
df['arrival_time']

0        07:45:00
1        07:55:00
2        08:00:00
3        07:30:00
4        08:10:00
           ...   
62399    16:05:00
62400    11:40:00
62401    08:05:00
62402    16:10:00
62403    07:25:00
Name: arrival_time, Length: 62404, dtype: object