# Flugpreis Vorhersage - Kaufen oder Warten?
## Projektarbeit Data Mining
___
### Wintersemester 2021/22
### Gruppe G:
Max Grundmann - s0559326
### Inhalte
1. Problemanalyse
2. Explorative Datenanalyse
3. Weitere Features
4. Praktische Überlegungen
___
## 1. Datenvorbereitung

In [14]:
import pandas as pd
import numpy as np
import os
from sklearn.impute import SimpleImputer
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [15]:
dirname = os.getcwd()
filename = os.path.join(dirname, '../Data/train_set.csv')

data = pd.read_csv(filename, index_col=0)



In [16]:
data.head(20)

Unnamed: 0_level_0,Request_Date,Flight_Date,Departure_hour,flight_unique_id,route_abb,Price_In_Eur,min_future_price_in_Eur,buy
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2019-06-03T11:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,208.07,259.07,1
2,2019-06-03T23:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,1
3,2019-06-04T11:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,1
4,2019-06-04T23:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,1
5,2019-06-03T11:00:00Z,2019-06-05,21,2019-06-05 FR 147,STN-SXF,143.86,251.72,1
6,2019-06-03T23:00:00Z,2019-06-05,21,2019-06-05 FR 147,STN-SXF,252.06,251.72,0
7,2019-06-04T11:00:00Z,2019-06-05,21,2019-06-05 FR 147,STN-SXF,251.72,251.72,1
8,2019-06-04T23:00:00Z,2019-06-05,21,2019-06-05 FR 147,STN-SXF,251.72,251.72,1
9,2019-06-03T11:00:00Z,2019-06-05,22,2019-06-05 FR 8545,SXF-STN,22.17,22.17,1
10,2019-06-03T23:00:00Z,2019-06-05,22,2019-06-05 FR 8545,SXF-STN,22.17,28.55,1


In [17]:
def prep_data(data):
    # Datentypen ändern
    data['Flight_Date'] = pd.to_datetime(data['Flight_Date'])
    data['Request_Date'] = pd.to_datetime(data['Request_Date'])
    
    # One Hot Encoding für Routen-Bezeichnungen
    data = pd.get_dummies(data,prefix=['route'], columns = ['route_abb'], drop_first=False)
    
    # Flag, wenn die Anfrage die letzte Anfrage vor dem Flug ist
    is_last_request = pd.DataFrame(data.groupby('flight_unique_id')['Request_Date'].max()).reset_index()
    is_last_request['is_last_request'] = 1

    data = data.merge(is_last_request, 
                      left_on=['flight_unique_id', 'Request_Date'], 
                      right_on=['flight_unique_id', 'Request_Date'], 
                      how='left')
    data['is_last_request'] = data['is_last_request'].fillna(0)
    data['is_last_request'] = data['is_last_request'].astype(int)
    
    # Anzahl der bisherigen Requests als Feature hinzufügen
    data = data.sort_values(['flight_unique_id', 'Request_Date'])
    unique_flights = data['flight_unique_id'].unique()

    requests_counter = 0
    flight_id_index = 0
    current_flight = unique_flights[flight_id_index]
    number_of_requests_per_row = []

    for index, row in  data.iterrows():
        if row['flight_unique_id'] != current_flight:       
            flight_id_index += 1
            current_flight = unique_flights[flight_id_index]
            requests_counter = 0
        number_of_requests_per_row.append(requests_counter)
        requests_counter += 1

    data['previous_requests'] = number_of_requests_per_row
    
    # Datumsfelder in einzelne Bestandteile zerlegen
    data['flight_weekday'] = data['Flight_Date'].dt.weekday
    data['flight_day'] = data['Flight_Date'].dt.day
    data['flight_month'] = data['Flight_Date'].dt.month 
    data['flight_is_weekend'] = data['flight_weekday'] >= 5

    data['request_weekday'] = data['Request_Date'].dt.weekday
    data['request_day'] = data['Request_Date'].dt.day
    data['request_month'] = data['Request_Date'].dt.month
    data['request_is_weekend'] = data['request_weekday'] >= 5
    
    data['request_hour'] = data['Request_Date'].dt.hour
    
    # Cyclische Features in Sinus und Cosinus Repräsentation umwandeln
    # Quelle: https://www.mikulskibartosz.name/time-in-machine-learning/
    def encode(data, col, max_val):
        data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
        data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
        return data

    data = encode(data, 'request_weekday', 7)
    data = encode(data, 'request_month', 12)
    data = encode(data, 'request_day', 365)
    data = encode(data, 'request_hour', 24)

    data = encode(data, 'flight_weekday', 7)
    data = encode(data, 'flight_month', 12)
    data = encode(data, 'flight_day', 365)
    data = encode(data, 'Departure_hour', 24)
    
    # Tage bis zum Flug berechnen
    data['Request_Date_w/o_Time'] = pd.to_datetime(data['Request_Date']).dt.date
    data['days_remaining'] = (pd.to_datetime(data['Flight_Date']).dt.date - data['Request_Date_w/o_Time']).dt.days
    data.drop(['Request_Date_w/o_Time'],1, inplace=True)
    
    # Relevante Feiertage im Zeitraum der Daten, die in Berlin und oder Frankfurt gelten
    # sowie Public Holidays in Großbritannien. 
    feiertage = {
        '2019-06-09':'Pfingstsonntag',
        '2019-06-10':'Pfingstmontag',
        '2019-06-20':'Fronleichnam',
        '2019-06-20':'Schulferien Beginn',
        '2019-08-02':'Schulferien Ende',
        '2019-08-26':'Summer Bank Holidays',
        '2019-07-15':'School Summer Holidays Beginn',
        '2019-09-06':'School Summer Holidays End'}

    feiertage_df = pd.DataFrame(feiertage.items(), columns=['Datum_Feiertag', 'Feiertag_Bezeichnung'])
    feiertage_df['Datum_Feiertag'] = pd.to_datetime(feiertage_df['Datum_Feiertag'])
    
    from datetime import datetime

    day_diff_list = []
    for index, row in feiertage_df.iterrows():
        day_diff_list.append(abs((data['Flight_Date'] - row['Datum_Feiertag']).dt.days))
        
    feiertage_diff_df = pd.concat(day_diff_list, axis=1)
    feiertage_diff_df = feiertage_diff_df.min(axis=1)
    feiertage_diff_df = feiertage_diff_df.reset_index().drop('index', 1)
    feiertage_diff_df.columns = ['Days_Untill_Event']

    data = pd.concat([data, feiertage_diff_df], axis=1)

    # Features skalieren
    scaler = MinMaxScaler()
    columns_to_be_scaled = ['Price_In_Eur', 
                         'min_future_price_in_Eur', 
                         'days_remaining', 
                         'previous_requests', 
                         'Days_Untill_Event']
    to_be_scaled = data[columns_to_be_scaled]
    scaled = pd.DataFrame(scaler.fit_transform(to_be_scaled), columns=columns_to_be_scaled)
    data = pd.concat([data.reset_index(), scaled.reset_index()], axis=1)
    
    # Nicht mehr benötigte Spalten entfernen
    data.drop(['Request_Date', 
               'Flight_Date', 
               'Price_In_Eur', 
               'min_future_price_in_Eur', 
               'index', 
               'Departure_hour', 
               'flight_weekday', 
               'flight_day', 
               'flight_month', 
               'request_weekday', 
               'request_day', 
               'request_month', 
               'days_remaining',
               'request_hour', 
               'Days_Untill_Event',
               'previous_requests',
               'flight_unique_id'], inplace=True, axis=1)
    
    # Boolean in Int umwandeln
    data['request_is_weekend'] = data['request_is_weekend'].astype(int)
    data['flight_is_weekend'] = data['flight_is_weekend'].astype(int)
    
    return data

In [18]:
data = prep_data(data)

  data.drop(['Request_Date_w/o_Time'],1, inplace=True)
  feiertage_diff_df = feiertage_diff_df.reset_index().drop('index', 1)


In [19]:
data

Unnamed: 0,buy,route_FRA-STN,route_STN-FRA,route_STN-SXF,route_SXF-STN,is_last_request,flight_is_weekend,request_is_weekend,request_weekday_sin,request_weekday_cos,...,request_hour_sin,request_hour_cos,flight_weekday_sin,flight_weekday_cos,flight_month_sin,flight_month_cos,flight_day_sin,flight_day_cos,Departure_hour_sin,Departure_hour_cos
0,1,0,0,0,1,0,0,0,0.000000,1.000000,...,0.258819,-0.965926,0.974928,-0.222521,1.224647e-16,-1.000000e+00,0.085965,0.996298,-0.965926,0.258819
1,1,0,0,0,1,0,0,0,0.000000,1.000000,...,-0.258819,0.965926,0.974928,-0.222521,1.224647e-16,-1.000000e+00,0.085965,0.996298,-0.965926,0.258819
2,1,0,0,0,1,0,0,0,0.781831,0.623490,...,0.258819,-0.965926,0.974928,-0.222521,1.224647e-16,-1.000000e+00,0.085965,0.996298,-0.965926,0.258819
3,1,0,0,0,1,1,0,0,0.781831,0.623490,...,-0.258819,0.965926,0.974928,-0.222521,1.224647e-16,-1.000000e+00,0.085965,0.996298,-0.965926,0.258819
4,1,0,0,1,0,0,0,0,0.000000,1.000000,...,0.258819,-0.965926,0.974928,-0.222521,1.224647e-16,-1.000000e+00,0.085965,0.996298,-0.707107,0.707107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83619,1,0,0,0,1,0,0,0,0.433884,-0.900969,...,0.258819,-0.965926,0.781831,0.623490,-1.000000e+00,-1.836970e-16,0.171293,0.985220,0.500000,-0.866025
83620,0,0,0,0,1,0,0,0,0.433884,-0.900969,...,-0.258819,0.965926,0.781831,0.623490,-1.000000e+00,-1.836970e-16,0.171293,0.985220,0.500000,-0.866025
83621,0,0,0,0,1,0,0,0,-0.433884,-0.900969,...,0.258819,-0.965926,0.781831,0.623490,-1.000000e+00,-1.836970e-16,0.171293,0.985220,0.500000,-0.866025
83622,1,0,0,0,1,0,0,0,-0.433884,-0.900969,...,-0.258819,0.965926,0.781831,0.623490,-1.000000e+00,-1.836970e-16,0.171293,0.985220,0.500000,-0.866025


In [20]:
# Daten in Trainings- und Testdaten aufteilen
y = data['buy']
X = data.drop(['buy'], 1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99, stratify=y)

  X = data.drop(['buy'], 1)


In [42]:
from random import sample
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

In [None]:
def evaluate_model(model, x_train, y_train, x_test, y_test):
    model = model.fit(x_train, y_train)
    print('Training Score:', model.score(x_train, y_train))
    
    # Predictions on the test dataset
    predicted = pd.DataFrame(model.predict(x_test))
    # Probabilities on the test dataset
    probs = pd.DataFrame(model.predict_proba(x_test))
    print('Test Score:', metrics.accuracy_score(y_test, predicted))
    
    print(metrics.classification_report(y_test, predicted))
    return model

In [None]:
list_of_models = [RandomForestClassifier(), 
                  #SVC(), 
                  tree.DecisionTreeClassifier(max_depth=10), 
                  LogisticRegression(), 
                  KNeighborsClassifier(n_neighbors=5), 
                  GaussianNB()]

for model in list_of_models:
    evaluate_model(model, x_train, y_train, x_test, y_test)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(RandomForestClassifier(), x_train, y_train,
 scoring="neg_mean_squared_error", cv=10)

tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std

In [None]:
cv = KFold(n_splits=10, random_state=1, shuffle=True)

model = RandomForestClassifier()

scores = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestClassifier()

# rf_random = RandomizedSearchCV(estimator = rf, 
#                                param_distributions = random_grid, 
#                                n_iter = 100, 
#                                cv = 3, 
#                                verbose=2, 
#                                random_state=42, 
#                                n_jobs = -1)

# rf_random.fit(x_train, y_train)
# rf_random.best_params_

In [None]:
random_grid = {'n_estimators': [10, 30, 100, 200],
               'max_features': ['auto', 'sqrt']}
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}

rf = RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 100, 
                               cv = 5, 
                               verbose=2, 
                               random_state=42, 
                               n_jobs = -1)

rf_random.fit(x_train, y_train)
rf_random.best_params_

In [None]:
cv = KFold(n_splits=10, random_state=1, shuffle=True)

model = RandomForestClassifier(n_estimators= 200, 
                               max_features= 'sqrt')

scores = cross_val_score(model, x_train.drop(['request_is_weekend', 
                                              'request_month_sin', 
                                              'request_month_cos'],1), y_train, scoring='accuracy', cv=cv, n_jobs=-1)

model = model.fit(x_train, y_train)

In [None]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# Predictions on the test dataset
predicted = pd.DataFrame(model.predict(x_test))
# Probabilities on the test dataset
probs = pd.DataFrame(model.predict_proba(x_test))
print('Test Score:', metrics.accuracy_score(y_test, predicted))

print(metrics.classification_report(y_test, predicted))
print(model.feature_importances_)

In [None]:
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(10, 10), dpi=80)

sorted_idx = model.feature_importances_.argsort()
plt.barh(x_train.columns[sorted_idx], model.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

___

Durchschnittlich liegen zwischen dem Abfragedatum und dem Flug 38 Tage. Der größte Abstand beträgt 99 Tage und der geringste einen Tag. 

## 3. Neural Network

In [62]:
from tensorflow import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping

In [65]:
n_cols = x_train.shape[1]

# Set up the model: model
model = Sequential()

# Add the first layer
model.add(Dense(1000, activation='relu', input_shape=(n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='softmax'))

opt = keras.optimizers.SGD(learning_rate=0.1)

model.compile(
    optimizer=opt, 
    loss='categorical_crossentropy', 
    metrics=['accuracy'])

early_stopping_monitor = EarlyStopping(patience=3)

model.fit(
    X, 
    y, 
    validation_split=0.3, 
    epochs=15, 
    callbacks=[early_stopping_monitor])

Epoch 1/15
Epoch 2/15
Epoch 3/15


<keras.callbacks.History at 0x7f92c10b62b0>

In [56]:
import matplotlib.pyplot as plt

plt.plot(model.history['val_loss'], 'r')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

TypeError: 'History' object is not subscriptable

In [40]:
pred = model.predict(x_test)

In [43]:
print(metrics.classification_report(y_test, pred))

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

## 3. Feature Engineering

Für eine optimale Vorhersage wären natürlich Informationen zur Anzahl tatsächlich verfügbarer Sitzplätze oder detailiertere Informationen zu verkauften Ticket-Typen (Business Class, Economy Class, Geschäftsreisende, etc. - wenn auch vermutlich auf kürzeren Strecken wie denen im Datensatz weniger relevant) hilfreich, die jedoch typischerweise nicht öffentlich verfügbar sind (vgl. Manolis Papadakis 2021: Predicting Airfare Prices, S. 1. http://cs229.stanford.edu/proj2012/Papadakis-PredictingAirfarePrices.pdf Letzter Zugriff: 06.12.2021).

In [None]:
data.skew()

Die Verteilung der Features (mit Ausnahme der Preisbasierten Spalten) zeigt mit Werten zwischen -0,5 und 0,5 keine Hinweise auf deutlich asymetrische Verteilungen (vgl. https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de) . Grundsätzlich besteht bei den vorhandenen Features keine Notwendigkeit zur Normalisierung. Nur kategorische Werte müssen für die Verwendung in den Modellen noch codiert werden.

Über die bestehenden Features hinaus werden folgende Features für die Vorhersage ergänzt:

- Wochentag
- Tage bis zum Flug
- Tage bis zum nächsten Feiertag

Ob diese Features einen tatsächlichen Mehrwert für die Vorhersage liefern, muss im zweiten Teil der Projektarbeit ermittelt werden.