# Flugpreis Vorhersage - Kaufen oder Warten?
## Projektarbeit Data Mining
___
### Wintersemester 2021/22
### Gruppe G:
Max Grundmann - s0559326
### Inhalte
1. Exploratory Data Analysis (EDA)
2. Datenvorbereitung
3. Modelauswahl
### 4. Auswertung
4.1 Monetäres Gütemaß <br>
4.2 Testdatensatz
___
Dieses Notebook wertet die trainierten Classifier aus und wendet sie auf die Test-Daten an.
___

## 4.1. Monetäres Gütemaß

##### Imports

In [1]:
import pandas as pd
import pickle
import os
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

from keras.models import load_model

Classifier laden

In [4]:
rf_model = pickle.load(open('../Models/final_models/random_forest_v4.pkl', 'rb'))
nn_model = load_model('../Models/final_models/nn_v5.h5')

Trainingsdaten laden

In [5]:
n = 15

dirname = os.getcwd()
filename = os.path.join(dirname, f'../Data/prepped/train_set_n{n}.csv')

try:
    original_data = pd.read_csv('../Data/raw/train_set.csv')
    original_data.drop(['buy'], axis=1)

    train_data = pd.read_csv(filename, index_col=0)
    X = train_data.drop(columns=['buy'], axis=1)
except FileNotFoundError as e:
    print('Datei konnte nicht gefunden werden.')

X.columns = X.columns.map(str)
X = X.fillna(0)

for i in range(n):
    X[str(i)] = np.where(X[str(i)].isnull(), X.Price_In_Eur, X[str(i)])

Vorhersage für Trainingsdaten

In [14]:
rf_prediction = rf_model.predict(X)
nn_prediction = nn_model.predict(X)

# NN gibt Werte zwischen 0 und 1 aus die noch diskretisiert werden müssen
def zero_or_one(x):
    if x > 0.5:
        return 1
    else:
        return 0
vec = np.vectorize(zero_or_one)

nn_prediction = vec(nn_prediction)

Implementierung des monetären Gütemaß von Prof. Spott

In [7]:
# The model quality evaluation function expects a Pandas dataframe with at least the following columns:
# Request_Date          int64
# flight_unique_id     object
# Price               float64
# buy                    bool

def model_quality_evaluation(df_input):
    # Make a copy of the provided dataframe as to not modify the original.
    df = df_input.copy()

    # Convert 'Price' to whole cents and store as integers to avoid floating point errors.
    df['Price_In_Eur'] = df['Price_In_Eur'] * 100
    df['Price_In_Eur'] = df['Price_In_Eur'].astype(int)

    # Initialize a variable that stores the sum of all our balances.
    sum_balances = 0

    # Get a list of all 'flight_unique_id'.
    flight_unique_ids = df['flight_unique_id'].unique()

    # Iterate over all 'flight_unique_id'.
    for flight_unique_id in flight_unique_ids:
        # Get a subset of the data for the specified 'flight_unique_id'.
        df_subset = df[df['flight_unique_id'] == flight_unique_id]

        # Get all request dates except for the latest request date before departure.
        # At the latest request date before departure we need to buy a ticket anyway,
        # so we don't care about this specific request date.
        request_dates = df_subset[df_subset['Request_Date'] != df_subset['Request_Date'].max()]

        # Make sure request dates are sorted in descending order.
        request_dates.sort_values(by='Request_Date', ascending=False, inplace=True)

        # Get the ticket price from the latest request date before departure,
        # because we certainly have to buy a ticket at this date.
        last_buying_price = df_subset[df_subset['Request_Date'] == df_subset['Request_Date'].max()]['Price_In_Eur'].values[0]

        # Iterate over the remaining request dates
        for _, row in request_dates.iterrows():
            # and check wether the model wants to buy a ticket at the specific request date.
            if(row['buy'] == True):
                # If the model decides to buy a ticket the last buying price is set to the
                # price point of this request date and the balance doesn't change.
                last_buying_price = row['Price_In_Eur']
            else:
                # If the models decides to not buy a ticket the balance equals the
                # the current ticket price minus the last buying price.
                current_price = row['Price_In_Eur']
                balance = current_price - last_buying_price

                # The balance is added to the sum of all balances.
                sum_balances = sum_balances + balance

    # Return the sum of all our previously calculated balances.
    return sum_balances / 100

Ground Truth Baseline

In [8]:
gt_score = model_quality_evaluation(original_data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Um das richtige Format für die Berechnung der monetären Güte zu erstellen, muss die Vorhersage mit dem orginalen Dataframe zusammengeführt werden, da die Trainingsdaten nur noch die codierten Spalten enthält.

In [28]:
df_with_rf_pred = original_data.copy()
df_with_nn_pred = original_data.copy()
df_with_rf_pred['buy'] = list(rf_prediction)
df_with_nn_pred['buy'] = list(nn_prediction)

df_with_rf_pred

Unnamed: 0,index,Request_Date,Flight_Date,Departure_hour,flight_unique_id,route_abb,Price_In_Eur,min_future_price_in_Eur,buy
0,1,2019-06-03T11:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,208.07,259.07,0
1,2,2019-06-03T23:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,0
2,3,2019-06-04T11:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,0
3,4,2019-06-04T23:00:00Z,2019-06-05,19,2019-06-05 FR 146,SXF-STN,259.07,259.07,1
4,5,2019-06-03T11:00:00Z,2019-06-05,21,2019-06-05 FR 147,STN-SXF,143.86,251.72,0
...,...,...,...,...,...,...,...,...,...
83619,83620,2019-08-01T11:00:00Z,2019-09-10,10,2019-09-10 FR 8543,SXF-STN,35.69,39.69,1
83620,83621,2019-08-01T23:00:00Z,2019-09-10,10,2019-09-10 FR 8543,SXF-STN,46.83,39.69,0
83621,83622,2019-08-02T11:00:00Z,2019-09-10,10,2019-09-10 FR 8543,SXF-STN,46.83,39.69,0
83622,83623,2019-08-02T23:00:00Z,2019-09-10,10,2019-09-10 FR 8543,SXF-STN,39.69,39.69,1


In [11]:
# Random Forest v4 model
rf_score = model_quality_evaluation(df_with_rf_pred)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [16]:
# Neural Network v5 model
nn_score = model_quality_evaluation(df_with_nn_pred)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [17]:
print(f"Ground Truth Baseline: {gt_score}")
print(f"Random Forest v4: {rf_score}")
print(f"Neural Net v5: {nn_score}")
print("------------------------")
print(f"Differenz RF: {rf_score - gt_score}")
print(f"Differenz NN: {nn_score - gt_score}")

Ground Truth Baseline: 1388860.66
Random Forest v4: 1279392.33
Neural Net v5: 1244365.11
------------------------
Differenz RF: -109468.32999999984
Differenz NN: -144495.5499999998


Während das Neuronale Netz also die leicht besseren Accuracy Werte liefert, muss es sich bei der monetären Güte dem Random Forest Classifier geschlagen geben.

## 4.2. Testdatensatz

##### Testdaten laden

In [25]:
n = 15

dirname = os.getcwd()
filename = os.path.join(dirname, f'../Data/prepped/test_set_n{n}.csv')

try:
    test_data = pd.read_csv(filename, index_col=0)
    X = test_data
except FileNotFoundError as e:
    print('Datei konnte nicht gefunden werden.')

X.columns = X.columns.map(str)
X = X.fillna(0)

for i in range(n):
    X[str(i)] = np.where(X[str(i)].isnull(), X.Price_In_Eur, X[str(i)])

Testdatensatz

In [26]:
prediction = rf_model.predict(X)
prediction = pd.Series(prediction)
prediction.to_csv('../Predictions/rf_pred.cvs')
prediction

0       0
1       0
2       0
3       0
4       0
       ..
5578    1
5579    1
5580    1
5581    1
5582    1
Length: 5583, dtype: int64

In [27]:
prediction = nn_model.predict(X)
prediction = vec(prediction)
prediction = pd.DataFrame(prediction)[0]
prediction.to_csv('../Predictions/nn_pred.cvs')
prediction

0       1
1       0
2       0
3       0
4       1
       ..
5578    1
5579    1
5580    1
5581    1
5582    1
Name: 0, Length: 5583, dtype: int32