<div style="text-align: center"><h1 style="text-decoration: underline;">DSML Project</h1></div>



This is the official Notebook of the DSML Project from Marc Rennefort, Kilian Lipinsky, Timo Hagelberg, Jan Behrendt-Emden and Paul Severin. In order to create this Project we used the following dataset: https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips-2023-2024-/n26f-ihde/about_data

In [None]:
#Note all your imports here

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from holoviews.plotting.plotly import ScatterPlot
from mpmath import sumap
from numpy.ma.core import inner
from pandas.core.common import random_state
#from skimage.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
import seaborn as sns
from sklearn.linear_model import LinearRegression
from datetime import datetime
from sklearn.model_selection import train_test_split
from meteostat import Hourly, Point
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score, precision_score, f1_score


<h4 style="text-decoration: underline;">1. Buisness Understanding</h4>
The given dataset contains all trips in 2023-2024 to the city of chicago reportet by rideshare companies such as Uber. With this data we want to train a model that helps use to predict the tip in ride hailing trips as accurate as possible. This could be interesting for Uber drivers to plan their rides on routes and days which give the most tip. In addition it could also help Uber as a company to make appropriate sales forecasts. For our prediction we want to include the following features: travel time, distance, fare amount, weather conditions, and whether the customer shared the ride. At this point it should be said that the tip values are not exact because they are rounded to the nearest $1.00 and only digital tips are included in the dataset, as tips paid in cash don't get tracked.


<h4 style="text-decoration: underline;">2. Data Understanding and Data Prepertion</h4>

<h4 style="text-decoration: underline;">2.1 Some Basic Data Preperation</h4>
In the first step we want to do some basic data preperartion and data understanding which means that we load our data set with the columns we need, we drop all rows with null values and changing our timestamps to datetime format.

In [None]:
#Loading our dataset with the columns we need
data_cleaned = pd.read_csv('Data/Chicago_RideHailing_Data.csv', usecols= ['Trip End Timestamp', 'Trip Seconds', 'Trip Miles', 'Tip', 'Trip Total','Dropoff Centroid Latitude', 'Dropoff Centroid Longitude', 'Shared Trip Authorized', 'Shared Trip Match'])

#We can't use the Total including Tip, so we will calculate the cost (Fare + other Charges) as Total - Tip
data_cleaned['Cost'] = data_cleaned['Trip Total'] - data_cleaned['Tip']
data_cleaned = data_cleaned.drop(columns=['Trip Total'])

In [None]:
#Get some basic understanding of our data
print('Null Values: ', data_cleaned.isnull().sum())
data_cleaned.info()
data_cleaned.head()

In [None]:
#Drop all rows with null values
data_cleaned = data_cleaned.dropna(axis = 0)

In [None]:
#Changing our timestamp to datetime format
data_cleaned['Trip End Timestamp'] = pd.to_datetime(data_cleaned['Trip End Timestamp'],  format='%m/%d/%Y %I:%M:%S %p', errors='coerce')

In [None]:
#Check if everything worked correctly
print('Null-Werte: ', data_cleaned.isnull().sum())
data_cleaned.info()

<h4 style="text-decoration: underline;">2.2 Including weather data</h4>
In order to add the weather data we need to group our data because otherwise we will get runtime issues if we do API calls for barely 25 Million rows. This should be fine for our prediction purposes because there won't be huge differences in temperature or rain if we round by the second decimal place

In [None]:
#Round the Latitude and Longitude by the second decimal place and insert it in a new column
data_cleaned["Latitude rounded"] =  data_cleaned["Dropoff Centroid Latitude"].round(2)
data_cleaned["Longitude rounded"] = data_cleaned["Dropoff Centroid Longitude"].round(2)

#Group the data by Latitude and Longitude
data_grouped = data_cleaned.groupby(["Latitude rounded", "Longitude rounded"])["Trip End Timestamp"].agg(["min", "max"]).reset_index()
data_grouped.head()

In [None]:
weather_list = []
for i in range(len(data_grouped)):
    #Initalise variables
    latitude = data_grouped["Latitude rounded"].iloc[i]
    longitude = data_grouped["Longitude rounded"].iloc[i]
    location = Point(latitude, longitude)
    timestamp_min = data_grouped["min"].iloc[i]
    timestamp_max = data_grouped["max"].iloc[i]
    
    #Round min and max column to the next hour in order to extract the weather data correctly
    timestamp_min_rounded = timestamp_min.replace(minute = 0, second = 0) 
    timestamp_max_rounded = timestamp_max.replace(minute = 0, second = 0)
    
    #Extract the weather data per location
    weather = Hourly(location, timestamp_min_rounded, timestamp_max_rounded).fetch()

    #Merge the extracted weather data with the fitting timestamps and locations
    for j in range(len(weather)):
       weather_list.append({"Timestamp": weather.index[j], "Latitude rounded": latitude, "Longitude rounded": longitude, "Temperature": weather["temp"].iloc[j], "Rain in mm": weather["prcp"].iloc[j]})

#Covert the list to a DataFrame
weather_data = pd.DataFrame(weather_list)
weather_data.head()

In [None]:
#Now we prepare the merge of the weather data and the other data. For this we need to round our timestamps by the next hour because our weather data is given hourly
data_cleaned["Trip End Timestamp Rounded"] = data_cleaned["Trip End Timestamp"].dt.floor("h")


In [None]:
#In the next step we can start with the merge
data_merged = pd.merge(data_cleaned, weather_data_clean, left_on=["Trip End Timestamp Rounded", "Latitude rounded", "Longitude rounded"], right_on =["Timestamp", "Latitude rounded", "Longitude rounded"], how = "inner")

In [None]:
# Now we check if we introduced any null values with our weather data or during the merge
print('Null-Werte: ', data_merged.isnull().sum())


In [None]:
#After that we can drop all the columns we just needed to merge our data
data_merged = data_merged.drop(columns = ["Dropoff Centroid Latitude", "Dropoff Centroid Longitude", "Latitude rounded", "Longitude rounded", "Timestamp" ,"Trip End Timestamp Rounded"])
data_merged.head()

<h4 style="text-decoration: underline;">2.3 Creation of dummy variables</h4>
To perform or regression later on we need to transfer the 'Shared Trip Authorized', 'Shared Trip Match' and 'Trip End Timestamp' column to numeric datatype. For this we make use of dummy variable where 1 stands for true and 0 stands for false for the Shared Trip variables and different variables for the time of day in case of the Trip End variable. In addition we need to creat dummy variables for the rain because we are not interested in the amount of rain on a certain day rather we want to plot wheter it rained or not. 

In [None]:
#Create dummy variable for 'Shared Trip Authorized' and 'Shared Trip Match' (1 = True and 0 = False)
data_merged["Shared Trip Authorized"] = data_merged["Shared Trip Authorized"].astype(int)
data_merged["Shared Trip Match"] = data_merged["Shared Trip Match"].astype(int)
print("📋First 5 Rows:")
data_merged.head()


In [None]:
#Create dummy variables for the Trip End Timestamp: 22:00-6:00 for night, 6:00-18:00 for day and 18:00-22:00 for evening
data_merged["Trip End Hour"] = data_merged["Trip End Timestamp"].dt.hour
data_merged["Night"] = np.where((data_merged["Trip End Hour"] >= 22) | (data_merged["Trip End Hour"] < 6), 1, 0)
data_merged["Day"] = np.where((data_merged["Trip End Hour"] >= 6) & (data_merged["Trip End Hour"] < 18), 1, 0)
data_merged["Evening"] = np.where((data_merged["Trip End Hour"] >= 18) & (data_merged["Trip End Hour"] < 22), 1, 0)
#Drop the Trip End Hour & Trip End Timestamp column as we don't need them anymore
data_merged = data_merged.drop(columns=["Trip End Hour", "Trip End Timestamp"])

data_merged.head()

In [None]:
#We see that the column 'Rain in mm' is from datatype object but we need a numeric datatype, so we need to transform this column to the right datatype
data_merged['Rain in mm'] = pd.to_numeric(data_merged['Rain in mm'], errors='coerce')

In [None]:
#Switching the columns so that all dummy variables will be after all numeric variables
data_merged = data_merged[['Tip', 'Trip Seconds', 'Trip Miles', 'Cost', 'Temperature', 'Rain in mm', 'Shared Trip Authorized', 'Shared Trip Match', 'Day', 'Evening', 'Night']]

<h4 style="text-decoration: underline;">2.4 Dealing with outliers</h4>
Outliers are often caused by errors in data collection. Therefore it is important to identify and remove them as they can distort the performance of predictive models by representing values that do not reflect typical or real-world scenarios. In the following we attempted to remove outliers using the standard deviation method. However, it must be acknowledged that we cannot say with complete certainty whether each detected outlier represents a data collection error or simply reflects rare but valid cases—since we did not collect the data ourselves. Nevertheless this approach helps us prepare the dataset as effectively as possible for building robust predictive models
 

In [None]:

#We don't need to deal with outliers on the columns with dummy variables because there can't be outliers if we just have the values 0 or 1
data_merged_without_tip = data_merged.drop(columns = ["Shared Trip Authorized", "Shared Trip Match", "Night", "Day", "Evening"])
for columns in data_merged_without_tip.columns:
    #Calculate mean and standard deviation of the current column
    mean = data_merged_without_tip[columns].mean()
    std = data_merged_without_tip[columns].std()
    
    #Calculate upper and lower limit
    upperlimit = mean + 3 * std
    lowerlimit = mean - 3 * std

    #Replace all outliers with null values so we can remove them later
    data_merged.loc[(data_merged[columns] > upperlimit) | (data_merged[columns] < lowerlimit), columns] = np.nan

#Remove all null values (delete all outliers)    
data_without_outliers = data_merged.dropna(axis = 0)
print(f"❌ Deleted {len(data_merged) - len(data_without_outliers)} outliers")


<h4 style="text-decoration: underline;">2.5 Saving the data</h4>
We will save the preperated data here, so that we can continue working with it, without having to rerun all the code above every time we start.

In [None]:
data_without_outliers.to_csv('Data/Chicago_RideHailing_Data_Cleaned.csv', index=False)

<h4 style="text-decoration: underline;">3. Data Modeling</h4>
If you have already saved the data you can use this shortcut to save some runtime.

In [None]:
data_without_outliers = pd.read_csv('Data/Chicago_RideHailing_Data_Cleaned.csv')
#Check if everything worked correctly
data_without_outliers.head()

<h4 style="text-decoration: underline;">3.1 Train Test Validation Split</h4>
First of all we need to split our data in train, test and validation data.

In [None]:
#Define x and y vectors
x = data_without_outliers.drop(columns = ["Tip"])
y = data_without_outliers["Tip"]

#Perform train test validation split
x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(x, y, test_size = 0.5, random_state = 42)
x_val_data, x_test_data, y_val_data, y_test_data = train_test_split(x_test_data, y_test_data, test_size = 0.6, random_state = 42)
print("Datasplit:")
print(f"🏋️ Training: {len(x_train_data)} Samples ({len(x_train_data)/len(data_without_outliers)*100:.1f}%)")
print(f"🔬 Testing: {len(x_test_data)} Samples ({len(x_test_data)/len(data_without_outliers)*100:.1f}%)")
print(f"✅ Validation: {len(x_val_data)} Samples ({len(x_val_data)/len(data_without_outliers)*100:.1f}%)")

In [None]:
# This second split will only be important for our second model

x_train1_data, x_train2_data, y_train1_data, y_train2_data = train_test_split(x_train_data, y_train_data, test_size = 0.5, random_state = 42)
print("Datasplit:")
print(f"Train Model 1: {len(x_train1_data)} Samples ({len(x_train1_data)/len(x_train_data)*100:.1f}%)")
print(f"Train Model 2: {len(x_train2_data)} Samples ({len(x_train2_data)/len(x_train_data)*100:.1f}%)")

<h4 style="text-decoration: underline;">3.2 Descriptive analyses Split</h4>
Now we will look at key statistics about our data. We only use the training data for this, in order to rule out accidental data leakage.

In [None]:
#Create a table with a column for each feature and key statistics as rows
def create_statistics_table(df):
    stats = pd.DataFrame(index=['Mean', 'Median', 'Standard Deviation', 'Min', 'Max', 'Range'])
    for column in df.columns:
        stats[column] = [
            df[column].mean(),
            df[column].median(),
            df[column].std(),
            df[column].min(),
            df[column].max(),
            df[column].max() - df[column].min()
        ]
    return stats

#Create statistics table for the training data
features_stats_table = create_statistics_table(x_train_data.drop(columns=['Shared Trip Authorized', 'Shared Trip Match', 'Night', 'Day', 'Evening']))
target_stats_table = create_statistics_table(y_train_data.to_frame(name='Tip'))
stats_table = pd.concat([target_stats_table.rename(columns={'Tip': 'Tip (Target)'}), features_stats_table], axis=1)
stats_table.head(len(stats_table))

In [None]:
# Now we create a table presenting the share of 1s for each dummy variable in the training data
dummy_share = pd.DataFrame({
    'Shared Trip Authorized': x_train_data['Shared Trip Authorized'].value_counts(normalize=True),
    'Shared Trip Match': x_train_data['Shared Trip Match'].value_counts(normalize=True),
    'Night': x_train_data['Night'].value_counts(normalize=True),
    'Day': x_train_data['Day'].value_counts(normalize=True),
    'Evening': x_train_data['Evening'].value_counts(normalize=True)
})

dummy_share.head(len(dummy_share))


In the next step we plot the data to get an first overview on our predictors. For this we use the first 5000 values of each predictor because if we would use all value we cannot see any possible linear relationship because there would be too much data in one scatter plot. In addition it leads to a much shorter runtime. 

In [None]:
#Create Scatterplots for the first 4 predictors
fig_1, axes_1 = plt.subplots(nrows = 1, ncols = 4, figsize= (21,6))
fig_1.suptitle("Scatterplots of all predictors", fontsize=26)
for i, ax in enumerate(axes_1):
    ax.scatter(x = x_train_data.iloc[:5000, i], y = y_train_data[:5000], color = f'C{i}')
    ax.set_title(x_train_data.columns[i])
plt.tight_layout()
plt.show()

#Create Scatterplots for predictor 5-7
fig_2, axes_2 = plt.subplots(nrows = 1, ncols = 3, figsize= (21,6))
for i, ax in enumerate(axes_2):
    x_values = x_train_data.iloc[:5000, i + 4]
    y_values = y_train_data[:5000]
    
    # Kombiniere x und y Werte für das Zählen
    points_df = pd.DataFrame({'x': x_values, 'y': y_values})
    
    # Zähle die Häufigkeit jeder einzigartigen Kombination
    point_counts = points_df.groupby(['x', 'y']).size().reset_index(name='count')
    
    # Erstelle die Größen basierend auf der Anzahl (skaliert für bessere Sichtbarkeit)
    sizes = point_counts['count']
    
    ax.scatter(x = point_counts['x'], y = point_counts['y'], 
               s = sizes, color = f'C{i + 4}')
    ax.set_title(x_train_data.columns[i + 4])
plt.tight_layout()
plt.show()

#Create Scatterplots for predictor 8-10
fig_3, axes_3 = plt.subplots(nrows = 1, ncols = 3, figsize= (21,6))
for i, ax in enumerate(axes_3):
    x_values = x_train_data.iloc[:5000, i + 7]
    y_values = y_train_data[:5000]
    
    # Kombiniere x und y Werte für das Zählen
    points_df = pd.DataFrame({'x': x_values, 'y': y_values})
    
    # Zähle die Häufigkeit jeder einzigartigen Kombination
    point_counts = points_df.groupby(['x', 'y']).size().reset_index(name='count')
    
    # Erstelle die Größen basierend auf der Anzahl (skaliert für bessere Sichtbarkeit)
    sizes = point_counts['count']
    
    ax.scatter(x = point_counts['x'], y = point_counts['y'], 
               s = sizes, color = f'C{i + 7}')
    ax.set_title(x_train_data.columns[i + 7])
plt.tight_layout()
plt.show()

<h4 style="text-decoration: underline;">3.3 Data normalization</h4>
After plotting the data we get some interesting information. It looks like there are a lot of linear correlations for example between the costs and the tip. Moreover it is noticeable that the predictors are on diffrent scales so we need to normalize them to get meaningful result in our regression later on. For the normalization we make use of the python libary StandartScaler which uses the following formula:
\[
z = \frac{x - \mu}{\sigma}
\] 

In [None]:
#Select which Data to normalize (no Dummy Variables)
columns_to_normalize = ["Trip Seconds", "Trip Miles", "Cost", "Temperature", "Rain in mm"]


x_train_data_scaled = x_train_data.copy()
x_val_data_scaled = x_val_data.copy()
x_test_data_scaled = x_test_data.copy()
x_train1_data_scaled = x_train1_data.copy()
x_train2_data_scaled = x_train2_data.copy()


#Normalize data
scaler = StandardScaler()
x_train_data_scaled[columns_to_normalize] = scaler.fit_transform(x_train_data[columns_to_normalize])
#Hier nur noch transform verwenden, damit Mittelwert und Standardabweichung nicht neu berechnet werden
x_val_data_scaled[columns_to_normalize] = scaler.transform(x_val_data[columns_to_normalize])
x_test_data_scaled[columns_to_normalize] = scaler.transform(x_test_data[columns_to_normalize])
x_train1_data_scaled[columns_to_normalize] = scaler.transform(x_train1_data[columns_to_normalize])
x_train2_data_scaled[columns_to_normalize] = scaler.transform(x_train2_data[columns_to_normalize])


x_train_data_scaled.head()


<h4 style="text-decoration: underline;">3.4 Linear Regression</h4>

After finishing the data preperation and completing our descripive task we now want to start with our predictive models. First of all we want to create an simple linear regression because we have seen a few linear relationships between our predictors and the tip in the descriptive task above. An linear regression is a good first attemp to create good understandable machine learning models because it is easy to implement and to interpret. Espacially in approximatiely lineare realtionships (as we have seen in our scatter plots before) this model is really robust and gives us good predictions. But in complex problems (where the realationships are not linear) this easy approach could be too simple and is not usable in order to represent reality. So this step is used to see if our prediction problem could be solved well by an easy understandable model or if it needs more complex regression tasks to create appropriate predictions.
 
We define our linear regression model with the following input vector:
\begin{equation}
x^{(i)} = \left[ \begin{array}{c} 1 \\ \mathrm{Trip\ Seconds}^{(i)} \\ \mathrm{Trip\ Miles}^{(i)} \\ \mathrm{Shared\ Trip\ Authorized}^{(i)} \\ \mathrm{Shared\ Trip\ Matched}^{(i)} \\ \mathrm{Cost}^{(i)} \\ \mathrm{Temperature}^{(i)} \\ \mathrm{Rain\ in\ mm}^{(i)} \\ \mathrm{Night}^{(i)} \\ \mathrm{Day}^{(i)} \\
\mathrm{Evening}^{(i)} \end{array} \right] \end{equation}

In addition we create our linear hypothesis function $tip_\beta(x) = \beta^Tx$ is given by:

\begin{equation}
tip_\beta(x) = \beta_0 + \beta_1 * \mathrm{Trip\ Seconds}^{(i)} + \beta_2 * \mathrm{Trip\ Miles}^{(i)} + \beta_3 * \mathrm{Shared\ Trip\ Authorized}^{(i)} + \\ \beta_4 * \mathrm{Shared\ Trip\ Matched}^{(i)} + \beta_5 * \mathrm{Cost}^{(i)} + \beta_6 * \mathrm{Temperature}^{(i)} + \beta_7 * \mathrm{Rain\ In\ MM}^{(i)} + \\ \beta_8 * \mathrm{Night}^{(i)} + \beta_9 * \mathrm{Day}^{(i)} + \beta_{10} * \mathrm{Evening}^{(i)} 
\end{equation}

After defining our linear regression model, we now want to find the right values for our predictors $ \beta_0, \beta_1, \dots, \beta_9 $.


In [None]:
linear_model = LinearRegression()
linear_model.fit(x_train_data_scaled, y_train_data)
print("📋Values for our predictors:")
print(str(linear_model.coef_))

In [None]:
linear_model_prediction = linear_model.predict(x_val_data_scaled)
print("❌Mean absoulte Error:", round(mean_absolute_error(y_val_data, linear_model_prediction), 4), "\n🤖R²:", round(r2_score(y_val_data, linear_model_prediction), 4))

As we can see, our model has a low R² value and a high error which indicates that the predictions from our linear model differ significantly from the actual values. We can visualize this by creating a scatter plot of the predicted tips versus the actual tip amounts and adding an ideal line that represents perfect predictions. This allows us to see how far off our model's predictions are from the ideal.

It becomes especially apparent that the model struggles with higher tip amounts (greater than $4) since it never predicts values above that threshold. This is probablly caused by the large number of zero values in our dataset, which lowers the slope of the regression line and makes the model systematically underestimate higher tips.

Another possible reason why our predtictions differ a lot from the actual tip is that the tips are rounded by the nearest 1.00(see buisness understanding above) which cannot be well represented by our linear model. Because the tip values are rounded to full euros we suggest that a classification approach will likely work better than regression. The large gaps between the data points make it difficult to fit a precise regression line. Instead it seems more effective to divide the data into 10 classes (from 0 to \$9) and predict one of these classes using classification algorithms

In conclusion it becomes visible that our simple regression model is not appropriate to depict the real world and we need to try out other (classifiction) approaches to create a better and more robust model.

In [None]:
plt.figure(figsize=(6, 6))
#Plot ideal line
plt.plot([y_val_data.min(), y_val_data.max()], [y_val_data.min(), y_val_data.max()], 'r--')
#Scatter plot prediction versus actual tip
plt.scatter(prediction[:5000], y_val_data[:5000])
plt.xlabel('Prediction')
plt.ylabel('Actual Tip')
plt.title('Actual versus predicted Tip')
plt.show()

<h4 style="text-decoration: underline;">3.5 Decision Tree</h4>

 

Versuch mit Decision Tree

In [None]:
# Berechne den Anteil der Reihen in der Test Data, wo "Tip"=0 ist
zero_tip_count = (y_test_data == 0).sum()
total_count = len(y_test_data)
zero_tip_percentage = (zero_tip_count / total_count) * 100
print(f"📊 Der Anteil der Reihen in der Test Data, wo 'Tip' = 0 ist: {zero_tip_percentage:.1f}%")
print(zero_tip_count, "Reihen haben 'Tip' = 0")

In [None]:
def tree_performance_test(predictions, y_data, isBinary=False):
    print(f"Accuracy: {accuracy_score(y_data, predictions):.4f}")
    if isBinary:
        print(f"Recall (Anteil 1er, die erkannt wurden): {recall_score(y_data, predictions):.4f}")
        print(f"Precision (Anteil 1er-Vorhersagen, die Richtig waren): {precision_score(y_data, predictions):.4f}")
        print(f"F1-Score: {f1_score(y_data, predictions):.4f}")
        print("---")
    else:
        print(f"Recall (Durchschnittlicher Anteil wahrer Vorhersagen an allen Vorhersagen pro Klasse): {recall_score(y_data, predictions, average='macro'):.4f}")
        print(f"Precision (Durchschnittlicher Anteil an Werten pro Klasse, die vorhergesagt wurden): {precision_score(y_data, predictions, average='macro'):.4f}")
        print(f"F1-Score: {f1_score(y_data, predictions, average='macro'):.4f}")
    non_zero_predictions = (predictions != 0).sum()
    zero_predictions = (predictions == 0).sum()
    print(f"Anzahl der nicht-0-Vorhersagen: {non_zero_predictions}")
    print(f"Anzahl der 0-Vorhersagen: {zero_predictions}")
    print()

def show_correctness(predictions, y_val_data, max=None):
    plt.figure(figsize=(6, 6))
    right_counts = []
    sum = []
    for i in range(int(y_val_data.max()) + 1):
        right_predictions = ((predictions == i) & (y_val_data == i)).sum()
        sum_predictions = (y_val_data == i).sum()
        right_counts.append(right_predictions)
        sum.append(sum_predictions)
    tips = range(int(y_val_data.max()) + 1) 
    plt.bar(tips, sum, label = "Total Tips")
    plt.bar(tips, right_counts, label = "Correct predicted")
    plt.ylabel("Total Tips in million")
    plt.xlabel("Tip amounts")
    plt.legend()
    if max is not None:
        plt.ylim(0, max)
    plt.show()

In [None]:
model = DecisionTreeClassifier(max_depth=5, random_state = 42)
model.fit(x_train_data_scaled, y_train_data)
predictions = model.predict(x_val_data_scaled)
tree_performance_test(predictions, y_val_data)

As we can see, due to our high share of 0 in the tips (75%), a simple tree will always predict 0, as that prediction has by far the highest chance of being correct.

In [None]:
model = DecisionTreeClassifier(max_depth=60, random_state = 42)
model.fit(x_train_data_scaled, y_train_data)
predictions = model.predict(x_val_data_scaled)
tree_performance_test(predictions, y_val_data)
show_correctness(predictions, y_val_data)
show_correctness(predictions, y_val_data, max=500000)

So a more complex tree does predict other values, but for any tip thats not 0 it doesn't get it right very often.

Now our idea is, to first predicts if a customer tips at all (regardless of the amount) and predict the amount in a second step. Our hope is, that by only creating two categories, our model does have a 75% chance of hitting no tip and a 25% chance of hitting a ride with tip, instead of a a way lower chance for the individual amount. After we find a lot of rides without tips, we think that a prediction of the tip amount will work way better, because the class of tip=0 will not be as dominant anymore.

In [None]:
y_train1_binary = (y_train1_data != 0).astype(int)  # 1 wenn Y != 0, sonst 0
y_val_binary = (y_val_data != 0).astype(int)

In [None]:

binary_model = RandomForestClassifier(
    n_estimators=5,
    max_depth=17, 
    random_state=42, 
    class_weight={0: 1, 1: 3.5} # Nicht-0 höher gewichtet, um die Dominanz der 0er zu verringern
)

binary_model.fit(x_train1_data_scaled, y_train1_binary)
binary_predictions = binary_model.predict(x_val_data_scaled)

tree_performance_test(binary_predictions, y_val_binary, isBinary=True)
show_correctness(binary_predictions, y_val_binary)

This model identifies 72% of rides with tips, which means that 28% of rides with tips will be predicted with no tip. This is the best we could achieve.

For the rides where the model predicts a tip, we still have to predict how much will be tipped. Our hope of having a dataset with a less dominant class of 0 did not get fulfilled (now we have 70% of zero values instead of 75% before), but we still train another regression model because we might have a better chance of understanding the relationships of the data with a lot of clear zeros already being ruled out by the first model, so the regression doesn't have to deal with it.

In [None]:
# Use model 1 for predicting x_train2_data, to only train the regression model on rides where a tip was predicted
x_train2_data['Prediction'] = binary_model.predict(x_train2_data)
x_train2_data = x_train2_data[x_train2_data['Prediction'] != 0]
x_train2_data = x_train2_data.drop(columns=['Prediction'])


In [None]:
linear_model2 = LinearRegression()
linear_model2.fit(x_train2_data_scaled, y_train2_data)
print("📋Values for our predictors:")
print(str(linear_model.coef_))

In [None]:
# Binäre Vorhersagen
binary_predictions = binary_model.predict(x_val_data_scaled)

# Für Zeilen wo Tip vorhergesagt wird (binary_prediction == 1), 
# hole die kontinuierlichen Vorhersagen
tip_amount_predictions = linear_model2.predict(x_val_data_scaled[binary_predictions == 1])

# Kombiniere beide: 0 für keine Tips, vorhergesagte Werte für Tips
final_predictions = np.zeros(len(binary_predictions))
final_predictions[binary_predictions == 1] = tip_amount_predictions

In [None]:
# Show a graph of predicted vs actual tips
plt.figure(figsize=(6, 6))
plt.plot([y_val_data.min(), y_val_data.max()], [y_val_data.min(), y_val_data.max()], 'r--')
plt.scatter(final_predictions[:5000], y_val_data[:5000])
plt.xlabel('Prediction')
plt.ylabel('Actual Tip')
plt.title('Actual versus predicted Tip')
plt.show()


<h4 style="text-decoration: underline;">4. Implication/Reflection</h4>
Let's take a look at the final perfomance of our predictive models using the test dataset.

In [None]:
linear_model_prediction = linear_model.predict(x_test_data_scaled)
print("❌Mean absoulte Error:", round(mean_absolute_error(y_test_data, linear_model_prediction), 4), "\n🤖R²:", round(r2_score(y_test_data, linear_model_prediction), 4))

Assuming that our data was collected correctly, the results of our linear regression show that the model can explain only about 8% of the total variance in the target values (R² = 0.08). Additionally, the mean absolute error (MAE) is relatively high, which means that the predictions deviate significantly from the actual values on average. Such a low explanatory rate suggests that essential influencing factors are missing or that relationships are not correctly depicted. This means for all decision makers that the model does not make reliable predictions for any strategic or operational decisions and should therefore not be used. 

In order to improve the model's performance, it should be evaluated whether there are additional relevant features that could be included. This might involve incorporating other variables from the existing dataset or creating new features (e.g. polynomial features). Additionally, it would be reasonable to test non-linear modeling approaches as they might capture the real-world relationships more accurately than the current linear model (e.g. Radial Basis Function).

Moreover, it could make sense to include further analyses. For example, we could review the existing features and remove those that do not significantly contribute to the prediction. To do this we could use Lasso regression which is particularly useful after implementing the improvements mentioned above if our model ends up containing a large number of features (feature selection).
