<div style="text-align: center"><h1 style="text-decoration: underline;">DSML Project</h1></div>



This is the official Notebook of the DSML Project from Marc Rennefort, Kilian Lipinsky, Timo Hagelberg, Jan Behrendt-Emden and Paul Severin. In order to create this Project we used the following dataset: https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips-2023-2024-/n26f-ihde/about_data
<h4>1. Description</h4>
The goal of this project is to predict ride-hailing tips in Chicago based on travel time, distance, fare amount, weather conditions, and whether the customer shared the ride.


In [None]:
#Note all your imports here

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LinearRegression
from datetime import datetime
from meteostat import Hourly, Point


In [None]:
# Gesamten Datensatz laden
# Bei wenig RAM: Direkt übernächste nächste Zelle ausführen

data = pd.read_csv('Data/Chicago_RideHailing_Data.csv')  #habt geduld, dauert beim laden

In [None]:
# Basic Infos anzeigen
data.head()
data.info()
data.isnull().sum()
# nicht benötigte Spalten entfernen
data_cleaned = data.drop(columns = ['Percent Time Chicago', 'Percent Distance Chicago', 'Pickup Census Tract', 'Dropoff Census Tract',
                            'Pickup Community Area', 'Dropoff Community Area', 'Additional Charges', 'Trips Pooled', 
                            'Pickup Centroid Latitude', 'Pickup Centroid Location', 'Dropoff Centroid Latitude', 'Dropoff Centroid Longitude', 'Dropoff Centroid Location', 'Pickup Centroid Longitude'])
data_cleaned.head()


In [None]:
# Direkt nur die benötigten Spalten laden (mein Laptop hat nicht genug RAM für den gesamten Datensatz haha)
# Die vorherigen zwei Zellen müssen nicht ausgeführt werden!

data_cleaned = pd.read_csv('Data/Chicago_RideHailing_Data.csv', usecols= ['Trip ID', 'Trip Start Timestamp', 'Trip End Timestamp', 'Trip Seconds', 'Trip Miles', 'Fare', 'Tip', 'Trip Total', 'Shared Trip Authorized', 'Shared Trip Match'])
data_cleaned.head()

In [None]:
#drop all rows with null values
data_cleaned = data_cleaned.dropna(axis = 0)

In [None]:
#changing format of the date (Trip Start Timestamp)
#Funktioniert bei manchen aus der Gruppe super, bei mir aus irgendwelchen Gründen gar nicht
data_cleaned['Trip Start Timestamp'] = pd.to_datetime(data_cleaned['Trip Start Timestamp'],  format='%m/%d/%Y %I:%M:%S %p', errors='coerce')
data_cleaned['Trip End Timestamp'] = pd.to_datetime(data_cleaned['Trip End Timestamp'],  format='%m/%d/%Y %I:%M:%S %p', errors='coerce')

In [None]:
print('Null-Werte: ', data_cleaned.isnull().sum())
data_cleaned.info()
data_cleaned.head()

In [None]:
# Wetterdaten für Chicago im gesamten Zeitraum laden:
wetter = Hourly(Point(41.8781, -87.6298), datetime(2023, 1, 1), datetime(2025, 1, 1))
wetter = wetter.fetch()
wetter.head()

In [17]:
# Wetterdaten cleanen:
wetter_cleaned = wetter.drop(columns=['dwpt', 'rhum', 'wdir', 'pres'])
wetter_cleaned = wetter_cleaned.rename(columns={'time': 'Time', 'temp': 'Temperature in C', 'prcp': 'Rain in mm', 'snow': 'Snow in mm', 'wspd': 'Wind Speed in km/h', 'wpgt': 'Peak Wind Speed in km/h', 'tsun': 'Sunshine Duration in min', 'coco': 'Weather Condition Code'})

# Nur um die Daten zu überprüfen:
#wetter_cleaned.head()
#wetter_cleaned.info()
#print('Null-Werte: ', wetter_cleaned.isnull().sum())

#Snow, Peak Wind Speed und Sunshine Duration enthalten nur null-Werte:
wetter_cleaned = wetter_cleaned.drop(columns=['Snow in mm', 'Peak Wind Speed in km/h', 'Sunshine Duration in min'])

wetter_cleaned.head()

Unnamed: 0_level_0,Temperature in C,Rain in mm,Wind Speed in km/h,Weather Condition Code
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-01-01 00:00:00,2.0,0.0,6.0,3.0
2023-01-01 01:00:00,2.2,0.0,5.4,3.0
2023-01-01 02:00:00,2.8,0.0,14.8,3.0
2023-01-01 03:00:00,2.8,0.0,0.0,3.0
2023-01-01 04:00:00,2.8,0.0,0.0,12.0


#### Erklärung der Weather Condition Codes

| Code | Weather Condition |
|------|-------------------|
| 1    | Clear             |
| 2    | Fair              |
| 3    | Cloudy            |
| 4    | Overcast          |
| 5    | Fog               |
| 6    | Freezing Fog      |
| 7    | Light Rain        |
| 8    | Rain              |
| 9    | Heavy Rain        |
| 10   | Freezing Rain     |
| 11   | Heavy Freezing Rain |
| 12   | Sleet             |
| 13   | Heavy Sleet       |
| 14   | Light Snowfall    |
| 15   | Snowfall          |
| 16   | Heavy Snowfall    |
| 17   | Rain Shower       |
| 18   | Heavy Rain Shower |
| 19   | Sleet Shower      |
| 20   | Heavy Sleet Shower |
| 21   | Snow Shower       |
| 22   | Heavy Snow Shower |
| 23   | Lightning         |
| 24   | Hail              |
| 25   | Thunderstorm      |
| 26   | Heavy Thunderstorm |
| 27   | Storm             |

Quelle: https://dev.meteostat.net/formats.html#weather-condition-codes