# Report: Train delays and weather correlation in germany

## 1. Project Idea

Every day, 7.3 milion people in Germany rely on DB trains[^1], which is about every 10th person in Germany. It is rare to find someone who hasn't experienced any issues with DB train. Personally, every time I take the train to get somewhere, I experience delays, missed connections, or even cancellations. This has led me to the question of the underlying causes. I have the feeling that there may be an association between heavy rain or thunderstorms and an increase in train delays. This observation has led to look into a possible correlation between bad weather conditions and train delays.

Uncovering the reasons behind these delays would be of great benefit to the 7.3 million daily passengers who rely on DB trains. With this knowledge, people can plan their trips more efficiently. If people know that there will be bad weather on the day they want to travel, they can plan extra time for connecting trains or look for alternative transportations. Germans, known for their need for planning and punctuality, would greatly benefit from being aware of such a correlation.


While it may be unrealistic to find a single cause of train delays, it would still be useful to draw a connection between train delays and rainy weather. Therefore, the question that the following report attempts to answer is: 

**When investigating DB train delays, is there a correlation between more trains running late or more minutes of delay and rainy weather?**

[^1]: https://www.germany.travel/de/trade/global-trade-corner/deutsche-bahn-ag-db.html

## 2. Data sources und Data pipeline 

To collect the necessary data for this project, different APIs and files were used.
Mobilithek provides links to various DB APIs for train information, including the Timetable API [^2]. This API offers timetable information containing planned and changed timetables for long-distance trains [^2]. The data is offered by DB Station&Service AG and is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license [^3]. An overview of the API can be found on the DB API Marketplace [^4].

<img src="project/timetable_api.png" alt="timetable api" title="DB timetable API" width="200">

The desired data is collected using the Timetables API Version 1.0.213, utilizing the endpoints /fchg to get the current full changes of the timetable of a specific station and the endpoint /station to obtain all stations in germany.

Furthermore, geographical data, such as longitude and latitude of the train station,has to be collected separately from the Station Data API Version 2.7.423 and the endpoint /stations.

<img src="project/meteostat.png" alt="meteostat" title="Meteostat" width="200">

Weather data was collected from Meteostat Developers [^5], who offers historical weather and climate data also provided under the CC BY 4.0 license. The endpoint /station was used to obtain all weather stations and the endpoint /hourly to get the weather data for the desired stations.

The data pipeline includes some dependencies to manage data loading efficiently and prevent unnecessary slowdowns by getting unnessecary data. The key steps in the pipeline are getting all main train stations in germany, retrieving matching timetables and geo data. Then all weather stations in germany are obtained and matched with the train station, to collect weather data solely for the desired stations.

Data Sources: [Meteostat](https://meteostat.net/en/) and [DB Station&Service AG](https://www.bahnhof.de/)

<img src="project/pipeline.drawio.png" alt="pipeline" title="pipeline" width="1000">

[^2]: https://mobilithek.info/offers/-3916716856299319220
[^3]: https://creativecommons.org/licenses/by/4.0/
[^4]: https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/overview 
[^5]: https://dev.meteostat.net/


## 3. Main Problems and Mitigations

### 3.1 Train stations
Given the big number of train stations in germany (17081 to be exact, given by the API), it was necessary to filter the data for the main stations in germany. To simplify the task only train station with the addition "Hbf" where obtained.  

### 3.2 Timetables
The main problem of the API was that there is no historical data on train delays available. To still obtain some data to work with, I set up a cron job to run the pipeline every hour for the past weeks. However, this approach resulted in duplicate entries for trains. To address this, I kept only the most recent update of the timetable for each train.

The timetables obtained from the API come in a weird XML format. To still extract the desired information, I filtered for train departures and retrieved the planned and changed times, allowing the calculation of delays. I considered a train delayed if it had a delay of more than 5 minutes.

### 3.3 Weather stations
Obtaining weather data for all available weather stations was not possible due to the large amount of data. To mitigate this, I matched the train stations to the nearest weather stations using a nearest-neighbor approach (TODO) and obtained weather data solely for the desired weather stations. Furthermore, I restricted the data to the year 2023.

Unfortunately, there was an unhappy coincidence in the past weeks because there was very little rain. The limited timeframe and the limited rainy days makes it challenging to make out a significant correlation. 

## 4. Answer Data Science Question (TODO)

### Install dependencies

### Load data

### Questions: 

### Where are the main stations in germany?

### Where are the matching weather and train station?

### How does the number of delayed trains develope over time?
also minutes of delay?

### How does the rain develope over time?

### Both together

### Calculate correlation

plots for correlation

train delay corrletaion by time or by weekday weekend?

correlation by time maybe trains in the evenig are delayed than in the morning


## 5. Conclusion 

Unfortunatly, the limited timeframe made it hard to identify a significant correlation and draw a definitive conclusion. , I have tried to show what would have been possible with more data and a longer timeframe.

In addition, it has to be considered that there are other causes of the train delayes. Even during the past week with really good weather, delays still occured. Signal failures, technical issues, accidents causing emergency services on the tracks, and staff-related factors, such as illness or waiting for staff from other trains, can all independently contribute to delays. Additionally, delays can propagate, creating a cascading effect.

While the correlation between weather and train delays could not be fully established, it is clear that multiple causes influence the DB train timetables.