#                The Influence of the Weather on Travel Behaviour in the Netherlands
### Project Group 04

*Members and Student Numbers:*

Hsuan-An Chu: 5914647\
Jarrik Overbosch: 6105734\
Mats Poppe: 5883245\
Cian Rippen: 5054141\
Sam Terstappen: 6078720



## 1. Introduction

The Netherlands is a country known for its efficient transportation system and its dynamic weather patterns. The interplay between these two elements often goes unnoticed, yet it has a profound impact on the daily routines and travel behaviors of its residents. Understanding how weather influences travel behavior holds practical significance for transportation planning and climate adaptation. This report delves into the relationship between weather conditions and travel choices, seeking to answer critical questions about how the Dutch people adapt their travel behaviors in response to varying weather conditions.

This study focuses on two primary domains: public transportation and road traffic. We aim to explore the extent to which these weather variables impact the number of passengers using public transport services, as recorded through OV chipcard check-ins. Additionally, we seek to ascertain how these weather conditions affect road traffic, with an emphasis on traffic congestion. Moreover, we aim to investigate potential variations in travel behavior between urban and rural areas, recognizing that the geography and infrastructure of different regions may influence the choices people make when it comes to commuting.

## 2. Research Objective

*The research objective is to investigate the impact of various weather factors, including precipitation, temperature, wind speed, rain duration, rain amount, and visibility, on travel behavior in the Netherlands, with a specific focus on public transportation usage, road traffic patterns, the correlation with congestion, and potential variations in behavior between urban and rural areas*.

###### SMART Criteria 
#### Specific:
The research objective precisely defines the scope and nature of the study, focusing on the specific influence of multiple weather variables on travel behavior in the Netherlands.
#### Measurable:
The objective is quantifiable, as it seeks to measure the impact of weather factors on travel behavior through observable and recorded data.
#### Achievable:
It is realistic to investigate the relationship between weather and travel behavior, considering the available resources and the feasibility of data collection and analysis.
#### Relevant:
This research is highly relevant in the context of transportation planning and climate adaptation, addressing a critical issue with practical implications.
#### Time-bound: 
While not explicitly stated, the research is likely conducted within a defined timeframe of one month, ensuring it progresses in an organized and timely manner.




## What is the influence of the weather on travel behaviour in The Netherlands?
* What is the effect of weather on the amount of public transport passengers, based on OV chipcard checkins? And is there a correlation?
* Is there a difference in public transport behaviour between weekdays and weekend?
* What is the effect of weather on road traffic, based on highway congestion data?  And is there a correlation?
* Is there a difference in road traffic behaviour between weekdays and weekend?
* Is there a difference in road behaviour between Randstad and Area out side of Randstad?


(These maybe need some adjustments in the way they are formulated)

## 3. Data Pipeline
(Explain what data we used and why)


### Packages

For this project, a number of packages are used for data processing and displaying of the results. They are imported below. 

In [38]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import geopandas as gpd
from shapely.geometry import MultiPolygon
from plotly.offline import plot
from plotly.subplots import make_subplots

### Weather Data

Weather data from the Royal Netherlands Meteorological Institute (KNMI) is used. For the report, we've selected 3 main weather types for our analysis, which are Rain amount per hour (RH), Average windspeed per hour (FG), and Temperature(TG).

In [39]:
#reading weather data
df_weather = pd.read_csv("Literature and data/KNMI_daily_0123-0923.csv")
df_weather.columns = df_weather.columns.str.strip()
df_station_name = pd.read_csv("Literature and data/KNMI_station.csv")
df_station_name.columns = df_station_name.columns.str.strip()
df_weather.head()

Unnamed: 0,# STN,YYYYMMDD,FG,FXX,TG,DR,RH,VVN
0,209,20230101,96,200,,,,
1,209,20230102,69,100,,,,
2,209,20230103,88,200,,,,
3,209,20230104,141,220,,,,
4,209,20230105,90,160,,,,


As can be seen from the head of dataframe above, some weather stations do not collect every type of weather datas. To comfirm there is enough data coverage of the whole Netherlands, a map of all the weather station and their ability to collect precepitation data is graphed. 

In [40]:
#plotting the weather stations
df_weather.rename(columns={"# STN":"STN"}, inplace=True)
df_station_name.columns = df_station_name.columns.str.strip()
df_weather["STN"].astype(int)
df_station_name["STN"].astype(int)

df_station_data = df_station_name.merge(df_weather, on="STN")
df_station_data.drop_duplicates(subset="STN",inplace=True)
df_station_data["RH"] = pd.to_numeric(df_station_data["RH"], errors="coerce")
df_station_data["hasraindata"] = np.where(df_station_data["RH"]>=0, True, False)
df_station_data = df_station_data.iloc[:,[0,1,2,3,4,12]]
df_station_data.drop_duplicates(inplace=True)

px.set_mapbox_access_token("pk.eyJ1IjoiaHN1YW4tc2hhbmUiLCJhIjoiY2xvMnB3b2NqMDl3YzJpbW56eWxnNHRrNSJ9.zJi-0PqhXOPzbqZ973FdxA")
fig = px.scatter_mapbox(df_station_data, lat="LAT(north)", lon="LON(east)", color="hasraindata", zoom = 5, hover_name="NAME", labels={"hasraindata":"Rain data collection"})
fig.update_layout(title = "Weather Station Data used")
fig.show()

From the figure above, we can comfirm there is enough data coverage. Thus the data is processed further and grouped by the dates, then the values are adjusted since the numbers provided are in 0.1 increments according to the official documentation.

In [41]:
#precess data
df_weather.columns = ["STN", "dates", "windspeed","windspeed_max","temperature","rain_duration","rain_amount", "visibility"]#rename col
df_weather = df_weather.drop(columns=["STN"])
df_weather[df_weather.columns.difference(["dates"])] = df_weather[df_weather.columns.difference(["dates"])].apply(pd.to_numeric, errors="coerce")

df_weather_mean = df_weather.groupby(["dates"], as_index=False).mean(numeric_only=True)
df_weather_mean[df_weather_mean.columns.difference(["dates", "visibility"])] = df_weather_mean[df_weather_mean.columns.difference(["dates", "visibility"])]*0.1


df_weather_mean["dates"] = pd.to_datetime(df_weather_mean["dates"], format="%Y%m%d")
df_weather_mean.head()

Unnamed: 0,dates,windspeed,windspeed_max,temperature,rain_duration,rain_amount,visibility
0,2023-01-01,7.282609,18.608696,11.714706,5.363636,4.330303,52.36
1,2023-01-02,4.986957,12.369565,8.447059,4.651515,3.918182,38.52
2,2023-01-03,6.228261,16.152174,6.338235,1.469697,1.424242,14.36
3,2023-01-04,10.956522,19.521739,11.044118,8.351515,11.018182,33.64
4,2023-01-05,6.521739,14.217391,10.144118,1.621212,0.818182,36.24


After processing the data, graphs can be made showing the windspeed, temperature and rain_amount in 2023 so far. 

In [42]:
#plot the weather
fig = px.line(df_weather_mean, x="dates", y="windspeed", title="Average windspeed data (m/s)")
fig.show()

In [43]:
fig = px.line(df_weather_mean, x="dates", y="temperature", title="Average temperature data (C)")
fig.show()

In [44]:
fig = px.line(df_weather_mean, x="dates", y="rain_amount", title="Average rain data (mm/hr)")
fig.show()

### OV chipkaart data

Here, the OV check in data is imported. The number of check-ins is given in 1000's, so that column is multiplied by 1000. The data is then grouped per day and summed. A graph displaying check-ins per day in 2023 so far is shown. 

In [45]:
#reading
file = "Literature and Data/20230908_Instappers_per_uur_export_V3.csv"
df_OV = pd.read_csv(file)

#processing
df_OV["Aantal_check_ins"] = df_OV["Aantal_check_ins"] * 1000
df_OV["Aantal_check_ins"] = df_OV["Aantal_check_ins"].astype("int")
df_OV_sum = df_OV.groupby(by="Datum", sort=False)["Aantal_check_ins"].sum().reset_index()
df_OV_sum["Datum"] = pd.to_datetime(df_OV_sum["Datum"], format="%d-%m-%Y")
df_OV_sum.head()

#plotting
fig = px.line(df_OV_sum, x="Datum", y="Aantal_check_ins", title="Number of check-ins in 2023")
fig.update_layout(xaxis_title="Date", yaxis_title="Number of OV check-ins")
fig.show()

### Congestion Data

The following code was used to convert 10 seperate Excel files containing congestion data of each month of 2023 up until October to one Pickle file. It was run once, and then commented as it is not necessary to run it again. Then, the Pickle file is unpickled, and the data stored in a dataframe. The data is processed and then summed per day. 

In [46]:
# df_cong = pd.DataFrame()
# months = ["jan", "feb", "mar", "apr", "mei", "jun", "jul", "aug", "sep", "okt"]
# for i in months:",
#     month = pd.read_excel("Literature and Data/Congestion_data_2022/" + i + ".xlsx)
#     df_cong = pd.concat([df_cong, month])
#     print(i)

# df_cong.to_pickle("Literature and data/df_cong_pickle.pkl")

In [47]:
unpickled_df_cong = pd.read_pickle("Literature and data/df_cong_pickle.pkl").reset_index() 
df_cong_filt = unpickled_df_cong[["DatumFileBegin", "TijdFileBegin", "TijdFileEind", "FileZwaarte", "Oorzaak_4"]]

df_cong_filt["FileZwaarte"] = df_cong_filt["FileZwaarte"].str.replace(",", ".", regex=True).astype(float) #replace decimal indicators to be able to sum
# df_cong_filt.head(30)
df_cong_grouped = df_cong_filt.groupby("DatumFileBegin")["FileZwaarte"].sum().reset_index()
df_cong_grouped.head()
#The warning this gives is not a thing to worry about. 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,DatumFileBegin,FileZwaarte
0,2023-01-01,93.5
1,2023-01-02,2549.248
2,2023-01-03,4404.093
3,2023-01-04,6634.145
4,2023-01-05,7113.57


The congestion severity data is displayed in a graph.Two data points with severe bad weather are highlighted. It can be seen that these days are outliers in terms of congestion. It should be noted that on April 6th, the congestion was also made worse because a lot of people were travelling due to the Easter weekend. 

In [48]:
fig = px.line(df_cong_grouped, x="DatumFileBegin", y="FileZwaarte", title="Congestion severity per day")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")
# df_cong_grouped["DatumFileBegin"] = pd.to_datetime(df_cong_grouped["DatumFileBegin"])
# print(type(df_cong_grouped.iloc[95,0]))
fig.add_trace(
    go.Scatter(
        x=[df_cong_grouped.iloc[95,0], df_cong_grouped.iloc[18,0]],
        y=[df_cong_grouped.iloc[95,1], df_cong_grouped.iloc[18,1]],
        mode="markers",
        name="Extraordinarily bad weather days",
        showlegend=True)
)
fig.update_layout(legend=dict(
    orientation="h", 
    xanchor="right",
    y=1,
    x=1
))
fig.show()

## 4. Results

In the following sub-chapters, the subqesutions will be explored and the results dissected.



#### 1. What is the effect of weather on the amount of public transport passengers, based on OV chipcard checkins? And is there a correlation?

In the following sub-chapter, the correlation between the numbers of check-ins and weather will be compared and be studied for correlation.

Based on the five days below, no clear correlation can be drawn between the number of check-ins and the amount of precipitation or its duration. On January third and fifth, the highest number of check-ins was recorded. However, on these days, there was barely any precipitation. January 4th experienced a significant amount of precipitation, but the check-ins were still lower than on the other days. Since January 1st is a Sunday, the number of check-ins can be clearly linked to the weekend. The temperature fluctuated little over the week, and no clear patterns can be drawn from this information. Windspeed remained fairly steady as well, preventing correlation to be drawn from the tabel below.

In [49]:
df_OV_sum.rename(columns={"Datum":"dates"}, inplace=True)
df_OV_weather = df_OV_sum.merge(df_weather_mean, on = "dates")
df_OV_weather.head()

Unnamed: 0,dates,Aantal_check_ins,windspeed,windspeed_max,temperature,rain_duration,rain_amount,visibility
0,2023-01-01,1003699,7.282609,18.608696,11.714706,5.363636,4.330303,52.36
1,2023-01-02,2074400,4.986957,12.369565,8.447059,4.651515,3.918182,38.52
2,2023-01-03,2465899,6.228261,16.152174,6.338235,1.469697,1.424242,14.36
3,2023-01-04,2446900,10.956522,19.521739,11.044118,8.351515,11.018182,33.64
4,2023-01-05,2643299,6.521739,14.217391,10.144118,1.621212,0.818182,36.24


In [50]:
#OV v Rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["rain_amount"], name="Rain amount"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Rain Amount vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Rain amount(mm/h)</b>", secondary_y=True)

fig.show()

Over the course of January to September, data was collected and analyzed to see the relation between check-ins and precipitation levels. Once again, no clear conclusions can be drawn. Surprisingly, some of the heaviest rain days still show a relatively average number of check-ins. Even during the months of May to July, where almost no rain fell, the number of check-ins remained relatively average. Although there was a decrease, it cannot be directly correlated to precipitation.

The highest level of precipitation occurred in July, yet it has the lowest number of check-ins. This is likely due to the summer vacation period, leading to lower train usage. Train users are likely on vacation, and students are not traveling to school during this time.

Then the Checkin data along with the windspeed data are graphed.

In [51]:
#OV v wind
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["windspeed"], name="Wind Speed"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Wind speed vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Wind Speed(m/s)</b>", secondary_y=True)

fig.show()

Again, the number of check-ins throughout the year stay relatively steady. With the exception of the summer months. The wind fluctuates during these months and no clear indications that the number of check-ins can be directly related to the wind. 

Next, the OV and temperature data are plotted

In [52]:
#OV v Temps
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["temperature"], name="Temperature"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Temperature vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg temperature(C)</b>", secondary_y=True)

fig.show()

Once more, the number of check-ins remains consistent throughout the months, showing no significant correlation with temperature change. Whether the temperature increases or decreases, no noticeable spikes in check-ins can be directly correlated to temperature changes. Furthermore, the same can be said about the summer months as in rainfall vs check-ins.

Since it is hard to find correlation base on just observing the plots alone, the data are then process further for correlation analysis.

In [53]:
#OV correlation test all
OV_corr_rains = df_OV_weather["rain_amount"].corr(df_OV_weather["Aantal_check_ins"])
OV_corr_wind = df_OV_weather["windspeed"].corr(df_OV_weather["Aantal_check_ins"])
OV_corr_temps = df_OV_weather["temperature"].corr(df_OV_weather["Aantal_check_ins"])

OV_corr_data = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [OV_corr_rains,OV_corr_wind,OV_corr_temps]}
df_OV_corr = pd.DataFrame(OV_corr_data)

fig = px.bar(df_OV_corr, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and OV-checkins(All data)")
fig.show()

The correlation between three different data comparisons, some weak patterns emerge. Rain shows a negative correlation, meaning that as the amount of rain increases, train usage tends to decrease. However, it's essential to note that the inclusion of summer months, which has  outlier data, may have influenced the overall correlation level. With a correlation coefficient of -0.02, the relationship is a negative very weak correlation, and no significant conclusions can be drawn from this association, except that increased rainfall descreases train usage. Which is unexpected, as with bad weather you would assume people would prefer to travel by a mode of transport where you can stay dry.

Moving on to the correlation between wind and check-ins, a very weak positive relationship is observed (0.07), meaning that an increase in wind corresponds to a slight increase in train usage. Nevertheless, the correlation is too weak to be considered significant.

Lastly, the correlation between temperature and check-ins reveals a weak negative correlation (-0.019). While this is the strongest correlation among the three, it remains weak, and no clear or significant conclusions can be drawn. Once again, the negative correlation might be influenced by the presence of large data outliers in the summer months.

So, to anwser the question; What is the effect of weather on the amount of public transport passengers, based on OV chipcard checkins? And is there a correlation? Based on the resultes we analysed above no clear or strong enough correlation can be drawn from the results, meaning that the weather does not have a significant effect on the number of check-ins and little correlation has been found.

#### 2. Is there a difference in public transport behaviour between weekdays and weekend?

In the upcoming graphs, we will explore the effects of weather on check-ins during the weekend. Given that travel patterns between weekdays and weekends can vary significantly due to the absence of work and school-related travel, it is important to investigate these differences.

In [54]:
df_OV_sum["dates"] = pd.to_datetime(df_OV_sum["dates"], format="%d-%m-%Y")

df_OV_weekends = df_OV_sum[(df_OV_sum["dates"].dt.dayofweek == 5) | (df_OV_sum["dates"].dt.dayofweek == 6)]
fig = px.line(df_OV_weekends, x="dates", y="Aantal_check_ins", title="Number of check-ins on weekend days 2023")
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Number of OV check-ins")

fig.show()

In the graph above, the number of check-ins during the weekend remains relatively consistent throughout the months, displaying a notable increase on Saturdays compared to Sundays. This variation contributes to the graph's fluctuation. Notably, this graph exhibits fewer outliers during the vacation months compared to the graphs in the previous subquestion. This reduction in outliers is likely due to decreased work and school-related travel during vacation periods.

Then, the same correlation analysis is done with the weekend data.

In [55]:
df_OV_weekends.rename(columns={"Datum":"dates"}, inplace=True)
df_OV_weather_weekends = df_OV_weekends.merge(df_weather_mean, on = "dates")

#OV correlation test weekends
OV_corr_rains_weekends = df_OV_weather_weekends["rain_amount"].corr(df_OV_weather_weekends["Aantal_check_ins"])
OV_corr_wind_weekends = df_OV_weather_weekends["windspeed"].corr(df_OV_weather_weekends["Aantal_check_ins"])
OV_corr_temps_weekends = df_OV_weather_weekends["temperature"].corr(df_OV_weather_weekends["Aantal_check_ins"])

OV_corr_data_weekends = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [OV_corr_rains_weekends,OV_corr_wind_weekends,OV_corr_temps_weekends]}
df_OV_corr_weekends = pd.DataFrame(OV_corr_data_weekends)

fig = px.bar(df_OV_corr_weekends, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and OV-checkins(weekends data)")
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Above are the correlations between different weather conditions and the number of check-ins on weekends. Once again, all three correlations are weak to very weak. Both rain and wind exhibit a negative correlation, indicating that an increase in rain or wind corresponds to fewer people traveling by train. Conversely, there is a weak positive correlation between temperature and train usage, suggesting that more people tend to travel by train when the temperature increases. However, as previously stated, these correlations are weak to very weak, and no significant conclusions can be drawn.

Addressing the question of whether there is a difference in public transport behavior between weekdays and weekends, the correlations differ compared to the data for the entire week from the previous subquestion. Rain, for instance, shifted from a correlation of -0.02 to -0.16 on weekends, meaning that fewer people travel by train when it rains on weekends. This result is unexpected. Similarly, wind went from a correlation of 0.07 to -0.18, indicating that high levels of wind increase train usage during the week but decrease it on weekends. High temperatures also influence train usage differently, with a correlation of -0.19 on weekdays, indicating a decrease, and a correlation of 0.02 on weekends, suggesting an increase. However, all these correlations remain too weak to draw any conclusive findings. In summary, while there is a difference in the strength of the correlations, they are still too weak to assert that weather significantly influences the number of check-ins.

#### 3. What is the effect of weather on road traffic, based on highway congestion data?  And is there a correlation?

Considering the surprising lack of impact of weather on check-ins, an exploration into whether congestion levels increase due to bad weather. This can provide an indication of whether travelers opt for cars on days with bad weather

In [56]:
df_cong_grouped.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_weather = df_cong_grouped.merge(df_weather_mean, on = "dates")
df_cong_weather.head()

Unnamed: 0,dates,FileZwaarte,windspeed,windspeed_max,temperature,rain_duration,rain_amount,visibility
0,2023-01-01,93.5,7.282609,18.608696,11.714706,5.363636,4.330303,52.36
1,2023-01-02,2549.248,4.986957,12.369565,8.447059,4.651515,3.918182,38.52
2,2023-01-03,4404.093,6.228261,16.152174,6.338235,1.469697,1.424242,14.36
3,2023-01-04,6634.145,10.956522,19.521739,11.044118,8.351515,11.018182,33.64
4,2023-01-05,7113.57,6.521739,14.217391,10.144118,1.621212,0.818182,36.24


Once again, the same five days are tested, but this time against highway congestion. While the weather fluctuates over these five days, highway congestion steadily increases throughout the week. No decisive correlation can be drawn between these five days and the weather. The congestion is at its lowest on the first of July, likely because it is a Sunday. Other than that, it consistently appears to increase during the week.

In [57]:
#congestion v rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["rain_amount"], name="Rain amount"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Rain amount vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Rain amount(mm/h)</b>", secondary_y=True)

fig.show()

Surprisingly, the highest peaks of rain do not correspond with increased car travel, and vice versa; lower rainfall does not align with reduced car travel. At first glance, this graph does not show a clear correlation between congestion and the amount of rain. This is again suprising, given the common expectation that bad weather increases car accidents, potentially leading to higher congestion. Once again, there is a noticeable decrease in travel during the summer vacation months, likely attributed to reduced work and school-related travel.

In [58]:
#congestion v rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["windspeed"], name="Wind speed"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Wind speed vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Wind Speed(m/s)</b>", secondary_y=True)

fig.show()

Again, the peaks of windspeed do not correlate with the peaks of congestion. Again there is less travel in the summer months.

In [59]:
#Congestion v Temps
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["temperature"], name="Temperature"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Temperature vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Temperature(C)</b>", secondary_y=True)

fig.show()

Once more, the peaks of termparature do not correlate with the peaks of congestion. Again there is less travel in the summer months.

Then same correlation test is then conducted for further investigation.

In [60]:
#correlation test all
cong_corr_rains = df_cong_weather["rain_amount"].corr(df_cong_weather["FileZwaarte"])
cong_corr_wind = df_cong_weather["windspeed"].corr(df_cong_weather["FileZwaarte"])
cong_corr_temps = df_cong_weather["temperature"].corr(df_cong_weather["FileZwaarte"])

cong_corr_data = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [cong_corr_rains,cong_corr_wind,cong_corr_temps]}
df_cong_corr = pd.DataFrame(cong_corr_data)

fig = px.bar(df_cong_corr, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and congestion(All data)")
fig.show()

To answer the question regarding the effect of weather on road traffic, based on highway congestion data, we find that an increase in rain is associated with an increase in congestion, indicated by a positive correlation index of 0.10. From this, one could conclude a slight increase in car travel during rainy conditions. This could also be due to the possibility that rain contributes to more car accidents, subsequently leading to higher congestion. However, the correlation is so weak that no significant conclusions can be drawn.

Surprisingly, wind exhibits a negative, very weak correlation of -0.002, suggesting that more wind is associated with a decrease in car congestion. This result is unexpected, as one might anticipate increased car usage or accidents during strong winds.

Lastly, an increase in temperature correlates with a decrease in car usage, with a correlation of -0.05. This could be influenced by the lower car usage in the summer months and the hotter weather, which increases the overall average. Once again, all correlations are so low that no significant conclusions can be reached. It might be that the high usage of car congestion may be influenced by various other factors such as roadwork, vacation periods, or people choosing to stay home during adverse weather conditions. Further research is needed to clearly identify the primary contributors to peak car congestion.

#### 4. Is there a difference in road traffic behaviour between weekdays and weekend?

In the upcoming graphs, we will explore the effects of weather on car congestion during the weekend. Given that travel patterns between weekdays and weekends can vary significantly due to the absence of work and school-related travel, it is important to investigate these differences.

In [61]:
df_cong_grouped["dates"] = pd.to_datetime(df_cong_grouped["dates"])
df_cong_grouped["DayOfWeek"] = df_cong_grouped["dates"].dt.day_name()

# Filter weekends in 2023
df_cong_weekends = df_cong_grouped[(df_cong_grouped["DayOfWeek"].isin(["Saturday", "Sunday"]))]

fig = px.line(df_cong_weekends, x="dates", y="FileZwaarte", title="Congestion in the weekends")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")

df_cong_weekends.head()
fig.show()

In the above graph, the congestion on sundays and saturdays is shown, no clear pattern can be immediatly indentified. Except for saturdays having generally more car congestion in comparison to sundays. With the except of a few outliers like the peak of may the 21st. Next, a correlation test is done

In [62]:
df_cong_weekends.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_w_weekends = df_cong_weekends.merge(df_weather_mean, on = "dates")

#correlation test weekends
cong_corr_rains_weekends = df_cong_w_weekends["rain_amount"].corr(df_cong_w_weekends["FileZwaarte"])
cong_corr_wind_weekends = df_cong_w_weekends["windspeed"].corr(df_cong_w_weekends["FileZwaarte"])
cong_corr_temps_weekends = df_cong_w_weekends["temperature"].corr(df_cong_w_weekends["FileZwaarte"])

cong_corr_data_weekends = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [cong_corr_rains_weekends,cong_corr_wind_weekends,cong_corr_temps_weekends]}
df_cong_corr_weekends = pd.DataFrame(cong_corr_data_weekends)

fig = px.bar(df_cong_corr_weekends, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and congestion(weekends data)")
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



To anwser the question; Is there a difference in road traffic behaviour between weekdays and weekend? Considering the overall weekly perspective, rain exhibits a positive correlation of 0.10 with car congestion. However, when focusing solely on weekends, this correlation shifts to -0.06. Surprisingly, during weekends, increasing rainfall appears to decrease car congestion. This is unexpected, as one would anticipate that higher rainfall might lead to increased car travel and, consequently, more car accidents, contributing to higher congestion. The correlation for wind has transitioned from a very weak -0.002 to a weak -0.12, indicating that an increase in wind on weekends correlates with further decreased car usage.

The correlation for temperature has notably shifted from -0.05 to 0.50 during weekends, suggesting an almost moderate positive relationship. In other words, when the temperature increases, the amount of car congestion also increases. This finding is curious, as one might expect people to opt for alternative transportation, such as biking, on hot days.

While the correlations for rain and wind remain too weak to assert significant impacts, the moderate correlation for temperature suggests that temperature does indeed have an effect on the level of car congestion during weekends.

#### 5. Is there a difference in road behaviour between Randstad and Area out side of Randstad?

To investigate the difference between areas, the orginal weather data need to be split. The shape file of the randstad area and the weather station are graphed on the same plot, so the stations can be observed visually if it is in the area or not.

In [63]:
# read my geo dataframe
gdf_randstad = "Literature and data/Randstad_SHP/Randstad.shp"
gdf_randstad = gpd.read_file(gdf_randstad)
gdf_randstad = gdf_randstad.to_crs("WGS84")

In [64]:
px.set_mapbox_access_token("pk.eyJ1IjoiaHN1YW4tc2hhbmUiLCJhIjoiY2xvMnB3b2NqMDl3YzJpbW56eWxnNHRrNSJ9.zJi-0PqhXOPzbqZ973FdxA")
fig = px.scatter_mapbox(df_station_data, lat="LAT(north)", lon="LON(east)", color="hasraindata", zoom = 5, hover_name="NAME", labels={"hasraindata":"Rain data collection"})
fig.update_layout(title = "Weather Station Data used")

fig.update_layout(
    coloraxis_showscale=False,
    title = "Weather stations and randstad area",
    mapbox={
        "style":"carto-positron",
        "layers": [
            {
                "source": gdf_randstad["geometry"].__geo_interface__,
                "type": "line",
                "color": "orange"
            }
        ]
    },
)
fig.show()

From the map above, the stations in the randstad area are: Hoek van Holland, Rotterdam Geulhaven, Cabauw Mast, De Bilt, Voorschoten, Schiphol, and Wijk aan Zee. Then the weather data are give a flag of boolean, indicating if it is in the randstad area.

In [65]:
#filter randstad stations
list_stn_randstad = ["Hoek van Holland", "Rotterdam Geulhaven", "Cabauw Mast", "De Bilt", "Voorschoten", "Schiphol", "Wijk aan Zee"]
df_station_name["randstad"] = df_station_name["NAME"].str.contains("|".join(list_stn_randstad))
df_station_name.head()        

Unnamed: 0,STN,LON(east),LAT(north),ALT(m),NAME,randstad
0,209,4.518,52.465,0.0,IJmond,False
1,210,4.43,52.171,-0.2,Valkenburg Zh,False
2,215,4.437,52.141,-1.1,Voorschoten,True
3,225,4.555,52.463,4.4,IJmuiden,False
4,235,4.781,52.928,1.2,De Kooy,False


In [66]:
df_weather = pd.read_csv("Literature and data/KNMI_daily_0123-0923.csv")
df_weather.columns = df_weather.columns.str.strip()
df_weather.columns = ["STN", "dates", "windspeed","windspeed_max","temperature","rain_duration","rain_amount", "visibility"]#rename col
df_weather_randstad = df_station_name.merge(df_weather,on="STN")

df_weather_randstad[df_weather_randstad.columns.difference(["dates"])] = df_weather_randstad[df_weather_randstad.columns.difference(["dates"])].apply(pd.to_numeric, errors="coerce")


df_weather_randstad_mean = df_weather_randstad.groupby(["randstad","dates"], as_index=False).mean(numeric_only=True)
df_weather_randstad_mean[df_weather_randstad_mean.columns.difference(["dates", "visibility","randstad"])] = df_weather_randstad_mean[df_weather_randstad_mean.columns.difference(["dates", "visibility","randstad"])]*0.1

df_weather_randstad_mean["dates"] = pd.to_datetime(df_weather_randstad_mean["dates"], format="%Y%m%d")
df_weather_randstad_mean.head()

Unnamed: 0,randstad,dates,STN,LON(east),LAT(north),ALT(m),NAME,windspeed,windspeed_max,temperature,rain_duration,rain_amount,visibility
0,False,2023-01-01,29.9575,0.514898,5.217923,1.01275,,7.2975,18.45,11.696429,5.551852,4.059259,52.0
1,False,2023-01-02,29.9575,0.514898,5.217923,1.01275,,5.0525,12.425,8.5,4.803704,4.074074,37.952381
2,False,2023-01-03,29.9575,0.514898,5.217923,1.01275,,6.2725,16.175,6.278571,1.348148,1.188889,17.0
3,False,2023-01-04,29.9575,0.514898,5.217923,1.01275,,10.96,19.425,10.989286,8.618519,11.411111,34.285714
4,False,2023-01-05,29.9575,0.514898,5.217923,1.01275,,6.4975,14.15,10.089286,1.685185,0.874074,35.666667


Then a plot is drawn, indcating that we have seperated the data between 2 areas

In [67]:
fig = px.line(df_weather_randstad_mean, x="dates", y="rain_amount", color="randstad", title="Average Rain amount of Randstad artea vs others")
fig.show()

Then the congestion data is process by filtering it with a list of cities and towns in the randstad. The column "KopWegvakVan" is where the area before the congestions are, so it is selected as the origin of traffic for our purpose

In [68]:
unpickled_df_cong = pd.read_pickle("Literature and data/df_cong_pickle.pkl").reset_index() 
df_cong_filt = unpickled_df_cong[["DatumFileBegin", "TijdFileBegin", "TijdFileEind", "FileZwaarte", "Oorzaak_4", "KopWegvakVan"]]
df_cong_filt["FileZwaarte"] = df_cong_filt["FileZwaarte"].str.replace(",", ".", regex=True).astype(float)

list_cong_randstad = ["Almere", "Amsterdam", "Delft", "Dordrecht", "Haarlem",
                      "Den Haag", "Leiden", "Rotterdam", "Utrecht", "Zoetermeer", 
                      "Alkmaar", "Alphen aan den Rijn", "Amersfoort", "Amstelveen", 
                      "Capelle aan den IJssel", "Gouda", "Heerhugowaard", "Hilversum", 
                      "Hoofddorp", "Hoorn", "Lelystad", "Nieuwegein", "Purmerend", 
                      "Rijswijk", "Schiedam", "Spijkenisse", "Vlaardingen", "Zaandam", "Zeist"]

df_cong_filt["randstad"] = df_cong_filt["KopWegvakVan"].str.contains("|".join(list_cong_randstad))
# df_cong_filt.head(30)
df_cong_randstad_grouped = df_cong_filt.groupby(["randstad","DatumFileBegin"])["FileZwaarte"].sum().reset_index()
df_cong_randstad_grouped.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,randstad,DatumFileBegin,FileZwaarte
0,False,2023-01-01,93.5
1,False,2023-01-02,2525.94
2,False,2023-01-03,3690.496
3,False,2023-01-04,5675.566
4,False,2023-01-05,4226.861


In [69]:
fig = px.line(df_cong_randstad_grouped, x="DatumFileBegin", y="FileZwaarte",color="randstad", title="Congestion severity per day (Randstad area vs others)")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")

After processing two dataframes, they are merge on the column of dates and the randstad boolean flag. For further correlation calculations and plotting.

In [70]:
df_cong_randstad_grouped.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_weather_randstad = df_cong_randstad_grouped.merge(df_weather_randstad_mean, on = ["dates","randstad"])
df_weather_randstad_mean.head()

Unnamed: 0,randstad,dates,STN,LON(east),LAT(north),ALT(m),NAME,windspeed,windspeed_max,temperature,rain_duration,rain_amount,visibility
0,False,2023-01-01,29.9575,0.514898,5.217923,1.01275,,7.2975,18.45,11.696429,5.551852,4.059259,52.0
1,False,2023-01-02,29.9575,0.514898,5.217923,1.01275,,5.0525,12.425,8.5,4.803704,4.074074,37.952381
2,False,2023-01-03,29.9575,0.514898,5.217923,1.01275,,6.2725,16.175,6.278571,1.348148,1.188889,17.0
3,False,2023-01-04,29.9575,0.514898,5.217923,1.01275,,10.96,19.425,10.989286,8.618519,11.411111,34.285714
4,False,2023-01-05,29.9575,0.514898,5.217923,1.01275,,6.4975,14.15,10.089286,1.685185,0.874074,35.666667


In [71]:
df_cong_corr_randstad_rain = df_cong_weather_randstad.groupby("randstad")[["rain_amount","FileZwaarte"]].corr().unstack()[[("rain_amount", "FileZwaarte")]]\
    .reset_index()
df_cong_corr_randstad_wind = df_cong_weather_randstad.groupby("randstad")[["windspeed","FileZwaarte"]].corr().unstack()[[("windspeed", "FileZwaarte")]]\
    .reset_index()
df_cong_corr_randstad_temp = df_cong_weather_randstad.groupby("randstad")[["temperature","FileZwaarte"]].corr().unstack()[[("temperature", "FileZwaarte")]]\
    .reset_index()

df_cong_corr_randstad = df_cong_corr_randstad_rain.merge(df_cong_corr_randstad_wind)
df_cong_corr_randstad = df_cong_corr_randstad.merge(df_cong_corr_randstad_temp)
df_cong_corr_randstad.columns = ["Randstad", "Rain", "Wind", "Temps"]
df_cong_corr_randstad = df_cong_corr_randstad.melt(id_vars=["Randstad"])

In [72]:
fig = px.bar(df_cong_corr_randstad, x="variable", y="value", color="Randstad",
             barmode="group", text_auto=True, labels={
                     "variable": "Weather Types",
                     "value": "Correlation",
                     "Randstad": "In Randstad Area"},
             title="Correlation between weather and congestion (Randstad area vs others)")

fig.show()

Shown on the figures above, there is indeed a change of correlation of the two regions. Rain have bigger positive correlation on congestion in the Randstad area compared to the others. Meaning that when it rains, there is more likely to have a congestion in the the randstad area than the area outside. On the case of wind however, theres a change in the correlation values observed, but it is still a really small amount. So there is almost to none correlation of congestion to wind inside and outside of randstad. Then there's temperature, there's a stronger negative correlation inside of randstad. Meaning there will be much less traffic behavior when the weathers are hot inside of randstad than oustide. That could also be due to the summer holidays.


## 5. Conclusion 
In conclusion, the analysis of weather conditions and their correlation with public transport usage and road traffic behaviour yielded unexpected results. Examining the overall data for the entire week, weak to very weak correlations were found between rain, wind, temperature, and public transport check-ins. Despite some fluctuations in correlation strength between weekdays and weekends, the overall impact of weather on public transport usage remains inconclusive. The unexpected shifts in correlations during weekends suggest that other factors may play a role in influencing travel behaviour during these specific periods.
Similarly, when considering highway congestion data, the correlations between weather conditions and road traffic show weak associations. While rain appears to have a slight positive correlation with increased congestion during the entire week, this relationship reverses on weekends. The unexpected findings regarding wind and temperature further highlight the complexity of factors influencing road traffic behaviour.
The correlation analysis also reveals regional differences, particularly in the Randstad area. Rain has a more substantial positive correlation with congestion in this region compared to others, suggesting a higher likelihood of traffic congestion during rainy conditions. On the other hand, wind shows minimal correlation both inside and outside the Randstad area, emphasizing its limited impact on road congestion. Temperature, however, exhibits a stronger negative correlation inside the Randstad, indicating reduced traffic behaviour during hot weather, possibly influenced by summer holidays.
In summary, while the analysis provides insights into potential associations between weather conditions and travel behaviour, the overall correlations are weak, and no definitive conclusions can be drawn. The complexity of factors, like summer vacation, influencing public transport and road traffic, along with regional variations, shows the need for further research to further understand the impact of weather on transportation patterns.

## 6. Discussion 

While the analysis provides valuable insights, a note of caution is due for the weak overall correlations. The unexpected shifts and regional variations underscore the need for further research. Understanding the interplay between weather and travel behavior requires a more comprehensive exploration, considering factors like cultural events, holidays, and specific regional characteristics that might influence transportation choices. 

In addition, the data used for this research was limited to one year of weather en transport data. A singular year might not capture long-term trends or cyclical patterns that could impact the robustness of the conclusions. In order to improve the study's validity and generalizability, future research projects have to think about obtaining a larger dataset that spans many years. Expanding the time scope would provide a more thorough comprehension of the underlying dynamics, facilitating the detection of overarching patterns and the reduction of any anomalies that might distort the findings. A strategy like this would aid in the development of stronger and more trustworthy associations between meteorological factors and travel patterns.

## 7. Contribution Statement

Hsuan-An Chu: Weather data collection and processing, Weather station data visualization, Correlation calculation and visualiztion, 2-axis line chart template.

Cian Rippen: OV check-in and congestion data processing, OV check in and congestion data plotting over time, add bad weather indicators to peak congestion days.

Mats Poppe: congestion data processing, weekends and weekdays data pipeline, sub-question and document structuring.

Sam Terstappen: Weather data collection and processing, Problem structuring and discussion, document formatting.

Jarrik Overbosch: OV chipkaart data collection, Problem structuring and result interpretation, Conclusion, Document formatting.

### References

OV Chipcard checkins (TransLink) \
https://www.translink.nl/library \
Weather data (Koninklijk Nederlands Meteorologisch Instituut, KNMI)\
https://dataplatform.knmi.nl/group/precipitation \
https://www.daggegevens.knmi.nl/klimatologie/uurgegevens \
Congestion data (Rijkswaterstaat)\
https://downloads.rijkswaterstaatdata.nl/filedata/