#                The Influence of the Weather on Travel Behaviour in the Netherlands
### Project Group 04

*Members and Student Numbers:*

Hsuan-An Chu: 5914647\
Jarrik Overbosch: 6105734\
Mats Poppe: 5883245\
Cian Rippen: 5054141\
Sam Terstappen: 6078720



## 1. Introduction

The Netherlands is a country known for its efficient transportation system and its dynamic weather patterns. The interplay between these two elements often goes unnoticed, yet it has a profound impact on the daily routines and travel behaviors of its residents. Understanding how weather influences travel behavior holds practical significance for transportation planning and climate adaptation. This report delves into the relationship between weather conditions and travel choices, seeking to answer critical questions about how the Dutch people adapt their travel behaviors in response to varying weather conditions.

This study focuses on two primary domains: public transportation and road traffic. We aim to explore the extent to which these weather variables impact the number of passengers using public transport services, as recorded through OV chipcard check-ins. Additionally, we seek to ascertain how these weather conditions affect road traffic, with an emphasis on traffic congestion. Moreover, we aim to investigate potential variations in travel behavior between urban and rural areas, recognizing that the geography and infrastructure of different regions may influence the choices people make when it comes to commuting.

## 2. Research Objective

*The research objective is to investigate the impact of various weather factors, including precipitation, temperature, wind speed, rain duration, rain amount, and visibility, on travel behavior in the Netherlands, with a specific focus on public transportation usage, road traffic patterns, the correlation with congestion, and potential variations in behavior between urban and rural areas*.

###### SMART Criteria 
#### Specific:
The research objective precisely defines the scope and nature of the study, focusing on the specific influence of multiple weather variables on travel behavior in the Netherlands.
#### Measurable:
The objective is quantifiable, as it seeks to measure the impact of weather factors on travel behavior through observable and recorded data.
#### Achievable:
It is realistic to investigate the relationship between weather and travel behavior, considering the available resources and the feasibility of data collection and analysis.
#### Relevant:
This research is highly relevant in the context of transportation planning and climate adaptation, addressing a critical issue with practical implications.
#### Time-bound: 
While not explicitly stated, the research is likely conducted within a defined timeframe of one month, ensuring it progresses in an organized and timely manner.




## What is the influence of the weather on travel behaviour in The Netherlands?
* What is the effect of weather on the amount of public transport passengers, based on OV chipcard checkins? And is there a correlation?
* Is there a difference in public transport behaviour between weekdays and weekend?
* What is the effect of weather on road traffic, based on highway congestion data?  And is there a correlation?
* Is there a difference in road traffic behaviour between weekdays and weekend?
* Is there a difference in road behaviour between Randstad and Area out side of Randstad?


(These maybe need some adjustments in the way they are formulated)

## 3. Data Pipeline
(Explain what data we used and why)


### Packages

For this project, a number of packages are used for data processing and displaying of the results. They are imported below. 

In [4]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import geopandas as gpd
from shapely.geometry import MultiPolygon
from plotly.offline import plot
from plotly.subplots import make_subplots

ModuleNotFoundError: No module named 'shapely'

### Weather Data

Weather data from the Royal Netherlands Meteorological Institute (KNMI) is used. For the report, we've selected 3 main weather types for our analysis, which are Rain amount per hour (RH), Average windspeed per hour (FG), and Temperature(TG).

In [None]:
#reading weather data
df_weather = pd.read_csv("Literature and data/KNMI_daily_0123-0923.csv")
df_weather.columns = df_weather.columns.str.strip()
df_station_name = pd.read_csv("Literature and data/KNMI_station.csv")
df_station_name.columns = df_station_name.columns.str.strip()
df_weather.head()

As can be seen from the head of dataframe above, some weather stations do not collect every type of weather datas. To comfirm there is enough data coverage of the whole Netherlands, a map of all the weather station and their ability to collect precepitation data is graphed. 

In [None]:
#plotting the weather stations
df_weather.rename(columns={"# STN":"STN"}, inplace=True)
df_station_name.columns = df_station_name.columns.str.strip()
df_weather["STN"].astype(int)
df_station_name["STN"].astype(int)

df_station_data = df_station_name.merge(df_weather, on="STN")
df_station_data.drop_duplicates(subset="STN",inplace=True)
df_station_data["RH"] = pd.to_numeric(df_station_data["RH"], errors="coerce")
df_station_data["hasraindata"] = np.where(df_station_data["RH"]>=0, True, False)
df_station_data = df_station_data.iloc[:,[0,1,2,3,4,12]]
df_station_data.drop_duplicates(inplace=True)

px.set_mapbox_access_token("pk.eyJ1IjoiaHN1YW4tc2hhbmUiLCJhIjoiY2xvMnB3b2NqMDl3YzJpbW56eWxnNHRrNSJ9.zJi-0PqhXOPzbqZ973FdxA")
fig = px.scatter_mapbox(df_station_data, lat="LAT(north)", lon="LON(east)", color="hasraindata", zoom = 5, hover_name="NAME", labels={"hasraindata":"Rain data collection"})
fig.update_layout(title = "Weather Station Data used")
fig.show()

From the figure above, we can comfirm there is enough data coverage. Thus the data is processed further and grouped by the dates, then the values are adjusted since the numbers provided are in 0.1 increments according to the official documentation.

In [None]:
#precess data
df_weather.columns = ["STN", "dates", "windspeed","windspeed_max","temperature","rain_duration","rain_amount", "visibility"]#rename col
df_weather = df_weather.drop(columns=["STN"])
df_weather[df_weather.columns.difference(["dates"])] = df_weather[df_weather.columns.difference(["dates"])].apply(pd.to_numeric, errors="coerce")

df_weather_mean = df_weather.groupby(["dates"], as_index=False).mean(numeric_only=True)
df_weather_mean[df_weather_mean.columns.difference(["dates", "visibility"])] = df_weather_mean[df_weather_mean.columns.difference(["dates", "visibility"])]*0.1


df_weather_mean["dates"] = pd.to_datetime(df_weather_mean["dates"], format="%Y%m%d")
df_weather_mean.head()

After processing the data, graphs can be made showing the windspeed, temperature and rain_amount in 2023 so far. 

In [None]:
#plot the weather
fig = px.line(df_weather_mean, x="dates", y="windspeed", title="Average windspeed data (m/s)")
fig.show()

In [None]:
fig = px.line(df_weather_mean, x="dates", y="temperature", title="Average temperature data (C)")
fig.show()

In [None]:
fig = px.line(df_weather_mean, x="dates", y="rain_amount", title="Average rain data (mm/hr)")
fig.show()

### OV chipkaart data

Here, the OV check in data is imported. The number of check-ins is given in 1000's, so that column is multiplied by 1000. The data is then grouped per day and summed. A graph displaying check-ins per day in 2023 so far is shown. 

In [None]:
#reading
file = "Literature and Data/20230908_Instappers_per_uur_export_V3.csv"
df_OV = pd.read_csv(file)

#processing
df_OV["Aantal_check_ins"] = df_OV["Aantal_check_ins"] * 1000
df_OV["Aantal_check_ins"] = df_OV["Aantal_check_ins"].astype("int")
df_OV_sum = df_OV.groupby(by="Datum", sort=False)["Aantal_check_ins"].sum().reset_index()
df_OV_sum["Datum"] = pd.to_datetime(df_OV_sum["Datum"], format="%d-%m-%Y")
df_OV_sum.head()

#plotting
fig = px.line(df_OV_sum, x="Datum", y="Aantal_check_ins", title="Number of check-ins in 2023")
fig.update_layout(xaxis_title="Date", yaxis_title="Number of OV check-ins")
fig.show()

### Congestion Data

The following code was used to convert 10 seperate Excel files containing congestion data of each month of 2023 up until October to one Pickle file. It was run once, and then commented as it is not necessary to run it again. Then, the Pickle file is unpickled, and the data stored in a dataframe. The data is processed and then summed per day. 

In [None]:
# df_cong = pd.DataFrame()
# months = ["jan", "feb", "mar", "apr", "mei", "jun", "jul", "aug", "sep", "okt"]
# for i in months:",
#     month = pd.read_excel("Literature and Data/Congestion_data_2022/" + i + ".xlsx)
#     df_cong = pd.concat([df_cong, month])
#     print(i)

# df_cong.to_pickle("Literature and data/df_cong_pickle.pkl")

In [None]:
unpickled_df_cong = pd.read_pickle("Literature and data/df_cong_pickle.pkl").reset_index() 
df_cong_filt = unpickled_df_cong[["DatumFileBegin", "TijdFileBegin", "TijdFileEind", "FileZwaarte", "Oorzaak_4"]]

df_cong_filt["FileZwaarte"] = df_cong_filt["FileZwaarte"].str.replace(",", ".", regex=True).astype(float) #replace decimal indicators to be able to sum
# df_cong_filt.head(30)
df_cong_grouped = df_cong_filt.groupby("DatumFileBegin")["FileZwaarte"].sum().reset_index()
df_cong_grouped.head()
#The warning this gives is not a thing to worry about. 

The congestion severity data is displayed in a graph.Two data points with severe bad weather are highlighted. It can be seen that these days are outliers in terms of congestion. It should be noted that on April 6th, the congestion was also made worse because a lot of people were travelling due to the Easter weekend. 

In [None]:
fig = px.line(df_cong_grouped, x="DatumFileBegin", y="FileZwaarte", title="Congestion severity per day")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")
# df_cong_grouped["DatumFileBegin"] = pd.to_datetime(df_cong_grouped["DatumFileBegin"])
# print(type(df_cong_grouped.iloc[95,0]))
fig.add_trace(
    go.Scatter(
        x=[df_cong_grouped.iloc[95,0], df_cong_grouped.iloc[18,0]],
        y=[df_cong_grouped.iloc[95,1], df_cong_grouped.iloc[18,1]],
        mode="markers",
        name="Extraordinarily bad weather days",
        showlegend=True)
)
fig.update_layout(legend=dict(
    orientation="h", 
    xanchor="right",
    y=1,
    x=1
))
fig.show()

## 4. Results
(Answering the Sub Questions)


#### 1. What is the effect of weather on the amount of public transport passengers, based on OV chipcard checkins? And is there a correlation?

In [None]:
df_OV_sum.rename(columns={"Datum":"dates"}, inplace=True)
df_OV_weather = df_OV_sum.merge(df_weather_mean, on = "dates")
df_OV_weather.head()

In [None]:
#OV v Rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["rain_amount"], name="Rain amount"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Rain Amount vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Rain amount(mm/h)</b>", secondary_y=True)

fig.show()

In [None]:
#OV v wind
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["windspeed"], name="Wind Speed"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Wind speed vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Wind Speed(m/s)</b>", secondary_y=True)

fig.show()

In [None]:
#OV v Temps
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["Aantal_check_ins"], name="Check-ins"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_OV_weather["dates"], y=df_OV_weather["temperature"], name="Temperature"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Temperature vs Check-ins"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Check-ins</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg temperature(C)</b>", secondary_y=True)

fig.show()

In [None]:
#OV correlation test all
OV_corr_rains = df_OV_weather["rain_amount"].corr(df_OV_weather["Aantal_check_ins"])
OV_corr_wind = df_OV_weather["windspeed"].corr(df_OV_weather["Aantal_check_ins"])
OV_corr_temps = df_OV_weather["temperature"].corr(df_OV_weather["Aantal_check_ins"])

OV_corr_data = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [OV_corr_rains,OV_corr_wind,OV_corr_temps]}
df_OV_corr = pd.DataFrame(OV_corr_data)

fig = px.bar(df_OV_corr, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and OV-checkins(All data)")
fig.show()

#### 2. Is there a difference in public transport behaviour between weekdays and weekend?

In [None]:
df_OV_sum["dates"] = pd.to_datetime(df_OV_sum["dates"], format="%d-%m-%Y")

df_OV_weekends = df_OV_sum[(df_OV_sum["dates"].dt.dayofweek == 5) | (df_OV_sum["dates"].dt.dayofweek == 6)]
fig = px.line(df_OV_weekends, x="dates", y="Aantal_check_ins", title="Number of check-ins on weekend days 2023")
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Number of OV check-ins")

fig.show()

In [None]:
df_OV_weekends.rename(columns={"Datum":"dates"}, inplace=True)
df_OV_weather_weekends = df_OV_weekends.merge(df_weather_mean, on = "dates")

#OV correlation test weekends
OV_corr_rains_weekends = df_OV_weather_weekends["rain_amount"].corr(df_OV_weather_weekends["Aantal_check_ins"])
OV_corr_wind_weekends = df_OV_weather_weekends["windspeed"].corr(df_OV_weather_weekends["Aantal_check_ins"])
OV_corr_temps_weekends = df_OV_weather_weekends["temperature"].corr(df_OV_weather_weekends["Aantal_check_ins"])

OV_corr_data_weekends = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [OV_corr_rains_weekends,OV_corr_wind_weekends,OV_corr_temps_weekends]}
df_OV_corr_weekends = pd.DataFrame(OV_corr_data_weekends)

fig = px.bar(df_OV_corr_weekends, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and OV-checkins(weekends data)")
fig.show()

#### 3. What is the effect of weather on road traffic, based on highway congestion data?  And is there a correlation?

In [None]:
df_cong_grouped.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_weather = df_cong_grouped.merge(df_weather_mean, on = "dates")
df_cong_weather.head()

In [None]:
#congestion v rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["rain_amount"], name="Rain amount"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Rain amount vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Rain amount(mm/h)</b>", secondary_y=True)

fig.show()

In [None]:
#congestion v rain
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["windspeed"], name="Wind speed"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Wind speed vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Wind Speed(m/s)</b>", secondary_y=True)

fig.show()

In [None]:
#Congestion v Temps
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["FileZwaarte"], name="Congestions"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df_cong_weather["dates"], y=df_cong_weather["temperature"], name="Temperature"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Temperature vs Congestion severity"
)

# Set x-axis title
fig.update_xaxes(title_text="Dates")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Congestion severity (km*min)</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Avg Temperature(C)</b>", secondary_y=True)

fig.show()

In [None]:
#correlation test all
cong_corr_rains = df_cong_weather["rain_amount"].corr(df_cong_weather["FileZwaarte"])
cong_corr_wind = df_cong_weather["windspeed"].corr(df_cong_weather["FileZwaarte"])
cong_corr_temps = df_cong_weather["temperature"].corr(df_cong_weather["FileZwaarte"])

cong_corr_data = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [cong_corr_rains,cong_corr_wind,cong_corr_temps]}
df_cong_corr = pd.DataFrame(cong_corr_data)

fig = px.bar(df_cong_corr, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and congestion(All data)")
fig.show()

#### 4. Is there a difference in road traffic behaviour between weekdays and weekend?

In [None]:
df_cong_grouped["dates"] = pd.to_datetime(df_cong_grouped["dates"])
df_cong_grouped["DayOfWeek"] = df_cong_grouped["dates"].dt.day_name()

# Filter gegevens voor weekenddagen in 2023
df_cong_weekends = df_cong_grouped[(df_cong_grouped["DayOfWeek"].isin(["Saturday", "Sunday"]))]

fig = px.line(df_cong_weekends, x="dates", y="FileZwaarte", title="Congestion in the weekends")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")

df_cong_weekends.head()
fig.show()

In [None]:
df_cong_grouped["dates"] = pd.to_datetime(df_cong_grouped["dates"])
df_cong_grouped["DayOfWeek"] = df_cong_grouped["dates"].dt.day_name()

# Filter gegevens voor weekenddagen in 2023
df_cong_weekends = df_cong_grouped[(df_cong_grouped["DayOfWeek"].isin(["Saturday", "Sunday"]))]

fig = px.line(df_cong_weekends, x="dates", y="FileZwaarte", title="Congestion in the weekends")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")

df_cong_weekends.head()
fig.show()

In [None]:
df_cong_weekends.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_w_weekends = df_cong_weekends.merge(df_weather_mean, on = "dates")

#correlation test weekends
cong_corr_rains_weekends = df_cong_w_weekends["rain_amount"].corr(df_cong_w_weekends["FileZwaarte"])
cong_corr_wind_weekends = df_cong_w_weekends["windspeed"].corr(df_cong_w_weekends["FileZwaarte"])
cong_corr_temps_weekends = df_cong_w_weekends["temperature"].corr(df_cong_w_weekends["FileZwaarte"])

cong_corr_data_weekends = {"WeatherTypes": ["Rain", "Wind", "Temps"], "Correlation": [cong_corr_rains_weekends,cong_corr_wind_weekends,cong_corr_temps_weekends]}
df_cong_corr_weekends = pd.DataFrame(cong_corr_data_weekends)

fig = px.bar(df_cong_corr_weekends, x="WeatherTypes", y="Correlation", color="WeatherTypes",text_auto=True, title="Correlation between Weather condiditions and congestion(weekends data)")
fig.show()

#### 5. Is there a difference in road behaviour between Randstad and Area out side of Randstad?

To investigate the difference between areas, the orginal weather data need to be split. The shape file of the randstad area and the weather station are graphed on the same plot, so the stations can be observed visually if it is in the area or not.

In [None]:
# read my geo dataframe
gdf_randstad = "Literature and data/Randstad_SHP/Randstad.shp"
gdf_randstad = gpd.read_file(gdf_randstad)
gdf_randstad = gdf_randstad.to_crs("WGS84")

In [None]:
px.set_mapbox_access_token("pk.eyJ1IjoiaHN1YW4tc2hhbmUiLCJhIjoiY2xvMnB3b2NqMDl3YzJpbW56eWxnNHRrNSJ9.zJi-0PqhXOPzbqZ973FdxA")
fig = px.scatter_mapbox(df_station_data, lat="LAT(north)", lon="LON(east)", color="hasraindata", zoom = 5, hover_name="NAME", labels={"hasraindata":"Rain data collection"})
fig.update_layout(title = "Weather Station Data used")

fig.update_layout(
    coloraxis_showscale=False,
    title = "Weather stations and randstad area",
    mapbox={
        "style":"carto-positron",
        "layers": [
            {
                "source": gdf_randstad["geometry"].__geo_interface__,
                "type": "line",
                "color": "orange"
            }
        ]
    },
)
fig.show()

From the map above, the stations in the randstad area are: Hoek van Holland, Rotterdam Geulhaven, Cabauw Mast, De Bilt, Voorschoten, Schiphol, and Wijk aan Zee. Then the weather data are give a flag of boolean, indicating if it is in the randstad area.

In [None]:
#filter randstad stations
list_stn_randstad = ["Hoek van Holland", "Rotterdam Geulhaven", "Cabauw Mast", "De Bilt", "Voorschoten", "Schiphol", "Wijk aan Zee"]
df_station_name["randstad"] = df_station_name["NAME"].str.contains("|".join(list_stn_randstad))
df_station_name.head()        

In [None]:
df_weather = pd.read_csv("Literature and data/KNMI_daily_0123-0923.csv")
df_weather.columns = df_weather.columns.str.strip()
df_weather.columns = ["STN", "dates", "windspeed","windspeed_max","temperature","rain_duration","rain_amount", "visibility"]#rename col
df_weather_randstad = df_station_name.merge(df_weather,on="STN")

df_weather_randstad[df_weather_randstad.columns.difference(["dates"])] = df_weather_randstad[df_weather_randstad.columns.difference(["dates"])].apply(pd.to_numeric, errors="coerce")


df_weather_randstad_mean = df_weather_randstad.groupby(["randstad","dates"], as_index=False).mean(numeric_only=True)
df_weather_randstad_mean[df_weather_randstad_mean.columns.difference(["dates", "visibility","randstad"])] = df_weather_randstad_mean[df_weather_randstad_mean.columns.difference(["dates", "visibility","randstad"])]*0.1

df_weather_randstad_mean["dates"] = pd.to_datetime(df_weather_randstad_mean["dates"], format="%Y%m%d")
df_weather_randstad_mean.head()

In [None]:
fig = px.line(df_weather_randstad_mean, x="dates", y="rain_amount", color="randstad", title="Average Rain amount of Randstad artea vs others")
fig.show()

Then the congestion data is process by filtering it with a list of cities and towns in the randstad. The column "KopWegvakVan" is where the area before the congestions are, so it is selected as the origin of traffic for our purpose

In [None]:
unpickled_df_cong = pd.read_pickle("Literature and data/df_cong_pickle.pkl").reset_index() 
df_cong_filt = unpickled_df_cong[["DatumFileBegin", "TijdFileBegin", "TijdFileEind", "FileZwaarte", "Oorzaak_4", "KopWegvakVan"]]
df_cong_filt["FileZwaarte"] = df_cong_filt["FileZwaarte"].str.replace(",", ".", regex=True).astype(float)

list_cong_randstad = ["Almere", "Amsterdam", "Delft", "Dordrecht", "Haarlem",
                      "Den Haag", "Leiden", "Rotterdam", "Utrecht", "Zoetermeer", 
                      "Alkmaar", "Alphen aan den Rijn", "Amersfoort", "Amstelveen", 
                      "Capelle aan den IJssel", "Gouda", "Heerhugowaard", "Hilversum", 
                      "Hoofddorp", "Hoorn", "Lelystad", "Nieuwegein", "Purmerend", 
                      "Rijswijk", "Schiedam", "Spijkenisse", "Vlaardingen", "Zaandam", "Zeist"]

df_cong_filt["randstad"] = df_cong_filt["KopWegvakVan"].str.contains("|".join(list_cong_randstad))
# df_cong_filt.head(30)
df_cong_randstad_grouped = df_cong_filt.groupby(["randstad","DatumFileBegin"])["FileZwaarte"].sum().reset_index()
df_cong_randstad_grouped.head()

In [None]:
fig = px.line(df_cong_randstad_grouped, x="DatumFileBegin", y="FileZwaarte",color="randstad", title="Congestion severity per day (Randstad area vs others)")
fig.update_layout(xaxis_title="Date", yaxis_title="Congestion severity (km*min)")

After processing two dataframes, they are merge on the column of dates and the randstad boolean flag. For further correlation calculations and plotting.

In [None]:
df_cong_randstad_grouped.rename(columns={"DatumFileBegin":"dates"}, inplace=True)
df_cong_weather_randstad = df_cong_randstad_grouped.merge(df_weather_randstad_mean, on = ["dates","randstad"])
df_weather_randstad_mean.head()

In [None]:
df_cong_corr_randstad_rain = df_cong_weather_randstad.groupby("randstad")[["rain_amount","FileZwaarte"]].corr().unstack()[[("rain_amount", "FileZwaarte")]]\
    .reset_index()
df_cong_corr_randstad_wind = df_cong_weather_randstad.groupby("randstad")[["windspeed","FileZwaarte"]].corr().unstack()[[("windspeed", "FileZwaarte")]]\
    .reset_index()
df_cong_corr_randstad_temp = df_cong_weather_randstad.groupby("randstad")[["temperature","FileZwaarte"]].corr().unstack()[[("temperature", "FileZwaarte")]]\
    .reset_index()

df_cong_corr_randstad = df_cong_corr_randstad_rain.merge(df_cong_corr_randstad_wind)
df_cong_corr_randstad = df_cong_corr_randstad.merge(df_cong_corr_randstad_temp)
df_cong_corr_randstad.columns = ["Randstad", "Rain", "Wind", "Temps"]
df_cong_corr_randstad = df_cong_corr_randstad.melt(id_vars=["Randstad"])

In [None]:
fig = px.bar(df_cong_corr_randstad, x="variable", y="value", color="Randstad",
             barmode="group", text_auto=True, labels={
                     "variable": "Weather Types",
                     "value": "Correlation",
                     "Randstad": "In Randstad Area"},
             title="Correlation between weather and congestion (Randstad area vs others)")

fig.show()

Shown on the figures above, there is indeed a change of correlation of the two regions. Rain have bigger positive correlation on congestion in the Randstad area compared to the others. Meaning that when it rains, there is more likely to have a congestion in the the randstad area than the area outside. On the case of wind however, theres a change in the correlation values observed, but it is still a really small amount. So there is almost to none correlation of congestion to wind inside and outside of randstad. Then there's temperature, there's a stronger negative correlation inside of randstad. Meaning there will be much less traffic behavior when the weathers are hot inside of randstad than oustide. That could also be due to the summer holidays.


## 5. Conclusion 
(Answering the Central Research Question)

## 6. Discussion 
(Limitations and directions for further research)

In [None]:
# Text

## 7. Contribution Statement

Hsuan-An Chu: Weather data collection and processing, Weather station data visualization, Correlation calculation and visualiztion, 2-axis line chart template.
Cian Rippen: OV check-in and congestion data processing, OV check in and congestion data plotting over time, add bad weather indicators to peak congestion days

### References