# **INF161 - Bike Traffic Prediction Project**
### *Ole Kristian Westby | owe009@uib.no | H23*

This project uses data from Statens vegvesen and Geofysisk institutt. The goal is to create a model that can predict the volume of bikers at a given time over Nygårdsbroen. I'll need to prepare the data so I'm left with the data I deem valuable to perform this task. That's what this Jupyter notebook is for. I'll also be explaining my steps throughout the book. At the end, we'll have some juicy, ready data that we'll use to insert into /ready_data/ ready for the model to work on.

I recognize that throughout the years there has been some times where people might have used the bikes more/less frequently because of certain factors. I will keep a list that I will update continuously as I find them.
- Covid-19 likely kept more people home, especially in peak times. Less people using bicycles to get to work as they had work from home. Only interested in peak covid-19 times though. 
- 2017 UCI Road World Championships. I've checked the routes and don't see that any bikes passed Nygårdsbroen but I will look closer at the data later.
- 

#### **Let's start by importing some libraries.**

In [63]:
import numpy as np
import pandas as pd
import os

#### **We'll handle the traffic data first.**

In [64]:
dir_weather = "raw_data/weather_data/"

files = [f for f in os.listdir(dir_weather) if f.endswith('.csv')]

# Interesting columns
columns = ["Dato", "Tid", "Solskinstid", "Lufttemperatur", "Vindstyrke", "Vindkast"]

dfs = []
for file in files:

    file_path = os.path.join(dir_weather, file)

    df = pd.read_csv(file_path, usecols=columns)

    dfs.append(df)

merged_df = pd.concat(dfs, ignore_index=True)
print(merged_df)

              Dato    Tid  Solskinstid  Lufttemperatur  Vindstyrke  Vindkast
0       2010-01-01  00:00          0.0            -4.6         1.1       NaN
1       2010-01-01  00:10          0.0            -4.1         1.6       NaN
2       2010-01-01  00:20          0.0            -3.5         1.3       NaN
3       2010-01-01  00:30          0.0            -4.1         0.7       NaN
4       2010-01-01  00:40          0.0            -4.4         0.8       NaN
...            ...    ...          ...             ...         ...       ...
709216  2023-06-30  23:10          0.0            13.7         2.3       3.6
709217  2023-06-30  23:20          0.0            13.6         1.9       3.3
709218  2023-06-30  23:30          0.0            13.6         1.7       3.0
709219  2023-06-30  23:40          0.0            13.6         1.9       3.3
709220  2023-06-30  23:50          0.0            13.5         1.9       3.0

[709221 rows x 6 columns]


#### **Now we've created one big dataframe containing all interesting weather data from 2010 to 2023. However, the traffic data only goes from 2015-2023, and so I want to clear the dataset for any weather data before that. The model is going to get slightly less accurate with less data, but I think there's enough data already with 2015-2023 to do this anyways. It makes it simpler as well. I won't remove the data from the raw_data folder because I still recognize it there, and I want to see the different predictions based on it being included or not, but for now I won't focus on it.**

In [65]:
merged_df["Dato"] = pd.to_datetime(merged_df["Dato"])

merged_df = merged_df[merged_df["Dato"].dt.year >= 2015]

print(merged_df)

             Dato    Tid  Solskinstid  Lufttemperatur  Vindstyrke  Vindkast
262467 2015-01-01  00:00          0.0             6.6         4.2       NaN
262468 2015-01-01  00:10          0.0             6.6         4.0       NaN
262469 2015-01-01  00:20          0.0             6.6         3.1       NaN
262470 2015-01-01  00:30          0.0             6.6         3.7       NaN
262471 2015-01-01  00:40          0.0             6.7         2.9       NaN
...           ...    ...          ...             ...         ...       ...
709216 2023-06-30  23:10          0.0            13.7         2.3       3.6
709217 2023-06-30  23:20          0.0            13.6         1.9       3.3
709218 2023-06-30  23:30          0.0            13.6         1.7       3.0
709219 2023-06-30  23:40          0.0            13.6         1.9       3.3
709220 2023-06-30  23:50          0.0            13.5         1.9       3.0

[446754 rows x 6 columns]


#### **As we see in the beginning of the merged dataframe, we see some data missing in 2015-01-01 for Vindkast. I want clean, full data and who knows how many rows are missing data in one or more columns. Let's find out.**

In [66]:
rows_missing_data = merged_df[merged_df.isna().any(axis=1)].shape[0]
print(rows_missing_data)

3786


#### **As we can see, there are 3786 rows that are missing some data. 3786 is only 0,84% of the merged dataframe. I think we can afford to get rid of that.**

In [67]:
merged_df.dropna(inplace=True)

merged_df = merged_df.reset_index(drop=True)

print(merged_df)

             Dato    Tid  Solskinstid  Lufttemperatur  Vindstyrke  Vindkast
0      2015-01-08  15:30          0.0             4.4         1.3       3.6
1      2015-01-08  15:40          0.0             4.7         1.6       2.7
2      2015-01-08  15:50          0.0             4.5         1.9       2.7
3      2015-01-08  16:00          0.0             4.2         3.2       5.4
4      2015-01-08  16:10          0.0             4.5         2.8       4.5
...           ...    ...          ...             ...         ...       ...
442963 2023-06-30  23:10          0.0            13.7         2.3       3.6
442964 2023-06-30  23:20          0.0            13.6         1.9       3.3
442965 2023-06-30  23:30          0.0            13.6         1.7       3.0
442966 2023-06-30  23:40          0.0            13.6         1.9       3.3
442967 2023-06-30  23:50          0.0            13.5         1.9       3.0

[442968 rows x 6 columns]


#### **The next thing I want to do is combine the columns "Dato" and "Tid" to get a single datetime column. This will be useful later.**

In [68]:
merged_df["Dato"] = merged_df["Dato"].astype(str)
merged_df["Tid"] = merged_df["Tid"].astype(str)

merged_df["Datotid"] = merged_df["Dato"] + " " + merged_df["Tid"]

merged_df["Datotid"] = pd.to_datetime(merged_df["Datotid"])

merged_df.drop(["Dato", "Tid"], axis=1, inplace=True)

print(merged_df)

        Solskinstid  Lufttemperatur  Vindstyrke  Vindkast             Datotid
0               0.0             4.4         1.3       3.6 2015-01-08 15:30:00
1               0.0             4.7         1.6       2.7 2015-01-08 15:40:00
2               0.0             4.5         1.9       2.7 2015-01-08 15:50:00
3               0.0             4.2         3.2       5.4 2015-01-08 16:00:00
4               0.0             4.5         2.8       4.5 2015-01-08 16:10:00
...             ...             ...         ...       ...                 ...
442963          0.0            13.7         2.3       3.6 2023-06-30 23:10:00
442964          0.0            13.6         1.9       3.3 2023-06-30 23:20:00
442965          0.0            13.6         1.7       3.0 2023-06-30 23:30:00
442966          0.0            13.6         1.9       3.3 2023-06-30 23:40:00
442967          0.0            13.5         1.9       3.0 2023-06-30 23:50:00

[442968 rows x 5 columns]
