# **INF161 - Bike Traffic Prediction Project**
### *Ole Kristian Westby | owe009@uib.no | H23*

This project uses data from Statens vegvesen and Geofysisk institutt. The goal is to create a model that can predict the volume of bikers at a given time over Nygårdsbroen. I'll need to prepare the data so I'm left with the data I deem valuable to perform this task. That's what this Jupyter notebook is for. I'll also be explaining my steps throughout the book. At the end, we'll have some juicy, ready data that we'll use to insert into /ready_data/ ready for the model to work on.

I recognize that throughout the years there has been some times where people might have used the bikes more/less frequently because of certain factors. I will keep a list that I will update continuously as I find them.
- Covid-19 likely kept more people home, especially in peak times. Less people using bicycles to get to work as they had work from home. Only interested in peak covid-19 times though. 
- 2017 UCI Road World Championships. I've checked the routes and don't see that any bikes passed Nygårdsbroen but I will look closer at the data later.
- 

#### **Let's start by importing some libraries.**

In [176]:
import numpy as np
import pandas as pd
import os

#### **We'll handle the traffic data first.**

In [177]:
dir_weather = "raw_data/weather_data/"

files = [f for f in os.listdir(dir_weather) if f.endswith('.csv')]

# Interesting columns
columns = ["Dato", "Tid", "Solskinstid", "Lufttemperatur", "Vindstyrke", "Vindkast"]

dfs = []
for file in files:

    file_path = os.path.join(dir_weather, file)

    df = pd.read_csv(file_path, usecols=columns)

    dfs.append(df)

merged_weather_df = pd.concat(dfs, ignore_index=True)
merged_weather_df.tail()

Unnamed: 0,Dato,Tid,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast
709216,2023-06-30,23:10,0.0,13.7,2.3,3.6
709217,2023-06-30,23:20,0.0,13.6,1.9,3.3
709218,2023-06-30,23:30,0.0,13.6,1.7,3.0
709219,2023-06-30,23:40,0.0,13.6,1.9,3.3
709220,2023-06-30,23:50,0.0,13.5,1.9,3.0


#### **Now we've created one big dataframe containing all interesting weather data from 2010 to 2023. However, the traffic data only goes from 2015-2023, and so I want to clear the dataset for any weather data before that. The model is going to get slightly less accurate with less data, but I think there's enough data already with 2015-2023 to do this anyways. It makes it simpler as well. I won't remove the data from the raw_data folder because I still recognize it there, and I want to see the different predictions based on it being included or not, but for now I won't focus on it.**

In [178]:
merged_weather_df["Dato"] = pd.to_datetime(merged_weather_df["Dato"])

merged_weather_df = merged_weather_df[merged_weather_df["Dato"].dt.year >= 2015]
merged_weather_df = merged_weather_df.reset_index(drop=True)

merged_weather_df.tail()

Unnamed: 0,Dato,Tid,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast
446749,2023-06-30,23:10,0.0,13.7,2.3,3.6
446750,2023-06-30,23:20,0.0,13.6,1.9,3.3
446751,2023-06-30,23:30,0.0,13.6,1.7,3.0
446752,2023-06-30,23:40,0.0,13.6,1.9,3.3
446753,2023-06-30,23:50,0.0,13.5,1.9,3.0


#### **As we see in the beginning of the merged dataframe, we see some data missing in 2015-01-01 for Vindkast. I want clean, full data and who knows how many rows are missing data in one or more columns. Let's find out.**

In [179]:
rows_missing_data = merged_weather_df[merged_weather_df.isna().any(axis=1)].shape[0]
print(rows_missing_data)

3786


#### **As we can see, there are 3786 rows that are missing some data. 3786 is only 0,84% of the merged dataframe. I think we can afford to get rid of that.**

In [180]:
merged_weather_df.dropna(inplace=True)

merged_weather_df = merged_weather_df.reset_index(drop=True)

merged_weather_df.tail()

Unnamed: 0,Dato,Tid,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast
442963,2023-06-30,23:10,0.0,13.7,2.3,3.6
442964,2023-06-30,23:20,0.0,13.6,1.9,3.3
442965,2023-06-30,23:30,0.0,13.6,1.7,3.0
442966,2023-06-30,23:40,0.0,13.6,1.9,3.3
442967,2023-06-30,23:50,0.0,13.5,1.9,3.0


#### **The next thing I want to do is combine the columns "Dato" and "Tid" to get a single datetime column. This will be useful later.**

In [181]:
merged_weather_df["Dato"] = merged_weather_df["Dato"].astype(str)
merged_weather_df["Tid"] = merged_weather_df["Tid"].astype(str)

merged_weather_df["Datotid"] = merged_weather_df["Dato"] + " " + merged_weather_df["Tid"]

merged_weather_df["Datotid"] = pd.to_datetime(merged_weather_df["Datotid"])

merged_weather_df.drop(["Dato", "Tid"], axis=1, inplace=True)

merged_weather_df.tail()

Unnamed: 0,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast,Datotid
442963,0.0,13.7,2.3,3.6,2023-06-30 23:10:00
442964,0.0,13.6,1.9,3.3,2023-06-30 23:20:00
442965,0.0,13.6,1.7,3.0,2023-06-30 23:30:00
442966,0.0,13.6,1.9,3.3,2023-06-30 23:40:00
442967,0.0,13.5,1.9,3.0,2023-06-30 23:50:00


#### **The next we'll do is look at the traffic data**

In [182]:
dir_traffic = "raw_data/traffic_data/trafikkdata.csv"
traffic_df = pd.read_csv(dir_traffic)

traffic_df.tail()

  traffic_df = pd.read_csv(dir_traffic)


Unnamed: 0,Trafikkregistreringspunkt;Navn;Vegreferanse;Fra;Til;Dato;Fra tidspunkt;Til tidspunkt;Felt;Trafikkmengde;Dekningsgrad (%)|Antall timer total|Antall timer inkludert|Antall timer ugyldig|Ikke gyldig lengde|Lengdekvalitetsgrad (%)|< 5,6m|>= 5,6m|5,6m - 7,6m|7,6m - 12,5m|12,5m - 16,0m|>= 16,0m|16,0m - 24,0m|>= 24,0m
348635,17510B2483952;Gamle Nygårdsbru sykkel;KV256 S2...,,,,,,,,,,,,
348636,17510B2483952;Gamle Nygårdsbru sykkel;KV256 S2...,,,,,,,,,,,,
348637,17510B2483952;Gamle Nygårdsbru sykkel;KV256 S2...,,,,,,,,,,,,
348638,17510B2483952;Gamle Nygårdsbru sykkel;KV256 S2...,,,,,,,,,,,,
348639,17510B2483952;Gamle Nygårdsbru sykkel;KV256 S2...,,,,,,,,,,,,
