MUNICIPALITY_ID, TIMESTAMP, USAGE, TOTAL_CAPACITY
where municipality_id is an anonymization to disguise the actual names, timestamp represents the exact time of the measurement, usage is the number of buses in use at the time of measurement and total_capacity represents the total number of buses in the municipality. There are 10 municipalities (ids from 0 to 9), and two measurements for an hour.
The committee asks you to forecast the hourly bus usages for next week for each municipality. Hence you can aggregate the two measurements for an hour by taking the max value (sum would not be a nice idea for the obvious reasons) for each hour, and you should model this data with a time series model of your selection. (It would be a nice idea to implement a very simple baseline model first, and then try to improve the accuracy by introducing more complex methods eventually. The bare minimum requirement of the task is one simple baseline and one complex method.)
The committee says that they will use the last two weeks (starting from 2017-08-05 to 2017-08-19) as assessment (test) data, hence your code should report the error (in the criterion you chose for the task) for the last two weeks. You may use true values for the prediction of the last week of test data, then combine the error of the first and last week of the test separately.
Keep in mind that the dataset has missing data, hence a suitable missing data interpolation would be useful.

In [27]:
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [28]:
df = pd.read_csv("municipality_bus_utilization.csv")

In [29]:
df.head()

Unnamed: 0,timestamp,municipality_id,usage,total_capacity
0,2017-06-04 07:59:42,9,454,1332
1,2017-06-04 07:59:42,8,556,2947
2,2017-06-04 07:59:42,4,1090,3893
3,2017-06-04 07:59:42,0,204,2813
4,2017-06-04 07:59:42,7,718,2019


In [30]:
df.shape

(13070, 4)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13070 entries, 0 to 13069
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   timestamp        13070 non-null  object
 1   municipality_id  13070 non-null  int64 
 2   usage            13070 non-null  int64 
 3   total_capacity   13070 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 408.6+ KB


In [32]:
df.isnull().sum()

timestamp          0
municipality_id    0
usage              0
total_capacity     0
dtype: int64

In [34]:
df

Unnamed: 0,timestamp,municipality_id,usage,total_capacity
0,2017-06-04 07:59:42,9,454,1332
1,2017-06-04 07:59:42,8,556,2947
2,2017-06-04 07:59:42,4,1090,3893
3,2017-06-04 07:59:42,0,204,2813
4,2017-06-04 07:59:42,7,718,2019
...,...,...,...,...
13065,2017-08-19 16:30:35,2,548,697
13066,2017-08-19 16:30:35,8,1193,2947
13067,2017-08-19 16:30:35,7,1354,2019
13068,2017-08-19 16:30:35,6,1680,3113


In [35]:
# Use the last two weeks (starting from 2017-08-05 to 2017-08-19) as assessment (test) data.
mask = (df["timestamp"] > "2017-08-05") & (df["timestamp"] <= "2017-08-19")
test_data = df.loc[mask]

In [36]:
test_data

Unnamed: 0,timestamp,municipality_id,usage,total_capacity
10390,2017-08-05 08:02:03,1,141,397
10391,2017-08-05 08:02:03,6,494,3113
10392,2017-08-05 08:02:03,7,581,2019
10393,2017-08-05 08:02:03,4,1782,3893
10394,2017-08-05 08:02:03,8,453,2947
...,...,...,...,...
12885,2017-08-18 16:30:25,5,215,587
12886,2017-08-18 16:30:25,4,1367,3893
12887,2017-08-18 16:30:25,7,1272,2019
12888,2017-08-18 16:30:25,1,374,397


In [38]:
test_data.tail()

Unnamed: 0,timestamp,municipality_id,usage,total_capacity
12885,2017-08-18 16:30:25,5,215,587
12886,2017-08-18 16:30:25,4,1367,3893
12887,2017-08-18 16:30:25,7,1272,2019
12888,2017-08-18 16:30:25,1,374,397
12889,2017-08-18 16:30:25,9,763,1332


In [39]:
train_data = df.drop(range(10390,12890))
train_data

Unnamed: 0,timestamp,municipality_id,usage,total_capacity
0,2017-06-04 07:59:42,9,454,1332
1,2017-06-04 07:59:42,8,556,2947
2,2017-06-04 07:59:42,4,1090,3893
3,2017-06-04 07:59:42,0,204,2813
4,2017-06-04 07:59:42,7,718,2019
...,...,...,...,...
13065,2017-08-19 16:30:35,2,548,697
13066,2017-08-19 16:30:35,8,1193,2947
13067,2017-08-19 16:30:35,7,1354,2019
13068,2017-08-19 16:30:35,6,1680,3113
