# The bike barometer

The region of Flanders has been building a lot of bike-highways. These are long stretches of really fine cycling paths. To see how well they are used some counters were also implemented, counters like the one on the following picture:

<img src="../files/2023-10-02-23-00-11.png" alt="image" width="600"/>

All data of these counters can be found on the following website:

[link](https://fietsbarometer.provincieantwerpen.be/geoloketten/?viewer=fietsbarometer)

We've downloaded a lot of the data from these counters. Let's start by importing them and cleaning them.

Tryout 1: can we read one file?

In [None]:
import pandas as pd

df_1 = pd.read_csv('../files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')
df_1.head()

But why read one file if we can read all the files? Carefully type all the filenames into an array and load all the files in one big dataframe. Or maybe Python has some sort of way of returning a list of all files in a folder that you can use?

Combining dataframes could be done using "append", but that is deprecated. Use "concat" instead.

In [None]:
# Up to you!



This should yield a dataframe op 725.740 rows long. Check out the datatypes next.

In [None]:
df.dtypes

The dates are not dates (but objects, which comes down to strings) and the total nr of cyclists ("Aantal fietsers") is a float while the other two are integers. That is fishy. Check if "Aantal fietsers" is always equal to "Aantal fietsers van" plus "Aantal fiets naar".

In [None]:
# Up to you!



Glad that checks out. We do see some NaN's, which is annoying. Are there a lot of them?

In [None]:
# Up to you!



Nope. That row were the only ones. That means we can safely drop all rows containing NaN's in the dataframe. Print the number of rows in the dataframe after the operation.

In [None]:
# Up to you!



Now fix the dates. Make sure the columns "Datum" and "Tijd" are saved in a single column as a datetime-field. Print the head of this new field.

In [None]:
# Up to you!



In [None]:
df.dtypes

## Analyzing

The data is now fully loaded and ready for some analyzing. But seven hundred thousand row is such an intimidatelingy large number. Maybe filter out all measurements for the counter nearest the school, at Rauwelkoven? The ID ("Meetpunt code") there is "FMN GV 21 Geel". Group those row by day of week and show the averages in a graph.

In [None]:
# Up to you!



This plot shows that most of the traffic is done on weekdays, less in weekends (but sligthly more on Sundays than Saturdays). Annoying: the days are out of order:

![](../files/2023-10-02-23-26-39.png)

Fix that one first! (And you are [never](https://stackoverflow.com/questions/47255746/change-order-on-x-axis-for-matplotlib-chart) the first with a problem.)

In [None]:
# Up to you!



Should this be a line or a bar graph, by the way? There are nice graphs about which graph to use:

![](../files/2023-10-03-19-20-09.png)

So we can use bars for days, but not for hours of the day or months (to many bars will decrease readability).

So we know weekdays are popular, which implies that the road is used more for traffic to and from school/work. This should also show up in the hours of the day it is being used.

Show the same graph but grouped by hour of the day. Also, show all hours of the day, something like:

![](../files/2023-10-03-10-38-14.png)

In [None]:
# Up to you!



There aren't many cyclists on weekends, but maybe they are responsible for the nightly rides, where the week-traffic is more during the day. We also see a spike around 7 and 8, which is when the commuters are using the road. They should be gone on Saturday and Sunday, no?

Draw a graph showing the average of the total nr of cyclists ("Aantal fietsers", not "van" and "naar") for weekends and weekdays. ([tip](https://datascienceparichay.com/article/pandas-check-weekday-or-weekend/))

In [None]:
# Up to you!



What about the impact of the seasons? Are there more cyclists in the summer, or less because we have fewer commuters?

Show a graph of the average number of cyclists overy month for weekends and for weekdays.

In [None]:
# Up to you!



If ever a graph explained how humans work, this would be it.

* Weekendcycling is up when it's nice and warm (April-September).
* Weekcycling has a spike in September ("This year, I'll commute by bike!) and a quick and sharp fall afterwards.
* In February the new years resolutions kick in and people start cycling again.
* Weekcycling is up when it's warm, but has a fall during the holidays (July and August).

Finally, recreate this graph but with all records, not just the ones from "fmn_gv_21".

In [None]:
# Up to you!



The spikes are still there, but the difference between week and weekend is somewhat smoothed over.

## Boxplots

We've done a lot of line charts, but maybe a boxplot would be interesting as well. Could we show a boxplot of all daily averages for all the measuring points per month?

In [None]:
# Up to you!



There seems to be one outlier, a bikepath with a monthly average of over 800 cyclists. Which is it?

In [None]:
# Up to you!



There are rules to calculate outliers, but the most common one is the 1.5 * IQR rule. This datapoint is definitely an outlier. Remove it!

In [None]:
# Up to you!



Now look at only the averages and explain your data.