# Bike highways - revisit in AWS

We've let PyCaret do it's magic, but can we also get these results ourselves?

We'll load the data for the same point locally and prepare it to be used on AWS (as we saw in the lab)

In [None]:
import pandas as pd

df = pd.read_csv('files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')
df.head()

AWS tells us to:

- Removes instances with missing values
- Sets the index to the InvoiceDate feature
- Only keeps instances that are from the United Kingdom
- Only keeps instances that use the target stock code (21232)
- Keeps instances where the price is greater than 0

(On the other dataset. Translate to our dataset!)

In [None]:
# DELETE

# Removes instances with missing values -> not needed
df.dropna()

# Sets the index to the InvoiceDate feature -> first combine the date, then set it as index
df["date_time"] = df["Datum"] + " " + df["Tijd"]
df["date_time"] = pd.to_datetime(df["date_time"])
df = df.set_index("date_time")
df.head()

# Only keeps instances that are from the United Kingdom -> ignore
# Only keeps instances that use the target stock code (21232) -> ignore
# Keeps instances where the price is greater than 0 -> don't delete hours on which there were no cyclists! A price of 0 is wrong, 0 cyclists is the correct nr of cyclists


Next part: generating training and test data frames. AWS says the following:

- Splits the data into a time series DataFrame and a related time series DataFrame.
- Downsamples the data from multiple sales entries per day into a single daily value. The **Quantity** column is summed, and the mean is used for the **Price** column.
- Splits the DataFrames into a training set that contains data from January 2010–October 2010, and a test set that contains data from November 2010–December 2010

Don't do it all at once. First resample to a daily value and then plot your data. That makes it easier to see which timespan to use for testing and training.

In [None]:
# DELETE

# Splits the data into a time series DataFrame and a related time series DataFrame.
df_time_series = df[["Aantal fietsers"]]
# df_time_series.head()

# Downsamples the data from multiple sales entries per day into a single daily value. The **Quantity** column is summed, and the mean is used for the **Price** column.
# We'll resample to a daily value, summing the nr of cyclists
df_time_series = df_time_series.resample('D').sum().reset_index().set_index(['date_time'])
# df_time_series.head()
df_time_series.plot()


Note the lines that are stuck to the bottom? Those are days on which the meter was broken. We should delete these as they're not readings but errors. Also note that the error-values aren't always zero, around January 2021 there were some way to small values. Filter them out and plot again...

In [None]:
# df_time_series[ df_time_series["Aantal fietsers"] == df_time_series["Aantal fietsers"].max() ]

# df_time_series.loc[ df_time_series["Aantal fietsers"] <= 100 ].plot -> arbirtrary threshold, check which works by trial and error.

df_time_series = df_time_series.loc[ df_time_series["Aantal fietsers"] >= 30 ]
df_time_series.plot()

See how the flat lines have been replaced by diagonal lines?

![](files/2023-10-09-15-51-09.png)

There aren't any values there, but because we drew a line graph the plot needs to be filled in. It therefore assumes a direct line between both values, but the model normally won't be bothered by it.

But back to AWS! Next part: generating training and test data frames. We were left with:

- Splits the DataFrames into a training set that contains data from January 2010–October 2010, and a test set that contains data from November 2010–December 2010

In [None]:
# DELETE

df_time_series

start_July_2022 = df_time_series[:'2022-07']
August_2022_end = df_time_series['2022-08':]

Next up is uploading to Amazon. It would be good to try this out in the AWS-canvas course and compare the results to PyCaret and our manual experimentations (from the next notebook).