# Skript for TMS Data Cleaning - Part 3

<strong><em>Important: This is a guide, which helps and explains you the data cleaning we where doing before this Hack-a-thon. There are parts you can and sometimes should directly copy and paste. You won't be able to copy the whole notebook and run it within your project.</em></strong>

## Creating the Client Connection to the Cloud Object Storage and the "smart-city-live-vehicle-positions" bucket

The following code cell can be automatically inserted trough the Notebook UI. To do so, click on the data button (top right corner) there you find the *files* and *connections* tab. Go to the *connection* as we want to create a client to our Cloud Object Storage. 

There you will find the Connection which we created before. Click "insert to code" and choose the "StreamingBody object" option. After that there will open a pop up which showes you the folder structure of your underlying cloud bucket. Choose the right folders and subfolders until you end up in the last subfolder, that contains all the .json files we need. Choose one file and click *Select*. Next you will see a code cell, inserted automatically, that looks like this one except it contains the correct api-keys etc.

> It doesn't matter which .json you will choose, because we will later on only use the created client object to access more then only one .json file.

# TMS Data Cleaning (the real part 😉)

We kindof started the cleaning process a bit early with deleting the duplicated rows as described above. Now we really take a deep look into the documentation and the features we have at hand.

## Delete all the irrelevant features 

After looking at the documentation __(https://www.digitraffic.fi/en/road-traffic/lam/)__ and deciding for our self, which features are irrelevant because the way they were measured or what they are about, we decided to drop the following ones:

- oldName
- sensorUnit
- shortName
- lastUpdated
- lastError
- type
- status
- timeWindowStart
- timeWindowEnd

> ____‼️____ Note that you don't have to decide the same way we did. Think about what important features you want to keep.

In [18]:
df_tms_clean = df_tms.drop(['name', 'timeWindowStart', 'timeWindowEnd', 'oldName', 'sensorUnit', 'shortName', 'lastUpdated', 'lastError', 'type', 'status', 'index'], axis=1)

In [19]:
df_tms_clean.head()

Unnamed: 0,measuredTime,roadStationId,id,sensorValue,timestamp,latitude,longitude
0,2022-03-12T23:59:35Z,23575.0,5158.0,139.0,2022-03-12 23:00:00,60.486397,26.54646
1,2022-03-12T23:59:35Z,23575.0,5164.0,1.0,2022-03-12 23:00:00,60.486397,26.54646
2,2022-03-12T23:59:35Z,23575.0,5071.0,1.0,2022-03-12 23:00:00,60.486397,26.54646
3,2022-03-12T23:59:35Z,23575.0,5057.0,103.0,2022-03-12 23:00:00,60.486397,26.54646
4,2022-03-12T23:59:35Z,23575.0,5168.0,1.0,2022-03-12 23:00:00,60.486397,26.54646


# Subsample after the hourly data
(https://www.digitraffic.fi/en/road-traffic/lam/)

If you took a look into the documenation of the API, you found that there are IDs for five minute and sixty minute measurements. We decide wo go with the higher aggregation of the data, which contains enough information for now.

All the IDs that contain the data for the sixty minute measurements: 
- 5056
- 5057
- 5054
- 5055
- 5067
- 5071

So we subset our `df_tms_clean` into a new DataFrame: `df_tms_hour`, now containing only the hour data.

In [20]:
df_tms_hour = df_tms_clean[df_tms_clean.id.isin([5056, 5057, 5054, 5055, 5067, 5071])]

In [21]:
df_tms_hour.head()

Unnamed: 0,measuredTime,roadStationId,id,sensorValue,timestamp,latitude,longitude
2,2022-03-12T23:59:35Z,23575.0,5071.0,1.0,2022-03-12 23:00:00,60.486397,26.54646
3,2022-03-12T23:59:35Z,23575.0,5057.0,103.0,2022-03-12 23:00:00,60.486397,26.54646
7,2022-03-12T23:59:35Z,23575.0,5067.0,2.0,2022-03-12 23:00:00,60.486397,26.54646
13,2022-03-12T23:59:45Z,23576.0,5054.0,69.0,2022-03-12 23:00:00,60.514416,26.926898
17,2022-03-12T23:59:45Z,23576.0,5071.0,3.0,2022-03-12 23:00:00,60.514416,26.926898


In [22]:
df_tms_hour = df_tms_hour.reset_index(drop=True)

In [23]:
df_tms_hour.head()

Unnamed: 0,measuredTime,roadStationId,id,sensorValue,timestamp,latitude,longitude
0,2022-03-12T23:59:35Z,23575.0,5071.0,1.0,2022-03-12 23:00:00,60.486397,26.54646
1,2022-03-12T23:59:35Z,23575.0,5057.0,103.0,2022-03-12 23:00:00,60.486397,26.54646
2,2022-03-12T23:59:35Z,23575.0,5067.0,2.0,2022-03-12 23:00:00,60.486397,26.54646
3,2022-03-12T23:59:45Z,23576.0,5054.0,69.0,2022-03-12 23:00:00,60.514416,26.926898
4,2022-03-12T23:59:45Z,23576.0,5071.0,3.0,2022-03-12 23:00:00,60.514416,26.926898


# Calculate geographical distance to our choosen central geolocation

You may have found that the TMS data is spread all accross the country of Finland. With the knowledge, that there are other data sources at hand, which focus themself around Helsinki, we decided to subsample TMS once again. We only wanted to have TMS data left, that is ~20km around the center of Helsinki.

To get a distance measurement from (longitude, latitude) locations, we needed a library called `geopy`. That's the reason, why we started with defining a custom environment. If we would have installed the package right now (which is possible) we would have to restart the kernel, which results in a unnessesary re-calculation of the done work.

So we went on GoogleMaps and choose a fixed Lat,Long pair as our central point of measurement. We then calculated every distance from the `roadStationId`s to this point and appended the `df_tms_hour`by one column, that contains the calculated distance.

In [24]:
import geopy
from geopy import distance
#(Lat, Long)!!
#fix ~ middle of Helsinki

fix = (60.192059, 24.945831)
abstand = []
for x in range(len(df_tms_hour)):
    distance = geopy.distance.distance((df_tms_hour.latitude[x], df_tms_hour.longitude[x]), fix).km
    abstand.append(distance)

df_tms_hour['distance'] = abstand
df_tms_hour

Unnamed: 0,measuredTime,roadStationId,id,sensorValue,timestamp,latitude,longitude,distance
0,2022-03-12T23:59:35Z,23575.0,5071.0,1.0,2022-03-12 23:00:00,60.486397,26.546460,94.283255
1,2022-03-12T23:59:35Z,23575.0,5057.0,103.0,2022-03-12 23:00:00,60.486397,26.546460,94.283255
2,2022-03-12T23:59:35Z,23575.0,5067.0,2.0,2022-03-12 23:00:00,60.486397,26.546460,94.283255
3,2022-03-12T23:59:45Z,23576.0,5054.0,69.0,2022-03-12 23:00:00,60.514416,26.926898,115.104429
4,2022-03-12T23:59:45Z,23576.0,5071.0,3.0,2022-03-12 23:00:00,60.514416,26.926898,115.104429
...,...,...,...,...,...,...,...,...
6331,2022-03-13T01:32:05Z,23142.0,5067.0,1.0,2022-03-13 01:00:00,60.736118,25.448445,66.627738
6332,2022-03-13T01:32:05Z,23142.0,5056.0,107.0,2022-03-13 01:00:00,60.736118,25.448445,66.627738
6333,2022-03-13T01:32:05Z,23142.0,5057.0,107.0,2022-03-13 01:00:00,60.736118,25.448445,66.627738
6334,2022-03-13T01:32:05Z,23142.0,5071.0,1.0,2022-03-13 01:00:00,60.736118,25.448445,66.627738


We than decided to limit the distance, a `roadStationId` has to the central, to 20km and delete this column, since we don't have any further use for it.

In [25]:
df_tms_core = df_tms_hour[df_tms_hour.distance <= 20]

In [26]:
df_tms_core = df_tms_core.reset_index(drop=True)

In [27]:
df_tms_core = df_tms_core.drop(['distance'], axis=1)


# Save the data back to our project

If we think we are finished with cleaning our data, which doesn't have to mean you have the exact same result, we want to extract the data out of the notebook and back into our project space. There we can use it as a data asset. 

To do so, we use the python libary `project_lib` and import `Project` from it. This gives us the needed functionality, to save the data (dataFrame is converted trough a pandas.DataFrame method named `to_csv()` into a csv format) back to our project space where it can be found as an data asset. 

In [None]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='project-id', project_access_token='project-access-token')
pc = project.project_context


project.save_data(data=df_tms_core.to_csv(index=False),file_name=str(dateloading)+"-only_TMS_hour.csv",overwrite=True)