# Skript for Live Vehicle Data Cleaning - Part 2

<strong><em>Important: This is a guide, which helps and explains you the data cleaning we where doing before this Hack-a-thon. There are parts you can and sometimes should directly copy and paste. You won't be able to copy the whole notebook and run it within your project.</em></strong>

## Creating the Client Connection to the Cloud Object Storage and the "smart-city-live-vehicle-positions" bucket

The following code cell can be automatically inserted trough the Notebook UI. To do so, click on the data button (top right corner) there you find the *files* and *connections* tab. Go to the *connection* as we want to create a client to our Cloud Object Storage. 

There you will find the Connection which we created before. Click "insert to code" and choose the "StreamingBody object" option. After that there will open a pop up which showes you the folder structure of your underlying cloud bucket. Choose the right folders and subfolders until you end up in the last subfolder, that contains all the .json files we need. Choose one file and click *Select*. Next you will see a code cell, inserted automatically, that looks like this one except it contains the correct api-keys etc.

> It doesn't matter which .json you will choose, because we will later on only use the created client object to access more then only one .json file.

# Clean the DataFrame

Let's start with the cleaning process of the data! :) 

## Delete rows with missing information about location

So LVD data is all about Vehicle Position data. So obviously, every record without the information about the position / geo-location is not interesting for us. We start the cleaning procedure with eliminating all rows that don't contain a `lat` or `long` (latitude, longitude)

In [11]:
df_lvd = df_lvd[df_lvd['lat'].notna() & df_lvd['long'].notna()]

## Deciding which featuers/columns are relevant

Not all of the 24 columns, or so called features, are relevant to us. To determine which are and which might not be, we take a look at the distribution of each feature (how many missing values) and look into the documentary about the API from where the data originally came.
(https://digitransit.fi/en/developers/apis/4-realtime-api/vehicle-positions/) 

Let's start with the calculation about the data distribution:

In [12]:
for column in df_lvd.columns:
    percent = (df_lvd[column].isna().sum()/len(df_lvd))*100
    print(column + " has this many NA values: " + str(percent) + "%")

acc has this many NA values: 4.624852762584805%
drst has this many NA values: 16.740779970232932%
loc has this many NA values: 0.0%
spd has this many NA values: 0.06756941844645123%
line has this many NA values: 3.760146826520084%
jrn has this many NA values: 4.5572833441383525%
dl has this many NA values: 9.742779659778847%
start has this many NA values: 0.0%
hdg has this many NA values: 1.4545686970972542%
tsi has this many NA values: 0.0%
dir has this many NA values: 0.0%
long has this many NA values: 0.0%
route has this many NA values: 0.0%
tst has this many NA values: 0.0%
stop has this many NA values: 58.72056393071395%
occu has this many NA values: 0.0%
veh has this many NA values: 0.0%
desi has this many NA values: 0.0%
oper has this many NA values: 0.0%
odo has this many NA values: 19.650830464676716%
lat has this many NA values: 0.0%
oday has this many NA values: 0.0%
seq has this many NA values: 95.49567647031968%
label has this many NA values: 99.94704018554197%


Now seeing, that there are columns which have a high percentage of missing values, we derive action from it.
We decide to delete `seq`and `label` because they're each missing in more than 95% of the total rows. 

After looking at the documentation, we are going to decide that we want to delete `odo` because it's value is not reliable. `oday`, `line` and `jrn` are listed as values, with an internal only purpose, so we are going to delete them too. 

**Now take a moment, look into the documentation yourself and decide what else you might want to clean from your data set. Maybe you take the analytical question etc. into account and think about the goal you want to achieve with the given data.**

In [14]:
if 'seq' in df_lvd.columns:
    df_lvd_cleaned = df_lvd.drop(['seq', 'label', 'odo', 'oday', 'jrn', 'line'], axis=1)
else:
    df_lvd_cleaned = df_lvd.drop(['label', 'odo', 'oday', 'jrn', 'line'], axis=1)
        

df_lvd_cleaned = df_lvd_cleaned.reset_index(drop=True)
print(df_lvd_cleaned.head())

   acc  drst  loc    spd     dl  start    hdg         tsi dir       long  \
0  0.0   0.0  GPS   0.00  299.0  10:04  237.0  1646899191   1  25.077927   
1  0.0   1.0  GPS   0.00  119.0  10:01  224.0  1646899191   1  25.031280   
2  0.1   0.0  GPS  12.14 -129.0  09:14   63.0  1646899191   2  25.030158   
3  0.5   NaN  GPS  12.08 -166.0  09:40  335.0  1646899191   1  24.888061   
4 -0.1   NaN  GPS  38.12  -50.0  09:40   25.0  1646899191   1  25.090595   

   route                       tst     stop  occu   veh desi  oper        lat  
0   1801  2022-03-10T07:59:51.305Z  1453132     0   820  801    47  60.209848  
1   1086  2022-03-10T07:59:51.309Z  1431183     0  1051   86    22  60.194654  
2   1506  2022-03-10T07:59:51.308Z     None     0    41  506    30  60.233694  
3   4322  2022-03-10T07:59:51.312Z     None     0  1124  322    22  60.210779  
4  3001R  2022-03-10T07:59:51.258Z     None     0  6317    R    90  60.367697  


## Important infos about the features:
- spd => in m/s
- acc => m/s^2
- dl => delay in sec
- drst => door status (0 = closed, 1 = open)
- ... Feel free to expand 

When we take a closer look at the documentation you will find that there are two ways, the location is defined. Beside `GPS`, there is also `ODO` and `MAN`. We normally don't "trust" the estimation of an location nor do we wan't some manual inserted information. But before we consider deleting these, we check how the `GPS` based locating technique is used.

In [15]:
(df_lvd_cleaned['loc'] == "GPS").sum() / len(df_lvd_cleaned) * 100

95.1240446688642

We see that just over 95% are `GPS` based information. That's definatly an overwhelming majority, so we drop the others

In [16]:
df_lvd_cleaned = df_lvd_cleaned[df_lvd_cleaned['loc'] == "GPS"]
df_lvd_cleaned = df_lvd_cleaned.reset_index(drop=True)

Since every location is now measured trough `GPS` this column doesn't provide additional information, so we drop the whole column.

In [17]:
df_lvd_cleaned = df_lvd_cleaned.drop(['loc'], axis=1)

Now our dataframe looks like this.

In [18]:
df_lvd_cleaned.head()

Unnamed: 0,acc,drst,spd,dl,start,hdg,tsi,dir,long,route,tst,stop,occu,veh,desi,oper,lat
0,0.0,0.0,0.0,299.0,10:04,237.0,1646899191,1,25.077927,1801,2022-03-10T07:59:51.305Z,1453132.0,0,820,801,47,60.209848
1,0.0,1.0,0.0,119.0,10:01,224.0,1646899191,1,25.03128,1086,2022-03-10T07:59:51.309Z,1431183.0,0,1051,86,22,60.194654
2,0.1,0.0,12.14,-129.0,09:14,63.0,1646899191,2,25.030158,1506,2022-03-10T07:59:51.308Z,,0,41,506,30,60.233694
3,0.5,,12.08,-166.0,09:40,335.0,1646899191,1,24.888061,4322,2022-03-10T07:59:51.312Z,,0,1124,322,22,60.210779
4,-0.1,,38.12,-50.0,09:40,25.0,1646899191,1,25.090595,3001R,2022-03-10T07:59:51.258Z,,0,6317,R,90,60.367697
