# Skript for Live Vehicle Data Cleaning - Part 1

<strong><em>Important: This is a guide, which helps and explains you the data cleaning we where doing before this Hack-a-thon. There are parts you can and sometimes should directly copy and paste. You won't be able to copy the whole notebook and run it within your project.</em></strong>

## Creating the Client Connection to the Cloud Object Storage and the "smart-city-live-vehicle-positions" bucket

The following code cell can be automatically inserted trough the Notebook UI. To do so, click on the data button (top right corner) there you find the *files* and *connections* tab. Go to the *connection* as we want to create a client to our Cloud Object Storage. 

There you will find the Connection which we created before. Click "insert to code" and choose the "StreamingBody object" option. After that there will open a pop up which showes you the folder structure of your underlying cloud bucket. Choose the right folders and subfolders until you end up in the last subfolder, that contains all the .json files we need. Choose one file and click *Select*. Next you will see a code cell, inserted automatically, that looks like this one except it contains the correct api-keys etc.

> It doesn't matter which .json you will choose, because we will later on only use the created client object to access more then only one .json file.

In [1]:
# @hidden_cell


# This connection object is used to access your data and contains your credentials or project token.
# You might want to remove those credentials before you share your notebook.


import types
import pandas as pd
import ibm_boto3
from botocore.client import Config

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.

Cloud_Object_Storage_Connection_client = ibm_boto3.client(
    service_name='s3',
    ibm_api_key_id='api-key',
    ibm_service_instance_id='service-instance-id',
    ibm_auth_endpoint='https://iam.cloud.ibm.com/identity/token',
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-de.cloud-object-storage.appdomain.cloud'
)

body = Cloud_Object_Storage_Connection_client.get_object(Bucket='smart-city-live-vehicle-positions', Key='topics/open_HFP_API/partition=0/open_HFP_API+0+0054722848.json')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object 

if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# Since JSON data can be semi-structured and contain additional metadata, it is possible that you might face an error during data loading.
# Refer to the documentation of 'pandas.read_json()' and 'pandas.io.json.json_normalize' for more possibilities to adjust the data loading.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html


After we sucsessfully created the client (your's might be named differently than `Cloud_Object_Storage_Connection_client`, either change it in the cell above or keep in mind to change the name whenever the client is used) we now need a function that can read/get/access more than one .json file. 

As we know, the data is saved in a S3 object style. The following cell shows the function we use to access files over a given timespan in which they were written to the storage. (Don't stress about the function and how it works in detail 😉) Just copy and paste it.


In [2]:
import argparse
import boto3
import dateutil.parser
import logging
import pytz
from collections import namedtuple

import pandas as pd
from datetime import datetime, timezone, timedelta



logger = logging.getLogger(__name__)


Rule = namedtuple('Rule', ['has_min', 'has_max'])
last_modified_rules = {
    Rule(has_min=True, has_max=True):
        lambda min_date, date, max_date: min_date <= date <= max_date,
    Rule(has_min=True, has_max=False):
        lambda min_date, date, max_date: min_date <= date,
    Rule(has_min=False, has_max=True):
        lambda min_date, date, max_date: date <= max_date,
    Rule(has_min=False, has_max=False):
        lambda min_date, date, max_date: True,
}

def get_s3_objects(s3, bucket, prefixes=None, suffixes=None, last_modified_min=None, last_modified_max=None):
    
    if last_modified_min and last_modified_max and last_modified_max < last_modified_min:
        raise ValueError(
            "When using both, last_modified_max: {} must be greater than last_modified_min: {}".format(
                last_modified_max, last_modified_min
            )
        )
    # Use the last_modified_rules dict to lookup which conditional logic to apply
    # based on which arguments were supplied
    last_modified_rule = last_modified_rules[bool(last_modified_min), bool(last_modified_max)]

    if not prefixes:
        prefixes = ('',)
    else:
        prefixes = tuple(set(prefixes))
    if not suffixes:
        suffixes = ('',)
    else:
        suffixes = tuple(set(suffixes))

    kwargs = {'Bucket': bucket}

    for prefix in prefixes:
        kwargs['Prefix'] = prefix
        while True:
            # The S3 API response is a large blob of metadata.
            # 'Contents' contains information about the listed objects.
            resp = s3.list_objects_v2(**kwargs)
            for content in resp.get('Contents', []):
                last_modified_date = content['LastModified']
                if (
                    content['Key'].endswith(suffixes) and
                    last_modified_rule(last_modified_min, last_modified_date, last_modified_max)
                ):
                    yield content

            # The S3 API is paginated, returning up to 1000 keys at a time.
            # Pass the continuation token into the next response, until we
            # reach the final page (when this field is missing).
            try:
                kwargs['ContinuationToken'] = resp['NextContinuationToken']
            except KeyError:
                break

## Defining the Timespan, which jsons/S3 objects will be collected

As introduced right before this, we are able to access data from LVD trough our client and define a timespan in which we want to get the data.

We decided to create three variables:
`dateloading`,
`starttime`,
`endtime`, to create the timespan we were talking about. 

> Even for a few hours the data that has been collected can sum up to 1.000.000+ rows. So to get a feeling of the data cleaning process it is more than enough to create a small timespan (one hour). Also do remember that it is possible, if you change date and time, that there is no data available.

In [4]:
dateloading = "2022-02-26"
starttime = datetime.fromisoformat(dateloading + ' 12:00:00.000+00:00')
endtime = datetime.fromisoformat(dateloading + ' 13:00.000+00:00')

Because the data amount is so large, we needed to access the data day by day and within that day in a few hour steps.

Trough the `Cloud_Object_Storage_Connection` and with the usage of the defined method `get_s3_objects` we are now able to access our s3 objects, that were written within the defined timewindow.

In [6]:
objs=get_s3_objects(s3=Cloud_Object_Storage_Connection_client,bucket="smart-city-live-vehicle-positions", last_modified_min=starttime, last_modified_max=endtime)

The variable `objs` is now filled with these s3 Objects, which isn't a format we can really work with in terms of the final data in form of .json. So it requires one more step to extract the wanted data into our pandas.DataFrame.

## Reading the Vehicle Positions from the variable objs and store them into a DataFrame

To receive our data, we use the variable `objs` and iterate trough every `obj` that it contains. We have defined an empty DataFrame (`df_lvd`) which will be filled step by step with the data we want to extract. To do so, we have to use our client again. We use `.get_object()`, give it the exact Bucket we want to access and the Key to our data, which is stored in each `obj['Key']['Body']` and read the lines we "find" there which are in the .json format. Around that call, we use `pd.read_json()` to extract the data from the json. This input (now in the form of a DataFrame) is now passed to the self defined method `extract_data()` that extracts all the "VP" (VehiclePosition) related data, deletes all empty rows, makes a list out of it (to eliminate format issues) and writes it back into a pd.DataFrame form.

This is passed back to the iterational loop, where it is appended to the `df_lvd`. After extracting all the jsons from all the given `obj` out of `objs`, we finally reset the index and drop the old one.

In [7]:
def extract_data(df_temp):
    df_temp = df_temp['VP']
    df_temp = df_temp.dropna()
    df_temp = df_temp.tolist()
    df_temp = pd.DataFrame.from_records(df_temp)
    return df_temp

df_lvd = pd.DataFrame()
for obj in objs:
    df_lvd = df_lvd.append(extract_data(pd.read_json(Cloud_Object_Storage_Connection_client.get_object(Bucket='smart-city-live-vehicle-positions', Key=obj['Key'])['Body'].read(), lines=True)))   

df_lvd = df_lvd.reset_index(drop = True)

Now, let's take a look if that worked out

In [8]:
df_lvd

Unnamed: 0,acc,drst,loc,spd,line,jrn,dl,start,hdg,tsi,...,stop,occu,veh,desi,oper,odo,lat,oday,seq,label
0,0.0,0.0,GPS,0.00,588,8.0,299.0,10:04,237.0,1646899191,...,1453132,0,820,801,47,0.0,60.209848,2022-03-10,,
1,0.0,1.0,GPS,0.00,120,895.0,119.0,10:01,224.0,1646899191,...,1431183,0,1051,86,22,94.0,60.194654,2022-03-10,,
2,0.1,0.0,GPS,12.14,964,923.0,-129.0,09:14,63.0,1646899191,...,,0,41,506,30,12966.0,60.233694,2022-03-10,,
3,0.5,,GPS,12.08,769,57.0,-166.0,09:40,335.0,1646899191,...,,0,1124,322,22,,60.210779,2022-03-10,,
4,-0.1,,GPS,38.12,284,9651.0,-50.0,09:40,25.0,1646899191,...,,0,6317,R,90,,60.367697,2022-03-10,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111027,0.0,1.0,GPS,0.00,244,452.0,-346.0,09:12,234.0,1646899313,...,1130113.0,0,1015,212,22,19654.0,60.170823,2022-03-10,,
111028,,1.0,ODO,,636,8636.0,59.0,09:28,,1646899313,...,4530501.0,0,1035,P,90,25909.0,,2022-03-10,,
111029,0.0,1.0,GPS,0.00,973,921.0,299.0,09:35,94.0,1646899309,...,1370108.0,0,833,61T,6,10204.0,60.241650,2022-03-10,,
111030,0.0,1.0,GPS,0.00,973,921.0,299.0,09:35,94.0,1646899310,...,1370108.0,0,833,61T,6,10204.0,60.241650,2022-03-10,,


How many rows x columns do we have at hand? 

In [9]:
df_lvd.shape

(111032, 24)