# Data extraction of trips using Dask dataframe

# Purpose
As a first step the time series data will be divided into trips, as a data reduction. Energy consumption can be calculated for each trip together with other aggregated quantities such as mean values, standard deviations etc. This will be used to analyze how much trips differ from each other over the year.

But the file is larger than the memory can take so this solution uses a Dask dataframe instead.

# Methodology
* Loop over the dask dataframe partitions and number the trips, save to partquet in each loop.

# Setup

In [1]:
#%load imports.py
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,3)

#import seaborn as sns
import os
from collections import OrderedDict

from IPython.display import display

pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.set_option("display.max_columns", None)

import folium
import plotly.express as px
import plotly.graph_objects as go

import sys
import os
sys.path.append('../../../')
from src.visualization import visualize

sys.path.append('../../../src/models/pipelines/longterm/scripts/prepdata/trip')
import prepare_dataset, trips, trip_id

import scipy.integrate
import seaborn as sns

import pyarrow as pa
import pyarrow.parquet as pq


## Parameters

In [2]:
name='tycho_short_parquet'
n_rows=None

In [3]:
from dask.distributed import Client, progress, TimeoutError
client = Client(n_workers=4, threads_per_worker=1, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:52592  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 7.45 GiB


In [4]:
df = prepare_dataset.get_dask(name=name)

Note, we have launched a browser for you to login. For old experience with device code, use "az login --use-device-code"


Performing interactive authentication. Please follow the instructions on the terminal.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


Method filter: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Method to_dask_dataframe: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [5]:
df.head()

Unnamed: 0,Timestamp [UTC],Latitude (deg),Longitude (deg),Speed over ground (kts),Heading (deg),Power EM Thruster 1 (kW),Power EM Thruster 2 (kW),Power EM Thruster 3 (kW),Power EM Thruster 4 (kW),Power EM Thruster Total (kW),Course over ground (deg),Sin PM1 (),Sin PM2 (),Sin PM3 (),Sin PM4 (),Cos PM1 (),Cos PM2 (),Cos PM3 (),Cos PM4 (),Power heeling (kW),Power Pitch Thruster 1 (kW),Power Pitch Thruster 2 (kW),Power Pitch Thruster 3 (kW),Power Pitch Thruster 4 (kW),Power Steer Thruster 1 (kW),Power Steer Thruster 2 (kW),Power Steer Thruster 3 (kW),Power Steer Thruster 4 (kW),Power Propulsion Total (kW),Power hotel Total (kW)
0,2020-01-01 08:31:19+00:00,56.0331,12.61723,0.42,77.7,146.0,123.0,148.0,164.0,581.0,89.04,-0.2023,-0.15491,0.01044,-0.0188,-0.9805,-0.99164,-1.0,-0.99954,0.0,,,,,,,,,581.0,
1,2020-01-01 08:31:20+00:00,56.0331,12.61723,0.4,77.6,164.0,166.0,150.0,162.0,642.0,86.49,-0.10016,-0.09088,0.02536,-0.00851,-0.98849,-0.99707,-1.0,-0.99966,0.0,,,,,,,,,642.0,
2,2020-01-01 08:31:21+00:00,56.0331,12.61723,0.48,77.5,171.0,177.0,146.0,162.0,656.0,84.19,-0.07849,-0.02191,0.02719,-0.00839,-0.99469,-0.99997,-1.0,-0.99969,0.0,,,,,,,,,656.0,
3,2020-01-01 08:31:22+00:00,56.0331,12.61724,0.56,77.4,182.0,186.0,152.0,159.0,679.0,81.75,-0.05557,0.00128,0.0271,-0.00848,-0.99591,-0.99997,-1.0,-0.99969,0.0,,,,,,,,,679.0,
4,2020-01-01 08:31:23+00:00,56.0331,12.61724,0.56,77.3,203.0,205.0,150.0,158.0,716.0,80.01,-0.048,0.00131,0.02771,-0.0083,-0.99615,-0.99994,-1.0,-0.99954,0.0,,,,,,,,,716.0,


In [6]:
ds = prepare_dataset.get_dataset(name=name, n_rows=n_rows)

df = ds.to_dask_dataframe(sample_size=1000000, dtypes=None, on_error='null', out_of_range_datetime='null')

Method filter: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Method to_dask_dataframe: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [7]:
df.npartitions

5

In [8]:
output_path ='id.parquet'
trip_id.save_numbered_trips(df=df, output_path=output_path)

In [9]:
client.close()