## mlflowrate: Tutorial

This notebook will loosely guide users on how to use the data integration part of the software on a local jupyter notebook environment.

However, the same methods/logic flow are also applied for an Azure Databricks notebook.

To import the software, one must call:

In [None]:
from mlflowrate.workflow import WorkFlow

We also import the relevant Spark frameworks, within an Azure Databricks notebook this will not be necessary!

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, IntegerType
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import lit, unix_timestamp
from pyspark.sql import functions

Here, we construct some dummy datasets for processing!

In [None]:
dummy1 = [("JJ", 1.0, 2, 3, 8, 5), ("Bizarre", 1.8, 3, 3, 5, None)]
df1 = spark.createDataFrame(dummy1, ["datetime", "a", "b", "c", "d", "e"])

dummy2 = [("John", 1.0, 2, 3, 4, 5), ("Snow", 1.3, None, 4, 5, 6), ("JJ", 1.0, 2, 3, 8, 5), ("Bizarre", 1.8, 3, 3, 5, 6)]
df2 = spark.createDataFrame(dummy2, ["datetime", "a", "b", "c", "d", "e"])

To start using the package, we must instantiate the data management module WorkFlow:

In [None]:
dfs = {"d1":df1, 
       "d2":df2}

flow = WorkFlow(dfs)

Lets merge the two pieces of dummy data into a single dataframe called "df":

In [None]:
flow.integrate.merge_data(newname="df", first="d1", second="d2", axis=0)
flow.integrate.status("df")

You've now integrated csv data into the program!

## Real Life Example:

For better intuition on how to use the program. The below code shows how the software was used for applying different machine learning models to the data in an Azure Databricks Notebook:

### 1. Load csv data as Spark DataFrames

In [None]:
# Load datas into notebook
df_01 = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/newdump_01.csv')
df_02 = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/newdump_02.csv')
df_OW1ql = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/Qliq.csv')
df_records = spark.read.format('csv').options(header='true', inferSchema='true', delimiter='|', encoding='iso-8859-1').load('/FileStore/tables/interences_filtered.csv')
test_OW1 = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/OW1test_1.csv')

# Convert date columns to datetime formats
# Rename columns so that it fits with flow package.
df_01 = df_01.select(
      F.to_timestamp(F.col("ts").cast("string"), "dd-MMM-yy HH:mm:ss").alias("datetime"),
      df_01["name"].alias("tag"),
      df_01["value"])

df_02 = df_02.select(
      F.to_timestamp(F.col("ts").cast("string"), "dd-MMM-yy HH:mm:ss").alias("datetime"),
      df_02["name"].alias("tag"),
      df_02["value"])

df_OW1ql = df_OW1ql.select(
      F.to_timestamp(F.col("DATE").cast("string"), "MM/dd/yyyy").alias("datetime"),
      *[feat for feat in df_OW1ql.columns if feat not in ["DATE"]])

test_OW1 = test_OW1.select(
      F.to_timestamp(F.col("DATE").cast("string"), "MM/dd/yyyy HH:mm").alias("datetime"),
      *[n for n in test_OW1.schema.names if not n == "DATE"])

df_records = df_records.select(
      F.to_timestamp(F.col("Date").cast("string"), "MM/dd/yyyy").alias("datetime"),
      *[n for n in df_records.schema.names if not n == "Date"])

### 2. Clean and Integrate the data

In [None]:
#Create dictionary for sorting tagname data into oilwells:
do = {"OW1" : ["EXAMPLE", "EXAMPLE"],
      "OW2" : ["EXAMPLE", "EXAMPLE"],
      "OW3" : ["EXAMPLE"]}

#Create dictionary for sorting tagname data into measurements:
nn = {"EXAMPLE" : "WHP",
      "EXAMPLE" : "WHT",
      "EXAMPLE" : "GLR",
      "EXAMPLE" : "GLP",
      "EXAMPLE" : "DHP",
      "EXAMPLE" : "DHT",
      "EXAMPLE" : "Choke",
      "EXAMPLE" : "ASD"}

# Collect up data into a dictionary with corresponding names
data = {"df_01":df_01, "df_02":df_02, "df_OW1ql":df_OW1ql, "test_OW1":test_OW1, "df_records":df_records}

# Make new workflow object by passing in the data dictionary 
flow = WorkFlow(data)

# Cache data to speed up spark queries
flow.integrate.cache_data("df_01", "df_02", "df_OW1ql", "test_OW1", "df_records")

# Merge dataframes vertically
flow.integrate.merge_data("df_dump", "df_01", "df_02", axis=0)

# select the stored data, and remove nulls
flow.integrate.clean_data("df_dump", remove_nulls=True)

# select the stored data, and reorganise it to more dictionaries
flow.integrate.organise_data("df_dump", "date_tag_val_col", distinct_oilwells=do, change_sensor_names=nn) 

# We get three new datasets OW1. OW2. OW3. but their dataframes will be out of shape
flow.integrate.clean_data("OW1", avg_over="day", is_dict=True)  # average the data on the dict format
flow.integrate.organise_data("OW1", "dict_col")  # reorganise the data again so the dictionary matches the dataframe
flow.integrate.set_organised("OW1") # Tell the class that this data is ready for dataset assembly
flow.integrate.cache_data("OW1") # Cache data frame for fast querying
flow.integrate.status("OW1") # check the status of the data

flow.integrate.clean_data("df_OW1ql", remove_nulls=True) # remove nulls in the data
flow.integrate.edit_col("df_OW1ql", "Daily liquid rate [Sm3/d]", newname="Qliq") # edit the column files to have a new name
flow.integrate.select_col("df_OW1ql", "Qliq") # reselect the data frame to only be the given sample
flow.integrate.organise_data("df_OW1ql", "mult_col") # reorganise data so that it we have a correponding group dict format
flow.integrate.set_organised("df_OW1ql") # Tell the class that this data is ready for assembly
flow.integrate.cache_data("df_OW1ql") # cache data frame for fast querying
flow.integrate.status("df_OW1ql") # check the status of the data

flow.integrate.drop_col("test_OW1", "WELLNAME", "NUMBER", "Qo", "Qg", "Qw", "GOR", "WOR", "WTC", "PI", "Pres", "DCP", "ChokeD", "DCTcalc")
flow.integrate.clean_data("test_OW1", remove_nulls=True)
flow.integrate.clean_data("test_OW1", char_col="WHT", remove_char="-")
flow.integrate.edit_col("test_OW1", "pbh", newname="DHP")
flow.integrate.edit_col("test_OW1", "Qgl", newname="GLR")
flow.integrate.edit_col("test_OW1", "Q liq", newname="Qliq")
flow.integrate.edit_col("test_OW1", "WHT", typ="double")
flow.integrate.organise_data("test_OW1", "mult_col")
flow.integrate.set_organised("test_OW1")
flow.integrate.cache_data("test_OW1")

### 3. Make datasets for machine learning

In [None]:
flow.next_phase()
# Make Spark DataFrame sets: Pick the features we want in our datasets from the different sources of data!
flow.datasets.make_set("Sep", align_dates="day", feats={"OW1": ["GLP", "ASD"], "test_OW1": ["WHP", "Choke", "GLR", "DHP", "Qliq", "WHT"]})
flow.datasets.make_set("Field", align_dates="day", feats={"OW1": ["DHP", "WHT", "GLR", "ASD", "WHP", "GLP", "Choke"], "df_OW1ql":["Qliq"]})

flow.datasets.make_set("Sep_2018", align_dates="day", feats={"OW1": ["GLP", "ASD"], "test_OW1": ["WHP", "Choke", "GLR", "DHP", "Qliq", "WHT"]})
flow.datasets.make_set("Field_2018", align_dates="day", feats={"OW1": ["DHP", "WHT", "GLR", "ASD", "WHP", "GLP", "Choke"], "df_OW1ql":["Qliq"]})

flow.datasets.make_set("OrigSep", align_dates="day", feats={"test_OW1": ["WHP", "Choke", "GLR", "DHP", "Qliq", "WHT"]})
flow.datasets.make_set("OrigField", align_dates="day", feats={"OW1": ["DHP", "WHT", "GLR", "WHP", "Choke"], "df_OW1ql":["Qliq"]})

flow.datasets.make_set("OrigSep_2018", align_dates="day", feats={"test_OW1": ["WHP", "Choke", "GLR", "DHP", "Qliq", "WHT"]})
flow.datasets.make_set("OrigField_2018", align_dates="day", feats={"OW1": ["DHP", "WHT", "GLR", "WHP", "Choke"], "df_OW1ql":["Qliq"]})

Manipulate the datasets: We want to move some rows of data into the our other Spark Datasets

In [None]:
flow.datasets.cache_data("Sep", "Field")
condition = lambda df : df.where(df.Qliq <= 50)
flow.datasets.append_rows("Field", condition, "Sep")
flow.datasets.date_range("OrigSep_2018", "2018-06-01", "2020-01-01")
flow.datasets.date_range("OrigField_2018", "2018-06-01", "2020-01-01")

condition = lambda df : df.where(df.Qliq <= 50)
flow.datasets.append_rows("Field_2018", condition, "Sep_2018")
flow.datasets.date_range("Sep_2018", "2018-06-01", "2020-01-01")
flow.datasets.date_range("Field_2018", "2018-06-01", "2020-01-01")

Make our dataset objects!

In [None]:
# convert spark dataframes to pandas dataset
flow.datasets.make_dataset("Sep", label="Qliq", pandas=True)
flow.datasets.make_dataset("Field", label="Qliq", pandas=True)
flow.datasets.make_dataset("Sep_2018", label="Qliq", pandas=True)
flow.datasets.make_dataset("Field_2018", label="Qliq", pandas=True)
flow.datasets.make_dataset("OrigSep", label="Qliq", pandas=True)
flow.datasets.make_dataset("OrigField", label="Qliq", pandas=True)
flow.datasets.make_dataset("OrigSep_2018", label="Qliq", pandas=True)
flow.datasets.make_dataset("OrigField_2018", label="Qliq", pandas=True)

flow.next_phase()

### 4. Apply a range of ML models to one of our datasets

In [None]:
datasets = flow.dataexplore.get_datasets()
eval_preds = flow.dataexplore.fitpredict_naivemodels(datasets["OrigSep_2018"], datasets["OrigField_2018"], eval_all=True)
flow.dataexplore.plotmodels(datasets["OrigSep_2018"], datasets["OrigField_2018"], eval_preds)