## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [1]:
import random
from matplotlib.pyplot import imshow
import numpy as np
from sklearn import datasets
from matplotlib import pyplot as plt
import pandas as pd
import os
import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset

import json
import logging

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant.  Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```

If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)

In [2]:
print("This notebook was created using version 1.37.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was created using version 1.37.0 of the Azure ML SDK
You are currently using version 1.37.0 of the Azure ML SDK


In [3]:
ws = Workspace.from_config()

project_folder = './sample_projects/pdm-automl'

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
SDK version,1.37.0
Subscription ID,321dcb27-13a1-47b1-9ea3-c51d0eb1617e
Workspace Name,pdm-ws
Resource Group,pdm-rg
Location,eastus
Project Directory,./sample_projects/pdm-automl
Experiment Name,pdm-automl


## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

In [4]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

Turning diagnostics collection on. 


#### <font color='blue'> Challenge 1</font>:
You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.
1. Define a variable for your cluster name
2. Verify that the cluster does not exist already. If the cluster doesn't exist, create one.

Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace this code will skip the creation process. As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

Tips: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

In [1]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = ""

# Verify that the cluster does not exist already. If it doesn't exist, create one. 
'''
Code goes here 
'''

'\nCode goes here \n'

## Load Training Data

In [6]:
# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

azureml_globaldatasets - Default = False
workspaceworkingdirectory - Default = False
workspaceartifactstore - Default = False
workspaceblobstore - Default = True
workspacefilestore - Default = False


In [None]:
# Upload the data into the data store
default_ds.upload_files(files=['./data/train_FD001.txt', './data/test_FD001.txt', './data/RUL_FD001.txt'], # Upload the diabetes csv files in /data
                       target_path='pdm-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

In [17]:
from azureml.core import Dataset

default_ds = ws.get_default_datastore()


#Create a tabular dataset from the path on the datastore (this may take a short while)
tab_data_set_RUL = Dataset.Tabular.from_delimited_files(path=(default_ds, 'pdm-data/RUL_FD001.txt'),separator = " ",
                                                        header = False)
tab_data_set_train = Dataset.Tabular.from_delimited_files(path=(default_ds, 'pdm-data/train_FD001.txt'),separator = " ",
                                                        header = False)
tab_data_set_test = Dataset.Tabular.from_delimited_files(path=(default_ds, 'pdm-data/test_FD001.txt'), separator = " ",
                                                        header = False)

# Register the tabular dataset
try:
    tab_data_set_RUL = tab_data_set_RUL.register(workspace=ws, name='RULL dataset', description='pdm data', tags = {'format':'txt'},
                            create_new_version=True)
    tab_data_set_train = tab_data_set_train.register(workspace=ws, name='Train dataset', description='pdm data', tags = {'format':'txt'},
                            create_new_version=True)
    tab_data_set_test = tab_data_set_test.register(workspace=ws, name='Test dataset', description='pdm data', tags = {'format':'txt'},
                            create_new_version=True)
    print('Datasets registered.')
except Exception as ex:
    print(ex)


Datasets registered.


In [7]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Datasets:
	 Test dataset version 1
	 Train dataset version 1
	 RULL dataset version 1


In [8]:
dataset_RULL = Dataset.get_by_name(ws, "RULL dataset")
dataset_Train = Dataset.get_by_name(ws, "Train dataset")
dataset_Test = Dataset.get_by_name(ws, "Test dataset")

In [9]:
rul_df = dataset_RULL.to_pandas_dataframe()
train_df = dataset_Train.to_pandas_dataframe()
test_df = dataset_Test.to_pandas_dataframe()

In [10]:
rul_df.head()

Unnamed: 0,Column1,Column2
0,112,
1,98,
2,69,
3,82,
4,91,


In [11]:
train_df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,...,Column19,Column20,Column21,Column22,Column23,Column24,Column25,Column26,Column27,Column28
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,,
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,,
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,,
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,,
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,,


In [12]:
test_df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,...,Column19,Column20,Column21,Column22,Column23,Column24,Column25,Column26,Column27,Column28
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,,
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,,
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,,
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,...,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,,
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,...,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,,


In [13]:
from sklearn import preprocessing
import pickle
import io

dataColumns = ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']

In [14]:
# Rename training data columns
train_df = pd.read_csv(io.StringIO(u""+train_df.to_csv(index=False)), header=None, skiprows=1)
train_df.drop(train_df.columns[[26, 27]], axis=1, inplace=True)
train_df.columns = dataColumns

In [15]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


In [16]:
# Rename test data columns
test_df = pd.read_csv(io.StringIO(u""+test_df.to_csv(index=False)), header=None, skiprows=1)
test_df.drop(test_df.columns[[26, 27]], axis=1, inplace=True)
test_df.columns = dataColumns

In [17]:
test_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,...,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,...,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413


In [18]:
# Rename RULL data columns
rul_df = pd.read_csv(io.StringIO(u""+rul_df.to_csv(index=False)), header=None, skiprows=1)
rul_df.drop(rul_df.columns[[1]], axis=1, inplace=True)
rul_df.columns = ['more']
rul_df['id'] = rul_df.index + 1


In [19]:
rul_df.head()

Unnamed: 0,more,id
0,112,1
1,98,2
2,69,3
3,82,4
4,91,5


In [20]:
# train set, calculate RUL
train_df = train_df.sort_values(['id','cycle'])
rul = pd.DataFrame(train_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']

In [21]:
rul.head()

Unnamed: 0,id,max
0,1,192
1,2,287
2,3,179
3,4,189
4,5,269


In [22]:
train_df = train_df.merge(rul, on=['id'], how='left')

In [23]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s13,s14,s15,s16,s17,s18,s19,s20,s21,max
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,192
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,192
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,192
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,192
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,192


In [24]:
train_df['RUL'] = train_df['max'] - train_df['cycle']

In [25]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s14,s15,s16,s17,s18,s19,s20,s21,max,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,192,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,192,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,192,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,192,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,192,187


In [26]:
train_df.drop('max', axis=1, inplace=True)

In [27]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s13,s14,s15,s16,s17,s18,s19,s20,s21,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


In [28]:
# test set, use ground truth to calculate RUL
test_df = test_df.sort_values(['id','cycle'])
rul = pd.DataFrame(test_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
rul_df['max'] = rul['max'] + rul_df['more']
rul_df.drop('more', axis=1, inplace=True)
test_df = test_df.merge(rul_df, on=['id'], how='left')
test_df['RUL'] = test_df['max'] - test_df['cycle']
test_df.drop('max', axis=1, inplace=True)

In [29]:
test_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s13,s14,s15,s16,s17,s18,s19,s20,s21,RUL
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,142
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,141
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,140
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,...,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,139
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,...,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,138


In [30]:
# label data
w1 = 30
train_df['label1'] = np.where(train_df['RUL'] <= w1, 1, 0 )
test_df['label1'] = np.where(test_df['RUL'] <= w1, 1, 0 )


In [31]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s14,s15,s16,s17,s18,s19,s20,s21,RUL,label1
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191,0
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190,0
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189,0
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188,0
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187,0


In [32]:
rslt_df = train_df[train_df['label1'] == 1] 

In [33]:
rslt_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s14,s15,s16,s17,s18,s19,s20,s21,RUL,label1
161,1,162,-0.0005,0.0004,100.0,518.67,643.15,1592.22,1423.48,14.62,...,8123.77,8.5015,0.03,394,2388,100.0,38.78,23.1538,30,1
162,1,163,0.0003,-0.0004,100.0,518.67,642.85,1600.54,1421.09,14.62,...,8124.06,8.5129,0.03,393,2388,100.0,38.65,23.1419,29,1
163,1,164,0.0005,-0.0002,100.0,518.67,643.17,1598.96,1416.76,14.62,...,8124.63,8.4803,0.03,394,2388,100.0,38.62,23.1761,28,1
164,1,165,0.001,0.0004,100.0,518.67,642.76,1597.03,1408.09,14.62,...,8126.53,8.4922,0.03,393,2388,100.0,38.59,23.2129,27,1
165,1,166,-0.0022,-0.0003,100.0,518.67,643.34,1596.72,1422.37,14.62,...,8119.14,8.4663,0.03,395,2388,100.0,38.62,23.145,26,1


In [34]:
# normalize train data
train_df['cycle_norm'] = train_df['cycle']
cols_normalize = train_df.columns.difference(['id','cycle','RUL','label1'])   # feature columns
min_max_scaler = preprocessing.MinMaxScaler()
norm_train_df = pd.DataFrame(min_max_scaler.fit_transform(train_df[cols_normalize]), 
                             columns=cols_normalize, 
                             index=train_df.index)
with open('min_max_scaler.pickle','wb') as f:
    pickle.dump(min_max_scaler, f)
join_df = train_df[train_df.columns.difference(cols_normalize)].join(norm_train_df)

In [35]:
cols_normalize

Index(['cycle_norm', 's1', 's10', 's11', 's12', 's13', 's14', 's15', 's16',
       's17', 's18', 's19', 's2', 's20', 's21', 's3', 's4', 's5', 's6', 's7',
       's8', 's9', 'setting1', 'setting2', 'setting3'],
      dtype='object')

In [36]:
join_df.head()

Unnamed: 0,RUL,cycle,id,label1,cycle_norm,s1,s10,s11,s12,s13,...,s3,s4,s5,s6,s7,s8,s9,setting1,setting2,setting3
0,191,1,1,0,0.0,0.0,0.0,0.369048,0.633262,0.205882,...,0.406802,0.309757,0.0,1.0,0.726248,0.242424,0.109755,0.45977,0.166667,0.0
1,190,2,1,0,0.00277,0.0,0.0,0.380952,0.765458,0.279412,...,0.453019,0.352633,0.0,1.0,0.628019,0.212121,0.100242,0.609195,0.25,0.0
2,189,3,1,0,0.00554,0.0,0.0,0.25,0.795309,0.220588,...,0.369523,0.370527,0.0,1.0,0.710145,0.272727,0.140043,0.252874,0.75,0.0
3,188,4,1,0,0.00831,0.0,0.0,0.166667,0.889126,0.294118,...,0.256159,0.331195,0.0,1.0,0.740741,0.318182,0.124518,0.54023,0.5,0.0
4,187,5,1,0,0.01108,0.0,0.0,0.255952,0.746269,0.235294,...,0.257467,0.404625,0.0,1.0,0.668277,0.242424,0.14996,0.390805,0.333333,0.0


In [37]:
norm_train_df.head()

Unnamed: 0,cycle_norm,s1,s10,s11,s12,s13,s14,s15,s16,s17,...,s3,s4,s5,s6,s7,s8,s9,setting1,setting2,setting3
0,0.0,0.0,0.0,0.369048,0.633262,0.205882,0.199608,0.363986,0.0,0.333333,...,0.406802,0.309757,0.0,1.0,0.726248,0.242424,0.109755,0.45977,0.166667,0.0
1,0.00277,0.0,0.0,0.380952,0.765458,0.279412,0.162813,0.411312,0.0,0.333333,...,0.453019,0.352633,0.0,1.0,0.628019,0.212121,0.100242,0.609195,0.25,0.0
2,0.00554,0.0,0.0,0.25,0.795309,0.220588,0.171793,0.357445,0.0,0.166667,...,0.369523,0.370527,0.0,1.0,0.710145,0.272727,0.140043,0.252874,0.75,0.0
3,0.00831,0.0,0.0,0.166667,0.889126,0.294118,0.174889,0.166603,0.0,0.333333,...,0.256159,0.331195,0.0,1.0,0.740741,0.318182,0.124518,0.54023,0.5,0.0
4,0.01108,0.0,0.0,0.255952,0.746269,0.235294,0.174734,0.402078,0.0,0.416667,...,0.257467,0.404625,0.0,1.0,0.668277,0.242424,0.14996,0.390805,0.333333,0.0


In [38]:
train_df = join_df.reindex(columns = train_df.columns)

In [39]:
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s15,s16,s17,s18,s19,s20,s21,RUL,label1,cycle_norm
0,1,1,0.45977,0.166667,0.0,0.0,0.183735,0.406802,0.309757,0.0,...,0.363986,0.0,0.333333,0.0,0.0,0.713178,0.724662,191,0,0.0
1,1,2,0.609195,0.25,0.0,0.0,0.283133,0.453019,0.352633,0.0,...,0.411312,0.0,0.333333,0.0,0.0,0.666667,0.731014,190,0,0.00277
2,1,3,0.252874,0.75,0.0,0.0,0.343373,0.369523,0.370527,0.0,...,0.357445,0.0,0.166667,0.0,0.0,0.627907,0.621375,189,0,0.00554
3,1,4,0.54023,0.5,0.0,0.0,0.343373,0.256159,0.331195,0.0,...,0.166603,0.0,0.333333,0.0,0.0,0.573643,0.662386,188,0,0.00831
4,1,5,0.390805,0.333333,0.0,0.0,0.349398,0.257467,0.404625,0.0,...,0.402078,0.0,0.416667,0.0,0.0,0.589147,0.704502,187,0,0.01108


In [40]:
# normalize test data
test_df['cycle_norm'] = test_df['cycle']
norm_test_df = pd.DataFrame(min_max_scaler.transform(test_df[cols_normalize]), 
                            columns=cols_normalize, 
                            index=test_df.index)
test_join_df = test_df[test_df.columns.difference(cols_normalize)].join(norm_test_df)
test_df = test_join_df.reindex(columns = test_df.columns)
test_df = test_df.reset_index(drop=True)


In [41]:
test_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,...,s15,s16,s17,s18,s19,s20,s21,RUL,label1,cycle_norm
0,1,1,0.632184,0.75,0.0,0.0,0.545181,0.310661,0.269413,0.0,...,0.308965,0.0,0.333333,0.0,0.0,0.55814,0.661834,142,0,0.0
1,1,2,0.344828,0.25,0.0,0.0,0.150602,0.379551,0.222316,0.0,...,0.213159,0.0,0.416667,0.0,0.0,0.682171,0.686827,141,0,0.00277
2,1,3,0.517241,0.583333,0.0,0.0,0.376506,0.346632,0.322248,0.0,...,0.458638,0.0,0.416667,0.0,0.0,0.728682,0.721348,140,0,0.00554
3,1,4,0.741379,0.5,0.0,0.0,0.370482,0.285154,0.408001,0.0,...,0.257022,0.0,0.25,0.0,0.0,0.666667,0.66211,139,0,0.00831
4,1,5,0.58046,0.5,0.0,0.0,0.391566,0.352082,0.332039,0.0,...,0.300885,0.0,0.166667,0.0,0.0,0.658915,0.716377,138,0,0.01108


In [42]:
# describe data and use only some columns
def describe():
    print('train set', train_df.shape)
    print('test set', test_df.shape)
    stats = train_df.describe().T
    unchanging_cols = list(stats[stats['std']==0].index)
    print('unchanging cols', unchanging_cols)
    # ['setting3', 's1', 's5', 's10', 's16', 's18', 's19']

print('Describe data:')
describe()
    

Describe data:
train set (20631, 29)
test set (13096, 29)
check distribution 
 0    17531
1    3100 
Name: label1, dtype: int64
unchanging cols ['setting3', 's1', 's5', 's10', 's16', 's18', 's19']


#### <font color='blue'> Challenge 2</font>:

Check the distribution of labels

In [None]:
'''
Code goes here
'''

In [43]:
#remove unchanging columns
feature_cols = ['cycle_norm', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
feature_cols = [s for s in feature_cols if s not in ['setting3', 's1', 's5', 's10', 's16', 's18', 's19']]


In [44]:
feature_cols

['cycle_norm',
 'setting1',
 'setting2',
 's2',
 's3',
 's4',
 's6',
 's7',
 's8',
 's9',
 's11',
 's12',
 's13',
 's14',
 's15',
 's17',
 's20',
 's21']

In [45]:
cols = ['id','cycle','RUL','label1'] + feature_cols    
train_df = train_df[cols]
test_df = test_df[cols]

In [46]:
cols

['id',
 'cycle',
 'RUL',
 'label1',
 'cycle_norm',
 'setting1',
 'setting2',
 's2',
 's3',
 's4',
 's6',
 's7',
 's8',
 's9',
 's11',
 's12',
 's13',
 's14',
 's15',
 's17',
 's20',
 's21']

# Feature Engineering

In [47]:
import pandas as pd
import numpy as np

lag_window = 5
lag_cols = [s for s in feature_cols if s not in ['cycle_norm','setting1','setting2','setting3']]


In [48]:
feature_cols

['cycle_norm',
 'setting1',
 'setting2',
 's2',
 's3',
 's4',
 's6',
 's7',
 's8',
 's9',
 's11',
 's12',
 's13',
 's14',
 's15',
 's17',
 's20',
 's21']

In [49]:
lag_cols

['s2',
 's3',
 's4',
 's6',
 's7',
 's8',
 's9',
 's11',
 's12',
 's13',
 's14',
 's15',
 's17',
 's20',
 's21']

In [50]:
# build lagging features - train data set
df_mean = train_df[lag_cols].rolling(window=lag_window).mean()
df_std = train_df[lag_cols].rolling(window=lag_window).std()
df_mean.columns = ['MA'+s for s in lag_cols]
df_std.columns = ['STD'+s for s in lag_cols]
df_train = pd.concat([train_df,df_mean,df_std], axis=1, join='inner')


In [51]:
df_mean.head(10)

Unnamed: 0,MAs2,MAs3,MAs4,MAs6,MAs7,MAs8,MAs9,MAs11,MAs12,MAs13,MAs14,MAs15,MAs17,MAs20,MAs21
0,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,
4,0.300602,0.348594,0.353747,1.0,0.694686,0.257576,0.124904,0.284524,0.765885,0.247059,0.176767,0.340285,0.316667,0.634109,0.688788
5,0.31747,0.32579,0.346219,1.0,0.70467,0.245455,0.128036,0.247619,0.766738,0.25,0.170812,0.33359,0.3,0.621705,0.674399
6,0.337349,0.32797,0.328089,1.0,0.723671,0.239394,0.141551,0.232143,0.768443,0.238235,0.171669,0.307118,0.3,0.637209,0.66164
7,0.35,0.306039,0.317184,1.0,0.710467,0.215152,0.130656,0.228571,0.770576,0.238235,0.16944,0.299269,0.316667,0.64031,0.652361
8,0.336145,0.341748,0.293315,1.0,0.68599,0.19697,0.127946,0.247619,0.724947,0.229412,0.161038,0.302809,0.316667,0.666667,0.661392
9,0.296386,0.37833,0.273869,1.0,0.672786,0.193939,0.12485,0.217857,0.707889,0.235294,0.156476,0.302193,0.316667,0.674419,0.679343


In [52]:
df_std.head()

Unnamed: 0,STDs2,STDs3,STDs4,STDs6,STDs7,STDs8,STDs9,STDs11,STDs12,STDs13,STDs14,STDs15,STDs17,STDs20,STDs21
0,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,
4,0.070716,0.088853,0.036452,0.0,0.046092,0.040087,0.020584,0.089918,0.092233,0.038065,0.013682,0.099856,0.091287,0.05707,0.046256


In [53]:
df_train.head()

Unnamed: 0,id,cycle,RUL,label1,cycle_norm,setting1,setting2,s2,s3,s4,...,STDs8,STDs9,STDs11,STDs12,STDs13,STDs14,STDs15,STDs17,STDs20,STDs21
0,1,1,191,0,0.0,0.45977,0.166667,0.183735,0.406802,0.309757,...,,,,,,,,,,
1,1,2,190,0,0.00277,0.609195,0.25,0.283133,0.453019,0.352633,...,,,,,,,,,,
2,1,3,189,0,0.00554,0.252874,0.75,0.343373,0.369523,0.370527,...,,,,,,,,,,
3,1,4,188,0,0.00831,0.54023,0.5,0.343373,0.256159,0.331195,...,,,,,,,,,,
4,1,5,187,0,0.01108,0.390805,0.333333,0.349398,0.257467,0.404625,...,0.040087,0.020584,0.089918,0.092233,0.038065,0.013682,0.099856,0.091287,0.05707,0.046256


In [54]:
df_train.shape

(20631, 52)

In [55]:
# cut head by id, due to lagging transformation
#train_array = [df_train[df_train['id']==id].values[lag_window+40:,:] for id in df_train['id'].unique()]

train_array=df_train[0:0]
for id in df_train['id'].unique():
    dfx=df_train[df_train['id']==id].iloc[lag_window+40:]
    train_array=train_array.append(dfx)

In [56]:
train_array.head()

Unnamed: 0,id,cycle,RUL,label1,cycle_norm,setting1,setting2,s2,s3,s4,...,STDs8,STDs9,STDs11,STDs12,STDs13,STDs14,STDs15,STDs17,STDs20,STDs21
45,1,46,146,0,0.124654,0.517241,0.583333,0.36747,0.376063,0.321067,...,0.034551,0.025044,0.071602,0.079474,0.051576,0.009221,0.031995,0.045644,0.079547,0.079811
46,1,47,145,0,0.127424,0.5,0.916667,0.301205,0.202311,0.348413,...,0.029536,0.026466,0.071602,0.086421,0.051576,0.010189,0.03643,0.045644,0.07875,0.099588
47,1,48,144,0,0.130194,0.609195,0.583333,0.204819,0.380859,0.285449,...,0.025353,0.032236,0.073939,0.046348,0.043376,0.010094,0.065087,0.045644,0.077014,0.079744
48,1,49,143,0,0.132964,0.41954,0.916667,0.307229,0.290168,0.261816,...,0.015152,0.029407,0.069975,0.053147,0.022303,0.017984,0.06348,0.045644,0.077014,0.072424
49,1,50,142,0,0.135734,0.316092,0.416667,0.46988,0.380423,0.360736,...,0.013552,0.0253,0.073915,0.074147,0.030495,0.018657,0.063387,0.037268,0.060196,0.070976


In [57]:
train_array.shape

(16131, 52)

In [58]:
df_train.shape

(20631, 52)

In [59]:
import numpy as np
import pandas as pd

In [60]:
#train_array = np.concatenate(train_array).astype(np.float32)
train_set=train_array.drop(train_array.iloc[:, :3], axis = 1)

In [61]:
train_set.head()

Unnamed: 0,label1,cycle_norm,setting1,setting2,s2,s3,s4,s6,s7,s8,...,STDs8,STDs9,STDs11,STDs12,STDs13,STDs14,STDs15,STDs17,STDs20,STDs21
45,0,0.124654,0.517241,0.583333,0.36747,0.376063,0.321067,1.0,0.62963,0.242424,...,0.034551,0.025044,0.071602,0.079474,0.051576,0.009221,0.031995,0.045644,0.079547,0.079811
46,0,0.127424,0.5,0.916667,0.301205,0.202311,0.348413,1.0,0.729469,0.242424,...,0.029536,0.026466,0.071602,0.086421,0.051576,0.010189,0.03643,0.045644,0.07875,0.099588
47,0,0.130194,0.609195,0.583333,0.204819,0.380859,0.285449,1.0,0.671498,0.272727,...,0.025353,0.032236,0.073939,0.046348,0.043376,0.010094,0.065087,0.045644,0.077014,0.079744
48,0,0.132964,0.41954,0.916667,0.307229,0.290168,0.261816,1.0,0.705314,0.257576,...,0.015152,0.029407,0.069975,0.053147,0.022303,0.017984,0.06348,0.045644,0.077014,0.072424
49,0,0.135734,0.316092,0.416667,0.46988,0.380423,0.360736,1.0,0.655395,0.242424,...,0.013552,0.0253,0.073915,0.074147,0.030495,0.018657,0.063387,0.037268,0.060196,0.070976


## Create and Register Dataset Object

#### <font color='blue'> Challenge 3</font>:

1. Create and register the processed data (`train_set`) as a dataset in the cell below. The name of the dataset should be `dataset_from_pandas_df`.
2. What's the benefit of registering datasets?

Tips: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets#:~:text=An%20Azure%20Machine%20Learning%20dataset.%20Register%20and%20retrieve,a%20specific%20version%20by%20name%20and%20version%20number.

In [3]:
from azureml.core import Workspace, Dataset

datastore = ws.get_default_datastore()

'''
Code goes here ..
'''

'\nCode goes here ..\n'