<ul style="list-style-type:circle;font-size:14px;line-height:27px;">
    <li><b>from streamanalytix.python.dataset import Dataset:</b> Load Dataset class from streamanalytix API.</li>
	<li><b>Dataset(source_name):</b> Creates object of dataset class for given source.
		<ul>
			<li><b>Dataset.get_dataframe():</b> Read data source and return a pandas dataframe. </li>
		</ul>
	</li>
	<li><b>from streamanalytix.utilities import sax_utils:</b> Load sax_util script. It has following functions
		<ul>
			<li><b>sax_utils.save_and_download_model(model_name, model_object):</b>This method lets the user to save and download the trained model with StreamAnalytix. This model can than be used for training and/or scoring purpose as part of the StreamAnalytix pipeline</li>
				<ul>
					<li><b>model_name:</b> Name of the model. Accepted String value only. e.g. <i>"DecisionTreeModel"</i></li>
					<li><b>model_object:</b> Object of the trained model</li>
				</ul>
			<li><b>sax_utils.create_h2o_frame(source_name, cluster_name):</b>This method read data source and return H2O frame using notebook environment.</li>
				<ul>
					<li><b>source_name:</b> Name of the source. Accepted String value only. e.g. <i>"MyDataSource"</i></li>
					<li><b>cluster_name:</b> Name of the cluster. Accepted String value only. e.g. <i>"TrainingCluster"</i></li>
				</ul>
			<li><b>sax_utils.upload_and_register_h2o_model(model_object, model_name, model_type, project_name, project_version, workspace_name):</b>This method lets the user to upload and register h2o model in 'mojo' format in streamanalytix</li>
				<ul>
					<li><b>model_object:</b> Object of trained H2O model</li>
					<li><b>model_name:</b>Name of the model. Accepted String value only. e.g. <i>"H2OTreeModel"</i>
					<li><b>model_type:</b>Type of trained model. Accepted String value only. We support H2O model of types :<i>"DistributedRandomForest"</i>,<i>"GeneralizedLinearModelling"</i>,<i>"IsolationForest"</i>,<i>"GradientBoostingMachine"</i></li>
					<li><b>project_name:</b>Project Name in which model should register. Accepted String value only. e.g. <i>"MyProject"</i></li>
					<li><b>project_version:</b>Version of given project in which model should register</li>
					<li><b>workspace_name:</b>Workspace Name in which model should register. Accepted String value only. e.g. <i>"MyWorkspace"</i></li>
				</ul>
		</ul>
	</li>
</ul>


In [1]:
from streamanalytix.python.dataset import Dataset
from streamanalytix.utilities import sax_utils


dataset_1 = Dataset("compressor_healthy_data")
dataset_2 = Dataset("compressor_damage_data")

# you can use pandas to create dataframe as shown below
# you can use pandas to create dataframe as shown below
healthy_df = dataset_1.get_dataframe()
damaged_df = dataset_2.get_dataframe()

Dataframe created
Dataframe created


In [2]:
# add label
healthy_df['label'] = 'HEALTHY'
damaged_df['label'] = 'DAMAGED'

In [3]:
# print datasets
print(healthy_df.head())
print(damaged_df.head())

        AN2      AN3      AN4       AN5       AN6      AN7    label
0   6.92610 -0.90846 -0.33036  0.366540  1.581900  0.65208  HEALTHY
1   2.11830  2.86120  0.35087 -0.408250 -2.693900 -0.41473  HEALTHY
2   0.26564  1.63200  0.64668  0.638510  0.018625  0.37443  HEALTHY
3  10.76700  1.60540 -0.93881 -0.042284 -0.729160 -1.02010  HEALTHY
4   3.19520  7.72530 -2.25360 -0.696960 -3.156400 -0.61025  HEALTHY
       AN2     AN3      AN4     AN5      AN6     AN7    label
0  -2.3892 -1.0360 -1.38470 -2.9812 -0.23971 -2.3579  DAMAGED
1  13.2220  6.0517  1.83150  1.1092  3.44000  0.3012  DAMAGED
2   4.7969  4.1318 -0.69563  0.1667 -3.11800  1.8308  DAMAGED
3  -1.5859  1.8773  1.78470 -4.6230  1.21870  3.2342  DAMAGED
4  -2.0814  1.9095 -1.79360  5.7990 -1.49020  1.6979  DAMAGED


In [4]:
# filter datasets
healthy_df_filter = healthy_df[(-10 <= healthy_df.AN3) & (healthy_df.AN3 <= 10)]
damaged_df_filter = damaged_df[(-10 <= damaged_df.AN2) & (damaged_df.AN2 <= 10)]

In [5]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [6]:
# join datasets
final_df = pd.concat([healthy_df_filter, damaged_df_filter])

In [7]:
# train and test split
_test_data_size= 0.2
y = final_df['label']  # Labels
X = final_df.drop(['label'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=_test_data_size)

In [8]:
# create pipeline model
training_pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('regressor', RandomForestClassifier())])

In [9]:
# fit model
training_pipeline.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('regressor', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [10]:
# score
print(training_pipeline.score(X_test, y_test))

0.807909604519774
