# 4. Machine Learning for IoT Data

See the IoT Data Analysis guide: [`IoT_Datacamp.md`](../IoT_Datacamp.md).

## 4.1 Basic Model Training: Split, Scale, Train, Evaluate

Supervised machine learning algorithms have independent variables `X` and target/dependent variable(s) `y`. We split the data in `train` and `test` subsets; the `test` subset cannot be seen by the model during training.

In time series, we cannot randomly split the dataset; we take the last 20% as the test subset.

```python
environment.columns
# ['precipitation', 'wind-gust-speed', 'humidity', 'radiation', 'sunshine', 'wind-direction', 'wind-speed', 'pressure', 'temperature', 'target']

environment.shape # (2972, 10)

# Define the split day
# limit_day = environment.index[int(environment.shape[0]*0.8)].date()
limit_day = "2018-10-27"

# Split the data
train_env = environment[:limit_day]
test_env = environment[limit_day:]

# Print start and end dates
print(show_start_end(train_env))
print(show_start_end(test_env))

# Split the data into X and y
X_train = train_env.drop("target", axis=1)
y_train = train_env["target"]
X_test = test_env.drop("target", axis=1)
y_test = test_env["target"]

# Scale
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
sc = StandardScaler()

# Fit the scaler
sc.fit(X_train)

# Transform the data
X_train_s = sc.transform(X_train)
X_test_s = sc.transform(X_test)
X_train_s = pd.DataFrame(X_train_s, 
                         columns=X_train.columns, 
                         index=X_train.index)
X_test_s = pd.DataFrame(X_test_s, 
                        columns=X_test.columns, 
                        index=X_test.index)

# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Initialize the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train_s, y_train)

# Predict classes
print(logreg.predict(X_test_s))

# Score the model
print(logreg.score(X_train_s, y_train))
print(logreg.score(X_test_s, y_test))
```

## 4.2 Develop Machine Learning Pipeline

```python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Initialize Objects
sc = StandardScaler()
logreg = LogisticRegression()
# Create pipeline
pl = Pipeline([
        ("scale", sc),
        ("logreg", logreg)
    ])


# Train and predict
pl.fit(X_train, y_train)
print(pl.predict(X_test))

# Persisting th emodel
import pickle

with Path("pipeline_model.pkl").open("bw") as f:
    pickle.dump(pl, f)

with Path("pipeline_model.pkl").open('br') as f:
    pl = pickle.load(f)

```

## 4.3 Apply the Trained Machine Learning Model to New Data

The following snippet shows how to apply the ML model to the data stream. The steps are:

- Get the message with `on_message()` using a callback.
- Extract the record and convert it to a dataframe with a single row.
- Predict the target with the dataframe.
- Pass the result to the function that should do something with it.

```python
def on_message(client, userdata, message):
    # Extract data: single JSON record
    data = json.loads(message.payload)
    # {'timestamp': '2018-11-30 18:15:00',
    #  'humidity': 81.7,
    #  'pressure': 1019.8,
    #  'temperature': 1.5}
    # Create 
    df = pd.DataFrame.from_records([data],
                                   index="timestamp",
                                   columns=cols)
    # Predict
    category = pl.predict(df)
    # Pass prediction to function
    # Since the input is an array of 1 entry, the output, too!
    maybe_alert(category[0])

# Subscribe to topic
subscribe.callback(on_message, topic, hostname=MQTT_HOST)
```