# ML Lab Solution

We can take a quick try ourselves using the UC Irvine ML repository's *Combined Cycle Power Plant Data Set* (https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

This dataset consists of about 10,000 records of measurements relating to peaker power plants.
- Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW

We want to model the power output as a function of the other parameters.

In [None]:
from dask.distributed import Client

client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')

client

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/powerplant.csv', sample=512000, blocksize=4e4)
ddf

In [None]:
y = ddf.PE
y

In [None]:
X = ddf.drop(columns=['PE'])
X

In [None]:
X = X.to_dask_array(lengths=True)
X

In [None]:
y = y.to_dask_array(lengths=True)

In [None]:
y

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

X_train

In [None]:
y_train

In [None]:
from dask_ml.linear_model import LinearRegression

lr = LinearRegression(solver='lbfgs', max_iter=10)
lr_model = lr.fit(X_train, y_train)

In [None]:
y_predicted = lr_model.predict(X_test)

In [None]:
y.max().compute() - y.min().compute()

In [None]:
from dask_ml.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(y_test, y_predicted))

In [None]:
client.close()