# Data exploration 🧐

The first step in any ML use case is to explore and get a feel for the data. 
In this notebook we therefore load, inspect and visualize the data.

We are using a [publicly available dataset](https://www.kaggle.com/datasets/berkerisen/wind-turbine-scada-dataset) of a windturbine in Turkey which contains different values over time, such as the generated power.

In [None]:
# TODO PREP: open in colab
import pandas as pd
import plotly_express as px

In [17]:
# TODO PREP: store as CSV and update dependencies
# url = "https://raw.githubusercontent.com/ykerus/experiment-tracking-with-mlflow/main/data/turbine-data.csv"
location = "../data/turbine-data.csv"
data = pd.read_csv(location).set_index("timestamp")
data.index = pd.to_datetime(data.index)
data

Unnamed: 0_level_0,active_power,wind_speed,wind_direction,is_curtailed
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-01 00:00:00,380.047791,5.311336,259.994904,False
2018-01-01 00:10:00,453.769196,5.672167,268.641113,False
2018-01-01 00:20:00,306.376587,5.216037,272.564789,False
2018-01-01 00:30:00,419.645905,5.659674,271.258087,False
2018-01-01 00:40:00,380.650696,5.577941,265.674286,False
...,...,...,...,...
2018-12-31 23:10:00,2963.980957,11.404030,80.502724,False
2018-12-31 23:20:00,1684.353027,7.332648,84.062599,False
2018-12-31 23:30:00,2201.106934,8.435358,84.742500,False
2018-12-31 23:40:00,2515.694092,9.421366,84.297913,False


It seems we have some columns specifying **wind speed** and **direction**, and a column specifying **how much power was generated** for those values. We also have a column indicating whether the turbine was **curtailed**. What would that mean?

Let's plot some attributes over time to get a better feel of the data.

In [18]:
# Tip: you can select a region on the graph to zoom in on it
px.line(data, y="active_power", title="Generated power over time")

In [19]:
# Tip: you can select a region on the graph to zoom in on it
fig = px.line(data, y="wind_speed", title="Wind speed over time")
fig.update_traces(line_color="orange")
fig

In [20]:
fig = px.scatter(
    data, 
    x="wind_speed", 
    y="active_power",
    # color="is_curtailed",
    title="Relation between wind speed and power generated"
)
fig.update_layout(xaxis_title="Wind speed (m/s)", yaxis_title="Power")
fig

Generated power seems to go up, the harder the wind blows! <br>
Makes sense for a wind turbine...

Let's see if we can use this information to train a model that can predict the generated power, based on the available information about the circumstances, such as the wind speed.
We'll do this in the next exercise.
