# Machine Learning Analysis in the Earth Systems Sciences


In this module, you are tasked with planning, implementing, and evaluating a machine learning solution for a real-world scenario. Given pre-configured code blocks and prepared data, you will create a problem statement, explore the data, experiment with model development, and ultimately make a recommendation on the utility of machine learning for your scenario.
To get started, first run the cell below to prepare this notebook. While that process runs, watch the following video to learn more about this scenario.

# Missing weather station in western North Carolina

Play the below video to learn about the situation.

`<video>`

`link to transcript`

## Part 1: Problem Framing

Based on the information provided in the video, which type of machine learning analysis is most appropriate for this scenario?

TK Narrative scaffolding to this:

1. Does a simpler solution exist?
2. Can machine learning requirements be met?
3. Which scientific question should be answered?
4. 
Since the process of creating a machine learning model has many steps and iterations, it's important to keep records of all findings and intermediate results. Your machine learning model handbook is in a separate document where you will record your findings.

<div class="alert alert-success" role="alert">
<p class="admonition-title" style="font-weight:bold">Exercise</p>
    <p>In your <b>Machine Learning Model Handbook</b>, type the scientific question to be answered for this situation.</p>
    <p>GENERAL RUBRIC TBD</p>
</div>

## Part 2: Data Handling
### Locate Data of Interest

You will be using other stations in the <a href="https://econet.climate.ncsu.edu/" target="blank">NC ECONet</a> for this project. Your colleague is a <a href="https://www.mongodb.com/resources/basics/data-engineering#what-is-data-engineering" target="blank">data engineer</a> who has done much of the data preparation for you. They have prepared the following document to describe the nature of the dataset they are providing you for your model building work. 

### Metadata Document for Western North Carolina Weather Station Data

#### General Information

Dataset Name: Western NC Weather Station Time-Series Data

Description: This dataset contains tabular time-series data collected from multiple weather stations in Western North Carolina. The data includes atmospheric and environmental variables recorded at hourly intervals.

Date Range: January 1, 2015, to December 16, 2024

Geographic Coverage: Western North Carolina 

Data Frequency: Hourly

Last Updated: Jan 1, 2025

#### Data Structure

File Format: .parquet

Number of Records: 69,760 per station per feature

Columns (Features) per Station (XXXX):

- observation_datetime_station_XXXX: Date and time of observation in <UTC?>
- airtemp_degF_XXXX_station_XXXX (°F): Air temperature measured at 2 meters above ground level
- windspeed_avg_mph_XXXX_station_XXXX (mph): Average wind speed during the hour at <2? 6? 10?> meters above ground level
- winddgust_mph_XXXX_station_XXXX (mph): <Peak wind gust during the hour at <2?> meters above ground level>
- rhavg_percent_XXXX_station_XXXX (%): Relative humidity
- precip_in_XXXX_station_XXXX (in): Total precipitation accumulated in the hour at <1? 2?> meters above ground level <need snow equivalent info>
- date_station_XXXX: <>
- day_index: <>
- hour_index: <>

Stations:

- BEAR (Bearwallow Mountain)
- BURN (Burnsville Tower)
- FRYI (Frying Pan Mountain)
- JEFF (Mount Jefferso Tower)
- **MITC (Mount Mitchell State Park) - target station**
- NCAT (North Carolina A&T University Research Farm)
- SALI (Piedmont Research Station)
- SASS (Sassafrass Mountain)
- UNCA (University of North Carolina - Asheville Weather Tower)
- WINE (Wayah Bald Mountain)

#### Data Quality

Missing Data: Timestamps with no recorded data are marked as <>. <Other info about hanling missing data>

Outlier Handling: <outside range handling>

#### Data Provenance

Source: North Carolina State Climate Office ECONet, <a href="https://econet.climate.ncsu.edu" target="blank">https://econet.climate.ncsu.edu/about/</a>

#### Data Transformations

Time Normalization: <?>

Unit Conversion: <?>

Aggregations: <?>

### Explore Data

While your data engineer colleague prepared the data for your model and created the metadata document, you will still need to familiarize yourself with the data before you use it as input to a machine learning algorithm. In this step, you will take a closer look at the potential features for your model with a few tables and plots. 

First, let's read the data into this notebook.

In [70]:
# Import the Python library that can interpret the data
import pandas as pd

# Location of the data on 
file_path = "processed_data/NC_processed_data_1_2.parquet"
df = pd.read_parquet(file_path) 

Our ***target features*** (the features that we are trying to predict with the machine learning model) are temperature, relative humidity, wind speed, wind gust, and precipitation at the Mt. Mitchell station. All other features are possible ***input features*** to the model. 

Let's now explore just the target features at Mt. Mitchell. 

<div class="alert alert-info" role="alert">
<p class="admonition-title" style="font-weight:bold">Note</p>
	<p>To save a plot, hold Shift on your keyboard and right click on the plot. Then select <b>Copy image</b>. You may paste the image into your Machine Learning Model Handbook as needed.</p>
</div>

In [102]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import matplotlib.pyplot as plt

# Variable dropdown
var_dropdown = widgets.Dropdown(
    options=[
        ('Temperature (F)', 'airtemp_degF_MITC_station_MITC'),
        ('Average Wind Speed (mph)', 'windspeed_avg_mph_MITC_station_MITC'),
        ('Wind Gust (mph)', 'winddgust_mph_MITC_station_MITC'),
        ('Relative Humidity (%)', 'rhavg_percent_MITC_station_MITC'),
        ('Precipitation (in)', 'precip_in_MITC_station_MITC')
    ],
    description='Variable:',
    disabled=False
)

# Plot type dropdown
plot_dropdown = widgets.Dropdown(
    options=['Histogram', 'Time Series'],
    description='Plot type:',
    disabled=False
)

# Button for plotting
plot_button = widgets.Button(description="Plot", button_style="success")

# Output widget to render plots
output = widgets.Output()

# Display widgets and output
display(widgets.HTML(value="<h3>Mt. Mitchell</h3>"), var_dropdown, plot_dropdown, plot_button, output)

# Button click event handler
def on_plot_button_click(b):
    # Retrieve current selections
    var_value = var_dropdown.value
    var_label = var_dropdown.label

    # Clear previous output
    with output:
        clear_output(wait=True)  # Clear the previous plot

        # Generate the selected plot
        if plot_dropdown.value == 'Histogram':
            fig, ax = plt.subplots(1, 1, tight_layout=True)
            ax.hist(df[var_dropdown.value], bins=30, color='skyblue', edgecolor='black')
            ax.set_title(f"Histogram of {var_dropdown.label} at Mt. Mitchell (MITC)", fontsize=14)
            ax.set_xlabel(var_dropdown.label)
            ax.set_ylabel("Number of records")
            plt.show()
        elif plot_dropdown.value == 'Time Series':
            fig, ax = plt.subplots(1, 1, tight_layout=True)
            ax.plot(df['observation_datetime_station_MITC'][::100], df[var_dropdown.value][::100], label=var_dropdown.label, color='orange')
            ax.set_title(f"Time Series of {var_dropdown.label} at Mt. Mitchell (MITC)", fontsize=14)
            ax.set_xlabel("Date")
            ax.set_ylabel(var_dropdown.label)
            plt.show()

# Attach the event handler to the button
plot_button.on_click(on_plot_button_click)

HTML(value='<h3>Mt. Mitchell</h3>')

Dropdown(description='Variable:', options=(('Temperature (F)', 'airtemp_degF_MITC_station_MITC'), ('Average Wi…

Dropdown(description='Plot type:', options=('Histogram', 'Time Series'), value='Histogram')

Button(button_style='success', description='Plot', style=ButtonStyle())

Output()

<div class="alert alert-info" role="alert">
<p class="admonition-title" style="font-weight:bold">Note</p>
	<p>The machine learning algorithm will treat each station + environmental variable pair as a unique feature, i.e. <code>airtemp_degF_BURN_station_BURN</code> is a different feature than <code>airtemp_degF_SASS_station_SASS</code>.</p>
</div>

In [100]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import matplotlib.pyplot as plt

# Variable dropdown
var_dropdown = widgets.Dropdown(
    options=[
        ('Temperature (F)', 'airtemp_degF'),
        ('Average Wind Speed (mph)', 'windspeed_avg_mph'),
        ('Wind Gust (mph)', 'winddgust_mph'),
        ('Relative Humidity (%)', 'rhavg_percent'),
        ('Precipitation (in)', 'precip_in')
    ],
    description='Variable:',
    disabled=False
)

# Plot type dropdown
plot_dropdown = widgets.Dropdown(
    options=['Histogram', 'Time Series'],
    description='Plot type:',
    disabled=False
)

# Station dropdown
station_dropdown = widgets.Dropdown(
    options=['BEAR', 'BURN', 'FRYI', 'JEFF', 'NCAT', 'SALI', 'SASS', 'UNCA', 'WINE'],
    description='Station:',
    disabled=False
)

# Button for plotting
plot_button = widgets.Button(description="Plot", button_style="success")

# Output widget to render plots
output = widgets.Output()

# Display widgets and output
display(widgets.HTML(value="<h3>Input Stations</h3>"), var_dropdown, plot_dropdown, station_dropdown, plot_button, output)

# Button click event handler
def on_plot_button_click(b):
    # Retrieve current selection
    selected_var = f"{var_dropdown.value}_{station_dropdown.value}_station_{station_dropdown.value}"

    # Clear previous output
    with output:
        clear_output(wait=True)  # Clear the previous plot

        # Generate the selected plot
        if plot_dropdown.value == 'Histogram':
            fig, ax = plt.subplots(1, 1, tight_layout=True)
            ax.hist(df[selected_var], bins=30, color='skyblue', edgecolor='black')
            ax.set_title(f"Histogram of {var_dropdown.label} at {station_dropdown.value}", fontsize=14)
            ax.set_xlabel(var_dropdown.label)
            ax.set_ylabel("Number of records")
            plt.show()
        elif plot_dropdown.value == 'Time Series':
            fig, ax = plt.subplots(1, 1, tight_layout=True)
            ax.plot(df['observation_datetime_station_MITC'][::100], df[selected_var][::100], label=var_dropdown.label, color='orange')
            ax.set_title(f"Time Series of {var_dropdown.label} at {station_dropdown.value}", fontsize=14)
            ax.set_xlabel("Date")
            ax.set_ylabel(var_dropdown.label)
            plt.show()

# Attach the event handler to the button
plot_button.on_click(on_plot_button_click)


HTML(value='<h3>Input Stations</h3>')

Dropdown(description='Variable:', options=(('Temperature (F)', 'airtemp_degF'), ('Average Wind Speed (mph)', '…

Dropdown(description='Plot type:', options=('Histogram', 'Time Series'), value='Histogram')

Dropdown(description='Station:', options=('BEAR', 'BURN', 'FRYI', 'JEFF', 'NCAT', 'SALI', 'SASS', 'UNCA', 'WIN…

Button(button_style='success', description='Plot', style=ButtonStyle())

Output()