# commodAI: Commodity Market Analysis

**Objectives:**

1.  **Primary Goal:** Develop a Python-based (Jupyter Notebook) system for analyzing a commodity market dataset (2000-2023 closing ticks for 18 commodities) to identify and cluster periods of market instability.
2.  **Anomaly Detection:** Utilize machine learning (and potentially deep learning) techniques to detect anomalies within the time series data of each commodity.  Anomalies should be tagged with confidence levels.  Focus on identifying regions, not just individual points.
3.  **Clustering:** Perform ensemble clustering on the identified anomaly regions.  Provide statistical justification for the chosen clustering methods and the resulting clusters.  Analyze and explain the characteristics of each cluster.
4.  **Visualization:** Create a modern, interactive dashboard (within the Jupyter Notebook environment) to visualize the time series data, detected anomalies, clustering results, and statistical analyses.
5.  **Self-Contained Execution:** The entire project MUST be executable on a single machine (MacBook M1 Pro, Sonoma 14.7.3, Anaconda environment, Python 3.9.7) without relying on external web services or APIs (beyond standard package installations).
6.  **Reproducibility:** The jupyter notbook must provide ALL the necessary setup for the environment to be replicated.
7. **Document:** The jupyter notebook must contain a balanced mixture of code and markdown cells to document the entire process.

## Environment Setup

This section describes the steps to set up the Anaconda environment and install the necessary packages.  It is crucial to follow these steps to ensure the notebook can be executed successfully and reproducibly.

We will create a dedicated Anaconda environment named `commodAI_env` with Python 3.9.7.  This isolates the project's dependencies and avoids conflicts with other Python projects.

**Steps:**

1.  **Create the Anaconda environment:**
    *   Open your terminal.
    *   Run the following command:
    ```bash
    conda create -n commodAI_env python=3.9.7
    ```
    This command creates an environment named `commodAI_env` with Python version 3.9.7.

2.  **Activate the environment:**
    *   In the terminal, run:
    ```bash
    conda activate commodAI_env
    ```
    This activates the `commodAI_env` environment, making it the active Python environment for your terminal.

3.  **Install the required packages:**
    *   You can install the packages individually using `pip install`, or you can use the provided `requirements.txt` file.
    *   To install using `requirements.txt`, first ensure the file is in the same directory as your Jupyter Notebook.
    *   Then, in the terminal (with the environment activated), run:
    ```bash
    pip install -r requirements.txt
    ```
    This command installs all the packages listed in `requirements.txt` along with their dependencies.

**requirements.txt:**
```
pandas==1.5.3
scikit-learn==1.2.2
plotly==5.14.1
tensorflow==2.12.0
ipywidgets==8.0.4
scipy==1.10.1
```

These are the core packages required for this project.  Specific versions are listed to ensure reproducibility.  You can create this file manually or use `conda list -e > requirements.txt` to generate it from your environment after installing the packages.


## Data Loading and Preprocessing

This section loads the commodity market data from the CSV file (`Gran Canaria_database_v3.csv`) and performs initial data preprocessing.

The dataset contains daily closing prices for 18 different commodities from 2000 to 2023.  The goal of preprocessing is to clean the data, handle missing values, and prepare it for anomaly detection and clustering.


In [None]:
# Import pandas
import pandas as pd

# Load the CSV data into a pandas DataFrame
df = pd.read_csv("database_v3.csv", sep=';')

# Ensure the 'date' column is in the correct datetime format
df['date'] = pd.to_datetime(df['date'])

# Display the first few rows of the DataFrame
print("First few rows of the DataFrame:")
print(df.head())

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Handle missing values using interpolation
df = df.interpolate(method='ffill', limit_direction='forward')
print("\nMissing values after interpolation:")
print(df.isnull().sum())

# Basic check for outliers using z-score (example for the 'Crude oil, Brent-Europe' column)
from scipy import stats
import numpy as np

commodity_column = ["crude", "brent"]
z_scores = np.abs(stats.zscore(df[commodity_column]))
outlier_threshold = 3 # Z-score threshold for outlier detection
outliers = df[z_scores > outlier_threshold]

print(f"\nNumber of outliers in '{commodity_column}' column (Z-score > {outlier_threshold}): {len(outliers)}")
if not outliers.empty:
    print("Outlier indices:", outliers.index.tolist())
    print("First few outlier values:")
    print(outliers.head())
else:
    print("No outliers detected in '{commodity_column}' column based on Z-score threshold.")

# Further data consistency checks can be added here if needed

In [None]:
# Import pandas
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Import pandas
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

## Anomaly Detection

This section focuses on detecting anomalies in the commodity time series data. We use the Isolation Forest algorithm, which is an unsupervised learning method particularly effective for anomaly detection in high-dimensional datasets.

**Isolation Forest:**
Isolation Forest isolates anomalies by randomly partitioning the data. Anomalies are easier to isolate and therefore require fewer partitions. The `contamination` parameter specifies the expected proportion of outliers in the dataset.  It's important to choose this parameter carefully based on your understanding of the data.  A higher value will result in more data points being flagged as anomalies.


In [None]:
# Import pandas
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Import pandas
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Import IsolationForest
from sklearn.ensemble import IsolationForest

# Define a function to detect anomalies for each commodity
def detect_anomalies(df, commodity, contamination=0.05):
    """
    Detects anomalies in a given commodity's time series data using Isolation Forest.
    Args:
        df (pd.DataFrame): The DataFrame containing the commodity data.
        commodity (str): The name of the commodity column.
        contamination (float): The proportion of outliers in the data set.
    Returns:
        tuple: A tuple containing the anomaly scores and anomaly regions.
    """
    # Create a copy of the DataFrame to avoid modifying the original
    df_copy = df.copy()
    # Train Isolation Forest model
    model = IsolationForest(contamination=contamination, random_state=42)
    model.fit(df_copy[[commodity]])
    # Get anomaly scores
    scores = model.decision_function(df_copy[[commodity]])
    # Get anomaly predictions
    predictions = model.predict(df_copy[[commodity]])
    # Identify anomaly regions
    anomaly_regions = []
    start_index = None
    # Iterate through the predictions to identify anomaly regions
    for i, prediction in enumerate(predictions):
        # If the current prediction is an anomaly and we haven't started an anomaly region yet
        if prediction == -1 and start_index is None:
            # Start a new anomaly region
            start_index = i
        # If the current prediction is not an anomaly and we have started an anomaly region
        elif prediction == 1 and start_index is not None:
            # End the anomaly region
            anomaly_regions.append((start_index, i - 1))
            # Reset the start index
            start_index = None
    # If we started an anomaly region but didn't end it
    if start_index is not None:
        # End the anomaly region at the end of the data
        anomaly_regions.append((start_index, len(predictions) - 1))
    return scores, anomaly_regions

# Apply anomaly detection to each commodity
commodity_columns = df.columns[1:]  # Exclude the 'date' column
anomaly_scores = {}
anomaly_regions = {}
for commodity in commodity_columns:
    scores, regions = detect_anomalies(df, commodity)
    anomaly_scores[commodity] = scores
    anomaly_regions[commodity] = regions
    print(f"\nAnomaly regions for '{commodity}': {regions}")

## Ensemble Clustering

This section implements ensemble clustering on the identified anomaly regions using k-means, DBSCAN, and hierarchical clustering.

**Why Ensemble Clustering?**
Ensemble clustering combines the results of multiple clustering algorithms to obtain a more robust and stable clustering solution.  Different algorithms have different strengths and weaknesses, and combining them can help to overcome the limitations of individual algorithms. In this case, we are using K-means, DBSCAN and Hierarchical clustering. However, the results are not combined, and each algorithm is evaluated independently.

It evaluates the quality of the clusters using Silhouette score and Davies-Bouldin index, and analyzes the characteristics of each cluster.

**Clustering Methods:**
*   **K-means:** A centroid-based clustering algorithm that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
*   **DBSCAN:** A density-based clustering algorithm that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions.
*   **Hierarchical Clustering:** A clustering algorithm that builds a hierarchy of clusters. It starts with each data point in its own cluster, and then merges the closest pairs of clusters until only a single cluster remains.

In [None]:
# Import clustering algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
# Import metrics for evaluation
from sklearn.metrics import silhouette_score, davies_bouldin_score
# Prepare data for clustering
anomaly_data = []
for commodity in commodity_columns:
    regions = anomaly_regions[commodity]
    for start, end in regions:
        # Extract the data for the anomaly region
        region_data = df[commodity][start:end+1].values
        anomaly_data.append(region_data)
# Pad the sequences to have the same length
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_anomaly_data = pad_sequences(anomaly_data, padding='post', dtype='float64')
# Define the clustering algorithms
kmeans = KMeans(n_clusters=3, random_state=42, n_init = 'auto')
dbscan = DBSCAN(eps=0.5, min_samples=5)
hierarchical = AgglomerativeClustering(n_clusters=3)
# Fit the clustering algorithms
kmeans_labels = kmeans.fit_predict(padded_anomaly_data)
dbscan_labels = dbscan.fit_predict(padded_anomaly_data)
hierarchical_labels = hierarchical.fit_predict(padded_anomaly_data)
# Evaluate the clustering algorithms
def evaluate_clustering(labels, data):
    """
    Evaluates the clustering algorithms using Silhouette score and Davies-Bouldin index.
    Args:
        labels (np.ndarray): The cluster labels.
        data (np.ndarray): The data used for clustering.
    Returns:
        tuple: A tuple containing the Silhouette score and Davies-Bouldin index.
    """
    # Filter out noise labels (-1 for DBSCAN)
    if -1 in labels:
        data = data[labels != -1]
        labels = labels[labels != -1]
    if len(set(labels)) < 2:
        return -1, -1
    silhouette = silhouette_score(data, labels)
    davies_bouldin = davies_bouldin_score(data, labels)
    return silhouette, davies_bouldin

kmeans_silhouette, kmeans_davies_bouldin = evaluate_clustering(kmeans_labels, padded_anomaly_data)
dbscan_silhouette, dbscan_davies_bouldin = evaluate_clustering(dbscan_labels, padded_anomaly_data)
hierarchical_silhouette, hierarchical_davies_bouldin = evaluate_clustering(hierarchical_labels, padded_anomaly_data)

print(f"K-means Silhouette score: {kmeans_silhouette}, Davies-Bouldin index: {kmeans_davies_bouldin}")
print(f"DBSCAN Silhouette score: {dbscan_silhouette}, Davies-Bouldin index: {dbscan_davies_bouldin}")
print(f"Hierarchical Silhouette score: {hierarchical_silhouette}, Davies-Bouldin index: {hierarchical_davies_bouldin}")

# Analyze cluster characteristics
from collections import defaultdict
cluster_characteristics = defaultdict(list)
for i, label in enumerate(kmeans_labels):
    commodity = commodity_columns[i % len(commodity_columns)]
    cluster_characteristics[label].append(commodity)

print("\nCluster characteristics:")
for label, commodities in cluster_characteristics.items():
    print(f"Cluster {label}: {commodities}")

## Interactive Dashboard

This section creates an interactive dashboard to visualize the time series data, detected anomalies, and clustering results.

The dashboard allows the user to select a commodity from a dropdown menu and view the corresponding time series data, anomaly regions, and clustering results.  The anomaly regions are highlighted in red, and the clusters are color-coded to distinguish them.


In [None]:
# Import necessary libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display

# Create a dropdown widget for commodity selection
commodity_dropdown = widgets.Dropdown(
    options=commodity_columns,
    description='Select Commodity:'
)
# Create an output widget for the plot
plot_output = widgets.Output()

# Define a function to update the plot based on the selected commodity
def update_plot(commodity):
    with plot_output:
        plot_output.clear_output()
        # Create the plot
        fig = make_subplots(specs=[[{"secondary_y": True}]])
        # Add time series data
        fig.add_trace(
            go.Scatter(x=df['date'], y=df[commodity], mode='lines', name=commodity),
            secondary_y=False,
        )
        # Add anomaly regions
        for start, end in anomaly_regions[commodity]:
            fig.add_trace(
                go.Scatter(x=df['date'][start:end+1], y=df[commodity][start:end+1], mode='lines', name='Anomaly Region', line=dict(color='red')),
                secondary_y=False,
            )
        # Add clustering results (example: using KMeans labels)
        # Note: This is a simplified example. You'll need to map the anomaly regions to the cluster labels.
        # Assuming kmeans_labels is a list of cluster labels for each anomaly region
        # and anomaly_regions[commodity] is a list of (start, end) tuples for each anomaly region.
        # You'll need to iterate through the anomaly regions and assign a color based on the cluster label.
        colors = ['blue', 'green', 'purple']  # Example colors for clusters
        for i, (start, end) in enumerate(anomaly_regions[commodity]):
            cluster_label = kmeans_labels[i % len(kmeans_labels)]  # Get the cluster label for the anomaly region
            color = colors[cluster_label % len(colors)]  # Assign a color based on the cluster label
            fig.add_trace(
                go.Scatter(x=df['date'][start:end+1], y=df[commodity][start:end+1], mode='lines', name=f'Cluster {cluster_label}', line=dict(color=color)),
                secondary_y=False,
            )
        # Update layout
        fig.update_layout(
            title=f'Commodity: {commodity} with Anomaly Regions and Clustering',
            xaxis_title='Date',
            yaxis_title='Price',
            template='plotly_white'
        )
        fig.show()

# Link the dropdown to the update function
commodity_dropdown.observe(lambda change: update_plot(change.new), names=['value'])

# Display the widgets
display(commodity_dropdown)
display(plot_output)

# Initialize the plot with the first commodity
update_plot(commodity_columns[0])