## Performing basic exploratory data analysis

In [8]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("../data/hydrogen_prices.csv")
print(df.head())
print("shape: ", df.shape)
print("columns: ", df.columns)
print(df.info())
print(df.describe())



         date  energy_cost  gov_policy_score  demand_index  \
0  2016-01-03      54.9671                 9        1.0435   
1  2016-01-10      48.6174                 1        1.3892   
2  2016-01-17      56.4769                10        1.1383   
3  2016-01-24      65.2303                 7        1.2572   
4  2016-01-31      47.6585                10        1.1610   

   average_temperature  hydrogen_price  
0                72.13         41.9657  
1                71.68         50.7611  
2                70.78         40.4830  
3                82.33         47.8869  
4                68.55         38.1190  
shape:  (500, 6)
columns:  Index(['date', 'energy_cost', 'gov_policy_score', 'demand_index',
       'average_temperature', 'hydrogen_price'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date    

In [10]:
features = ['energy_cost', 'gov_policy_score', 'demand_index', 'average_temperature']
scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])

correlation_matrix = df.drop(columns='date').corr()

print("Correlation Matrix:")
correlation_matrix

Correlation Matrix:


Unnamed: 0,energy_cost,gov_policy_score,demand_index,average_temperature,hydrogen_price
energy_cost,1.0,0.041656,0.028713,0.010008,0.679667
gov_policy_score,0.041656,1.0,0.06631,-0.018739,-0.475712
demand_index,0.028713,0.06631,1.0,0.009097,0.432929
average_temperature,0.010008,-0.018739,0.009097,1.0,0.015035
hydrogen_price,0.679667,-0.475712,0.432929,0.015035,1.0


 **During exploratory data analysis (EDA), we found that average_temperature had minimal correlation with both the target and other features, indicating it doesn’t contribute meaningfully to predicting hydrogen prices. As a result, it is a candidate for removal in the feature selection process.**

## Data preprocessing

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load and preprocess data
def load_data(path):
    df = pd.read_csv(path)
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    X = df.drop(columns=["date", "hydrogen_price", "average_temperature"])
    y = df["hydrogen_price"]
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    return X_scaled, y, scaler

# Train and evaluate multiple models
def evaluate_models(X_train, y_train, X_test, y_test):
    models = {
        "Linear Regression": LinearRegression(),
        "Decision Tree": DecisionTreeRegressor(random_state=42),
        "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42)
    }

    results = []

    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        mae = mean_absolute_error(y_test, preds)
        rmse = mean_squared_error(y_test, preds)
        r2 = r2_score(y_test, preds)

        results.append({
            "Model": name,
            "MAE": mae,
            "RMSE": rmse,
            "R2 Score": r2
        })
    return pd.DataFrame(results).sort_values(by="RMSE")


In [12]:
X, y, scaler = load_data("../data/hydrogen_prices.csv")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

results_df = evaluate_models(X_train, y_train, X_test, y_test)
print("\nModel Comparison Results:\n")
print(results_df.to_string(index=False))


Model Comparison Results:

            Model      MAE      RMSE  R2 Score
Linear Regression 1.482624  3.729088  0.914728
    Random Forest 1.800342  5.643280  0.870957
    Decision Tree 2.768175 13.801297  0.684409


### Model Analysis

- Linear Regression performed the best overall with the lowest MAE and RMSE, and the highest R2 score. This indicates that a simple linear model is effective at capturing the relationships in the data.
- Random Forest offered reasonable performance, but slightly underperformed compared to linear regression. It may be overfitting or not benefiting as much from the dataset size or feature simplicity.
- Decision Tree performed the worst, showing signs of overfitting (low bias, high variance), which is reflected in the high RMSE and low R2 score.

#### Conclusion
Linear Regression was chosen for deployment due to its strong performance, simplicity, and interpretability.
