In [1]:
import pandas as pd

file_path = '/content/sst.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,water_temperature,depth,month,salinity,wildlife_seen,wind_speed,cloud_cover,wave_height,oxygen_levels
0,28.84,51.1,June,34.8,2,20,79,2.2,7.19
1,25.32,34.4,January,36.1,7,9,40,1.8,6.73
2,28.2,41.9,June,31.5,4,4,1,2.3,8.48
3,26.41,50.0,December,33.4,7,17,36,2.0,6.51
4,26.68,62.0,January,36.1,3,4,86,1.8,7.14


**Regression Analysis:**

1. Fit an initial linear regression model using all predictors. While this may lead to overfitting, it
serves as a starting point.

2. Evaluate the initial model using metrics like MSE and R2. This will quantify the model's
performance.

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

# One-hot encode
encoder = OneHotEncoder(drop='first', sparse_output=False)
month_encoded = encoder.fit_transform(data[['month']])
month_encoded_df = pd.DataFrame(month_encoded, columns=encoder.get_feature_names_out(['month']))

X = pd.concat([data.drop(['water_temperature', 'month'], axis=1), month_encoded_df], axis=1)
y = data['water_temperature']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

y_pred = linear_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2


(0.9959089175073893, 0.7180053731650053)

3. Fine-tune the model to avoid overftting or underftting by fnding a good subset of predictors.
Justify your reasoning as to why you are removing or keeping certain variables.

In [13]:
import statsmodels.api as sm

def backward_elimination(X, y, significance_level = 0.05):
    features = X.columns.tolist()
    while len(features) > 0:
        features_with_constant = sm.add_constant(X[features])
        p_values = sm.OLS(y, features_with_constant).fit().pvalues[1:]
        max_p_value = p_values.max()
        if max_p_value > significance_level:
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
        else:
            break
    return features


significant_features_all = backward_elimination(X_train, y_train)

# Fit a new model using only the significant features from all original features
X_train_significant_all = X_train[significant_features_all]
X_test_significant_all = X_test[significant_features_all]

linear_model_significant_all = LinearRegression()
linear_model_significant_all.fit(X_train_significant_all, y_train)

y_pred_significant_all = linear_model_significant_all.predict(X_test_significant_all)

# Calculate the performance metrics for the model with significant features from all original features
mse_significant_all = mean_squared_error(y_test, y_pred_significant_all)
r2_significant_all = r2_score(y_test, y_pred_significant_all)

mse_significant_all, r2_significant_all, significant_features_all


(0.9947811096844658,
 0.7183247153664618,
 ['depth', 'salinity', 'wind_speed', 'month_June'])

**Regression Tree Analysis**

1. Build an initial regression tree model.

In [8]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)

tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)

mse_tree_initial = mean_squared_error(y_test, y_pred_tree)
r2_tree_initial = r2_score(y_test, y_pred_tree)

mse_tree_initial, r2_tree_initial


(2.0788990000000003, 0.41135345067507756)

2. Tune the model parameters like tree depth and minimum samples per leaf. This ensures that
the model is neither too complex nor too simple.
3. Evaluate the tuned model's performance using MSE and R2.

In [10]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': range(1, 20),
    'min_samples_leaf': range(1, 20),
}
tree_model_cv = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')

tree_model_cv.fit(X_train, y_train)

best_params = tree_model_cv.best_params_

y_pred_tree_tuned = tree_model_cv.predict(X_test)

mse_tree_tuned = mean_squared_error(y_test, y_pred_tree_tuned)
r2_tree_tuned = r2_score(y_test, y_pred_tree_tuned)

best_params, mse_tree_tuned, r2_tree_tuned


({'max_depth': 3, 'min_samples_leaf': 10},
 1.0974554042309475,
 0.6892521778409941)

**Technical Report:**


**Linear Regression Model Analysis:**<br>
The best linear regression model was identified through backward elimination based on p-values, ensuring only statistically significant predictors were retained. This process led to a model that included the following features: depth, salinity, wind speed, and the categorical variable for the month of June. The performance of this model yielded a Mean Squared Error (MSE) of approximately 0.995 and a Coefficient of Determination (\( R^2 \)) of 0.718. These metrics indicate a robust model that accounts for approximately 71.8% of the variance in sea surface temperature.

**Regression Tree Model Analysis:**<br>
For the regression tree analysis, the model was tuned for parameters like tree depth and minimum samples per leaf. The optimal model had a maximum depth of 3 and required at least 10 samples per leaf node. The tuned regression tree model demonstrated an MSE of approximately 1.097 and an \( R^2 \)) of 0.689. While the improvement from the initial tree model was substantial, it still lagged behind the linear regression model in performance.

**Best Overall Model:**
Comparing the two approaches, the linear regression model with selected predictors based on statistical significance is determined to be the best overall model. Its higher \( R^2 \) value and lower MSE make it more reliable for predicting sea surface temperatures. Additionally, the linear model's simplicity and interpretability make it a suitable choice for both scientists and policymakers.

**Model Findings and Implications:**<br>
Our analysis reveals that sea depth, salinity, wind speed, and the month of June significantly influence sea surface temperature. These insights are crucial for predicting and preparing for El Niño-related changes. The identified key factors can serve as indicators for potential temperature shifts, assisting sectors like agriculture and fisheries in planning and response. We recommend employing the refined linear regression model for ongoing sea temperature monitoring to support climate-related decision-making.

