# Machine Learning for River Prediction (Exercise Notebook)

This notebook will guide you step-by-step through building and evaluating machine learning models to predict outcomes related to river data. Follow the instructions, complete the tasks, and fill in the code where indicated.

## 1. Introduction

In this exercise, you will:
- Obtain and merge datasets
- Create new features from the data
- Build, train, and evaluate different machine learning models
- Interpret and visualize the results

**Learning Objectives**:
- Understand how to work with data in a machine learning context.
- Learn how to combine and preprocess datasets.
- Train models such as Linear Regression, Decision Tree, and XGBoost.
- Compare model performances and interpret feature importance.

## 2. Data Acquisition and Preparation

### Task 1: Load the data
Use pandas to load the training and testing datasets provided. Each dataset represents a different time split of the same data.


In [None]:
import pandas as pd

# Load the train and test datasets
df_train = pd.read_csv('data/train.csv', index_col=0)
df_test = pd.read_csv('data/test.csv', index_col=0)

# Tag each dataset to distinguish between them
df_train['tt'] = 'train'
df_test['tt'] = 'test'

### Task 2: Merge the datasets
Combine the training and test sets into one dataframe so that we can perform feature engineering on the entire dataset.


In [None]:
# Concatenate the train and test datasets
df = pd.concat([df_train, df_test])

# Display the first few rows to confirm the merge
df.head()

## 3. Feature Creation

### Task 3: Create new features
We will now create new features from the existing data that could improve the performance of our machine learning models.

**Guidelines for feature creation**:
- **Datetime Features**: Extract useful features from the timestamp, such as year, month, day, and hour.
- **Interactions**: Create new features by interacting existing ones (e.g., multiplication or ratios).
- **Aggregations**: If you have categories, compute averages, sums, or other statistics per category.


In [None]:
# Example: Creating datetime-based features
df['year'] = pd.to_datetime(df['timestamp']).dt.year
df['month'] = pd.to_datetime(df['timestamp']).dt.month
df['day'] = pd.to_datetime(df['timestamp']).dt.day

# Create additional features as needed
# Example: interaction features
df['feature_interaction'] = df['feature1'] * df['feature2']

### Task 4: Handle missing values
Check for missing values and decide on a strategy to handle them (e.g., imputation, removal).


In [None]:
# Check for missing values
df.isnull().sum()

# You can either fill missing values or drop rows/columns
# Example: filling missing values with the mean
df.fillna(df.mean(), inplace=True)

## 4. Data Splitting and Preprocessing

### Task 5: Split the data back into train and test sets
Now that we’ve processed the data, let's split it back into training and testing sets.


In [None]:
# Split the combined dataset back into train and test sets
df_train = df[df['tt'] == 'train'].drop(columns=['tt'])
df_test = df[df['tt'] == 'test'].drop(columns=['tt'])

# Define X and y for both train and test sets
X_train = df_train.drop(columns=['target_column'])  # Replace 'target_column' with your target variable
y_train = df_train['target_column']  # Replace 'target_column' with your target variable

X_test = df_test.drop(columns=['target_column'])
y_test = df_test['target_column']

## 5. Model Training and Evaluation

### Task 6: Train a Linear Regression Model
Use `LinearRegression` from sklearn to fit the model and evaluate its performance.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

### Task 7: Train a Decision Tree Model
Repeat the steps to train a Decision Tree model and compare its performance.


In [None]:
from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
tree_model = DecisionTreeRegressor()
tree_model.fit(X_train, y_train)

# Make predictions and evaluate
y_tree_pred = tree_model.predict(X_test)
tree_mse = mean_squared_error(y_test, y_tree_pred)
print(f'Decision Tree Mean Squared Error: {tree_mse}')

## 6. Feature Importance

### Task 8: Visualize feature importance
For the Decision Tree model, visualize the importance of each feature.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot feature importance
feature_names = X_train.columns
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_names)), tree_model.feature_importances_, align='center')
plt.yticks(np.arange(len(feature_names)), feature_names)
plt.xlabel('Importance')
plt.title('Feature Importance in Decision Tree')
plt.gca().invert_yaxis()  # Invert y-axis for better visualization
plt.show()

## 7. Challenge: Implement XGBoost

### Task 9: Train and evaluate an XGBoost model
Your final task is to implement an XGBoost model. Train the model and evaluate its performance just as you did with the previous models.


In [None]:
# Your XGBoost code here

## 8. Conclusion

Summarize your findings and reflect on the model performance:
- Which model performed the best?
- How important were the features in making predictions?
- What feature engineering steps made the biggest difference?
