
# 🧠 Baseline Notebook: Simple Regression Model

Welcome to the **Puerto Rico National Olympiad of Artificial Intelligence 2025**! 🎯  
In this task, your goal is to **predict a continuous variable (target)** based on three input features using **supervised machine learning**.

## 📝 How the competition works

- You are given:
  - `train.csv`: a dataset of 10,000 entries with 3 features and the corresponding target value.
  - `eval.csv`: a dataset of 5,000 entries with the same 3 features, but without the target. Each row has a unique `id`.
- Your task is to:
  1. **Train a regression model** using the training dataset.
  2. **Predict the target values** for the evaluation dataset.
  3. **Export your predictions** to a CSV file with two columns: `id` and `prediction`.

You will **submit your predictions**, and the organizers will evaluate how close they are to the real (hidden) targets using the **Mean Squared Error (MSE)** metric.


In [None]:
import pandas as pd

# Load training data (features and target)
train_df = pd.read_csv("train.csv")
print("Training data:")
print(train_df.head())

# Load test data (features only)
eval_df = pd.read_csv("eval.csv")
print("\nEvaluation data:")
print(eval_df.head())

Training data:
   feature_1  feature_2  feature_3    target
0   0.496714  -0.138264   0.647689  2.213169
1   1.523030  -0.234153  -0.234137  6.794350
2   1.579213   0.767435  -0.469474  4.067783
3   0.542560  -0.463418  -0.465730  3.321263
4   0.241962  -1.913280  -1.724918  4.523979

Evaluation data:
   id  feature_1  feature_2  feature_3
0   1  -0.471858   1.012702  -0.198187
1   2   0.090569   0.717391  -0.058963
2   3  -1.817848   1.040588   1.254929
3   4  -1.831116  -1.043291   2.023880
4   5   0.232035   0.003551  -0.426383



## 🔧 Step 1: Prepare the data

We'll separate the features and the target for training.  
The evaluation data only has features and an ID — we'll use those features to predict the target.


In [None]:
# Separate features and target
X_train = train_df[['feature_1', 'feature_2', 'feature_3']]
y_train = train_df['target']

# Extract test features and IDs
X_eval = eval_df[['feature_1', 'feature_2', 'feature_3']]
eval_ids = eval_df['id']


## 📊 Step 2: Evaluate your model before submitting

Before using the evaluation dataset, let's split our training data into a smaller training set and a validation set.
This helps us understand how well the model might perform on unseen data — just like in the real evaluation.

We will use **80% for training** and **20% for testing**.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the original training data
X_train, X_test, y_train, y_test= train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train the model on training subset
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate on the validation set
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f"Test MSE: {test_mse:.4f}")

Test MSE: 1.0345



## 🤖 Step 3: Train a simple regression model with all the data

We'll use **Linear Regression**, a common starting point in machine learning.


In [None]:
from sklearn.linear_model import LinearRegression

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)


## 📈 Step 3: Make predictions on the evaluation data


In [11]:
# Predict using the trained model
y_eval_pred = model.predict(X_eval)


## 💾 Step 4: Save predictions to a CSV file

We create a new DataFrame with the `id` and your model's predictions, then save it.  
Make sure your file is named `predictions.csv` for submission!


In [12]:

# Create submission DataFrame
submission = pd.DataFrame({'id': eval_ids, 'prediction': y_eval_pred})

# Save to CSV
submission.to_csv("predictions.csv", index=False)
print("Predictions saved to predictions.csv!")


Predictions saved to predictions.csv!
