
# 🧠 Task 1 Baseline Notebook: Simple Regression Model

Welcome to the **Puerto Rico National Olympiad of Artificial Intelligence 2025**! 🎯  
In this task, your goal is to **predict a continuous variable (target)** based on three input features using **supervised machine learning**.

## 📝 How the competition works

- You are given:
  - `task1.ipynb`: a baseline jupyter notebook (Python) that serves as a starting point to solve the task.
  - `train.csv`: a dataset of 1,000 entries with 3 features and the corresponding target value.
  - `eval.csv`: a dataset of 500 entries with the same 3 features, but without the target. Each row has a unique `id`.
- Your task is to:
  1. **Train a regression model** using the training dataset.
  2. **Predict the target values** for the evaluation dataset.
  3. **Export your predictions** to a CSV file with two columns: `id` and `prediction`.

You will **submit your predictions**, and the organizers will evaluate how close they are to the real (hidden) targets using the **Mean Squared Error (MSE)** metric.


In [33]:
import pandas as pd

# Load training data (features and target)
train_df = pd.read_csv("train.csv")
print("Training data:")
display(train_df)

# Load test data (features only)
eval_df = pd.read_csv("eval.csv")
print("\nEvaluation data:")
display(eval_df)

Training data:


Unnamed: 0,feature_1,feature_2,feature_3,target
0,0.103526,-13.118362,-10.651137,-10.016119
1,-15.978721,-6.300250,7.319555,-2.835304
2,19.583475,-12.295496,4.966992,-4.764661
3,2.809919,-6.226995,-2.081223,1.350010
4,-6.540757,-18.306329,5.112026,1.230942
...,...,...,...,...
995,-15.769965,-8.028244,-0.739729,-6.686316
996,-18.437936,-8.903178,-0.640778,-0.563026
997,3.721055,3.891592,-0.531207,-4.071553
998,20.420697,2.952799,-7.764991,-13.854245



Evaluation data:


Unnamed: 0,id,feature_1,feature_2,feature_3
0,1,6.789313,-1.369955,-5.193964
1,2,-0.626396,2.431235,-2.942425
2,3,-1.112261,-9.039076,-7.355299
3,4,16.283966,-3.791277,-2.035804
4,5,7.591552,-5.765104,-25.910423
...,...,...,...,...
495,496,1.946075,-7.424706,-13.200225
496,497,11.889134,7.083038,3.514482
497,498,2.482206,-4.593609,-8.498444
498,499,9.955815,17.031726,-16.380291



## 🔧 Step 1: Prepare the data

We'll separate the features and the target for training.  
The evaluation data only has features and an ID — we'll use those features to predict the target.


In [34]:
# Separate features and target
X_train_full = train_df[['feature_1', 'feature_2', 'feature_3']]
y_train_full = train_df['target']

# Extract test features and IDs
X_eval = eval_df[['feature_1', 'feature_2', 'feature_3']]
eval_ids = eval_df['id']

## 📊 Step 2: Evaluate your model before submitting

Before using the evaluation dataset, let's split our training data into a smaller training set and a validation set.
This helps us understand how well the model might perform on unseen data — just like in the real evaluation.

We'll use **Linear Regression**, a common starting point in machine learning and we will use **80% for training** and **20% for testing**.

🚨 **Stop here** 🚨

The procedure below creates a very simple model that will not generate the best results. If you want to improve your score try some of the following ideas:
- Try more powerful models like `RandomForestRegressor`, `GradientBoostingRegressor`, or `SVR`. See [Machine Learning Map](https://scikit-learn.org/stable/machine_learning_map.html).
- Use feature transformations like polynomial features or normalization. See [Dataset Transformations](https://scikit-learn.org/stable/data_transforms.html).
- Tune model hyperparameters. For this see the documentation for each model.

In [35]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the original training data
X_train, X_test, y_train, y_test= train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=42)

# Train the model on training subset
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate on the validation set
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f"Test MSE: {test_mse:.4f}")

Test MSE: 23.4352


## 🤖 Step 3: Train your model again with all the data

In [39]:
# Train the model
model.fit(X_train_full, y_train_full)


## 📈 Step 3: Make predictions on the evaluation data


In [40]:
# Predict using the trained model
y_eval_pred = model.predict(X_eval)


## 💾 Step 4: Save predictions to a CSV file

We create a new DataFrame with the `id` and your model's predictions, then save it.  
Make sure your file is named `predictions.csv` for submission!


In [38]:
# Create submission DataFrame
submission = pd.DataFrame({'id': eval_ids, 'prediction': y_eval_pred})

# Save to CSV
submission.to_csv("predictions.csv", index=False)
print("Predictions saved to predictions.csv!")

Predictions saved to predictions.csv!
