# 🚚 SmartCargo - Optimizarea livrărilor în România (SmartCargo – Delivery Optimization in Romania)

The transport company **“SmartCargo Romania”** needs intelligent solutions to improve delivery time estimates. You are the new data science specialist on the team, and your mission is to build accurate models that predict delivery times between cities in Romania.

## 🎯 Your Main Goal

You need to analyze historical data on trips between cities and understand how factors such as distance, time of day, weather, traffic, or driver experience influence the actual delivery duration.

Your goal is to build a machine learning model capable of predicting the estimated time for new trips.

## 📦 Transport Data

Each row in the files `train_data.csv` and `test_data.csv` represents a delivery between two cities in Romania.

**Each delivery includes the following details:**

| Column              | Description                                                                                   |
|---------------------|-----------------------------------------------------------------------------------------------|
| `ID`                | Unique identifier for the trip.                                                               |
| `City A`            | Departure city (as text).                                                                     |
| `City B`            | Destination city (as text).                                                                   |
| `Distance`          | Actual distance between cities (in kilometers).                                               |
| `Time of Day`       | Time of day expressed in minutes from midnight when the trip starts (0–1439).                 |
| `Weather`           | Weather during the trip (Clear, Rain, Snow, Fog)                                              |
| `Traffic`           | Traffic level on a numeric scale (0.0 – 1000.0), with the maximum meaning the most congested. |
| `Road Quality`      | Road quality on a numeric scale (1 – 1000), with the maximum meaning the best quality.        |
| `Driver Experience `| Driver’s level of experience (1 – 40 years).                                                  |
| `deliver_time`      | Only in `train_data.csv`: the actual delivery time in minutes.                                |

## 🎯 Your Mission

The operations manager has given you the following two essential tasks to improve SmartCargo’s operations.

The prediction dataset (`test_data.csv`) contains samples with the same features as the training set but without the `predicted_time` column.

Your model will generate predictions for these samples.

### Subtasks

- Task 1 (20 points) – Bârlad Situation: An important client from Bârlad has reported frequent delays on trips under foggy conditions. The manager wants to know how many trips depart from `Bârlad` and during foggy (`Fog`) weather in the prediction dataset. Find and report the number of such trips.

- Task 2 (80 points) – Estimating Times for Unknown Trips: The manager has a list of new trips for which the travel times are unknown. You must use historical data (from the training set) to train prediction models and then estimate the delivery times for all trips in the prediction dataset.

### Output Format

A CSV file `output.csv` that includes the following 3 columns:

- `subtaskID` – the number of the subtask (1, 2).
- `datapointID` – refers to the ID column in `test_data.csv`.
- `answer` – the corresponding answer for the datapoint for that subtask.

**Note:** For subtask 1, which requires a single answer for the entire dataset, output only one row with `datapointID` set to 1.

Submit a single CSV file containing answers for all the subtasks you completed. To see an example, download the file `sample_output.csv` (Note: although it has the correct format, it earns 0 points upon submission).

## Scoring

Scores for the subtasks will be calculated as follows:

- Subtask 1: integer answer; you receive 20 points only if it is correct.
- Subtask 2: based on your model's performance, using the Mean Absolute Error (MAE) metric on the test dataset. You can find the exact scoring method based on MAE in the Starter Kit.

> This is an English translation of the original Romanian task description, generated by ChatGPT.


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Load data

In [2]:
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv")
df_train

Unnamed: 0,ID,City A,City B,Distance,Time of Day,Weather,Traffic,Road Quality,Driver Experience,deliver_time
0,1,Satu Mare,Suceava,352,452,Fog,154.014691,370,30,355
1,2,Ploiesti,Timisoara,519,1386,Clear,949.697532,701,2,529
2,3,Deva,Bacau,457,91,Fog,387.019309,45,26,465
3,4,Hunedoara,Focsani,447,1120,Clear,130.544017,643,6,441
4,5,Hunedoara,Arad,201,1096,Clear,619.557737,375,20,230
...,...,...,...,...,...,...,...,...,...,...
9995,9996,Targoviste,Targu Mures,280,907,Rain,563.783292,283,39,296
9996,9997,Brasov,Iasi,320,1409,Fog,189.985776,646,23,336
9997,9998,Botosani,Deva,506,890,Clear,337.350595,476,40,479
9998,9999,Iasi,Suceava,145,1086,Clear,454.312259,726,31,161


## Subtask 1

In [3]:
num_barlad_fog = ((df_test["City A"] == "Barlad") & (df_test["Weather"] == "Fog")).sum().item()
num_barlad_fog

15

In [4]:
subtask1_rows = [(1, 1, num_barlad_fog)]

## Subtask 2

In [5]:
X_train = df_train.copy()
X_train, y_train = X_train.drop(["ID", "deliver_time", "City A", "City B"], axis=1), X_train["deliver_time"]
X_test = df_test.copy().drop("ID", axis=1)

In [6]:
X_train.dtypes

Distance               int64
Time of Day            int64
Weather               object
Traffic              float64
Road Quality           int64
Driver Experience      int64
dtype: object

In [7]:
cat_cols = []
num_cols = []
for col in X_train.columns:
    if X_train[col].dtype == "object":
        cat_cols.append(col)
        print(f"{col:<10}", X_train[col].unique())
    else:
        num_cols.append(col)

Weather    ['Fog' 'Clear' 'Rain' 'Snow']


In [8]:
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(drop="first"), cat_cols),
    ("num", StandardScaler(), num_cols)
])

In [9]:
pipeline = Pipeline([
    ("pre", preprocessor),
    ("reg", LinearRegression())
])

In [10]:
cross_val_score(pipeline, X_train, y_train, cv=5, scoring="neg_mean_absolute_error") * -1

array([1.67558831, 1.65138414, 1.63703101, 1.63356371, 1.59481032])

In [11]:
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

In [12]:
subtask2_rows = []
for id_, val in zip(df_test["ID"], preds):
    subtask2_rows.append((2, id_, val))

## Save answers

In [13]:
submission_rows = subtask1_rows + subtask2_rows
df_submission = pd.DataFrame(submission_rows, columns=["subtaskID", "datapointID", "answer"])
df_submission.to_csv("submission.csv", index=False)

## Submission results

Subtask 1:
- Equality: 1
- Score: 21/21

Subtask 2:
- MAE: 1.606679
- Score: 79/79
