# Predicția Prețului Unei Case (House Price Prediction)

For this problem, you need to implement a regression model capable of predicting the value of a house (`Price`) using an available dataset. The dataset is organized in a CSV file, and the model's performance will be evaluated based on the **Mean Absolute Error (MAE)**.

**The dataset contains the following columns:**

- `Square_Footage`: Area of the house.
- `Num_Bedrooms`: Number of bedrooms.
- `Num_Bathrooms`: Number of bathrooms.
- `Year_Built`: Year the house was built.
- `Lot_Size`: Size of the lot.
- `Garage_Size`: Size of the garage.
- `Neighborhood_Quality`: Quality of the neighborhood.
- `Footage_to_Lot_Ratio`: Ratio between house area and lot size.
- `Total_Rooms`: Total number of rooms.
- `Age_of_House`: Age of the house.
- `Garage_to_Footage_Ratio`: Ratio between garage size and house area.
- `Avg_Room_Size`: Average room size.
- `Price`: Target variable, house price (numerical value, prediction objective).
- `House_Orientation_Angle`: Orientation angle of the house.
- `Street_Alignment_Offset`: Street alignment.
- `Solar_Exposure_Index`: Solar exposure index.
- `Magnetic_Field_Strength`: Magnetic field strength.
- `Vibration_Level`: Vibration level.

## Tasks

### Subtask 1 (10 points)

For each house in the test set, determine the estimated total area as the sum of the house area (`Square_Footage`), garage size (`Garage_Size`), and lot size (`Lot_Size`).

### Subtask 2 (10 points)

For each house in the test set, calculate the ratio between the garage size (`Garage_Size`) and the total number of rooms (`Total_Rooms`). The result should be added as a new column called `Garage_to_Room_Ratio`.

### Subtask 3 (10 points)

For each house in the test set, calculate the environmental stability index, defined as the difference between the solar exposure index (`Solar_Exposure_Index`) and the vibration level (`Vibration_Level`), divided by the magnetic field strength (`Magnetic_Field_Strength`).

```python
Env_Stability_Index = (Solar_Exposure_Index - Vibration_Level) / Magnetic_Field_Strength
```

### Subtask 4 (10 points)

Using the training dataset, calculate the mean value of the `Square_Footage` column, representing the average house area in the training set.
Then, for each house in the test set, determine the absolute value of the difference between the actual house area (`Square_Footage`) and the average calculated from the training set.

### Subtask 5 (60 points)

The main goal of this task is to build a machine learning regression model that can predict `Price` based on the features provided in the dataset. The model must generalize well to new data and will be evaluated using Mean Absolute Error (MAE).

Implement a regression model to predict the `Price` field, using the training set `train_data.csv`. Generate predictions on the evaluation set provided in the `test_data.csv` file (this file does not contain the Price column).

## Notes about the dataset

The target field is `Price`, a numerical value representing the house price.

Numerical variables (`Square_Footage`, `Num_Bedrooms`, `Num_Bathrooms`, `Year_Built`, `Lot_Size`, `Garage_Size`, `Footage_to_Lot_Ratio`, `Total_Rooms`, `Age_of_House`, `Garage_to_Footage_Ratio`, `Avg_Room_Size`, `House_Orientation_Angle`, `Street_Alignment_Offset`, `Solar_Exposure_Index`, `Magnetic_Field_Strength`, `Vibration_Level`) can be directly used for regression.

It is recommended to check and handle missing values (if any) and to normalize/scale the features to improve model performance.

## Evaluation Criteria

Performance: The model should have as low an MAE as possible.

### Note
If you submit `sample_output.csv`, you will receive 5 points.

> This is an English translation of the original Romanian task description, generated by ChatGPT.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer

## Load Data

In [2]:
df_train = pd.read_csv("train_data.csv")
df_test = pd.read_csv("test_data.csv")
df_train

Unnamed: 0,ID,Square_Footage,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size,Garage_Size,Neighborhood_Quality,Footage_to_Lot_Ratio,Total_Rooms,Age_of_House,Garage_to_Footage_Ratio,Avg_Room_Size,Price,House_Orientation_Angle,Street_Alignment_Offset,Solar_Exposure_Index,Magnetic_Field_Strength,Vibration_Level
0,1,2028,2,3,1967,1.784790,2,2,1136.268444,5,58,0.000986,405.600000,11184.929934,16.722149,298.409571,235.502857,227.621575,129.770822
1,2,3519,5,3,1966,4.009947,0,10,877.567605,8,59,0.000000,439.875000,13941.315383,340.115663,43.878994,300.292055,46.684432,211.676987
2,3,4507,2,3,2014,4.122337,0,7,1093.311933,5,11,0.000000,901.400000,19686.885572,219.823215,24.542031,186.851621,10.837394,316.769266
3,4,3371,4,2,2000,1.580318,0,1,2133.114532,6,25,0.000000,561.833333,20964.530841,10.361763,147.970249,107.843644,175.620355,244.463978
4,5,2871,5,1,1974,3.426914,2,6,837.780090,6,51,0.000697,478.500000,12180.466278,329.344524,46.114469,357.571806,335.719756,135.850744
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,796,2257,5,1,1968,3.131006,0,2,720.854482,6,57,0.000000,376.166667,9787.329652,147.737852,142.647127,308.584533,260.441214,347.508981
796,797,3894,3,2,1975,1.256532,0,5,3099.006876,5,50,0.000000,778.800000,26089.319670,291.370016,250.193479,37.520929,91.178058,233.555539
797,798,1484,5,1,2010,1.246555,1,6,1190.481338,6,15,0.000674,247.333333,10792.074311,279.189942,279.521081,312.037328,293.281562,225.156042
798,799,1865,4,2,1994,4.354220,0,7,428.320082,6,31,0.000000,310.833333,7715.953491,252.348887,10.790492,140.213482,140.122970,101.016293


## Subtask 1

In [3]:
total_areas = df_test["Square_Footage"] + df_test["Garage_Size"] + df_test["Lot_Size"]
total_areas

0      4015.098092
1      2312.369622
2      4710.792970
3      4937.479598
4      3649.980987
          ...     
195    3770.520335
196     620.461372
197    4494.399155
198    1864.546672
199    3200.913258
Length: 200, dtype: float64

In [4]:
subtask1_rows = []
for id_, val in zip(df_test["ID"], total_areas):
    subtask1_rows.append((1, id_, val))

## Subtask 2

In [5]:
df_train["Garage_to_Room_Ratio"] = df_train["Garage_Size"] / df_train["Total_Rooms"]
df_test["Garage_to_Room_Ratio"] = df_test["Garage_Size"] / df_test["Total_Rooms"]
df_test["Garage_to_Room_Ratio"]

0      0.250000
1      0.250000
2      0.250000
3      0.333333
4      0.000000
         ...   
195    0.250000
196    0.200000
197    0.000000
198    1.000000
199    1.000000
Name: Garage_to_Room_Ratio, Length: 200, dtype: float64

In [6]:
subtask2_rows = []
for id_, val in zip(df_test["ID"], df_test["Garage_to_Room_Ratio"]):
    subtask2_rows.append((2, id_, val))

## Subtask 3

In [7]:
df_train["Env_Stability_Index"] = (df_train["Solar_Exposure_Index"] - df_train["Vibration_Level"]) / df_train["Magnetic_Field_Strength"]
df_test["Env_Stability_Index"] = (df_test["Solar_Exposure_Index"] - df_test["Vibration_Level"]) / df_test["Magnetic_Field_Strength"]
df_test["Env_Stability_Index"]

0     -0.943669
1      0.260458
2     -0.390373
3     -0.076163
4      1.048315
         ...   
195    2.508925
196    0.965027
197    0.469447
198    0.792108
199    1.397728
Name: Env_Stability_Index, Length: 200, dtype: float64

In [8]:
subtask3_rows = []
for id_, val in zip(df_test["ID"], df_test["Env_Stability_Index"]):
    subtask3_rows.append((3, id_, val))

## Subtask 4

In [9]:
train_sf_mean = df_train["Square_Footage"].mean().item()
train_sf_mean

2813.9

In [10]:
sf_abs = (df_test["Square_Footage"] - train_sf_mean).abs()
sf_abs

0      1198.1
1       503.9
2      1894.1
3      2118.1
4       832.1
        ...  
195     953.1
196    2197.9
197    1677.1
198     955.9
199     381.1
Name: Square_Footage, Length: 200, dtype: float64

In [11]:
subtask4_rows = []
for id_, val in zip(df_test["ID"], sf_abs):
    subtask4_rows.append((4, id_, val))

## Subtask 5

In [12]:
X_train = df_train.drop("ID", axis=1)
X_train, y_train = X_train.drop("Price", axis=1), X_train["Price"]
X_test = df_test.drop("ID", axis=1)

In [13]:
X_train.dtypes

Square_Footage               int64
Num_Bedrooms                 int64
Num_Bathrooms                int64
Year_Built                   int64
Lot_Size                   float64
Garage_Size                  int64
Neighborhood_Quality         int64
Footage_to_Lot_Ratio       float64
Total_Rooms                  int64
Age_of_House                 int64
Garage_to_Footage_Ratio    float64
Avg_Room_Size              float64
House_Orientation_Angle    float64
Street_Alignment_Offset    float64
Solar_Exposure_Index       float64
Magnetic_Field_Strength    float64
Vibration_Level            float64
Garage_to_Room_Ratio       float64
Env_Stability_Index        float64
dtype: object

In [14]:
poly = PolynomialFeatures(degree=2)
X_train = poly.fit_transform(X_train)
X_test = poly.transform(X_test)

In [15]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [16]:
lr = LinearRegression()

In [17]:
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)
cross_val_score(lr, X_train, y_train, cv=5, scoring=mae_scorer) * -1

array([47.62957358, 50.19209595, 51.74388809, 57.62972336, 50.99069956])

In [18]:
lr.fit(X_train, y_train)
preds = lr.predict(X_test)

In [19]:
subtask5_rows = []
for id_, val in zip(df_test["ID"], preds):
    subtask5_rows.append((5, id_, val))

## Save answers

In [20]:
submission_rows = subtask1_rows + subtask2_rows + subtask3_rows + subtask4_rows + subtask5_rows
df_submission = pd.DataFrame(submission_rows, columns=["subtaskID", "datapointID", "answer"])
df_submission.to_csv("submission.csv", index=False)

## Submission results

Subtask 1:
- Equal: 200
- Score: 10/10

Subtask 2:
- Equal: 200
- Score: 10/10

Subtask 3:
- Equal: 200
- Score: 10/10

Subtask 4:
- Equal: 200
- Score: 10/10

Subtask 5:
- MAE: 56.84102
- Score: 60/60