## TÖL506M Introduction to deep neural networks - Homework 1
### Due: Thursday 28.8.2025

**Objectives**: Python, numpy and pandas refresher, data preprocessing, linear regression, model fitting and model evaluation, optimization and loss functions.

**Name**: (your name here), **email: ** (your email here), **collaborators:** (if any)

Please provide your solutions by filling in the appropriate cells in this notebook, creating new cells as needed. Hand in your solution on Gradescope. Make sure that you are familiar with the course rules on collaboration.

## Predicting Housing Prices in Iceland (100 points)

In this project, you will build models to predict housing prices in Iceland using real-world data from the Icelandic National Registry.

Such models are useful for:

* **Economists and policymakers** who want to understand which factors drive housing prices (especially if the models are interpretable).
* **Investors** who want to identify undervalued or overvalued properties.

### Data

The dataset is based on registered sales contracts (*þinglýstir kaupsamningar*) and includes information such as sales price, location, size, and more.
Download it here: [kaupskrá fasteigna](https://www.skra.is/gogn/grunngogn-til-nidurhals/kaupskra-fasteigna/).

⚠️ Notes about the data:

* It has relatively **few features**, which may make high accuracy difficult.
* Prices have increased sharply over time, so try to account for inflation somehow.
* Restricting the dataset to a single type of property is allowed.

---

## Tasks

### a) Data exploration (10 points)

* Use **scatter plots** and **histograms** to explore the data.
* Identify missing values, correlated features, and outliers.
* Based on your analysis, create a **feature set** to use in your model.

---

### b) Baseline linear model (10 points)

* Split the dataset into training and test sets, where the **test set = years 2024 and 2025**.
* Train a **linear regression model** (least squares loss) on your feature set.
* Evaluate the model with one or more performance measures (see notes below).
* Which features turned out to be useful? Which did not?

---

### c) Alternative loss functions (10 points)

* Try a different loss function (e.g. mean absolute deviation, Huber, logcosh).
* Does this improve performance?

---

### e) Advanced model: AutoGluon (20 points)

* Try using **AutoGluon** ([GitHub link](https://github.com/autogluon/autogluon)).
* Compare results with linear regression.

---

### d) Advanced model: TabPFN (20 points)

* Try using **TabPFN**, a foundation model for tabular data.
* Example Colab notebook: [TabPFN Demo](https://colab.research.google.com/github/PriorLabs/TabPFN/blob/main/examples/notebooks/TabPFN_Demo_Local.ipynb).
* Optional background reading: [Nature paper](https://www.nature.com/articles/s41586-024-08328-6).
* Compare its performance to your linear model and TabPFN.

---

### f) Learning curves (20 points)

* Plot a **learning curve** comparing all models (linear regression, TabPFN, AutoGluon).
* Train them with different amounts of data.
* Do the models’ performances **saturate** as you add more data?

---

### g) Further improvements (10 points)

* Experiment with other techniques (e.g. feature engineering, transformations, handling categorical variables, better scaling).
* Report any improvements.

---

## Notes & Tips

1. The **Icelandic housing price index** may help correct for inflation: [link](https://hms.is/gogn-og-maelabord/visitolur).
2. You may need to **manually adjust the dataset** to load it into pandas.
3. Carefully inspect the data: outliers, incorrect formatting, missing values, and scaling can all harm your models.

   > *Remember: garbage in → garbage out.*
4. Useful performance metrics:

   * Mean squared error (MSE)
   * Mean absolute deviation (MAD)
   * \$R^2\$ coefficient
5. Consider categorical variables — do they need encoding?
6. Nonlinear transformations of variables may help.
7. Example code for L1 (MAD) regression: [StackOverflow link](https://stackoverflow.com/questions/51883058/l1-norm-instead-of-l2-norm-for-cost-function-in-regression-model).
8. Diagnostic plots:

   * Plot **\$y\_{\text{true}}\$ vs. \$y\_{\text{pred}}\$**.
   * Plot a histogram of **errors (\$y\_{\text{true}} - y\_{\text{pred}}\$)**.
9. Alternative loss functions: MAD, Huber, logcosh (see code snippet provided).

---

👉 Deliverables: Submit your code, plots, and a short written report (2–4 pages) explaining your approach, results, and insights. Everything should be contained within a single notebook.

You are allowed to get help from AI (e.g. ChatGPT) but please provide a disclaimer that you have done so and a list of sources you used.

In [None]:
# Insert your code here
# ...